2026-02-21T08:04:43.3589766Z Current runner version: '2.331.0' 2026-02-21T08:04:43.3593650Z Runner name: 'dgxb200-04-1004' 2026-02-21T08:04:43.3594217Z Runner group name: 'default' 2026-02-21T08:04:43.3594802Z Machine name: 'c6df6bced02c' 2026-02-21T08:04:43.3596438Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T08:04:43.3597948Z Contents: read 2026-02-21T08:04:43.3598328Z Metadata: read 2026-02-21T08:04:43.3598744Z ##[endgroup] 2026-02-21T08:04:43.3600176Z Secret source: Actions 2026-02-21T08:04:43.3600698Z Prepare workflow directory 2026-02-21T08:04:43.3965106Z Prepare all required actions 2026-02-21T08:04:43.3993252Z Getting action download info 2026-02-21T08:04:43.8808420Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T08:04:44.2270953Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T08:04:44.6398521Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T08:04:45.0209495Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T08:04:45.6590246Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T08:04:46.1047203Z Getting action download info 2026-02-21T08:04:46.3058055Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T08:04:46.3061000Z ##[group] Inputs 2026-02-21T08:04:46.3061347Z runner: linux.dgx.b200 2026-02-21T08:04:46.3061619Z python-version: 3.12 2026-02-21T08:04:46.3062005Z image: nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:46.3062330Z runtime-version: cu130 2026-02-21T08:04:46.3062648Z container-options: --gpus all 2026-02-21T08:04:46.3062898Z alias: b200 2026-02-21T08:04:46.3063188Z kernels: welford 2026-02-21T08:04:46.3063422Z env-vars: 2026-02-21T08:04:46.3063674Z custom-args: 2026-02-21T08:04:46.3064217Z run_h100: true 2026-02-21T08:04:46.3064455Z run_b200: true 2026-02-21T08:04:46.3064703Z run_mi325x: true 2026-02-21T08:04:46.3064959Z ##[endgroup] 2026-02-21T08:04:46.3065307Z Complete job name: run-b200 (welford) / benchmark-cu130-welford-py3.12-b200 2026-02-21T08:04:46.3297980Z ##[group]Checking docker version 2026-02-21T08:04:46.3307272Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T08:04:46.4288405Z '1.53' 2026-02-21T08:04:46.4307648Z Docker daemon API version: '1.53' 2026-02-21T08:04:46.4308193Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T08:04:46.5163336Z '1.52' 2026-02-21T08:04:46.5178522Z Docker client API version: '1.52' 2026-02-21T08:04:46.5182793Z ##[endgroup] 2026-02-21T08:04:46.5184989Z ##[group]Clean up resources from previous jobs 2026-02-21T08:04:46.5188190Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=199f0c" 2026-02-21T08:04:46.5305337Z ##[command]/usr/bin/docker network prune --force --filter "label=199f0c" 2026-02-21T08:04:46.5397179Z ##[endgroup] 2026-02-21T08:04:46.5397515Z ##[group]Create local container network 2026-02-21T08:04:46.5403856Z ##[command]/usr/bin/docker network create --label 199f0c github_network_bb902c0f1908469581a19a15aa9ed8d1 2026-02-21T08:04:46.9714401Z 509ef16cc3c18d0bba200d0a01fcb181441e562719ce2026b1c05e16fbdea924 2026-02-21T08:04:46.9739015Z ##[endgroup] 2026-02-21T08:04:46.9762283Z ##[group]Starting job container 2026-02-21T08:04:46.9780887Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:47.7649918Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda 2026-02-21T08:04:48.0511060Z 1cd98a0b9132: Pulling fs layer 2026-02-21T08:04:48.0515472Z 76249c7cd503: Pulling fs layer 2026-02-21T08:04:48.0520537Z 401d11fb2a09: Pulling fs layer 2026-02-21T08:04:48.0521372Z ab7341a40ee7: Pulling fs layer 2026-02-21T08:04:48.0521625Z c20926c42231: Pulling fs layer 2026-02-21T08:04:48.0522038Z afcf80b42416: Pulling fs layer 2026-02-21T08:04:48.0522382Z 8fb7ecb711ef: Pulling fs layer 2026-02-21T08:04:48.0522913Z d7913b78456a: Pulling fs layer 2026-02-21T08:04:48.0523178Z e93dd1223ff5: Pulling fs layer 2026-02-21T08:04:48.0523427Z c03b8ec8dd33: Pulling fs layer 2026-02-21T08:04:48.0523665Z eea924c2c8fb: Pulling fs layer 2026-02-21T08:04:48.3935207Z c20926c42231: Download complete 2026-02-21T08:04:48.3941903Z afcf80b42416: Download complete 2026-02-21T08:04:48.3950642Z 8fb7ecb711ef: Download complete 2026-02-21T08:04:48.3958867Z d7913b78456a: Download complete 2026-02-21T08:04:48.3966068Z 1cd98a0b9132: Download complete 2026-02-21T08:04:48.3972847Z c03b8ec8dd33: Download complete 2026-02-21T08:04:48.5928100Z 401d11fb2a09: Download complete 2026-02-21T08:04:48.5934061Z 76249c7cd503: Download complete 2026-02-21T08:04:49.5971575Z 76249c7cd503: Pull complete 2026-02-21T08:04:49.8920648Z ab7341a40ee7: Download complete 2026-02-21T08:05:00.6925389Z eea924c2c8fb: Download complete 2026-02-21T08:05:01.5957601Z 401d11fb2a09: Pull complete 2026-02-21T08:05:06.1922327Z e93dd1223ff5: Download complete 2026-02-21T08:05:06.1940266Z d7913b78456a: Pull complete 2026-02-21T08:05:06.1949352Z c03b8ec8dd33: Pull complete 2026-02-21T08:05:06.1953782Z ab7341a40ee7: Pull complete 2026-02-21T08:05:22.5930747Z c20926c42231: Pull complete 2026-02-21T08:05:22.5931154Z afcf80b42416: Pull complete 2026-02-21T08:05:22.5935395Z 8fb7ecb711ef: Pull complete 2026-02-21T08:05:22.5942918Z eea924c2c8fb: Pull complete 2026-02-21T08:06:01.3670503Z 1cd98a0b9132: Pull complete 2026-02-21T08:06:01.3677178Z e93dd1223ff5: Pull complete 2026-02-21T08:06:01.3678660Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92 2026-02-21T08:06:01.3681381Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:01.3687825Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:01.3760637Z ##[command]/usr/bin/docker create --name 1ba5cc7795af4ec8a97beebf24e9b59a_nvidiacuda1301develubuntu2404_bf8b79 --label 199f0c --workdir /__w/helion/helion --network github_network_bb902c0f1908469581a19a15aa9ed8d1 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/bob/_work":"/__w" -v "/home/bob/externals":"/__e":ro -v "/home/bob/_work/_temp":"/__w/_temp" -v "/home/bob/_work/_actions":"/__w/_actions" -v "/home/bob/_work/_tool":"/__w/_tool" -v "/home/bob/_work/_temp/_github_home":"/github/home" -v "/home/bob/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null" 2026-02-21T08:06:01.4089065Z 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T08:06:01.4106753Z ##[command]/usr/bin/docker start 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T08:06:02.0088772Z 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T08:06:02.0105661Z ##[command]/usr/bin/docker ps --all --filter id=748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T08:06:02.0252353Z 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 Up Less than a second 2026-02-21T08:06:02.0267748Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T08:06:02.0369713Z GITHUB_ACTIONS=true 2026-02-21T08:06:02.0371089Z CI=true 2026-02-21T08:06:02.0371424Z HOME=/github/home 2026-02-21T08:06:02.0372041Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:02.0372522Z NVARCH=x86_64 2026-02-21T08:06:02.0377757Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 2026-02-21T08:06:02.0383251Z NV_CUDA_CUDART_VERSION=13.0.88-1 2026-02-21T08:06:02.0383471Z CUDA_VERSION=13.0.1 2026-02-21T08:06:02.0383891Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:02.0384250Z NVIDIA_VISIBLE_DEVICES=all 2026-02-21T08:06:02.0384515Z NVIDIA_DRIVER_CAPABILITIES=compute,utility 2026-02-21T08:06:02.0384835Z NV_CUDA_LIB_VERSION=13.0.1-1 2026-02-21T08:06:02.0385205Z NV_NVTX_VERSION=13.0.85-1 2026-02-21T08:06:02.0385446Z NV_LIBNPP_VERSION=13.0.1.2-1 2026-02-21T08:06:02.0385723Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1 2026-02-21T08:06:02.0386025Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1 2026-02-21T08:06:02.0386272Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0 2026-02-21T08:06:02.0386581Z NV_LIBCUBLAS_VERSION=13.0.2.14-1 2026-02-21T08:06:02.0386828Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1 2026-02-21T08:06:02.0387133Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2026-02-21T08:06:02.0387413Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:02.0387657Z NCCL_VERSION=2.28.3-1 2026-02-21T08:06:02.0387928Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0 2026-02-21T08:06:02.0388179Z NVIDIA_PRODUCT_NAME=CUDA 2026-02-21T08:06:02.0388443Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1 2026-02-21T08:06:02.0388696Z NV_NVML_DEV_VERSION=13.0.87-1 2026-02-21T08:06:02.0388962Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1 2026-02-21T08:06:02.0389219Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1 2026-02-21T08:06:02.0389512Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1 2026-02-21T08:06:02.0389876Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1 2026-02-21T08:06:02.0390171Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0 2026-02-21T08:06:02.0390512Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1 2026-02-21T08:06:02.0390779Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1 2026-02-21T08:06:02.0391154Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1 2026-02-21T08:06:02.0391535Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2026-02-21T08:06:02.0391782Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:02.0392188Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0 2026-02-21T08:06:02.0392475Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2026-02-21T08:06:02.0398278Z ##[endgroup] 2026-02-21T08:06:02.0405576Z ##[group]Waiting for all services to be ready 2026-02-21T08:06:02.0407075Z ##[endgroup] 2026-02-21T08:06:02.0540837Z ##[group]Run echo "Detected NVIDIA image" 2026-02-21T08:06:02.0541197Z echo "Detected NVIDIA image" 2026-02-21T08:06:02.0541497Z nvidia-smi || echo "nvidia-smi not found" 2026-02-21T08:06:02.0543993Z shell: bash -l {0} 2026-02-21T08:06:02.0544291Z env: 2026-02-21T08:06:02.0544521Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:02.0544745Z ##[endgroup] 2026-02-21T08:06:02.1169477Z Detected NVIDIA image 2026-02-21T08:06:02.1318589Z Sat Feb 21 08:06:02 2026 2026-02-21T08:06:02.1318962Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.1319484Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T08:06:02.1319905Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:02.1320442Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T08:06:02.1321013Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T08:06:02.1321455Z | | | MIG M. | 2026-02-21T08:06:02.1321837Z |=========================================+========================+======================| 2026-02-21T08:06:02.1431163Z | 0 NVIDIA B200 Off | 00000000:43:00.0 Off | 0 | 2026-02-21T08:06:02.1431911Z | N/A 31C P0 140W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T08:06:02.1436207Z | | | Disabled | 2026-02-21T08:06:02.1436745Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:02.1437020Z 2026-02-21T08:06:02.1437318Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.1437821Z | Processes: | 2026-02-21T08:06:02.1438207Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T08:06:02.1438604Z | ID ID Usage | 2026-02-21T08:06:02.1438962Z |=========================================================================================| 2026-02-21T08:06:02.1439308Z | No running processes found | 2026-02-21T08:06:02.1439713Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.1920263Z ##[group]Run set -x 2026-02-21T08:06:02.1920564Z set -x 2026-02-21T08:06:02.1920794Z apt-get update 2026-02-21T08:06:02.1921043Z apt-get install -y git 2026-02-21T08:06:02.1921359Z shell: bash -l {0} 2026-02-21T08:06:02.1921602Z env: 2026-02-21T08:06:02.1921788Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:02.1922104Z ##[endgroup] 2026-02-21T08:06:02.2434490Z + apt-get update 2026-02-21T08:06:02.3062979Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease [1581 B] 2026-02-21T08:06:02.4292321Z Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 Packages [1218 kB] 2026-02-21T08:06:02.6153798Z Get:3 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T08:06:02.6155019Z Get:4 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T08:06:03.4104003Z Get:5 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T08:06:03.5465054Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T08:06:03.7780081Z Get:7 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T08:06:04.0156317Z Get:8 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T08:06:04.2108497Z Get:9 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T08:06:04.4319770Z Get:10 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T08:06:04.4340922Z Get:11 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T08:06:05.3913869Z Get:12 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T08:06:05.4174183Z Get:13 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T08:06:05.4830428Z Get:14 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T08:06:05.4882584Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T08:06:05.6027569Z Get:16 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T08:06:05.7232437Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T08:06:05.7233120Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T08:06:05.8239557Z Get:19 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T08:06:05.8258303Z Get:20 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T08:06:06.2415243Z Fetched 37.5 MB in 4s (9437 kB/s) 2026-02-21T08:06:06.8742534Z Reading package lists... 2026-02-21T08:06:06.8871180Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:06.8880921Z + apt-get install -y git 2026-02-21T08:06:07.5144317Z Reading package lists... 2026-02-21T08:06:07.6292701Z Building dependency tree... 2026-02-21T08:06:07.6293041Z Reading state information... 2026-02-21T08:06:07.7612681Z The following additional packages will be installed: 2026-02-21T08:06:07.7613283Z git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls 2026-02-21T08:06:07.7613855Z libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3 2026-02-21T08:06:07.7614297Z libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1 2026-02-21T08:06:07.7614686Z libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 2026-02-21T08:06:07.7615045Z openssh-client publicsuffix xauth 2026-02-21T08:06:07.7618966Z Suggested packages: 2026-02-21T08:06:07.7619299Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T08:06:07.7620114Z gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain 2026-02-21T08:06:07.7620460Z libpam-ssh monkeysphere ssh-askpass 2026-02-21T08:06:07.8022172Z The following NEW packages will be installed: 2026-02-21T08:06:07.8022679Z git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 2026-02-21T08:06:07.8023148Z libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1 2026-02-21T08:06:07.8023649Z libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0 2026-02-21T08:06:07.8024085Z libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6 2026-02-21T08:06:07.8024543Z libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth 2026-02-21T08:06:08.1598514Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded. 2026-02-21T08:06:08.1598994Z Need to get 8886 kB of archives. 2026-02-21T08:06:08.1599448Z After this operation, 38.0 MB of additional disk space will be used. 2026-02-21T08:06:08.1600099Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB] 2026-02-21T08:06:08.5031058Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T08:06:08.9666812Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB] 2026-02-21T08:06:09.0389292Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB] 2026-02-21T08:06:09.1292901Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB] 2026-02-21T08:06:09.1557549Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB] 2026-02-21T08:06:09.2139281Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B] 2026-02-21T08:06:09.2189913Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB] 2026-02-21T08:06:09.3717865Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB] 2026-02-21T08:06:09.4239299Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T08:06:09.4265349Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB] 2026-02-21T08:06:09.4564155Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T08:06:09.4693816Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB] 2026-02-21T08:06:09.4817350Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB] 2026-02-21T08:06:09.4908936Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B] 2026-02-21T08:06:09.4927025Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB] 2026-02-21T08:06:09.4966327Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB] 2026-02-21T08:06:09.5422531Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB] 2026-02-21T08:06:09.5992052Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB] 2026-02-21T08:06:09.6738583Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB] 2026-02-21T08:06:09.6801734Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T08:06:09.6815187Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T08:06:09.7812930Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB] 2026-02-21T08:06:09.7924180Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T08:06:09.7941760Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB] 2026-02-21T08:06:09.8242314Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB] 2026-02-21T08:06:09.8293844Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB] 2026-02-21T08:06:09.8447335Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T08:06:09.8709244Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T08:06:09.8722392Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T08:06:09.9272272Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T08:06:10.1699434Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T08:06:10.1914446Z Fetched 8886 kB in 2s (3926 kB/s) 2026-02-21T08:06:10.2070910Z Selecting previously unselected package krb5-locales. 2026-02-21T08:06:10.2089532Z (Reading database ... 2026-02-21T08:06:10.2090027Z (Reading database ... 5% 2026-02-21T08:06:10.2090654Z (Reading database ... 10% 2026-02-21T08:06:10.2091064Z (Reading database ... 15% 2026-02-21T08:06:10.2091395Z (Reading database ... 20% 2026-02-21T08:06:10.2091635Z (Reading database ... 25% 2026-02-21T08:06:10.2092009Z (Reading database ... 30% 2026-02-21T08:06:10.2092258Z (Reading database ... 35% 2026-02-21T08:06:10.2092538Z (Reading database ... 40% 2026-02-21T08:06:10.2092852Z (Reading database ... 45% 2026-02-21T08:06:10.2093092Z (Reading database ... 50% 2026-02-21T08:06:10.2093360Z (Reading database ... 55% 2026-02-21T08:06:10.2093608Z (Reading database ... 60% 2026-02-21T08:06:10.2093906Z (Reading database ... 65% 2026-02-21T08:06:10.2100225Z (Reading database ... 70% 2026-02-21T08:06:10.2113603Z (Reading database ... 75% 2026-02-21T08:06:10.2115772Z (Reading database ... 80% 2026-02-21T08:06:10.2122553Z (Reading database ... 85% 2026-02-21T08:06:10.2129824Z (Reading database ... 90% 2026-02-21T08:06:10.2136575Z (Reading database ... 95% 2026-02-21T08:06:10.2136839Z (Reading database ... 100% 2026-02-21T08:06:10.2137239Z (Reading database ... 15507 files and directories currently installed.) 2026-02-21T08:06:10.2142937Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ... 2026-02-21T08:06:10.2161792Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.2335774Z Selecting previously unselected package less. 2026-02-21T08:06:10.2345496Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T08:06:10.2370345Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T08:06:10.2564248Z Selecting previously unselected package libbsd0:amd64. 2026-02-21T08:06:10.2575719Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ... 2026-02-21T08:06:10.2613898Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:10.2809600Z Selecting previously unselected package libexpat1:amd64. 2026-02-21T08:06:10.2815519Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ... 2026-02-21T08:06:10.2833458Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:10.3028299Z Selecting previously unselected package libkrb5support0:amd64. 2026-02-21T08:06:10.3035882Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:10.3071199Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.3262690Z Selecting previously unselected package libk5crypto3:amd64. 2026-02-21T08:06:10.3275390Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:10.3292496Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.3462678Z Selecting previously unselected package libkeyutils1:amd64. 2026-02-21T08:06:10.3473130Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ... 2026-02-21T08:06:10.3489257Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:10.3684419Z Selecting previously unselected package libkrb5-3:amd64. 2026-02-21T08:06:10.3698302Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:10.3734911Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.3973177Z Selecting previously unselected package libgssapi-krb5-2:amd64. 2026-02-21T08:06:10.3981421Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:10.3995514Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.4188988Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T08:06:10.4196635Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T08:06:10.4213476Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:10.4402874Z Selecting previously unselected package libedit2:amd64. 2026-02-21T08:06:10.4409141Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ... 2026-02-21T08:06:10.4424888Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:10.4629555Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T08:06:10.4637933Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T08:06:10.4647552Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:10.4843680Z Selecting previously unselected package libnghttp2-14:amd64. 2026-02-21T08:06:10.4851598Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ... 2026-02-21T08:06:10.4866020Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:10.5011638Z Selecting previously unselected package libpsl5t64:amd64. 2026-02-21T08:06:10.5016233Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ... 2026-02-21T08:06:10.5022196Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:10.5115678Z Selecting previously unselected package libxau6:amd64. 2026-02-21T08:06:10.5124349Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ... 2026-02-21T08:06:10.5129852Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:10.5281050Z Selecting previously unselected package libxdmcp6:amd64. 2026-02-21T08:06:10.5289931Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ... 2026-02-21T08:06:10.5301292Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:10.5457727Z Selecting previously unselected package libxcb1:amd64. 2026-02-21T08:06:10.5459581Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ... 2026-02-21T08:06:10.5471686Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:10.5608076Z Selecting previously unselected package libx11-data. 2026-02-21T08:06:10.5615602Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ... 2026-02-21T08:06:10.5632135Z Unpacking libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:10.5879429Z Selecting previously unselected package libx11-6:amd64. 2026-02-21T08:06:10.5886098Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ... 2026-02-21T08:06:10.5888341Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:10.6078337Z Selecting previously unselected package libxext6:amd64. 2026-02-21T08:06:10.6087041Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ... 2026-02-21T08:06:10.6103789Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:10.6252711Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T08:06:10.6262654Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T08:06:10.6269500Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:10.6402797Z Selecting previously unselected package openssh-client. 2026-02-21T08:06:10.6403255Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T08:06:10.6469955Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:10.6788208Z Selecting previously unselected package publicsuffix. 2026-02-21T08:06:10.6790175Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ... 2026-02-21T08:06:10.6806170Z Unpacking publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:10.7002783Z Selecting previously unselected package xauth. 2026-02-21T08:06:10.7008572Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T08:06:10.7026824Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:10.7179630Z Selecting previously unselected package libbrotli1:amd64. 2026-02-21T08:06:10.7180881Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ... 2026-02-21T08:06:10.7194384Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:10.7370266Z Selecting previously unselected package librtmp1:amd64. 2026-02-21T08:06:10.7379138Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ... 2026-02-21T08:06:10.7389157Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:10.7501603Z Selecting previously unselected package libssh-4:amd64. 2026-02-21T08:06:10.7511333Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ... 2026-02-21T08:06:10.7518055Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:10.7668615Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T08:06:10.7681147Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T08:06:10.7681751Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:10.7843415Z Selecting previously unselected package liberror-perl. 2026-02-21T08:06:10.7850172Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ... 2026-02-21T08:06:10.7861726Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T08:06:10.8000872Z Selecting previously unselected package git-man. 2026-02-21T08:06:10.8015356Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T08:06:10.8023453Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:10.8199258Z Selecting previously unselected package git. 2026-02-21T08:06:10.8203913Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T08:06:10.8265641Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:10.9307788Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:10.9331642Z Setting up libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:10.9356806Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:10.9383746Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:10.9414954Z Setting up libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:10.9428490Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:10.9441292Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:10.9451532Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T08:06:10.9513134Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.9536758Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.9562697Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T08:06:10.9590347Z Setting up libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:10.9616324Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:10.9630662Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.9643536Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:10.9650984Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.9662000Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:10.9667987Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:10.9677631Z Setting up publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:10.9692773Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:10.9704849Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:10.9718535Z Setting up libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:10.9726398Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:10.9748884Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:10.9766647Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:10.9775248Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:10.9787688Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:11.0218466Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:11.0229298Z Setting up libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:11.0241539Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:11.0295807Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:11.0324617Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ... 2026-02-21T08:06:11.0650893Z ##[group]Run actions/checkout@v6 2026-02-21T08:06:11.0651176Z with: 2026-02-21T08:06:11.0651384Z repository: pytorch/helion 2026-02-21T08:06:11.0651826Z token: *** 2026-02-21T08:06:11.0652071Z ssh-strict: true 2026-02-21T08:06:11.0652294Z ssh-user: git 2026-02-21T08:06:11.0652492Z persist-credentials: true 2026-02-21T08:06:11.0652748Z clean: true 2026-02-21T08:06:11.0652981Z sparse-checkout-cone-mode: true 2026-02-21T08:06:11.0653199Z fetch-depth: 1 2026-02-21T08:06:11.0653419Z fetch-tags: false 2026-02-21T08:06:11.0653611Z show-progress: true 2026-02-21T08:06:11.0653993Z lfs: false 2026-02-21T08:06:11.0654329Z submodules: false 2026-02-21T08:06:11.0654552Z set-safe-directory: true 2026-02-21T08:06:11.0654735Z env: 2026-02-21T08:06:11.0654978Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:11.0655213Z ##[endgroup] 2026-02-21T08:06:11.0688386Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:11.2454304Z Syncing repository: pytorch/helion 2026-02-21T08:06:11.2455329Z ##[group]Getting Git version info 2026-02-21T08:06:11.2455640Z Working directory is '/__w/helion/helion' 2026-02-21T08:06:11.2456086Z [command]/usr/bin/git version 2026-02-21T08:06:11.2456309Z git version 2.43.0 2026-02-21T08:06:11.2470678Z ##[endgroup] 2026-02-21T08:06:11.2476734Z Temporarily overriding HOME='/__w/_temp/d6aea2a3-ace9-48e1-8733-3775e7da028e' before making global git config changes 2026-02-21T08:06:11.2477292Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:06:11.2479079Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:06:11.2506592Z Deleting the contents of '/__w/helion/helion' 2026-02-21T08:06:11.2507611Z ##[group]Initializing the repository 2026-02-21T08:06:11.2508542Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T08:06:11.2534257Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T08:06:11.2534712Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T08:06:11.2535101Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T08:06:11.2535459Z hint: 2026-02-21T08:06:11.2535675Z hint: git config --global init.defaultBranch 2026-02-21T08:06:11.2535982Z hint: 2026-02-21T08:06:11.2536221Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T08:06:11.2536599Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T08:06:11.2536955Z hint: 2026-02-21T08:06:11.2537139Z hint: git branch -m 2026-02-21T08:06:11.2538292Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T08:06:11.2543555Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T08:06:11.2569775Z ##[endgroup] 2026-02-21T08:06:11.2570148Z ##[group]Disabling automatic garbage collection 2026-02-21T08:06:11.2570467Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T08:06:11.2600212Z ##[endgroup] 2026-02-21T08:06:11.2600594Z ##[group]Setting up auth 2026-02-21T08:06:11.2600906Z Removing SSH command configuration 2026-02-21T08:06:11.2601295Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:06:11.2628077Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:06:11.2848598Z Removing HTTP extra header 2026-02-21T08:06:11.2849009Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:06:11.2872899Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:06:11.3089862Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:06:11.3098204Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:06:11.3120698Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:06:11.3331372Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T08:06:11.3370527Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T08:06:11.3392602Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T08:06:11.3417216Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T08:06:11.3441611Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T08:06:11.3463132Z ##[endgroup] 2026-02-21T08:06:11.3463527Z ##[group]Fetching the repository 2026-02-21T08:06:11.3467831Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T08:06:11.8172875Z From https://github.com/pytorch/helion 2026-02-21T08:06:11.8175189Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T08:06:11.8196783Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T08:06:11.8228337Z origin/main 2026-02-21T08:06:11.8231025Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T08:06:11.8251234Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:11.8264635Z ##[endgroup] 2026-02-21T08:06:11.8265775Z ##[group]Determining the checkout info 2026-02-21T08:06:11.8266212Z ##[endgroup] 2026-02-21T08:06:11.8266499Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T08:06:11.8285000Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T08:06:11.8311493Z ##[group]Checking out the ref 2026-02-21T08:06:11.8312162Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T08:06:11.8536627Z Switched to a new branch 'main' 2026-02-21T08:06:11.8541947Z branch 'main' set up to track 'origin/main'. 2026-02-21T08:06:11.8549016Z ##[endgroup] 2026-02-21T08:06:11.8573535Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T08:06:11.8594901Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:11.8773940Z ##[group]Run actions/setup-python@v6 2026-02-21T08:06:11.8774273Z with: 2026-02-21T08:06:11.8774571Z python-version: 3.12 2026-02-21T08:06:11.8774854Z check-latest: false 2026-02-21T08:06:11.8775270Z token: *** 2026-02-21T08:06:11.8775580Z update-environment: true 2026-02-21T08:06:11.8775863Z allow-prereleases: false 2026-02-21T08:06:11.8776207Z freethreaded: false 2026-02-21T08:06:11.8776471Z env: 2026-02-21T08:06:11.8776724Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:11.8776959Z ##[endgroup] 2026-02-21T08:06:11.8781612Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:12.0900961Z ##[group]Installed versions 2026-02-21T08:06:12.0910624Z Version 3.12 was not found in the local cache 2026-02-21T08:06:12.7882654Z Version 3.12 is available for downloading 2026-02-21T08:06:12.7883314Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T08:06:13.6840060Z Extract downloaded archive 2026-02-21T08:06:13.6941131Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/229b0fb7-ef07-440e-b0bf-1785344de81f -f /__w/_temp/ea271516-a3da-461d-9e48-f37819372d44 2026-02-21T08:06:15.4813044Z Execute installation script 2026-02-21T08:06:15.4918463Z Check if Python hostedtoolcache folder exist... 2026-02-21T08:06:15.4921511Z Creating Python hostedtoolcache folder... 2026-02-21T08:06:15.4924238Z Create Python 3.12.12 folder 2026-02-21T08:06:15.4932730Z Copy Python binaries to hostedtoolcache folder 2026-02-21T08:06:15.7463680Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T08:06:15.7498472Z Upgrading pip... 2026-02-21T08:06:17.1042456Z Looking in links: /tmp/tmpxs1cyve_ 2026-02-21T08:06:17.1043632Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T08:06:17.1082411Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T08:06:17.6846031Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T08:06:17.8295314Z Collecting pip 2026-02-21T08:06:17.8561257Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T08:06:17.8649886Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T08:06:17.8953215Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 97.5 MB/s eta 0:00:00 2026-02-21T08:06:17.9040246Z Installing collected packages: pip 2026-02-21T08:06:17.9044094Z Attempting uninstall: pip 2026-02-21T08:06:17.9053345Z Found existing installation: pip 25.0.1 2026-02-21T08:06:17.9226642Z Uninstalling pip-25.0.1: 2026-02-21T08:06:17.9259028Z Successfully uninstalled pip-25.0.1 2026-02-21T08:06:18.5056466Z Successfully installed pip-26.0.1 2026-02-21T08:06:18.5546149Z Create complete file 2026-02-21T08:06:18.5580486Z Successfully set up CPython (3.12.12) 2026-02-21T08:06:18.5580938Z ##[endgroup] 2026-02-21T08:06:18.5765183Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T08:06:18.5765448Z with: 2026-02-21T08:06:18.5765629Z activate-environment: false 2026-02-21T08:06:18.5765948Z working-directory: /home/bob/_work/helion/helion 2026-02-21T08:06:18.5766327Z github-token: *** 2026-02-21T08:06:18.5766535Z enable-cache: auto 2026-02-21T08:06:18.5767044Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T08:06:18.5767502Z restore-cache: true 2026-02-21T08:06:18.5767794Z save-cache: true 2026-02-21T08:06:18.5768014Z prune-cache: true 2026-02-21T08:06:18.5768233Z cache-python: false 2026-02-21T08:06:18.5768450Z ignore-nothing-to-cache: false 2026-02-21T08:06:18.5768723Z ignore-empty-workdir: false 2026-02-21T08:06:18.5768977Z add-problem-matchers: true 2026-02-21T08:06:18.5769205Z resolution-strategy: highest 2026-02-21T08:06:18.5769459Z env: 2026-02-21T08:06:18.5769730Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:18.5770017Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.5770323Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:18.5770649Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.5770901Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.5771227Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.5771664Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:18.5772072Z ##[endgroup] 2026-02-21T08:06:18.5778250Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:18.8002898Z (node:799) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:06:18.8003619Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:06:18.8112488Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T08:06:18.8114700Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T08:06:18.8115288Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T08:06:18.8121489Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T08:06:18.8129318Z Getting latest version from GitHub API... 2026-02-21T08:06:19.0804584Z manifest-file not provided, reading from local file. 2026-02-21T08:06:19.0833512Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T08:06:19.0834238Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T08:06:19.3695293Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/9e3ecc23-80cd-4bef-b82e-0839adf2df2c -f /__w/_temp/fb822711-3491-44db-bcbc-e5af75f59079 2026-02-21T08:06:19.7584353Z Added /github/home/.local/bin to the path 2026-02-21T08:06:19.7588327Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T08:06:19.7589706Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T08:06:19.7590082Z Added /github/home/.local/share/uv/python to the path 2026-02-21T08:06:19.7595892Z Successfully installed uv version 0.10.4 2026-02-21T08:06:19.9022990Z ##[group]Run uv venv --python 3.12 2026-02-21T08:06:19.9023378Z uv venv --python 3.12 2026-02-21T08:06:19.9023917Z shell: bash -l {0} 2026-02-21T08:06:19.9024139Z env: 2026-02-21T08:06:19.9024383Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:19.9024656Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:19.9025067Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:19.9025427Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:19.9025689Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:19.9026030Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:19.9026453Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:19.9026910Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:19.9027233Z ##[endgroup] 2026-02-21T08:06:20.0403644Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T08:06:20.0404143Z Creating virtual environment at: .venv 2026-02-21T08:06:20.0404499Z Activate with: source .venv/bin/activate 2026-02-21T08:06:20.0471174Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:20.0471457Z source .venv/bin/activate 2026-02-21T08:06:20.0471968Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130 2026-02-21T08:06:20.0472441Z shell: bash -l {0} 2026-02-21T08:06:20.0472606Z env: 2026-02-21T08:06:20.0472845Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:20.0473094Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:20.0473398Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:20.0473740Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:20.0474010Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:20.0474297Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:20.0474705Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:20.0475239Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:20.0475491Z ##[endgroup] 2026-02-21T08:06:20.9763800Z Resolved 26 packages in 831ms 2026-02-21T08:06:20.9800178Z Downloading networkx (2.0MiB) 2026-02-21T08:06:20.9836434Z Downloading sympy (6.0MiB) 2026-02-21T08:06:20.9860458Z Downloading nvidia-cuda-runtime (2.1MiB) 2026-02-21T08:06:20.9954618Z Downloading nvidia-cuda-cupti (10.2MiB) 2026-02-21T08:06:20.9960701Z Downloading nvidia-cusolver (184.5MiB) 2026-02-21T08:06:21.0032722Z Downloading torch (584.2MiB) 2026-02-21T08:06:21.0033038Z Downloading triton (162.6MiB) 2026-02-21T08:06:21.0161808Z Downloading nvidia-curand (56.8MiB) 2026-02-21T08:06:21.0211477Z Downloading nvidia-cufft (204.2MiB) 2026-02-21T08:06:21.0263090Z Downloading nvidia-cufile (1.2MiB) 2026-02-21T08:06:21.0284961Z Downloading nvidia-nvjitlink (38.8MiB) 2026-02-21T08:06:21.0358746Z Downloading nvidia-cudnn-cu13 (332.4MiB) 2026-02-21T08:06:21.0603458Z Downloading nvidia-cusparse (133.8MiB) 2026-02-21T08:06:21.0705577Z Downloading nvidia-nvshmem-cu13 (57.6MiB) 2026-02-21T08:06:21.0893149Z Downloading nvidia-cusparselt-cu13 (162.0MiB) 2026-02-21T08:06:21.0932191Z Downloading nvidia-cuda-nvrtc (86.0MiB) 2026-02-21T08:06:21.1042355Z Downloading nvidia-nccl-cu13 (184.9MiB) 2026-02-21T08:06:21.1065809Z Downloading nvidia-cublas (400.0MiB) 2026-02-21T08:06:21.3500375Z Downloaded nvidia-cufile 2026-02-21T08:06:21.5194083Z Downloaded nvidia-cuda-runtime 2026-02-21T08:06:22.1082744Z Downloaded networkx 2026-02-21T08:06:22.5674853Z Downloaded nvidia-cuda-cupti 2026-02-21T08:06:23.8687984Z Downloaded sympy 2026-02-21T08:06:24.1532195Z Downloaded triton 2026-02-21T08:06:25.1670539Z Downloaded nvidia-nvjitlink 2026-02-21T08:06:26.0285560Z Downloaded nvidia-nvshmem-cu13 2026-02-21T08:06:26.2461622Z Downloaded nvidia-curand 2026-02-21T08:06:27.2708246Z Downloaded nvidia-cuda-nvrtc 2026-02-21T08:06:28.5464510Z Downloaded nvidia-cusolver 2026-02-21T08:06:28.8163604Z Downloaded nvidia-cusparse 2026-02-21T08:06:29.4442935Z Downloaded nvidia-cufft 2026-02-21T08:06:29.6460138Z Downloaded nvidia-cusparselt-cu13 2026-02-21T08:06:29.8095507Z Downloaded nvidia-nccl-cu13 2026-02-21T08:06:30.9988984Z Downloaded nvidia-cudnn-cu13 2026-02-21T08:06:31.6871117Z Downloaded nvidia-cublas 2026-02-21T08:06:35.9702484Z Downloaded torch 2026-02-21T08:06:36.1295339Z Prepared 26 packages in 15.15s 2026-02-21T08:06:36.1336268Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:36.1336893Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:36.1337451Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:36.8837177Z Installed 26 packages in 753ms 2026-02-21T08:06:36.8837560Z + filelock==3.20.0 2026-02-21T08:06:36.8837758Z + fsspec==2025.12.0 2026-02-21T08:06:36.8838001Z + jinja2==3.1.6 2026-02-21T08:06:36.8838197Z + markupsafe==3.0.2 2026-02-21T08:06:36.8838736Z + mpmath==1.3.0 2026-02-21T08:06:36.8838923Z + networkx==3.6.1 2026-02-21T08:06:36.8839205Z + nvidia-cublas==13.0.0.19 2026-02-21T08:06:36.8839437Z + nvidia-cuda-cupti==13.0.48 2026-02-21T08:06:36.8839689Z + nvidia-cuda-nvrtc==13.0.48 2026-02-21T08:06:36.8839964Z + nvidia-cuda-runtime==13.0.48 2026-02-21T08:06:36.8840189Z + nvidia-cudnn-cu13==9.13.0.50 2026-02-21T08:06:36.8840443Z + nvidia-cufft==12.0.0.15 2026-02-21T08:06:36.8840684Z + nvidia-cufile==1.15.0.42 2026-02-21T08:06:36.8840923Z + nvidia-curand==10.4.0.35 2026-02-21T08:06:36.8841144Z + nvidia-cusolver==12.0.3.29 2026-02-21T08:06:36.8841405Z + nvidia-cusparse==12.6.2.49 2026-02-21T08:06:36.8841639Z + nvidia-cusparselt-cu13==0.8.0 2026-02-21T08:06:36.8842135Z + nvidia-nccl-cu13==2.27.7 2026-02-21T08:06:36.8842406Z + nvidia-nvjitlink==13.0.39 2026-02-21T08:06:36.8843145Z + nvidia-nvshmem-cu13==3.3.24 2026-02-21T08:06:36.8843402Z + nvidia-nvtx==13.0.39 2026-02-21T08:06:36.8843625Z + setuptools==70.2.0 2026-02-21T08:06:36.8843881Z + sympy==1.14.0 2026-02-21T08:06:36.8844069Z + torch==2.9.1+cu130 2026-02-21T08:06:36.8844301Z + triton==3.5.1 2026-02-21T08:06:36.8844508Z + typing-extensions==4.15.0 2026-02-21T08:06:36.8966312Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:36.8966662Z source .venv/bin/activate 2026-02-21T08:06:36.8967023Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T08:06:36.8967428Z python -c "import helion; print(helion.__name__)" 2026-02-21T08:06:36.8967902Z shell: bash -l {0} 2026-02-21T08:06:36.8968159Z env: 2026-02-21T08:06:36.8968387Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:36.8968658Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:36.8969188Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:36.8969495Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:36.8969812Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:36.8970088Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:36.8970647Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:36.8971082Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:36.8971418Z ##[endgroup] 2026-02-21T08:06:37.9064166Z Resolved 30 packages in 911ms 2026-02-21T08:06:37.9075005Z Building helion @ file:///__w/helion/helion 2026-02-21T08:06:37.9146500Z Downloading virtualenv (5.6MiB) 2026-02-21T08:06:37.9174751Z Downloading scikit-learn (8.5MiB) 2026-02-21T08:06:37.9246068Z Downloading pygments (1.2MiB) 2026-02-21T08:06:37.9350303Z Downloading scipy (33.4MiB) 2026-02-21T08:06:37.9381168Z Downloading numpy (15.8MiB) 2026-02-21T08:06:38.0602734Z Built helion @ file:///__w/helion/helion 2026-02-21T08:06:38.1173983Z Downloaded virtualenv 2026-02-21T08:06:38.1596287Z Downloaded pygments 2026-02-21T08:06:38.6022797Z Downloaded scikit-learn 2026-02-21T08:06:38.6064348Z Downloaded numpy 2026-02-21T08:06:38.8377580Z Downloaded scipy 2026-02-21T08:06:38.8385606Z Prepared 27 packages in 931ms 2026-02-21T08:06:38.8388094Z Uninstalled 1 package in 0.57ms 2026-02-21T08:06:38.8392476Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:38.8393148Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:38.8393713Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:38.9787319Z Installed 29 packages in 139ms 2026-02-21T08:06:38.9792987Z + cfgv==3.5.0 2026-02-21T08:06:38.9794367Z + distlib==0.4.0 2026-02-21T08:06:38.9794629Z + expecttest==0.3.0 2026-02-21T08:06:38.9795036Z + filecheck==1.0.3 2026-02-21T08:06:38.9795237Z - filelock==3.20.0 2026-02-21T08:06:38.9795469Z + filelock==3.24.3 2026-02-21T08:06:38.9795732Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T08:06:38.9795991Z + hypothesis==6.151.9 2026-02-21T08:06:38.9796225Z + identify==2.6.16 2026-02-21T08:06:38.9796789Z + iniconfig==2.3.0 2026-02-21T08:06:38.9797051Z + joblib==1.5.3 2026-02-21T08:06:38.9797247Z + markdown-it-py==4.0.0 2026-02-21T08:06:38.9797492Z + mdurl==0.1.2 2026-02-21T08:06:38.9797677Z + nodeenv==1.10.0 2026-02-21T08:06:38.9797890Z + numpy==2.4.2 2026-02-21T08:06:38.9798100Z + packaging==26.0 2026-02-21T08:06:38.9798291Z + platformdirs==4.9.2 2026-02-21T08:06:38.9798522Z + pluggy==1.6.0 2026-02-21T08:06:38.9798688Z + pre-commit==4.5.1 2026-02-21T08:06:38.9798927Z + psutil==7.2.2 2026-02-21T08:06:38.9799110Z + pygments==2.19.2 2026-02-21T08:06:38.9799322Z + pytest==9.0.2 2026-02-21T08:06:38.9799535Z + pytest-timeout==2.4.0 2026-02-21T08:06:38.9799769Z + pyyaml==6.0.3 2026-02-21T08:06:38.9799957Z + rich==14.3.3 2026-02-21T08:06:38.9800157Z + scikit-learn==1.8.0 2026-02-21T08:06:38.9800385Z + scipy==1.17.0 2026-02-21T08:06:38.9800567Z + sortedcontainers==2.4.0 2026-02-21T08:06:38.9800818Z + threadpoolctl==3.6.0 2026-02-21T08:06:38.9801018Z + virtualenv==20.38.0 2026-02-21T08:06:49.9923243Z helion 2026-02-21T08:06:50.6910615Z ##[group]Run set -x 2026-02-21T08:06:50.6910844Z set -x 2026-02-21T08:06:50.6911111Z source .venv/bin/activate 2026-02-21T08:06:50.6911372Z uv pip install pip 2026-02-21T08:06:50.6911605Z uv pip install quack-kernels --no-deps 2026-02-21T08:06:50.6911983Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T08:06:50.6912297Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:06:50.6912598Z pushd tritonbench/ 2026-02-21T08:06:50.6913004Z git submodule update --init --recursive 2026-02-21T08:06:50.6913293Z uv pip install -r requirements.txt 2026-02-21T08:06:50.6913564Z python install.py --liger 2026-02-21T08:06:50.6913823Z uv pip install -e . --no-deps 2026-02-21T08:06:50.6914082Z popd 2026-02-21T08:06:50.6914265Z popd 2026-02-21T08:06:50.6914590Z shell: bash -l {0} 2026-02-21T08:06:50.6914769Z env: 2026-02-21T08:06:50.6915075Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:50.6915325Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:50.6915651Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:50.6915919Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:50.6916231Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:50.6916519Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:50.6916897Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:50.6917432Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:50.6917682Z ##[endgroup] 2026-02-21T08:06:51.0698547Z + source .venv/bin/activate 2026-02-21T08:06:51.0700480Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0700771Z ++ '[' -n x ']' 2026-02-21T08:06:51.0701054Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:06:51.0701470Z ++ '[' .venv/bin/activate = /__w/_temp/e78c4f0e-59c7-4ede-81c7-a3bf964202ff.sh ']' 2026-02-21T08:06:51.0702176Z ++ deactivate nondestructive 2026-02-21T08:06:51.0702452Z ++ unset -f pydoc 2026-02-21T08:06:51.0702722Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0702989Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0703238Z ++ hash -r 2026-02-21T08:06:51.0703459Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0703701Z ++ unset VIRTUAL_ENV 2026-02-21T08:06:51.0703972Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:06:51.0704241Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:06:51.0704590Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:06:51.0704886Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:06:51.0707861Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:06:51.0708198Z ++ export VIRTUAL_ENV 2026-02-21T08:06:51.0708470Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0708654Z ++ unset SCRIPT_PATH 2026-02-21T08:06:51.0709525Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:51.0710787Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:51.0711493Z ++ export PATH 2026-02-21T08:06:51.0711775Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:06:51.0712083Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:06:51.0712371Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:06:51.0712684Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0712889Z ++ '[' -z '' ']' 2026-02-21T08:06:51.0713116Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:06:51.0713384Z ++ PS1='(helion) ' 2026-02-21T08:06:51.0713619Z ++ export PS1 2026-02-21T08:06:51.0713807Z ++ alias pydoc 2026-02-21T08:06:51.0714046Z ++ true 2026-02-21T08:06:51.0714221Z ++ hash -r 2026-02-21T08:06:51.0714753Z + uv pip install pip 2026-02-21T08:06:51.2964567Z Resolved 1 package in 217ms 2026-02-21T08:06:51.2999116Z Downloading pip (1.7MiB) 2026-02-21T08:06:51.4959352Z Downloaded pip 2026-02-21T08:06:51.4959774Z Prepared 1 package in 199ms 2026-02-21T08:06:51.5006028Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:51.5006631Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:51.5007226Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:52.0771143Z Installed 1 package in 580ms 2026-02-21T08:06:52.0778245Z + pip==26.0.1 2026-02-21T08:06:52.0805532Z + uv pip install quack-kernels --no-deps 2026-02-21T08:06:52.1022139Z Resolved 1 package in 13ms 2026-02-21T08:06:52.1477548Z Prepared 1 package in 45ms 2026-02-21T08:06:52.1520864Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:52.1521463Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:52.1522205Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:52.3162198Z Installed 1 package in 167ms 2026-02-21T08:06:52.3162503Z + quack-kernels==0.2.10 2026-02-21T08:06:52.3186561Z + mkdir -p benchmarks/ 2026-02-21T08:06:52.3190536Z + pushd benchmarks/ 2026-02-21T08:06:52.3190927Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:06:52.3191590Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:06:52.3201831Z Cloning into 'tritonbench'... 2026-02-21T08:06:53.8077019Z + pushd tritonbench/ 2026-02-21T08:06:53.8077571Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:06:53.8078229Z + git submodule update --init --recursive 2026-02-21T08:06:54.0607674Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T08:06:54.0969245Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T08:06:54.2054006Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T08:06:54.3060193Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T08:06:54.3744070Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T08:06:54.4598689Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T08:06:54.4629761Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T08:06:56.9370137Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T08:07:08.5459291Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T08:07:12.2751554Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T08:07:13.0844192Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T08:07:13.5590576Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T08:07:15.2024646Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T08:07:15.5050736Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T08:07:15.5070930Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T08:07:15.5096518Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T08:07:19.8749946Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T08:07:20.3233638Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T08:07:20.3864643Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T08:07:20.3880521Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T08:07:20.3881584Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T08:07:20.3904594Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T08:07:24.4919943Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T08:07:28.8529522Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T08:07:29.3542581Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T08:07:29.3817546Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T08:07:29.3833679Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T08:07:29.3859543Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T08:07:33.3991628Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T08:07:33.4578709Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T08:07:33.4694869Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T08:07:33.4696526Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T08:07:33.4697244Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T08:07:33.4697883Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T08:07:37.6006860Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T08:07:41.2723360Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T08:07:42.3589264Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T08:07:42.7911314Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T08:07:42.8410835Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T08:07:42.8431966Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T08:07:42.8435893Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T08:07:42.8456203Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T08:07:47.6133506Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T08:07:51.8442728Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T08:07:52.2728367Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T08:07:52.2769507Z + uv pip install -r requirements.txt 2026-02-21T08:07:52.2839123Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:52.4156764Z Resolved 30 packages in 130ms 2026-02-21T08:07:52.4272403Z Downloading pillow (6.7MiB) 2026-02-21T08:07:52.4307614Z Downloading matplotlib (8.3MiB) 2026-02-21T08:07:52.4307849Z Downloading fonttools (4.7MiB) 2026-02-21T08:07:52.4308072Z Downloading hf-xet (3.2MiB) 2026-02-21T08:07:52.4311785Z Downloading transformers (10.3MiB) 2026-02-21T08:07:52.4312095Z Downloading tokenizers (3.0MiB) 2026-02-21T08:07:52.4312299Z Downloading kiwisolver (1.4MiB) 2026-02-21T08:07:52.5547305Z Downloaded kiwisolver 2026-02-21T08:07:52.6295595Z Downloaded tokenizers 2026-02-21T08:07:52.6330934Z Downloaded hf-xet 2026-02-21T08:07:52.7767791Z Downloaded pillow 2026-02-21T08:07:52.8081711Z Downloaded fonttools 2026-02-21T08:07:52.9182366Z Downloaded matplotlib 2026-02-21T08:07:53.8476403Z Downloaded transformers 2026-02-21T08:07:53.8482658Z Prepared 23 packages in 1.43s 2026-02-21T08:07:53.8510085Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:53.8510638Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:53.8511195Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:53.9328618Z Installed 23 packages in 84ms 2026-02-21T08:07:53.9328944Z + certifi==2026.1.4 2026-02-21T08:07:53.9329196Z + charset-normalizer==3.4.4 2026-02-21T08:07:53.9329446Z + contourpy==1.3.3 2026-02-21T08:07:53.9329658Z + cycler==0.12.1 2026-02-21T08:07:53.9329858Z + fonttools==4.61.1 2026-02-21T08:07:53.9330072Z + hf-xet==1.2.0 2026-02-21T08:07:53.9330289Z + huggingface-hub==0.36.2 2026-02-21T08:07:53.9330535Z + idna==3.11 2026-02-21T08:07:53.9330724Z + kiwisolver==1.4.9 2026-02-21T08:07:53.9330944Z + matplotlib==3.10.8 2026-02-21T08:07:53.9331209Z + nvidia-ml-py==13.590.48 2026-02-21T08:07:53.9331439Z + pillow==12.1.1 2026-02-21T08:07:53.9331648Z + pyparsing==3.3.2 2026-02-21T08:07:53.9332053Z + python-dateutil==2.9.0.post0 2026-02-21T08:07:53.9332320Z + regex==2026.2.19 2026-02-21T08:07:53.9332517Z + requests==2.32.5 2026-02-21T08:07:53.9332732Z + safetensors==0.7.0 2026-02-21T08:07:53.9332939Z + six==1.17.0 2026-02-21T08:07:53.9333141Z + tabulate==0.9.0 2026-02-21T08:07:53.9333354Z + tokenizers==0.21.4 2026-02-21T08:07:53.9333568Z + tqdm==4.67.3 2026-02-21T08:07:53.9333781Z + transformers==4.53.0 2026-02-21T08:07:53.9333999Z + urllib3==2.6.3 2026-02-21T08:07:53.9413792Z + python install.py --liger 2026-02-21T08:07:58.6462700Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:58.6498245Z Audited 6 packages in 3ms 2026-02-21T08:07:58.7057091Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T08:07:58.7123238Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:58.7420696Z Resolved 1 package in 28ms 2026-02-21T08:07:58.7652690Z Prepared 1 package in 23ms 2026-02-21T08:07:58.7677311Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:58.7677860Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:58.7678772Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:58.7708942Z Installed 1 package in 6ms 2026-02-21T08:07:58.7709217Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T08:07:58.7740418Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T08:07:59.1606034Z + uv pip install -e . --no-deps 2026-02-21T08:07:59.2023216Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:59.2058150Z Resolved 1 package in 2ms 2026-02-21T08:07:59.2070880Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:07:59.9329087Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:07:59.9347473Z Prepared 1 package in 728ms 2026-02-21T08:07:59.9353572Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:59.9355753Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:59.9356295Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:59.9360097Z Installed 1 package in 0.46ms 2026-02-21T08:07:59.9364969Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T08:07:59.9432024Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:59.9432385Z /__w/helion/helion 2026-02-21T08:07:59.9432889Z + popd 2026-02-21T08:07:59.9433108Z + popd 2026-02-21T08:07:59.9482187Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:07:59.9482620Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:07:59.9482867Z  2026-02-21T08:07:59.9483053Z source .venv/bin/activate 2026-02-21T08:07:59.9483266Z  2026-02-21T08:07:59.9483473Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T08:07:59.9483745Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T08:07:59.9483985Z echo "$TEST_REPORTS_DIR" 2026-02-21T08:07:59.9484190Z  2026-02-21T08:07:59.9484364Z KERNEL_LIST="welford" 2026-02-21T08:07:59.9484603Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T08:07:59.9484882Z  echo "==========================================" 2026-02-21T08:07:59.9485194Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T08:07:59.9485479Z  echo "==========================================" 2026-02-21T08:07:59.9485726Z  2026-02-21T08:07:59.9486004Z  # Get available implementations and baseline for this kernel 2026-02-21T08:07:59.9486527Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T08:07:59.9487039Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T08:07:59.9487443Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T08:07:59.9487758Z  2026-02-21T08:07:59.9487934Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T08:07:59.9488270Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T08:07:59.9488644Z  continue 2026-02-21T08:07:59.9488824Z  fi 2026-02-21T08:07:59.9489024Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T08:07:59.9489345Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T08:07:59.9489653Z  continue 2026-02-21T08:07:59.9489839Z  fi 2026-02-21T08:07:59.9490032Z  echo "Using baseline: $BASELINE" 2026-02-21T08:07:59.9490346Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T08:07:59.9490616Z  2026-02-21T08:07:59.9490821Z  # Do autotuning but do not record the results 2026-02-21T08:07:59.9491102Z  python benchmarks/run.py \ 2026-02-21T08:07:59.9491351Z  --op $kernel \ 2026-02-21T08:07:59.9491586Z  --metrics speedup,accuracy \ 2026-02-21T08:07:59.9491930Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:07:59.9492204Z  --cudagraph \ 2026-02-21T08:07:59.9492414Z  --only $IMPLS \ 2026-02-21T08:07:59.9492667Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:07:59.9492934Z  --baseline $BASELINE \ 2026-02-21T08:07:59.9493162Z  --atol 1e-2 \ 2026-02-21T08:07:59.9493365Z  --rtol 1e-2 \ 2026-02-21T08:07:59.9493599Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:07:59.9494018Z  --keep-going \ 2026-02-21T08:07:59.9494213Z   2026-02-21T08:07:59.9494382Z  2026-02-21T08:07:59.9494542Z  # Relax the GPU 2026-02-21T08:07:59.9494747Z  sleep 2m 2026-02-21T08:07:59.9494921Z  2026-02-21T08:07:59.9495124Z  # Run again with cache and record results 2026-02-21T08:07:59.9495515Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T08:07:59.9495879Z  --op $kernel \ 2026-02-21T08:07:59.9496111Z  --metrics speedup,accuracy \ 2026-02-21T08:07:59.9496380Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:07:59.9496647Z  --cudagraph \ 2026-02-21T08:07:59.9496855Z  --only $IMPLS \ 2026-02-21T08:07:59.9497257Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:07:59.9497538Z  --baseline $BASELINE \ 2026-02-21T08:07:59.9497769Z  --atol 1e-2 \ 2026-02-21T08:07:59.9497969Z  --rtol 1e-2 \ 2026-02-21T08:07:59.9498208Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:07:59.9498526Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T08:07:59.9498806Z  --append-to-output \ 2026-02-21T08:07:59.9499044Z  --keep-going \ 2026-02-21T08:07:59.9499234Z   2026-02-21T08:07:59.9499401Z  2026-02-21T08:07:59.9499625Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T08:07:59.9499891Z done 2026-02-21T08:07:59.9500055Z  2026-02-21T08:07:59.9500278Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T08:07:59.9500612Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T08:07:59.9500857Z  exit 1 2026-02-21T08:07:59.9501037Z fi 2026-02-21T08:07:59.9501234Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T08:07:59.9501618Z shell: bash -l {0} 2026-02-21T08:07:59.9501794Z env: 2026-02-21T08:07:59.9502010Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:59.9502277Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.9502581Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:59.9502892Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.9503157Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.9503431Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.9503910Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:59.9504375Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:59.9504645Z ##[endgroup] 2026-02-21T08:08:00.0157221Z /__w/helion/helion/test/test-reports 2026-02-21T08:08:00.0157664Z ========================================== 2026-02-21T08:08:00.0157972Z Running benchmark for kernel: welford 2026-02-21T08:08:00.0158245Z ========================================== 2026-02-21T08:08:04.4836087Z Using baseline: eager_layer_norm 2026-02-21T08:08:04.4836488Z Available implementations for welford: helion_welford,torch_compile_welford,triton_welford 2026-02-21T08:08:09.5704849Z Applying custom args for welford: {'num_inputs': 6} 2026-02-21T08:08:09.6096861Z Running welford benchmark with Helion implementation... 2026-02-21T08:08:09.6101111Z 2026-02-21T08:08:09.9367923Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 10) 2026-02-21T08:08:09.9372292Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 2, 4, 5, 7, 9] 2026-02-21T08:08:09.9382194Z 2026-02-21T08:08:09.9393269Z 0%| | 0/6 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:13:09.5007002Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:13:09.5007259Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:13:09.5007488Z %cst = arith.constant dense<1.000000e+00> : tensor<64xf32> 2026-02-21T08:13:09.5007756Z %cst_0 = arith.constant dense<3.200000e+01> : tensor<64xf32> 2026-02-21T08:13:09.5007977Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:13:09.5008168Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:13:09.5008380Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:13:09.5008966Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<64xf32> 2026-02-21T08:13:09.5009182Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:13:09.5009363Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T08:13:09.5009556Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:13:09.5009730Z %c1024_i64 = arith.constant 1024 : i64 2026-02-21T08:13:09.5009908Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:13:09.5010226Z %0 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T08:13:09.5010677Z %1 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T08:13:09.5011124Z %2 = tt.make_tensor_descriptor %arg3, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T08:13:09.5011434Z %3 = tt.get_program_id x : i32 2026-02-21T08:13:09.5011824Z scf.for %arg5 = %3 to %c4096_i32 step %c2368_i32 : i32 { 2026-02-21T08:13:09.5012145Z %4 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:13:09.5012347Z %c960_i32 = arith.constant 960 : i32 2026-02-21T08:13:09.5012535Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:13:09.5013026Z %5:3 = scf.for %arg6 = %c0_i32 to %c960_i32 step %c96_i32 iter_args(%arg7 = %cst_1, %arg8 = %cst_1, %arg9 = %cst_1) -> (tensor<64xf32>, tensor<64xf32>, tensor<64xf32>) : i32 { 2026-02-21T08:13:09.5013532Z %37 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:13:09.5013840Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5014035Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5014231Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5014419Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5014616Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5014828Z %39 = arith.mulf %37, %37 : tensor<64x32xbf16> 2026-02-21T08:13:09.5015033Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5015228Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5015411Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5015603Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5015784Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5016006Z %41 = arith.extf %38 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5016232Z %42 = arith.divf %41, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5016427Z %43 = arith.mulf %38, %38 : tensor<64xbf16> 2026-02-21T08:13:09.5016645Z %44 = arith.extf %43 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5016859Z %45 = arith.divf %44, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5017080Z %46 = arith.extf %40 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5017289Z %47 = arith.subf %46, %45 : tensor<64xf32> 2026-02-21T08:13:09.5017491Z %48 = arith.subf %42, %arg8 : tensor<64xf32> 2026-02-21T08:13:09.5017699Z %49 = arith.addf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5017896Z %50 = arith.divf %cst, %49 : tensor<64xf32> 2026-02-21T08:13:09.5018095Z %51 = arith.mulf %50, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5018286Z %52 = arith.mulf %48, %51 : tensor<64xf32> 2026-02-21T08:13:09.5018485Z %53 = arith.addf %arg8, %52 : tensor<64xf32> 2026-02-21T08:13:09.5018679Z %54 = arith.addf %arg9, %47 : tensor<64xf32> 2026-02-21T08:13:09.5018877Z %55 = arith.mulf %48, %48 : tensor<64xf32> 2026-02-21T08:13:09.5019073Z %56 = arith.mulf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5019287Z %57 = arith.divf %56, %49 : tensor<64xf32> 2026-02-21T08:13:09.5019482Z %58 = arith.mulf %55, %57 : tensor<64xf32> 2026-02-21T08:13:09.5019670Z %59 = arith.addf %54, %58 : tensor<64xf32> 2026-02-21T08:13:09.5019860Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:13:09.5020137Z %60 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:13:09.5020326Z %61 = arith.addi %arg6, %60 : i32 2026-02-21T08:13:09.5020595Z %62 = tt.descriptor_load %0[%4, %61] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:13:09.5020888Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5021083Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5021264Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5021456Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5021637Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5021928Z %64 = arith.mulf %62, %62 : tensor<64x32xbf16> 2026-02-21T08:13:09.5022133Z %65 = "tt.reduce"(%64) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5022330Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5022599Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5022791Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5022984Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5023193Z %66 = arith.extf %63 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5023414Z %67 = arith.divf %66, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5023605Z %68 = arith.mulf %63, %63 : tensor<64xbf16> 2026-02-21T08:13:09.5023820Z %69 = arith.extf %68 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5024034Z %70 = arith.divf %69, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5024252Z %71 = arith.extf %65 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5024467Z %72 = arith.subf %71, %70 : tensor<64xf32> 2026-02-21T08:13:09.5024660Z %73 = arith.subf %67, %53 : tensor<64xf32> 2026-02-21T08:13:09.5024861Z %74 = arith.addf %49, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5025056Z %75 = arith.divf %cst, %74 : tensor<64xf32> 2026-02-21T08:13:09.5025258Z %76 = arith.mulf %75, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5025451Z %77 = arith.mulf %73, %76 : tensor<64xf32> 2026-02-21T08:13:09.5025643Z %78 = arith.addf %53, %77 : tensor<64xf32> 2026-02-21T08:13:09.5025834Z %79 = arith.addf %59, %72 : tensor<64xf32> 2026-02-21T08:13:09.5026020Z %80 = arith.mulf %73, %73 : tensor<64xf32> 2026-02-21T08:13:09.5026219Z %81 = arith.mulf %49, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5026406Z %82 = arith.divf %81, %74 : tensor<64xf32> 2026-02-21T08:13:09.5026593Z %83 = arith.mulf %80, %82 : tensor<64xf32> 2026-02-21T08:13:09.5026780Z %84 = arith.addf %79, %83 : tensor<64xf32> 2026-02-21T08:13:09.5026970Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:13:09.5027158Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:13:09.5027339Z %86 = arith.addi %arg6, %85 : i32 2026-02-21T08:13:09.5027609Z %87 = tt.descriptor_load %0[%4, %86] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:13:09.5027893Z %88 = "tt.reduce"(%87) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5028079Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5028258Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5028445Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5028625Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5028829Z %89 = arith.mulf %87, %87 : tensor<64x32xbf16> 2026-02-21T08:13:09.5029027Z %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5029207Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5029395Z %110 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5029578Z tt.reduce.return %110 : bf16 2026-02-21T08:13:09.5029766Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5029974Z %91 = arith.extf %88 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5030195Z %92 = arith.divf %91, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5030451Z %93 = arith.mulf %88, %88 : tensor<64xbf16> 2026-02-21T08:13:09.5030660Z %94 = arith.extf %93 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5030879Z %95 = arith.divf %94, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5031085Z %96 = arith.extf %90 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5031297Z %97 = arith.subf %96, %95 : tensor<64xf32> 2026-02-21T08:13:09.5031483Z %98 = arith.subf %92, %78 : tensor<64xf32> 2026-02-21T08:13:09.5031684Z %99 = arith.addf %74, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5031920Z %100 = arith.divf %cst, %99 : tensor<64xf32> 2026-02-21T08:13:09.5032125Z %101 = arith.mulf %100, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5032331Z %102 = arith.mulf %98, %101 : tensor<64xf32> 2026-02-21T08:13:09.5032519Z %103 = arith.addf %78, %102 : tensor<64xf32> 2026-02-21T08:13:09.5032766Z %104 = arith.addf %84, %97 : tensor<64xf32> 2026-02-21T08:13:09.5032954Z %105 = arith.mulf %98, %98 : tensor<64xf32> 2026-02-21T08:13:09.5033156Z %106 = arith.mulf %74, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5033351Z %107 = arith.divf %106, %99 : tensor<64xf32> 2026-02-21T08:13:09.5033549Z %108 = arith.mulf %105, %107 : tensor<64xf32> 2026-02-21T08:13:09.5033745Z %109 = arith.addf %104, %108 : tensor<64xf32> 2026-02-21T08:13:09.5033985Z scf.yield %99, %103, %109 : tensor<64xf32>, tensor<64xf32>, tensor<64xf32> 2026-02-21T08:13:09.5034243Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:13:09.5034655Z %6:3 = scf.for %arg6 = %c960_i32 to %c1024_i32 step %c32_i32 iter_args(%arg7 = %5#0, %arg8 = %5#1, %arg9 = %5#2) -> (tensor<64xf32>, tensor<64xf32>, tensor<64xf32>) : i32 { 2026-02-21T08:13:09.5035150Z %37 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:13:09.5035446Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5035630Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5035834Z %60 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5036031Z tt.reduce.return %60 : bf16 2026-02-21T08:13:09.5036215Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5036418Z %39 = arith.mulf %37, %37 : tensor<64x32xbf16> 2026-02-21T08:13:09.5036607Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T08:13:09.5036793Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:13:09.5036977Z %60 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:13:09.5037162Z tt.reduce.return %60 : bf16 2026-02-21T08:13:09.5037349Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:13:09.5037559Z %41 = arith.extf %38 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5037779Z %42 = arith.divf %41, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5037970Z %43 = arith.mulf %38, %38 : tensor<64xbf16> 2026-02-21T08:13:09.5038189Z %44 = arith.extf %43 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5038405Z %45 = arith.divf %44, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5038613Z %46 = arith.extf %40 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:13:09.5038825Z %47 = arith.subf %46, %45 : tensor<64xf32> 2026-02-21T08:13:09.5039014Z %48 = arith.subf %42, %arg8 : tensor<64xf32> 2026-02-21T08:13:09.5039219Z %49 = arith.addf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5039414Z %50 = arith.divf %cst, %49 : tensor<64xf32> 2026-02-21T08:13:09.5039613Z %51 = arith.mulf %50, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5039809Z %52 = arith.mulf %48, %51 : tensor<64xf32> 2026-02-21T08:13:09.5039999Z %53 = arith.addf %arg8, %52 : tensor<64xf32> 2026-02-21T08:13:09.5040199Z %54 = arith.addf %arg9, %47 : tensor<64xf32> 2026-02-21T08:13:09.5040389Z %55 = arith.mulf %48, %48 : tensor<64xf32> 2026-02-21T08:13:09.5040685Z %56 = arith.mulf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:13:09.5040879Z %57 = arith.divf %56, %49 : tensor<64xf32> 2026-02-21T08:13:09.5041074Z %58 = arith.mulf %55, %57 : tensor<64xf32> 2026-02-21T08:13:09.5041259Z %59 = arith.addf %54, %58 : tensor<64xf32> 2026-02-21T08:13:09.5041499Z scf.yield %49, %53, %59 : tensor<64xf32>, tensor<64xf32>, tensor<64xf32> 2026-02-21T08:13:09.5041753Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:13:09.5041969Z %7 = arith.divf %6#2, %6#0 : tensor<64xf32> 2026-02-21T08:13:09.5042173Z %8 = tt.splat %arg4 : f32 -> tensor<64xf32> 2026-02-21T08:13:09.5042367Z %9 = arith.addf %7, %8 : tensor<64xf32> 2026-02-21T08:13:09.5042731Z %10 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_rsqrtf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T08:13:09.5043200Z %11 = tt.expand_dims %6#1 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T08:13:09.5043546Z %12 = tt.expand_dims %10 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T08:13:09.5043801Z %c1008_i32 = arith.constant 1008 : i32 2026-02-21T08:13:09.5044001Z %c48_i32 = arith.constant 48 : i32 2026-02-21T08:13:09.5044227Z scf.for %arg6 = %c0_i32 to %c1008_i32 step %c48_i32 : i32 { 2026-02-21T08:13:09.5044511Z %37 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:13:09.5044768Z %38 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:13:09.5044975Z %39 = arith.addi %38, %37 : tensor<16xi32> 2026-02-21T08:13:09.5045282Z %40 = tt.descriptor_load %1[%4, %arg6] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:13:09.5045619Z %41 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5045903Z %42 = tt.addptr %41, %39 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5046202Z %43 = tt.load %42 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5046532Z %44 = tt.expand_dims %43 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5046837Z %45 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5047106Z %46 = tt.addptr %45, %39 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5047402Z %47 = tt.load %46 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5047753Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5048056Z %49 = arith.extf %40 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:13:09.5048330Z %50 = tt.broadcast %11 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5048568Z %51 = arith.subf %49, %50 : tensor<64x16xf32> 2026-02-21T08:13:09.5048810Z %52 = tt.broadcast %12 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5049044Z %53 = arith.mulf %51, %52 : tensor<64x16xf32> 2026-02-21T08:13:09.5049287Z %54 = arith.extf %44 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5049548Z %55 = tt.broadcast %54 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5049812Z %56 = arith.mulf %53, %55 : tensor<64x16xf32> 2026-02-21T08:13:09.5050045Z %57 = arith.extf %48 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5050297Z %58 = tt.broadcast %57 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5050531Z %59 = arith.addf %56, %58 : tensor<64x16xf32> 2026-02-21T08:13:09.5050762Z %60 = arith.truncf %59 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:13:09.5051096Z tt.descriptor_store %2[%4, %arg6], %60 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:13:09.5051403Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:13:09.5051594Z %61 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T08:13:09.5051800Z %62 = arith.addi %arg6, %61 : i32 2026-02-21T08:13:09.5052121Z %63 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:13:09.5052379Z %64 = tt.splat %62 : i32 -> tensor<16xi32> 2026-02-21T08:13:09.5052579Z %65 = arith.addi %64, %63 : tensor<16xi32> 2026-02-21T08:13:09.5052869Z %66 = tt.descriptor_load %1[%4, %62] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:13:09.5053204Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5053474Z %68 = tt.addptr %67, %65 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5053771Z %69 = tt.load %68 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5054070Z %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5054361Z %71 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5054679Z %72 = tt.addptr %71, %65 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5054959Z %73 = tt.load %72 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5055260Z %74 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5055538Z %75 = arith.extf %66 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:13:09.5055795Z %76 = tt.broadcast %11 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5056020Z %77 = arith.subf %75, %76 : tensor<64x16xf32> 2026-02-21T08:13:09.5056249Z %78 = tt.broadcast %12 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5056476Z %79 = arith.mulf %77, %78 : tensor<64x16xf32> 2026-02-21T08:13:09.5056697Z %80 = arith.extf %70 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5056943Z %81 = tt.broadcast %80 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5057164Z %82 = arith.mulf %79, %81 : tensor<64x16xf32> 2026-02-21T08:13:09.5057389Z %83 = arith.extf %74 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5057629Z %84 = tt.broadcast %83 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5057854Z %85 = arith.addf %82, %84 : tensor<64x16xf32> 2026-02-21T08:13:09.5058085Z %86 = arith.truncf %85 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:13:09.5058391Z tt.descriptor_store %2[%4, %62], %86 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:13:09.5058674Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:13:09.5058854Z %87 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T08:13:09.5059045Z %88 = arith.addi %arg6, %87 : i32 2026-02-21T08:13:09.5059265Z %89 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:13:09.5059510Z %90 = tt.splat %88 : i32 -> tensor<16xi32> 2026-02-21T08:13:09.5059707Z %91 = arith.addi %90, %89 : tensor<16xi32> 2026-02-21T08:13:09.5059980Z %92 = tt.descriptor_load %1[%4, %88] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:13:09.5060301Z %93 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5060560Z %94 = tt.addptr %93, %91 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5060848Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5061150Z %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5061435Z %97 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5061697Z %98 = tt.addptr %97, %91 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5061999Z %99 = tt.load %98 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5062304Z %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5062602Z %101 = arith.extf %92 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:13:09.5062919Z %102 = tt.broadcast %11 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5063160Z %103 = arith.subf %101, %102 : tensor<64x16xf32> 2026-02-21T08:13:09.5063387Z %104 = tt.broadcast %12 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5063624Z %105 = arith.mulf %103, %104 : tensor<64x16xf32> 2026-02-21T08:13:09.5063852Z %106 = arith.extf %96 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5064109Z %107 = tt.broadcast %106 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5064338Z %108 = arith.mulf %105, %107 : tensor<64x16xf32> 2026-02-21T08:13:09.5064566Z %109 = arith.extf %100 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5064823Z %110 = tt.broadcast %109 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5065104Z %111 = arith.addf %108, %110 : tensor<64x16xf32> 2026-02-21T08:13:09.5065345Z %112 = arith.truncf %111 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:13:09.5065654Z tt.descriptor_store %2[%4, %88], %112 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:13:09.5065928Z } {tt.flatten} 2026-02-21T08:13:09.5066124Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:13:09.5066377Z %14 = tt.splat %c1008_i32 : i32 -> tensor<16xi32> 2026-02-21T08:13:09.5066588Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:13:09.5066875Z %16 = tt.descriptor_load %1[%4, %c1008_i32] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:13:09.5067207Z %17 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5067461Z %18 = tt.addptr %17, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5067745Z %19 = tt.load %18 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5068046Z %20 = tt.expand_dims %19 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5068330Z %21 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:13:09.5068591Z %22 = tt.addptr %21, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:13:09.5068859Z %23 = tt.load %22 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:13:09.5069165Z %24 = tt.expand_dims %23 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:13:09.5069447Z %25 = arith.extf %16 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:13:09.5069705Z %26 = tt.broadcast %11 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5069933Z %27 = arith.subf %25, %26 : tensor<64x16xf32> 2026-02-21T08:13:09.5070155Z %28 = tt.broadcast %12 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5070380Z %29 = arith.mulf %27, %28 : tensor<64x16xf32> 2026-02-21T08:13:09.5070601Z %30 = arith.extf %20 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5070874Z %31 = tt.broadcast %30 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5071091Z %32 = arith.mulf %29, %31 : tensor<64x16xf32> 2026-02-21T08:13:09.5071313Z %33 = arith.extf %24 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:13:09.5071560Z %34 = tt.broadcast %33 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:13:09.5071775Z %35 = arith.addf %32, %34 : tensor<64x16xf32> 2026-02-21T08:13:09.5072043Z %36 = arith.truncf %35 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:13:09.5072362Z tt.descriptor_store %2[%4, %c1008_i32], %36 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:13:09.5072685Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:13:09.5072882Z tt.return 2026-02-21T08:13:09.5073016Z } 2026-02-21T08:13:09.5073142Z } 2026-02-21T08:13:09.5073210Z 2026-02-21T08:13:09.5073262Z {-# 2026-02-21T08:13:09.5073400Z external_resources: { 2026-02-21T08:13:09.5073608Z mlir_reproducer: { 2026-02-21T08:13:09.5077951Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:13:09.5082548Z disable_threading: false, 2026-02-21T08:13:09.5082713Z verify_each: true 2026-02-21T08:13:09.5082865Z } 2026-02-21T08:13:09.5082981Z } 2026-02-21T08:13:09.5083103Z #-} 2026-02-21T08:13:09.5083527Z /tmp/torchinductor_root/r6/cr6giw7vdouyzl4agvroul4clh3tj4z2e7lgu5zckhhr46sqk2j3.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:13:09.5084714Z /tmp/torchinductor_root/r6/cr6giw7vdouyzl4agvroul4clh3tj4z2e7lgu5zckhhr46sqk2j3.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:13:09.5085697Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:13:09.5086976Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'first', 'first'], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True, True], range_multi_buffers=[False, None, True], range_num_stages=[0, 4, 0], range_unroll_factors=[0, 3, 3], range_warp_specializes=[True, None, None]), static_shapes=True) 2026-02-21T08:13:09.5088204Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:13:09.5088462Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:13:12.6642775Z [69s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 512, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', '', 'last', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False, None], range_multi_buffers=[None, False, None], range_num_stages=[0, 1, 3], range_unroll_factors=[0, 3, 2], range_warp_specializes=[None, None, None]) 2026-02-21T08:13:12.6643853Z Tensor-likes are not close! 2026-02-21T08:13:12.6644004Z 2026-02-21T08:13:12.6644379Z Mismatched elements: 9 / 268435456 (0.0%) 2026-02-21T08:13:12.6644644Z Greatest absolute difference: 0.03125 at index (9373, 437) (up to 0.01 allowed) 2026-02-21T08:13:12.6644996Z Greatest relative difference: 1.9921875 at index (206021, 427) (up to 0.01 allowed) 2026-02-21T08:13:12.6645308Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:13:12.6645466Z 2026-02-21T08:13:18.6731581Z [75s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 4], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first', 'first', 'first'], maxnreg=256, num_sm_multiplier=4, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True, None], range_multi_buffers=[True, True, None], range_num_stages=[3, 4, 2], range_unroll_factors=[0, 1, 2], range_warp_specializes=[True, None, None]) 2026-02-21T08:13:18.6733128Z Tensor-likes are not close! 2026-02-21T08:13:18.6733632Z 2026-02-21T08:13:18.6733741Z Mismatched elements: 422 / 268435456 (0.0%) 2026-02-21T08:13:18.6734068Z Greatest absolute difference: 0.03125 at index (38640, 908) (up to 0.01 allowed) 2026-02-21T08:13:18.6734458Z Greatest relative difference: 25.0 at index (30765, 437) (up to 0.01 allowed) 2026-02-21T08:13:18.6734806Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:13:18.6734992Z 2026-02-21T08:13:19.1006634Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.2 configs/s 2026-02-21T08:13:19.1017397Z [75s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 60s]) 2026-02-21T08:13:19.5518920Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━ 588/588 1023.3 configs/s 2026-02-21T08:13:19.6842806Z [76s] Initial random population of 100, 5 starting points: 2026-02-21T08:13:19.6844453Z error=10 2026-02-21T08:13:19.6844621Z timeout=1 2026-02-21T08:13:19.6844747Z ok=89 2026-02-21T08:13:19.6844880Z min=0.3645 2026-02-21T08:13:19.6845043Z mid=4.4319 2026-02-21T08:13:19.6845191Z max=155.2220 2026-02-21T08:13:19.6845342Z best={'block_sizes': [16, 16, 16], 2026-02-21T08:13:19.6845595Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:13:19.6845851Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:13:19.6846029Z 'num_stages': 1, 2026-02-21T08:13:19.6846175Z 'num_warps': 4, 2026-02-21T08:13:19.6846311Z 'pid_type': 'flat', 2026-02-21T08:13:19.6846472Z 'range_flattens': [None, None, None], 2026-02-21T08:13:19.6846663Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:13:19.6846853Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:13:19.6847016Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:13:19.6847218Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:13:19.6861710Z [76s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:13:20.9815919Z [77s] Generation 1 starting: 101 neighbors, 5 active search path(s) 2026-02-21T08:13:30.5267830Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 5.3 configs/s 2026-02-21T08:13:37.6624786Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 14.6 configs/s 2026-02-21T08:13:42.1195359Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 720/720 185.7 configs/s 2026-02-21T08:13:42.2727981Z [99s] Generation 1 complete: 2026-02-21T08:13:42.2729862Z ok=106 2026-02-21T08:13:42.2730074Z min=0.2855 2026-02-21T08:13:42.2735357Z mid=0.8637 2026-02-21T08:13:42.2736850Z max=12.6280 2026-02-21T08:13:42.2737097Z best={'block_sizes': [16, 16, 64], 2026-02-21T08:13:42.2742632Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:13:42.2744919Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:13:42.2745190Z 'num_stages': 1, 2026-02-21T08:13:42.2750629Z 'num_warps': 2, 2026-02-21T08:13:42.2756197Z 'pid_type': 'flat', 2026-02-21T08:13:42.2761296Z 'range_flattens': [None, None, True], 2026-02-21T08:13:42.2763319Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:13:42.2763635Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:13:42.2766673Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:13:42.2766963Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:13:42.2767195Z [99s] Fitting surrogate: 206 points, 206 targets 2026-02-21T08:13:43.5664342Z [100s] Generation 2 starting: 94 neighbors, 5 active search path(s) 2026-02-21T08:14:18.7997230Z [135s] Timeout after 30s compiling Config(block_sizes=[512, 64, 64], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 4], range_warp_specializes=[None, None, False]) 2026-02-21T08:14:18.8015440Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 0.7 configs/s 2026-02-21T08:14:25.1099748Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 15.6 configs/s 2026-02-21T08:14:31.1072139Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━ 824/824 135.7 configs/s 2026-02-21T08:14:31.2775574Z [148s] Generation 2 complete: 2026-02-21T08:14:31.2778068Z error=1 2026-02-21T08:14:31.2778354Z timeout=1 2026-02-21T08:14:31.2778554Z ok=98 2026-02-21T08:14:31.2778738Z min=0.2478 2026-02-21T08:14:31.2778953Z mid=0.4874 2026-02-21T08:14:31.2779128Z max=14.7836 2026-02-21T08:14:31.2779320Z best={'block_sizes': [16, 16, 256], 2026-02-21T08:14:31.2779716Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:14:31.2780066Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:14:31.2780310Z 'num_stages': 1, 2026-02-21T08:14:31.2780515Z 'num_warps': 1, 2026-02-21T08:14:31.2780709Z 'pid_type': 'flat', 2026-02-21T08:14:31.2780907Z 'range_flattens': [None, None, False], 2026-02-21T08:14:31.2781160Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:14:31.2781411Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:14:31.2781671Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:14:31.2782224Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:14:31.2801474Z [148s] Fitting surrogate: 306 points, 306 targets 2026-02-21T08:14:32.6393598Z [149s] Generation 3 starting: 95 neighbors, 5 active search path(s) 2026-02-21T08:14:40.0341081Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 11.9 configs/s 2026-02-21T08:14:45.9540812Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 16.7 configs/s 2026-02-21T08:14:52.9184715Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━ 935/935 133.2 configs/s 2026-02-21T08:14:53.1048614Z [169s] Generation 3 complete: 2026-02-21T08:14:53.1052779Z error=3 2026-02-21T08:14:53.1057810Z ok=98 2026-02-21T08:14:53.1062477Z min=0.2315 2026-02-21T08:14:53.1066326Z mid=0.4076 2026-02-21T08:14:53.1069638Z max=2.6024 2026-02-21T08:14:53.1074873Z best={'block_sizes': [16, 32, 256], 2026-02-21T08:14:53.1075288Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:14:53.1075609Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:14:53.1080147Z 'num_stages': 1, 2026-02-21T08:14:53.1084849Z 'num_warps': 4, 2026-02-21T08:14:53.1089220Z 'pid_type': 'flat', 2026-02-21T08:14:53.1093792Z 'range_flattens': [None, None, True], 2026-02-21T08:14:53.1098363Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:14:53.1099800Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:14:53.1100019Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:14:53.1100235Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:14:53.1100530Z [170s] Fitting surrogate: 407 points, 407 targets 2026-02-21T08:14:54.4410702Z [171s] Generation 4 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:15:30.2676127Z [207s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 4], range_warp_specializes=[None, False, None]) 2026-02-21T08:15:30.2697666Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.7 configs/s 2026-02-21T08:15:36.4690179Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.1 configs/s 2026-02-21T08:15:44.3723134Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━ 957/957 120.2 configs/s 2026-02-21T08:15:44.5642565Z [221s] Generation 4 complete: 2026-02-21T08:15:44.5647213Z error=2 2026-02-21T08:15:44.5649372Z timeout=1 2026-02-21T08:15:44.5649581Z ok=100 2026-02-21T08:15:44.5649717Z min=0.1914 2026-02-21T08:15:44.5649875Z mid=0.3583 2026-02-21T08:15:44.5650033Z max=15.0456 2026-02-21T08:15:44.5650207Z best={'block_sizes': [16, 64, 256], 2026-02-21T08:15:44.5650521Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:15:44.5651610Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:15:44.5651962Z 'num_stages': 1, 2026-02-21T08:15:44.5652116Z 'num_warps': 4, 2026-02-21T08:15:44.5657671Z 'pid_type': 'flat', 2026-02-21T08:15:44.5657905Z 'range_flattens': [None, None, True], 2026-02-21T08:15:44.5662280Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:15:44.5666261Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:15:44.5670324Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:15:44.5672075Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:15:44.5672335Z [221s] Fitting surrogate: 510 points, 510 targets 2026-02-21T08:15:46.0200339Z [222s] Generation 5 starting: 95 neighbors, 5 active search path(s) 2026-02-21T08:15:53.5292592Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 6.9 configs/s 2026-02-21T08:15:59.3958206Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.5 configs/s 2026-02-21T08:16:04.7032991Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 186.7 2026-02-21T08:16:04.7034323Z configs/s 2026-02-21T08:16:04.8628773Z [241s] Generation 5 complete: 2026-02-21T08:16:04.8630571Z ok=101 2026-02-21T08:16:04.8630730Z min=0.1854 2026-02-21T08:16:04.8630869Z mid=0.3113 2026-02-21T08:16:04.8630998Z max=1.8768 2026-02-21T08:16:04.8631140Z best={'block_sizes': [4, 64, 256], 2026-02-21T08:16:04.8631416Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:16:04.8631690Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:16:04.8632946Z 'num_stages': 1, 2026-02-21T08:16:04.8633247Z 'num_warps': 1, 2026-02-21T08:16:04.8633400Z 'pid_type': 'flat', 2026-02-21T08:16:04.8633562Z 'range_flattens': [None, None, True], 2026-02-21T08:16:04.8633770Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:16:04.8633964Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:16:04.8634130Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T08:16:04.8634347Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:16:04.8659831Z [241s] Fitting surrogate: 611 points, 611 targets 2026-02-21T08:16:06.3484582Z [243s] Generation 6 starting: 96 neighbors, 5 active search path(s) 2026-02-21T08:16:12.5860074Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 37.1 configs/s 2026-02-21T08:16:18.5321642Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 16.4 configs/s 2026-02-21T08:16:25.8191399Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.4 2026-02-21T08:16:25.8191788Z configs/s 2026-02-21T08:16:26.0120844Z [262s] Generation 6 complete: 2026-02-21T08:16:26.0124550Z ok=102 2026-02-21T08:16:26.0128242Z min=0.1833 2026-02-21T08:16:26.0132350Z mid=0.3041 2026-02-21T08:16:26.0136231Z max=3.3162 2026-02-21T08:16:26.0141046Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:16:26.0145421Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:16:26.0146688Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:16:26.0146970Z 'num_stages': 1, 2026-02-21T08:16:26.0147120Z 'num_warps': 1, 2026-02-21T08:16:26.0147278Z 'pid_type': 'flat', 2026-02-21T08:16:26.0147456Z 'range_flattens': [None, None, False], 2026-02-21T08:16:26.0147664Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:16:26.0147870Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:16:26.0148054Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T08:16:26.0148270Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:16:26.0154715Z [262s] Fitting surrogate: 713 points, 713 targets 2026-02-21T08:16:27.4871588Z [264s] Generation 7 starting: 95 neighbors, 5 active search path(s) 2026-02-21T08:16:34.8233026Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 6.1 configs/s 2026-02-21T08:16:39.6490317Z [276s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, False, True]) 2026-02-21T08:16:39.6491528Z Tensor-likes are not close! 2026-02-21T08:16:39.6491655Z 2026-02-21T08:16:39.6491749Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:16:39.6492274Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:16:39.6492689Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:16:39.6493018Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:16:39.6493201Z 2026-02-21T08:16:39.9019160Z [276s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 64], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, False, True]) 2026-02-21T08:16:39.9020334Z Tensor-likes are not close! 2026-02-21T08:16:39.9020468Z 2026-02-21T08:16:39.9020550Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:16:39.9020848Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:16:39.9021231Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:16:39.9021563Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:16:39.9021736Z 2026-02-21T08:16:40.2323699Z [277s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=2, num_warps=32, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, False, True]) 2026-02-21T08:16:40.2324732Z Tensor-likes are not close! 2026-02-21T08:16:40.2324847Z 2026-02-21T08:16:40.2324928Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:16:40.2325194Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:16:40.2325540Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:16:40.2325835Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:16:40.2326000Z 2026-02-21T08:16:40.5352387Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.9 configs/s 2026-02-21T08:16:50.0590070Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 104.6 2026-02-21T08:16:50.0595074Z configs/s 2026-02-21T08:16:50.2916236Z [287s] Generation 7 complete: 2026-02-21T08:16:50.2921132Z error=3 2026-02-21T08:16:50.2926271Z ok=97 2026-02-21T08:16:50.2930139Z min=0.1791 2026-02-21T08:16:50.2934680Z mid=0.2857 2026-02-21T08:16:50.2938549Z max=3.0781 2026-02-21T08:16:50.2942455Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:16:50.2946580Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:16:50.2948710Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:16:50.2948970Z 'num_stages': 1, 2026-02-21T08:16:50.2949138Z 'num_warps': 1, 2026-02-21T08:16:50.2949304Z 'pid_type': 'flat', 2026-02-21T08:16:50.2949482Z 'range_flattens': [None, True, False], 2026-02-21T08:16:50.2949700Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:16:50.2949890Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:16:50.2950070Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T08:16:50.2950265Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:16:50.2953071Z [287s] Fitting surrogate: 813 points, 813 targets 2026-02-21T08:16:51.7298895Z [288s] Generation 8 starting: 95 neighbors, 5 active search path(s) 2026-02-21T08:17:01.6374161Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 3.1 configs/s 2026-02-21T08:17:07.5997140Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 16.4 configs/s 2026-02-21T08:17:19.2776666Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 85.5 configs/s 2026-02-21T08:17:19.5517178Z [316s] Generation 8 complete: 2026-02-21T08:17:19.5520884Z ok=100 2026-02-21T08:17:19.5525905Z min=0.1814 2026-02-21T08:17:19.5527431Z mid=0.2775 2026-02-21T08:17:19.5527594Z max=7.7496 2026-02-21T08:17:19.5527735Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:17:19.5528017Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:17:19.5528300Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:17:19.5528500Z 'num_stages': 1, 2026-02-21T08:17:19.5528669Z 'num_warps': 1, 2026-02-21T08:17:19.5528829Z 'pid_type': 'flat', 2026-02-21T08:17:19.5528984Z 'range_flattens': [None, True, True], 2026-02-21T08:17:19.5529179Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:17:19.5529369Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:17:19.5529535Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T08:17:19.5529737Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:17:19.5557719Z [316s] Fitting surrogate: 913 points, 913 targets 2026-02-21T08:17:20.6878223Z [317s] Generation 9 starting: 69 neighbors, 4 active search path(s) 2026-02-21T08:17:33.9606105Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 1.3 configs/s 2026-02-21T08:17:38.3842788Z [335s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', 'first'], num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 4], range_unroll_factors=[0, 0, 3], range_warp_specializes=[None, True, None]) 2026-02-21T08:17:38.3844275Z Tensor-likes are not close! 2026-02-21T08:17:38.3844388Z 2026-02-21T08:17:38.3844461Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:17:38.3844732Z Greatest absolute difference: 0.03125 at index (174394, 437) (up to 0.01 allowed) 2026-02-21T08:17:38.3845069Z Greatest relative difference: 0.7890625 at index (35973, 437) (up to 0.01 allowed) 2026-02-21T08:17:38.3845375Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:17:38.3845534Z 2026-02-21T08:17:38.3858710Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 16.7 configs/s 2026-02-21T08:17:47.3140801Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 111.6 2026-02-21T08:17:47.3144590Z configs/s 2026-02-21T08:17:47.5445188Z [344s] Generation 9 complete: 2026-02-21T08:17:47.5446389Z error=1 2026-02-21T08:17:47.5446541Z ok=73 2026-02-21T08:17:47.5446661Z min=0.1833 2026-02-21T08:17:47.5446791Z mid=0.2754 2026-02-21T08:17:47.5446908Z max=3.4222 2026-02-21T08:17:47.5447050Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:17:47.5447313Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:17:47.5447597Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:17:47.5447793Z 'num_stages': 1, 2026-02-21T08:17:47.5447926Z 'num_warps': 1, 2026-02-21T08:17:47.5448070Z 'pid_type': 'flat', 2026-02-21T08:17:47.5448225Z 'range_flattens': [None, True, True], 2026-02-21T08:17:47.5448421Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:17:47.5448603Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:17:47.5448772Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T08:17:47.5448959Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:17:47.5492132Z [344s] Fitting surrogate: 987 points, 987 targets 2026-02-21T08:17:48.7100670Z [345s] Generation 10 starting: 70 neighbors, 4 active search path(s) 2026-02-21T08:17:53.4457280Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 30.0 configs/s 2026-02-21T08:17:57.8047448Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 71/71 16.4 configs/s 2026-02-21T08:18:09.6341104Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 84.4 configs/s 2026-02-21T08:18:09.9178890Z [366s] Generation 10 complete: 2026-02-21T08:18:09.9180577Z ok=74 2026-02-21T08:18:09.9180752Z min=0.1832 2026-02-21T08:18:09.9180882Z mid=0.2611 2026-02-21T08:18:09.9181010Z max=3.3680 2026-02-21T08:18:09.9181147Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:18:09.9181422Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:18:09.9181702Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:18:09.9182241Z 'num_stages': 1, 2026-02-21T08:18:09.9182384Z 'num_warps': 1, 2026-02-21T08:18:09.9182573Z 'pid_type': 'flat', 2026-02-21T08:18:09.9182749Z 'range_flattens': [None, False, True], 2026-02-21T08:18:09.9182947Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:18:09.9183138Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:18:09.9183307Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:18:09.9183502Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:18:09.9226914Z [366s] Fitting surrogate: 1061 points, 1061 targets 2026-02-21T08:18:11.1103003Z [368s] Generation 11 starting: 71 neighbors, 4 active search path(s) 2026-02-21T08:18:16.6550761Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 6.0 configs/s 2026-02-21T08:18:19.0073708Z [375s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', ''], num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, None, False]) 2026-02-21T08:18:19.0075207Z Tensor-likes are not close! 2026-02-21T08:18:19.0075325Z 2026-02-21T08:18:19.0075413Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:18:19.0075714Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:18:19.0076088Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:18:19.0076440Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:18:19.0076639Z 2026-02-21T08:18:19.4292415Z [376s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', ''], num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, False]) 2026-02-21T08:18:19.4293553Z Tensor-likes are not close! 2026-02-21T08:18:19.4293673Z 2026-02-21T08:18:19.4293747Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:18:19.4294015Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:18:19.4294359Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:18:19.4294666Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:18:19.4294822Z 2026-02-21T08:18:19.6177121Z [376s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last', 'first', ''], num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 3, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, False]) 2026-02-21T08:18:19.6178346Z Tensor-likes are not close! 2026-02-21T08:18:19.6184149Z 2026-02-21T08:18:19.6186658Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:18:19.6187012Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:18:19.6192421Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:18:19.6197018Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:18:19.6198708Z 2026-02-21T08:18:20.8160233Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 72/72 17.5 configs/s 2026-02-21T08:18:34.4524900Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 73.3 configs/s 2026-02-21T08:18:34.7515556Z [391s] Generation 11 complete: 2026-02-21T08:18:34.7515875Z error=3 2026-02-21T08:18:34.7516056Z ok=73 2026-02-21T08:18:34.7516213Z min=0.1833 2026-02-21T08:18:34.7516361Z mid=0.2407 2026-02-21T08:18:34.7516490Z max=0.6287 2026-02-21T08:18:34.7516710Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:18:34.7517016Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:18:34.7517330Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:18:34.7517527Z 'num_stages': 1, 2026-02-21T08:18:34.7517682Z 'num_warps': 1, 2026-02-21T08:18:34.7517825Z 'pid_type': 'flat', 2026-02-21T08:18:34.7517988Z 'range_flattens': [None, True, True], 2026-02-21T08:18:34.7518182Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:18:34.7518371Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:18:34.7518533Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:18:34.7518724Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:18:34.7564066Z [391s] Fitting surrogate: 1137 points, 1137 targets 2026-02-21T08:18:35.8547684Z [392s] Generation 12 starting: 68 neighbors, 4 active search path(s) 2026-02-21T08:18:40.7115271Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 33.0 configs/s 2026-02-21T08:18:44.8546538Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 68/68 16.6 configs/s 2026-02-21T08:18:58.7682254Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 71.8 configs/s 2026-02-21T08:18:59.0906684Z [415s] Generation 12 complete: 2026-02-21T08:18:59.0908395Z ok=72 2026-02-21T08:18:59.0908587Z min=0.1821 2026-02-21T08:18:59.0908732Z mid=0.2303 2026-02-21T08:18:59.0908872Z max=3.3013 2026-02-21T08:18:59.0909037Z best={'block_sizes': [4, 128, 512], 2026-02-21T08:18:59.0909341Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:18:59.0909660Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:18:59.0909878Z 'num_stages': 1, 2026-02-21T08:18:59.0910039Z 'num_warps': 2, 2026-02-21T08:18:59.0910192Z 'pid_type': 'flat', 2026-02-21T08:18:59.0910376Z 'range_flattens': [None, True, True], 2026-02-21T08:18:59.0910588Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:18:59.0910805Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:18:59.0911550Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:18:59.0911805Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:18:59.0947810Z [415s] Fitting surrogate: 1209 points, 1209 targets 2026-02-21T08:19:00.0869289Z [416s] Generation 13 starting: 54 neighbors, 3 active search path(s) 2026-02-21T08:19:04.2910292Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 8.9 configs/s 2026-02-21T08:19:04.3040764Z [421s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 512], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, None, None]) 2026-02-21T08:19:04.3041956Z Tensor-likes are not close! 2026-02-21T08:19:04.3046040Z 2026-02-21T08:19:04.3049759Z Mismatched elements: 17 / 268435456 (0.0%) 2026-02-21T08:19:04.3054235Z Greatest absolute difference: 0.013671875 at index (32168, 938) (up to 0.01 allowed) 2026-02-21T08:19:04.3055881Z Greatest relative difference: 1.9296875 at index (95396, 437) (up to 0.01 allowed) 2026-02-21T08:19:04.3056314Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:04.3060372Z 2026-02-21T08:19:04.3757090Z [421s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 512], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, None, None]) 2026-02-21T08:19:04.3758556Z Tensor-likes are not close! 2026-02-21T08:19:04.3758683Z 2026-02-21T08:19:04.3758766Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:19:04.3759141Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:19:04.3764985Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:19:04.3767984Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:04.3768198Z 2026-02-21T08:19:04.6927458Z [421s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 512, 512], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, None, None]) 2026-02-21T08:19:04.6928691Z Tensor-likes are not close! 2026-02-21T08:19:04.6928812Z 2026-02-21T08:19:04.6928897Z Mismatched elements: 17 / 268435456 (0.0%) 2026-02-21T08:19:04.6929203Z Greatest absolute difference: 0.013671875 at index (32168, 938) (up to 0.01 allowed) 2026-02-21T08:19:04.6934342Z Greatest relative difference: 1.9296875 at index (95396, 437) (up to 0.01 allowed) 2026-02-21T08:19:04.6938745Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:04.6942436Z 2026-02-21T08:19:06.5671649Z [423s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', '', 'first', 'last'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 0, 4], range_unroll_factors=[0, 0, 3], range_warp_specializes=[None, True, None]) 2026-02-21T08:19:06.5673026Z Tensor-likes are not close! 2026-02-21T08:19:06.5677299Z 2026-02-21T08:19:06.5680858Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T08:19:06.5686263Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T08:19:06.5691200Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T08:19:06.5692644Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:19:06.5692832Z 2026-02-21T08:19:07.4173381Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 17.2 configs/s 2026-02-21T08:19:18.6793361Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 88.6 configs/s 2026-02-21T08:19:18.9461441Z [435s] Generation 13 complete: 2026-02-21T08:19:18.9465584Z error=4 2026-02-21T08:19:18.9469422Z ok=54 2026-02-21T08:19:18.9470877Z min=0.1812 2026-02-21T08:19:18.9471039Z mid=0.2325 2026-02-21T08:19:18.9471166Z max=0.5878 2026-02-21T08:19:18.9471302Z best={'block_sizes': [4, 128, 512], 2026-02-21T08:19:18.9471581Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:18.9471933Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:19:18.9472135Z 'num_stages': 2, 2026-02-21T08:19:18.9472271Z 'num_warps': 2, 2026-02-21T08:19:18.9472450Z 'pid_type': 'flat', 2026-02-21T08:19:18.9472610Z 'range_flattens': [None, None, True], 2026-02-21T08:19:18.9472826Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:19:18.9473019Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:19:18.9473187Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:19:18.9473390Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:19:18.9507769Z [435s] Fitting surrogate: 1267 points, 1267 targets 2026-02-21T08:19:19.7053865Z [436s] Generation 14 starting: 38 neighbors, 2 active search path(s) 2026-02-21T08:19:22.0768870Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 42.9 configs/s 2026-02-21T08:19:24.4173890Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 16.5 configs/s 2026-02-21T08:19:31.9121212Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 132.8 2026-02-21T08:19:31.9121649Z configs/s 2026-02-21T08:19:32.1225645Z [449s] Generation 14 complete: 2026-02-21T08:19:32.1227577Z ok=40 2026-02-21T08:19:32.1227808Z min=0.1894 2026-02-21T08:19:32.1227988Z mid=0.2201 2026-02-21T08:19:32.1228158Z max=3.9347 2026-02-21T08:19:32.1228344Z best={'block_sizes': [4, 128, 512], 2026-02-21T08:19:32.1228694Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:32.1229045Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:19:32.1229262Z 'num_stages': 2, 2026-02-21T08:19:32.1229418Z 'num_warps': 2, 2026-02-21T08:19:32.1229585Z 'pid_type': 'flat', 2026-02-21T08:19:32.1229767Z 'range_flattens': [None, None, True], 2026-02-21T08:19:32.1230003Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:19:32.1230226Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:19:32.1230415Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:19:32.1230650Z 'range_warp_specializes': [None, False, None]} 2026-02-21T08:19:32.1276449Z [449s] Fitting surrogate: 1307 points, 1307 targets 2026-02-21T08:19:32.8490964Z [449s] Generation 15 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:19:35.0795689Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 25.6 configs/s 2026-02-21T08:19:37.1398718Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 16.9 configs/s 2026-02-21T08:19:45.1295878Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 135.8 2026-02-21T08:19:45.1296429Z configs/s 2026-02-21T08:19:45.3203099Z [462s] Generation 15 complete: 2026-02-21T08:19:45.3207318Z ok=36 2026-02-21T08:19:45.3211181Z min=0.1884 2026-02-21T08:19:45.3212646Z mid=0.2181 2026-02-21T08:19:45.3212812Z max=0.9430 2026-02-21T08:19:45.3212956Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:19:45.3213239Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:45.3213533Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:19:45.3213723Z 'num_stages': 2, 2026-02-21T08:19:45.3214382Z 'num_warps': 1, 2026-02-21T08:19:45.3214562Z 'pid_type': 'flat', 2026-02-21T08:19:45.3214834Z 'range_flattens': [None, None, True], 2026-02-21T08:19:45.3215054Z 'range_multi_buffers': [None, None, True], 2026-02-21T08:19:45.3215243Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:19:45.3215408Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:19:45.3215605Z 'range_warp_specializes': [None, False, None]} 2026-02-21T08:19:45.3241378Z [462s] Fitting surrogate: 1343 points, 1343 targets 2026-02-21T08:19:45.8127822Z [462s] Generation 16 starting: 17 neighbors, 1 active search path(s) 2026-02-21T08:19:46.9982228Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 40.7 configs/s 2026-02-21T08:19:48.0110778Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.6 configs/s 2026-02-21T08:19:52.3618575Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 227.5 2026-02-21T08:19:52.3622772Z configs/s 2026-02-21T08:19:52.5133688Z [469s] Generation 16 complete: 2026-02-21T08:19:52.5139204Z ok=18 2026-02-21T08:19:52.5143498Z min=0.1916 2026-02-21T08:19:52.5150497Z mid=0.1976 2026-02-21T08:19:52.5152164Z max=0.2754 2026-02-21T08:19:52.5152354Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:19:52.5152659Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:52.5152968Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:19:52.5153173Z 'num_stages': 1, 2026-02-21T08:19:52.5153325Z 'num_warps': 1, 2026-02-21T08:19:52.5153471Z 'pid_type': 'flat', 2026-02-21T08:19:52.5153643Z 'range_flattens': [None, True, True], 2026-02-21T08:19:52.5153846Z 'range_multi_buffers': [None, None, True], 2026-02-21T08:19:52.5154050Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:19:52.5154226Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:19:52.5154440Z 'range_warp_specializes': [None, False, None]} 2026-02-21T08:19:52.5189651Z [469s] Fitting surrogate: 1361 points, 1361 targets 2026-02-21T08:19:53.0205693Z [469s] Generation 17 starting: 18 neighbors, 1 active search path(s) 2026-02-21T08:19:54.2603520Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 60.5 configs/s 2026-02-21T08:19:55.3974843Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.4 configs/s 2026-02-21T08:19:59.9933287Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 215.5 2026-02-21T08:19:59.9937065Z configs/s 2026-02-21T08:20:00.1494476Z [477s] Generation 17 complete: 2026-02-21T08:20:00.1499287Z ok=20 2026-02-21T08:20:00.1500942Z min=0.1965 2026-02-21T08:20:00.1501173Z mid=0.1987 2026-02-21T08:20:00.1506548Z max=0.3524 2026-02-21T08:20:00.1508532Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:20:00.1508846Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:20:00.1509581Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:20:00.1509806Z 'num_stages': 2, 2026-02-21T08:20:00.1509970Z 'num_warps': 1, 2026-02-21T08:20:00.1510122Z 'pid_type': 'flat', 2026-02-21T08:20:00.1510283Z 'range_flattens': [None, True, True], 2026-02-21T08:20:00.1510485Z 'range_multi_buffers': [None, None, True], 2026-02-21T08:20:00.1510670Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:20:00.1510845Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:20:00.1511039Z 'range_warp_specializes': [None, False, None]} 2026-02-21T08:20:00.1546836Z [477s] Fitting surrogate: 1381 points, 1381 targets 2026-02-21T08:20:00.6511493Z [477s] Generation 18 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:20:01.7379803Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 42.0 configs/s 2026-02-21T08:20:02.6239879Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.9 configs/s 2026-02-21T08:20:06.4342584Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 259.3 2026-02-21T08:20:06.4343547Z configs/s 2026-02-21T08:20:06.5712662Z [483s] Generation 18 complete: 2026-02-21T08:20:06.5716970Z ok=16 2026-02-21T08:20:06.5719210Z min=0.1975 2026-02-21T08:20:06.5719396Z mid=0.1987 2026-02-21T08:20:06.5719541Z max=0.3564 2026-02-21T08:20:06.5719761Z best={'block_sizes': [4, 64, 512], 2026-02-21T08:20:06.5720148Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:20:06.5720471Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T08:20:06.5720702Z 'num_stages': 2, 2026-02-21T08:20:06.5720859Z 'num_warps': 1, 2026-02-21T08:20:06.5721024Z 'pid_type': 'flat', 2026-02-21T08:20:06.5721207Z 'range_flattens': [None, True, True], 2026-02-21T08:20:06.5721434Z 'range_multi_buffers': [None, None, True], 2026-02-21T08:20:06.5721647Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:20:06.5721845Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T08:20:06.5722290Z 'range_warp_specializes': [None, False, None]} 2026-02-21T08:20:06.5765018Z [483s] Fitting surrogate: 1397 points, 1397 targets 2026-02-21T08:20:06.9028590Z [483s] Autotuning complete in 483.8s after searching 1344 configs. 2026-02-21T08:20:06.9030357Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:20:06.9031372Z @helion.kernel(config=helion.Config(block_sizes=[4, 64, 512], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, False, None]), static_shapes=True) 2026-02-21T08:20:06.9033737Z 2026-02-21T08:20:06.9033996Z [483s] Code of selected kernel: /tmp/torchinductor_root/26/c26klrbrkvex4yaw4pwxdrrgxrzshez7kx53eo7r3fhr4vmqvmho.py 2026-02-21T08:20:08.1599593Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:20:08.1600619Z x_val 2026-02-21T08:20:08.1601035Z ------- 2026-02-21T08:20:08.1601164Z 1024 2026-02-21T08:20:08.1601264Z 2026-02-21T08:20:08.1632946Z 17%|█▋ | 1/6 [11:58<59:51, 718.22s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T08:20:08.1633306Z x_val 2026-02-21T08:20:08.1633441Z ------- 2026-02-21T08:20:08.1633574Z 2048 2026-02-21T08:20:08.1633839Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T08:20:08.8532221Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T08:20:10.1570239Z INFO:tritonbench.utils.triton_op:Took 2.63ms to get benchmark function for torch_compile_welford 2026-02-21T08:27:52.2816738Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:27:52.2818365Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:27:52.2818604Z 'dtype': 'torch.bfloat16', 2026-02-21T08:27:52.2818817Z 'shape': (2048,), 2026-02-21T08:27:52.2819472Z 'stride': (1,)}, 2026-02-21T08:27:52.2819691Z { 'device': 'cuda:0', 2026-02-21T08:27:52.2819884Z 'dtype': 'torch.bfloat16', 2026-02-21T08:27:52.2820070Z 'shape': (2048,), 2026-02-21T08:27:52.2820239Z 'stride': (1,)}, 2026-02-21T08:27:52.2820402Z { 'device': 'cuda:0', 2026-02-21T08:27:52.2820585Z 'dtype': 'torch.bfloat16', 2026-02-21T08:27:52.2820770Z 'shape': (262144, 2048), 2026-02-21T08:27:52.2820958Z 'stride': (2048, 1)}), 2026-02-21T08:27:52.2821134Z 'kwargs': {}} 2026-02-21T08:27:52.2844782Z INFO:tritonbench.utils.triton_op:Took 3.16ms to get benchmark function for helion_welford 2026-02-21T08:27:53.8262889Z [0s] Autotune random seed: 2134763656 2026-02-21T08:27:53.9964535Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:28:28.3422979Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 128], indexing=['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', '', '', 'last'], maxnreg=256, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None, True], range_multi_buffers=[False, True, False], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 1, 3], range_warp_specializes=[None, True, None]) 2026-02-21T08:28:28.3443765Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:28:33.0184775Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:28:33.0189466Z tt.func public @_helion_welford(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:33.0193579Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:33.0195108Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:28:33.0195399Z %cst = arith.constant dense<1.000000e+00> : tensor<64xf32> 2026-02-21T08:28:33.0195666Z %cst_0 = arith.constant dense<3.200000e+01> : tensor<64xf32> 2026-02-21T08:28:33.0195895Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:28:33.0196072Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:33.0196258Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:28:33.0196470Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<64xf32> 2026-02-21T08:28:33.0196690Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:28:33.0196878Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T08:28:33.0197066Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:28:33.0197252Z %c2048_i64 = arith.constant 2048 : i64 2026-02-21T08:28:33.0197424Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:33.0197755Z %0 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c2048_i32], [%c2048_i64, %c1_i64] : , > 2026-02-21T08:28:33.0198222Z %1 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c2048_i32], [%c2048_i64, %c1_i64] : , > 2026-02-21T08:28:33.0199032Z %2 = tt.make_tensor_descriptor %arg3, [%c262144_i32, %c2048_i32], [%c2048_i64, %c1_i64] : , > 2026-02-21T08:28:33.0199351Z %3 = tt.get_program_id x : i32 2026-02-21T08:28:33.0199563Z scf.for %arg5 = %3 to %c4096_i32 step %c2368_i32 : i32 { 2026-02-21T08:28:33.0199798Z %4 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:28:33.0199992Z %c2016_i32 = arith.constant 2016 : i32 2026-02-21T08:28:33.0200184Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:28:33.0200600Z %5:3 = scf.for %arg6 = %c0_i32 to %c2016_i32 step %c96_i32 iter_args(%arg7 = %cst_1, %arg8 = %cst_1, %arg9 = %cst_1) -> (tensor<64xf32>, tensor<64xf32>, tensor<64xf32>) : i32 { 2026-02-21T08:28:33.0201112Z %35 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:28:33.0201561Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0201768Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0202306Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0202496Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0202697Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0202902Z %37 = arith.mulf %35, %35 : tensor<64x32xbf16> 2026-02-21T08:28:33.0203108Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0203300Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0203483Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0203679Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0203862Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0204090Z %39 = arith.extf %36 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0204311Z %40 = arith.divf %39, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0204520Z %41 = arith.mulf %36, %36 : tensor<64xbf16> 2026-02-21T08:28:33.0204743Z %42 = arith.extf %41 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0204954Z %43 = arith.divf %42, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0205168Z %44 = arith.extf %38 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0205380Z %45 = arith.subf %44, %43 : tensor<64xf32> 2026-02-21T08:28:33.0205580Z %46 = arith.subf %40, %arg8 : tensor<64xf32> 2026-02-21T08:28:33.0205779Z %47 = arith.addf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0205982Z %48 = arith.divf %cst, %47 : tensor<64xf32> 2026-02-21T08:28:33.0206181Z %49 = arith.mulf %48, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0206370Z %50 = arith.mulf %46, %49 : tensor<64xf32> 2026-02-21T08:28:33.0206566Z %51 = arith.addf %arg8, %50 : tensor<64xf32> 2026-02-21T08:28:33.0206757Z %52 = arith.addf %arg9, %45 : tensor<64xf32> 2026-02-21T08:28:33.0206956Z %53 = arith.mulf %46, %46 : tensor<64xf32> 2026-02-21T08:28:33.0207154Z %54 = arith.mulf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0207353Z %55 = arith.divf %54, %47 : tensor<64xf32> 2026-02-21T08:28:33.0207536Z %56 = arith.mulf %53, %55 : tensor<64xf32> 2026-02-21T08:28:33.0207724Z %57 = arith.addf %52, %56 : tensor<64xf32> 2026-02-21T08:28:33.0207914Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:28:33.0208096Z %58 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:28:33.0208283Z %59 = arith.addi %arg6, %58 : i32 2026-02-21T08:28:33.0208552Z %60 = tt.descriptor_load %0[%4, %59] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:28:33.0208844Z %61 = "tt.reduce"(%60) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0209031Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0209221Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0209415Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0209598Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0209899Z %62 = arith.mulf %60, %60 : tensor<64x32xbf16> 2026-02-21T08:28:33.0210095Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0210288Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0210473Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0210672Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0210918Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0211135Z %64 = arith.extf %61 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0211347Z %65 = arith.divf %64, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0211545Z %66 = arith.mulf %61, %61 : tensor<64xbf16> 2026-02-21T08:28:33.0211752Z %67 = arith.extf %66 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0212017Z %68 = arith.divf %67, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0212297Z %69 = arith.extf %63 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0212514Z %70 = arith.subf %69, %68 : tensor<64xf32> 2026-02-21T08:28:33.0212707Z %71 = arith.subf %65, %51 : tensor<64xf32> 2026-02-21T08:28:33.0212897Z %72 = arith.addf %47, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0213093Z %73 = arith.divf %cst, %72 : tensor<64xf32> 2026-02-21T08:28:33.0213285Z %74 = arith.mulf %73, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0213478Z %75 = arith.mulf %71, %74 : tensor<64xf32> 2026-02-21T08:28:33.0213663Z %76 = arith.addf %51, %75 : tensor<64xf32> 2026-02-21T08:28:33.0213854Z %77 = arith.addf %57, %70 : tensor<64xf32> 2026-02-21T08:28:33.0214045Z %78 = arith.mulf %71, %71 : tensor<64xf32> 2026-02-21T08:28:33.0214233Z %79 = arith.mulf %47, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0214428Z %80 = arith.divf %79, %72 : tensor<64xf32> 2026-02-21T08:28:33.0214614Z %81 = arith.mulf %78, %80 : tensor<64xf32> 2026-02-21T08:28:33.0214802Z %82 = arith.addf %77, %81 : tensor<64xf32> 2026-02-21T08:28:33.0214984Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:28:33.0215170Z %83 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:28:33.0215348Z %84 = arith.addi %arg6, %83 : i32 2026-02-21T08:28:33.0215615Z %85 = tt.descriptor_load %0[%4, %84] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:28:33.0215900Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0216077Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0216257Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0216438Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0216622Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0216814Z %87 = arith.mulf %85, %85 : tensor<64x32xbf16> 2026-02-21T08:28:33.0217013Z %88 = "tt.reduce"(%87) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0217197Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T08:28:33.0217375Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T08:28:33.0217563Z tt.reduce.return %108 : bf16 2026-02-21T08:28:33.0217741Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0217957Z %89 = arith.extf %86 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0218170Z %90 = arith.divf %89, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0218368Z %91 = arith.mulf %86, %86 : tensor<64xbf16> 2026-02-21T08:28:33.0218580Z %92 = arith.extf %91 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0218788Z %93 = arith.divf %92, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0219003Z %94 = arith.extf %88 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0219207Z %95 = arith.subf %94, %93 : tensor<64xf32> 2026-02-21T08:28:33.0219401Z %96 = arith.subf %90, %76 : tensor<64xf32> 2026-02-21T08:28:33.0219592Z %97 = arith.addf %72, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0219795Z %98 = arith.divf %cst, %97 : tensor<64xf32> 2026-02-21T08:28:33.0220053Z %99 = arith.mulf %98, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0220256Z %100 = arith.mulf %96, %99 : tensor<64xf32> 2026-02-21T08:28:33.0220455Z %101 = arith.addf %76, %100 : tensor<64xf32> 2026-02-21T08:28:33.0220645Z %102 = arith.addf %82, %95 : tensor<64xf32> 2026-02-21T08:28:33.0220841Z %103 = arith.mulf %96, %96 : tensor<64xf32> 2026-02-21T08:28:33.0221033Z %104 = arith.mulf %72, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0221236Z %105 = arith.divf %104, %97 : tensor<64xf32> 2026-02-21T08:28:33.0221432Z %106 = arith.mulf %103, %105 : tensor<64xf32> 2026-02-21T08:28:33.0221631Z %107 = arith.addf %102, %106 : tensor<64xf32> 2026-02-21T08:28:33.0221904Z scf.yield %97, %101, %107 : tensor<64xf32>, tensor<64xf32>, tensor<64xf32> 2026-02-21T08:28:33.0222153Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:28:33.0222502Z %6 = tt.descriptor_load %0[%4, %c2016_i32] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T08:28:33.0222802Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0222993Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T08:28:33.0223172Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T08:28:33.0223362Z tt.reduce.return %35 : bf16 2026-02-21T08:28:33.0223550Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0223739Z %8 = arith.mulf %6, %6 : tensor<64x32xbf16> 2026-02-21T08:28:33.0223936Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:28:33.0224118Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T08:28:33.0224305Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T08:28:33.0224481Z tt.reduce.return %35 : bf16 2026-02-21T08:28:33.0224665Z }) : (tensor<64x32xbf16>) -> tensor<64xbf16> 2026-02-21T08:28:33.0224876Z %10 = arith.extf %7 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0225101Z %11 = arith.divf %10, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0225304Z %12 = arith.mulf %7, %7 : tensor<64xbf16> 2026-02-21T08:28:33.0225511Z %13 = arith.extf %12 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0225729Z %14 = arith.divf %13, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0225936Z %15 = arith.extf %9 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T08:28:33.0226146Z %16 = arith.subf %15, %14 : tensor<64xf32> 2026-02-21T08:28:33.0226333Z %17 = arith.subf %11, %5#1 : tensor<64xf32> 2026-02-21T08:28:33.0226530Z %18 = arith.addf %5#0, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0226725Z %19 = arith.divf %cst, %18 : tensor<64xf32> 2026-02-21T08:28:33.0226915Z %20 = arith.mulf %19, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0227110Z %21 = arith.mulf %17, %20 : tensor<64xf32> 2026-02-21T08:28:33.0227295Z %22 = arith.addf %5#1, %21 : tensor<64xf32> 2026-02-21T08:28:33.0227487Z %23 = arith.addf %5#2, %16 : tensor<64xf32> 2026-02-21T08:28:33.0227671Z %24 = arith.mulf %17, %17 : tensor<64xf32> 2026-02-21T08:28:33.0227868Z %25 = arith.mulf %5#0, %cst_0 : tensor<64xf32> 2026-02-21T08:28:33.0228058Z %26 = arith.divf %25, %18 : tensor<64xf32> 2026-02-21T08:28:33.0228250Z %27 = arith.mulf %24, %26 : tensor<64xf32> 2026-02-21T08:28:33.0228440Z %28 = arith.addf %23, %27 : tensor<64xf32> 2026-02-21T08:28:33.0228622Z %29 = arith.divf %28, %18 : tensor<64xf32> 2026-02-21T08:28:33.0228822Z %30 = tt.splat %arg4 : f32 -> tensor<64xf32> 2026-02-21T08:28:33.0229020Z %31 = arith.addf %29, %30 : tensor<64xf32> 2026-02-21T08:28:33.0229390Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_rsqrtf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T08:28:33.0229814Z %33 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T08:28:33.0230135Z %34 = tt.expand_dims %32 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T08:28:33.0230456Z %c2016_i32_2 = arith.constant 2016 : i32 2026-02-21T08:28:33.0230651Z %c48_i32 = arith.constant 48 : i32 2026-02-21T08:28:33.0230889Z scf.for %arg6 = %c0_i32 to %c2016_i32_2 step %c48_i32 : i32 { 2026-02-21T08:28:33.0231195Z %35 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:28:33.0231460Z %36 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:28:33.0231668Z %37 = arith.addi %36, %35 : tensor<16xi32> 2026-02-21T08:28:33.0231991Z %38 = tt.descriptor_load %1[%4, %arg6] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:28:33.0232336Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0232612Z %40 = tt.addptr %39, %37 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0232913Z %41 = tt.load %40 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0233316Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0233627Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0233907Z %44 = tt.addptr %43, %37 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0234196Z %45 = tt.load %44 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0234515Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0234811Z %47 = arith.extf %38 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:28:33.0235088Z %48 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0235332Z %49 = arith.subf %47, %48 : tensor<64x16xf32> 2026-02-21T08:28:33.0235563Z %50 = tt.broadcast %34 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0235802Z %51 = arith.mulf %49, %50 : tensor<64x16xf32> 2026-02-21T08:28:33.0236037Z %52 = arith.extf %42 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0236310Z %53 = tt.broadcast %52 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0236527Z %54 = arith.mulf %51, %53 : tensor<64x16xf32> 2026-02-21T08:28:33.0236750Z %55 = arith.extf %46 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0237001Z %56 = tt.broadcast %55 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0237217Z %57 = arith.addf %54, %56 : tensor<64x16xf32> 2026-02-21T08:28:33.0237449Z %58 = arith.truncf %57 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:28:33.0237759Z tt.descriptor_store %2[%4, %arg6], %58 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:28:33.0238045Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:28:33.0238226Z %59 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T08:28:33.0238415Z %60 = arith.addi %arg6, %59 : i32 2026-02-21T08:28:33.0238642Z %61 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:28:33.0238879Z %62 = tt.splat %60 : i32 -> tensor<16xi32> 2026-02-21T08:28:33.0239073Z %63 = arith.addi %62, %61 : tensor<16xi32> 2026-02-21T08:28:33.0239342Z %64 = tt.descriptor_load %1[%4, %60] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:28:33.0239660Z %65 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0239914Z %66 = tt.addptr %65, %63 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0240198Z %67 = tt.load %66 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0240498Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0240782Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0241044Z %70 = tt.addptr %69, %63 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0241314Z %71 = tt.load %70 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0241683Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0241988Z %73 = arith.extf %64 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:28:33.0242235Z %74 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0242467Z %75 = arith.subf %73, %74 : tensor<64x16xf32> 2026-02-21T08:28:33.0242683Z %76 = tt.broadcast %34 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0242909Z %77 = arith.mulf %75, %76 : tensor<64x16xf32> 2026-02-21T08:28:33.0243125Z %78 = arith.extf %68 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0243375Z %79 = tt.broadcast %78 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0243605Z %80 = arith.mulf %77, %79 : tensor<64x16xf32> 2026-02-21T08:28:33.0243874Z %81 = arith.extf %72 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0244135Z %82 = tt.broadcast %81 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0244351Z %83 = arith.addf %80, %82 : tensor<64x16xf32> 2026-02-21T08:28:33.0244578Z %84 = arith.truncf %83 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:28:33.0244880Z tt.descriptor_store %2[%4, %60], %84 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:28:33.0245161Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:28:33.0245350Z %85 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T08:28:33.0245584Z %86 = arith.addi %arg6, %85 : i32 2026-02-21T08:28:33.0245813Z %87 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:28:33.0246050Z %88 = tt.splat %86 : i32 -> tensor<16xi32> 2026-02-21T08:28:33.0246247Z %89 = arith.addi %88, %87 : tensor<16xi32> 2026-02-21T08:28:33.0246518Z %90 = tt.descriptor_load %1[%4, %86] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:28:33.0246836Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0247101Z %92 = tt.addptr %91, %89 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0247379Z %93 = tt.load %92 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0247685Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0247970Z %95 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0248234Z %96 = tt.addptr %95, %89 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0248511Z %97 = tt.load %96 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0248815Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0249103Z %99 = arith.extf %90 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:28:33.0249357Z %100 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0249600Z %101 = arith.subf %99, %100 : tensor<64x16xf32> 2026-02-21T08:28:33.0249824Z %102 = tt.broadcast %34 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0250061Z %103 = arith.mulf %101, %102 : tensor<64x16xf32> 2026-02-21T08:28:33.0250294Z %104 = arith.extf %94 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0250543Z %105 = tt.broadcast %104 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0250782Z %106 = arith.mulf %103, %105 : tensor<64x16xf32> 2026-02-21T08:28:33.0251000Z %107 = arith.extf %98 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0251249Z %108 = tt.broadcast %107 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0251473Z %109 = arith.addf %106, %108 : tensor<64x16xf32> 2026-02-21T08:28:33.0251711Z %110 = arith.truncf %109 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:28:33.0252114Z tt.descriptor_store %2[%4, %86], %110 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:28:33.0252382Z } {tt.flatten} 2026-02-21T08:28:33.0252579Z scf.for %arg6 = %c2016_i32_2 to %c2048_i32 step %c16_i32 : i32 { 2026-02-21T08:28:33.0252844Z %35 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:28:33.0253089Z %36 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:28:33.0253285Z %37 = arith.addi %36, %35 : tensor<16xi32> 2026-02-21T08:28:33.0253575Z %38 = tt.descriptor_load %1[%4, %arg6] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T08:28:33.0253898Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0254156Z %40 = tt.addptr %39, %37 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0254494Z %41 = tt.load %40 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0254794Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0255084Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:28:33.0255377Z %44 = tt.addptr %43, %37 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:28:33.0255651Z %45 = tt.load %44 evictionPolicy = evict_first : tensor<16x!tt.ptr> 2026-02-21T08:28:33.0255955Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xbf16> -> tensor<1x16xbf16> 2026-02-21T08:28:33.0256234Z %47 = arith.extf %38 : tensor<64x16xbf16> to tensor<64x16xf32> 2026-02-21T08:28:33.0256495Z %48 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0256723Z %49 = arith.subf %47, %48 : tensor<64x16xf32> 2026-02-21T08:28:33.0256960Z %50 = tt.broadcast %34 : tensor<64x1xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0257197Z %51 = arith.mulf %49, %50 : tensor<64x16xf32> 2026-02-21T08:28:33.0257418Z %52 = arith.extf %42 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0257671Z %53 = tt.broadcast %52 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0257890Z %54 = arith.mulf %51, %53 : tensor<64x16xf32> 2026-02-21T08:28:33.0258124Z %55 = arith.extf %46 : tensor<1x16xbf16> to tensor<1x16xf32> 2026-02-21T08:28:33.0258366Z %56 = tt.broadcast %55 : tensor<1x16xf32> -> tensor<64x16xf32> 2026-02-21T08:28:33.0258612Z %57 = arith.addf %54, %56 : tensor<64x16xf32> 2026-02-21T08:28:33.0258841Z %58 = arith.truncf %57 : tensor<64x16xf32> to tensor<64x16xbf16> 2026-02-21T08:28:33.0259153Z tt.descriptor_store %2[%4, %arg6], %58 : !tt.tensordesc>, tensor<64x16xbf16> 2026-02-21T08:28:33.0259452Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:28:33.0259666Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:28:33.0259869Z tt.return 2026-02-21T08:28:33.0260037Z } 2026-02-21T08:28:33.0260191Z } 2026-02-21T08:28:33.0260260Z 2026-02-21T08:28:33.0260317Z {-# 2026-02-21T08:28:33.0260444Z external_resources: { 2026-02-21T08:28:33.0260604Z mlir_reproducer: { 2026-02-21T08:28:33.0264998Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:33.0274590Z disable_threading: false, 2026-02-21T08:28:33.0274803Z verify_each: true 2026-02-21T08:28:33.0274954Z } 2026-02-21T08:28:33.0275086Z } 2026-02-21T08:28:33.0275201Z #-} 2026-02-21T08:28:33.0275656Z /tmp/torchinductor_root/fx/cfxrranglcqxa35phfcznzrtpjpyvuh723qmftxk55ltatdclj2z.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:33.0276921Z /tmp/torchinductor_root/fx/cfxrranglcqxa35phfcznzrtpjpyvuh723qmftxk55ltatdclj2z.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:33.0277951Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:33.0279284Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'first', 'first'], maxnreg=64, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True, True], range_multi_buffers=[False, None, True], range_num_stages=[0, 4, 0], range_unroll_factors=[0, 3, 3], range_warp_specializes=[True, None, None]), static_shapes=True) 2026-02-21T08:28:33.0280621Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:33.0280919Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:51.7360398Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 4.3 configs/s 2026-02-21T08:28:51.7373243Z [57s] Adaptive compile timeout: 30s (90% percentile=3.3s, bounds=[30.0s, 30s]) 2026-02-21T08:28:52.0950766Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 357/357 633.8 configs/s 2026-02-21T08:28:52.3099123Z [58s] Initial random population of 100, 5 starting points: 2026-02-21T08:28:52.3103945Z error=5 2026-02-21T08:28:52.3108856Z timeout=1 2026-02-21T08:28:52.3110381Z ok=94 2026-02-21T08:28:52.3110553Z min=0.6031 2026-02-21T08:28:52.3110711Z mid=12.4601 2026-02-21T08:28:52.3110849Z max=310.2516 2026-02-21T08:28:52.3111026Z best={'block_sizes': [128, 32, 64], 2026-02-21T08:28:52.3111297Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:28:52.3111611Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:28:52.3111833Z 'maxnreg': 256, 2026-02-21T08:28:52.3112260Z 'num_sm_multiplier': 128, 2026-02-21T08:28:52.3112437Z 'num_stages': 1, 2026-02-21T08:28:52.3112583Z 'num_warps': 16, 2026-02-21T08:28:52.3112748Z 'pid_type': 'persistent_blocked', 2026-02-21T08:28:52.3112941Z 'range_flattens': [None, None, True], 2026-02-21T08:28:52.3113158Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:28:52.3113354Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:28:52.3113539Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:28:52.3113761Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:28:52.3122986Z [58s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:28:53.6414784Z [59s] Generation 1 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:29:01.4305865Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 10.7 configs/s 2026-02-21T08:29:08.5970124Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 14.0 configs/s 2026-02-21T08:29:14.2930718Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 363/363 61.8 configs/s 2026-02-21T08:29:14.5229682Z [80s] Generation 1 complete: 2026-02-21T08:29:14.5233958Z error=4 2026-02-21T08:29:14.5236206Z ok=98 2026-02-21T08:29:14.5236407Z min=0.5661 2026-02-21T08:29:14.5236568Z mid=1.1673 2026-02-21T08:29:14.5236723Z max=13.5660 2026-02-21T08:29:14.5236907Z best={'block_sizes': [16, 32, 64], 2026-02-21T08:29:14.5237249Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:29:14.5237890Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:29:14.5238102Z 'num_stages': 1, 2026-02-21T08:29:14.5238248Z 'num_warps': 2, 2026-02-21T08:29:14.5238385Z 'pid_type': 'flat', 2026-02-21T08:29:14.5238551Z 'range_flattens': [None, None, None], 2026-02-21T08:29:14.5238744Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:29:14.5238970Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:29:14.5239155Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:29:14.5239359Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:29:14.5268420Z [80s] Fitting surrogate: 202 points, 202 targets 2026-02-21T08:29:15.7726681Z [81s] Generation 2 starting: 91 neighbors, 5 active search path(s) 2026-02-21T08:29:31.5393438Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 1.2 configs/s 2026-02-21T08:29:38.1168938Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 14.5 configs/s 2026-02-21T08:29:53.0709623Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 363/363 24.0 configs/s 2026-02-21T08:29:53.3641572Z [119s] Generation 2 complete: 2026-02-21T08:29:53.3646496Z ok=97 2026-02-21T08:29:53.3651558Z min=0.5601 2026-02-21T08:29:53.3652948Z mid=0.8376 2026-02-21T08:29:53.3653106Z max=8.4009 2026-02-21T08:29:53.3653245Z best={'block_sizes': [16, 32, 64], 2026-02-21T08:29:53.3653503Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:29:53.3653764Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T08:29:53.3653942Z 'num_stages': 1, 2026-02-21T08:29:53.3654085Z 'num_warps': 2, 2026-02-21T08:29:53.3654221Z 'pid_type': 'flat', 2026-02-21T08:29:53.3654383Z 'range_flattens': [None, None, None], 2026-02-21T08:29:53.3654572Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:29:53.3654759Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:29:53.3654921Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:29:53.3655117Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:29:53.3701251Z [119s] Fitting surrogate: 299 points, 299 targets 2026-02-21T08:29:54.5373123Z [120s] Generation 3 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:30:02.1760064Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 17.3 configs/s 2026-02-21T08:30:08.2270374Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 14.8 configs/s 2026-02-21T08:30:23.6857100Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 392/392 27.4 configs/s 2026-02-21T08:30:23.9730276Z [149s] Generation 3 complete: 2026-02-21T08:30:23.9736115Z ok=91 2026-02-21T08:30:23.9738379Z min=0.5232 2026-02-21T08:30:23.9738542Z mid=0.7678 2026-02-21T08:30:23.9738715Z max=7.9636 2026-02-21T08:30:23.9738862Z best={'block_sizes': [128, 32, 128], 2026-02-21T08:30:23.9743889Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:30:23.9746155Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:30:23.9746434Z 'num_stages': 1, 2026-02-21T08:30:23.9751386Z 'num_warps': 8, 2026-02-21T08:30:23.9752963Z 'pid_type': 'flat', 2026-02-21T08:30:23.9753198Z 'range_flattens': [None, None, True], 2026-02-21T08:30:23.9753760Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:30:23.9753945Z 'range_num_stages': [0, 0, 2], 2026-02-21T08:30:23.9754122Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:30:23.9754322Z 'range_warp_specializes': [None, False, True]} 2026-02-21T08:30:23.9785998Z [149s] Fitting surrogate: 390 points, 390 targets 2026-02-21T08:30:25.2960288Z [151s] Generation 4 starting: 90 neighbors, 5 active search path(s) 2026-02-21T08:30:59.0720168Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.6 configs/s 2026-02-21T08:31:05.7027140Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 14.1 configs/s 2026-02-21T08:31:17.7879405Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 398/398 32.5 configs/s 2026-02-21T08:31:18.0560801Z [204s] Generation 4 complete: 2026-02-21T08:31:18.0564977Z ok=95 2026-02-21T08:31:18.0569396Z min=0.4988 2026-02-21T08:31:18.0574391Z mid=0.7998 2026-02-21T08:31:18.0579524Z max=33.5319 2026-02-21T08:31:18.0579804Z best={'block_sizes': [16, 32, 128], 2026-02-21T08:31:18.0583728Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:31:18.0588064Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T08:31:18.0589627Z 'num_stages': 1, 2026-02-21T08:31:18.0589853Z 'num_warps': 2, 2026-02-21T08:31:18.0590010Z 'pid_type': 'flat', 2026-02-21T08:31:18.0594189Z 'range_flattens': [None, None, None], 2026-02-21T08:31:18.0599335Z 'range_multi_buffers': [None, False, None], 2026-02-21T08:31:18.0603791Z 'range_num_stages': [0, 1, 0], 2026-02-21T08:31:18.0608330Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T08:31:18.0612837Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:31:18.0622832Z [204s] Fitting surrogate: 485 points, 485 targets 2026-02-21T08:31:19.2656021Z [205s] Generation 5 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:31:26.8445262Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 11.2 configs/s 2026-02-21T08:31:33.2353134Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 13.8 configs/s 2026-02-21T08:31:40.8893780Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 430/430 55.2 configs/s 2026-02-21T08:31:41.1183638Z [227s] Generation 5 complete: 2026-02-21T08:31:41.1187917Z error=1 2026-02-21T08:31:41.1189907Z ok=89 2026-02-21T08:31:41.1190072Z min=0.4947 2026-02-21T08:31:41.1190208Z mid=0.8652 2026-02-21T08:31:41.1190326Z max=21.8537 2026-02-21T08:31:41.1190473Z best={'block_sizes': [32, 64, 128], 2026-02-21T08:31:41.1190731Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:31:41.1191007Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:31:41.1191220Z 'num_stages': 1, 2026-02-21T08:31:41.1191360Z 'num_warps': 8, 2026-02-21T08:31:41.1191505Z 'pid_type': 'flat', 2026-02-21T08:31:41.1191662Z 'range_flattens': [None, None, True], 2026-02-21T08:31:41.1192222Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:31:41.1192418Z 'range_num_stages': [0, 0, 2], 2026-02-21T08:31:41.1192923Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:31:41.1193115Z 'range_warp_specializes': [None, False, True]} 2026-02-21T08:31:41.1234260Z [227s] Fitting surrogate: 575 points, 575 targets 2026-02-21T08:31:42.4517647Z [228s] Generation 6 starting: 86 neighbors, 5 active search path(s) 2026-02-21T08:32:17.4229781Z [263s] Timeout after 30s compiling Config(block_sizes=[256, 32, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, True, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 1, 3], range_warp_specializes=[None, None, False]) 2026-02-21T08:32:17.4251221Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 0.7 configs/s 2026-02-21T08:32:24.0118826Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 13.4 configs/s 2026-02-21T08:32:35.1760170Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 430/430 38.1 configs/s 2026-02-21T08:32:35.4396565Z [281s] Generation 6 complete: 2026-02-21T08:32:35.4400894Z error=1 2026-02-21T08:32:35.4402943Z timeout=1 2026-02-21T08:32:35.4403147Z ok=89 2026-02-21T08:32:35.4403321Z min=0.5068 2026-02-21T08:32:35.4403493Z mid=0.7618 2026-02-21T08:32:35.4403649Z max=28.1406 2026-02-21T08:32:35.4403841Z best={'block_sizes': [32, 64, 128], 2026-02-21T08:32:35.4404122Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:32:35.4404401Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:32:35.4404641Z 'num_stages': 1, 2026-02-21T08:32:35.4404832Z 'num_warps': 8, 2026-02-21T08:32:35.4405014Z 'pid_type': 'flat', 2026-02-21T08:32:35.4405229Z 'range_flattens': [None, None, True], 2026-02-21T08:32:35.4405479Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:32:35.4405765Z 'range_num_stages': [0, 0, 2], 2026-02-21T08:32:35.4406011Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:32:35.4406266Z 'range_warp_specializes': [None, False, True]} 2026-02-21T08:32:35.4439716Z [281s] Fitting surrogate: 666 points, 666 targets 2026-02-21T08:32:36.3814222Z [282s] Generation 7 starting: 51 neighbors, 3 active search path(s) 2026-02-21T08:32:41.1500456Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 10.1 configs/s 2026-02-21T08:32:44.6975424Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 15.1 configs/s 2026-02-21T08:32:50.6856161Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 430/430 70.2 configs/s 2026-02-21T08:32:50.8984012Z [296s] Generation 7 complete: 2026-02-21T08:32:50.8985103Z ok=55 2026-02-21T08:32:50.8985319Z min=0.4925 2026-02-21T08:32:50.8990816Z mid=0.8489 2026-02-21T08:32:50.8995384Z max=2.9920 2026-02-21T08:32:50.8996748Z best={'block_sizes': [16, 64, 128], 2026-02-21T08:32:50.8997067Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:32:50.8997358Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:32:50.8997562Z 'num_stages': 6, 2026-02-21T08:32:50.8997702Z 'num_warps': 2, 2026-02-21T08:32:50.8997851Z 'pid_type': 'flat', 2026-02-21T08:32:50.8998009Z 'range_flattens': [None, None, True], 2026-02-21T08:32:50.8998208Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:32:50.8998393Z 'range_num_stages': [0, 1, 1], 2026-02-21T08:32:50.8998567Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:32:50.8998761Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:32:50.9037752Z [296s] Fitting surrogate: 721 points, 721 targets 2026-02-21T08:32:51.7381006Z [297s] Generation 8 starting: 50 neighbors, 3 active search path(s) 2026-02-21T08:33:06.5597501Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 1.3 configs/s 2026-02-21T08:33:10.2956587Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 14.0 configs/s 2026-02-21T08:33:16.9907384Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 430/430 77.4 configs/s 2026-02-21T08:33:17.2008471Z [323s] Generation 8 complete: 2026-02-21T08:33:17.2012898Z ok=53 2026-02-21T08:33:17.2015172Z min=0.4905 2026-02-21T08:33:17.2015337Z mid=0.8570 2026-02-21T08:33:17.2015478Z max=31.4789 2026-02-21T08:33:17.2015628Z best={'block_sizes': [16, 64, 128], 2026-02-21T08:33:17.2015914Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:33:17.2016214Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:33:17.2016428Z 'num_stages': 6, 2026-02-21T08:33:17.2016584Z 'num_warps': 2, 2026-02-21T08:33:17.2016732Z 'pid_type': 'flat', 2026-02-21T08:33:17.2016908Z 'range_flattens': [None, None, True], 2026-02-21T08:33:17.2017121Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:33:17.2017332Z 'range_num_stages': [0, 1, 1], 2026-02-21T08:33:17.2017512Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:33:17.2018102Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:33:17.2057818Z [323s] Fitting surrogate: 774 points, 774 targets 2026-02-21T08:33:18.0941357Z [324s] Generation 9 starting: 52 neighbors, 3 active search path(s) 2026-02-21T08:33:24.1915815Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 6.2 configs/s 2026-02-21T08:33:28.0239482Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 14.2 configs/s 2026-02-21T08:33:32.2885818Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 430/430 97.6 configs/s 2026-02-21T08:33:32.4889186Z [338s] Generation 9 complete: 2026-02-21T08:33:32.4891112Z ok=55 2026-02-21T08:33:32.4891363Z min=0.4680 2026-02-21T08:33:32.4896105Z mid=0.9136 2026-02-21T08:33:32.4900526Z max=17.2708 2026-02-21T08:33:32.4905033Z best={'block_sizes': [16, 64, 512], 2026-02-21T08:33:32.4909535Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:33:32.4913884Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:33:32.4915819Z 'num_stages': 6, 2026-02-21T08:33:32.4916189Z 'num_warps': 2, 2026-02-21T08:33:32.4916376Z 'pid_type': 'flat', 2026-02-21T08:33:32.4916586Z 'range_flattens': [None, None, True], 2026-02-21T08:33:32.4916856Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:33:32.4917041Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:33:32.4917216Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:33:32.4917408Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:33:32.4944704Z [338s] Fitting surrogate: 829 points, 829 targets 2026-02-21T08:33:33.4288059Z [339s] Generation 10 starting: 53 neighbors, 3 active search path(s) 2026-02-21T08:33:47.6456926Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 1.8 configs/s 2026-02-21T08:33:51.6715795Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 56/56 14.0 configs/s 2026-02-21T08:33:57.3609299Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 458/458 78.6 configs/s 2026-02-21T08:33:57.5689977Z [363s] Generation 10 complete: 2026-02-21T08:33:57.5691717Z ok=57 2026-02-21T08:33:57.5692263Z min=0.4475 2026-02-21T08:33:57.5697280Z mid=0.7016 2026-02-21T08:33:57.5698792Z max=23.0810 2026-02-21T08:33:57.5699022Z best={'block_sizes': [16, 64, 256], 2026-02-21T08:33:57.5704658Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:33:57.5706838Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:33:57.5707113Z 'num_stages': 6, 2026-02-21T08:33:57.5711829Z 'num_warps': 4, 2026-02-21T08:33:57.5714232Z 'pid_type': 'flat', 2026-02-21T08:33:57.5714449Z 'range_flattens': [None, None, True], 2026-02-21T08:33:57.5714663Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:33:57.5714852Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:33:57.5715031Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:33:57.5715227Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:33:57.5746071Z [363s] Fitting surrogate: 886 points, 886 targets 2026-02-21T08:33:58.5328785Z [364s] Generation 11 starting: 55 neighbors, 3 active search path(s) 2026-02-21T08:34:03.8402546Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 9.8 configs/s 2026-02-21T08:34:07.7467609Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 58/58 15.0 configs/s 2026-02-21T08:34:13.6175237Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 460/460 76.6 configs/s 2026-02-21T08:34:13.8247464Z [379s] Generation 11 complete: 2026-02-21T08:34:13.8248662Z ok=59 2026-02-21T08:34:13.8248825Z min=0.4556 2026-02-21T08:34:13.8248963Z mid=0.7066 2026-02-21T08:34:13.8249082Z max=5.6576 2026-02-21T08:34:13.8249227Z best={'block_sizes': [8, 64, 256], 2026-02-21T08:34:13.8249475Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:34:13.8249749Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:34:13.8249944Z 'num_stages': 6, 2026-02-21T08:34:13.8250088Z 'num_warps': 2, 2026-02-21T08:34:13.8250231Z 'pid_type': 'flat', 2026-02-21T08:34:13.8250783Z 'range_flattens': [None, None, False], 2026-02-21T08:34:13.8251020Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:34:13.8251214Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:34:13.8251394Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:34:13.8251592Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:34:13.8293856Z [379s] Fitting surrogate: 945 points, 945 targets 2026-02-21T08:34:14.7651259Z [380s] Generation 12 starting: 52 neighbors, 3 active search path(s) 2026-02-21T08:34:48.0394972Z [414s] Timeout after 30s compiling Config(block_sizes=[64, 32, 512], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', '', 'last', ''], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 1, 4], range_warp_specializes=[None, None, False]) 2026-02-21T08:34:48.0412475Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 0.4 configs/s 2026-02-21T08:34:51.6739723Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 15.0 configs/s 2026-02-21T08:34:59.3035592Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 512/512 65.9 configs/s 2026-02-21T08:34:59.5187212Z [425s] Generation 12 complete: 2026-02-21T08:34:59.5191680Z timeout=1 2026-02-21T08:34:59.5195518Z ok=55 2026-02-21T08:34:59.5196905Z min=0.4271 2026-02-21T08:34:59.5197074Z mid=0.6472 2026-02-21T08:34:59.5197201Z max=17.4802 2026-02-21T08:34:59.5197357Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:34:59.5197627Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:34:59.5197908Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:34:59.5198120Z 'num_stages': 6, 2026-02-21T08:34:59.5198264Z 'num_warps': 2, 2026-02-21T08:34:59.5198409Z 'pid_type': 'flat', 2026-02-21T08:34:59.5198568Z 'range_flattens': [None, None, False], 2026-02-21T08:34:59.5198801Z 'range_multi_buffers': [None, True, None], 2026-02-21T08:34:59.5199009Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:34:59.5199189Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:34:59.5199396Z 'range_warp_specializes': [None, False, False]} 2026-02-21T08:34:59.5242258Z [425s] Fitting surrogate: 1001 points, 1001 targets 2026-02-21T08:35:00.4365724Z [426s] Generation 13 starting: 54 neighbors, 3 active search path(s) 2026-02-21T08:35:05.0839153Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 13.1 configs/s 2026-02-21T08:35:08.8619230Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 56/56 15.0 configs/s 2026-02-21T08:35:17.3439825Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 512/512 59.5 configs/s 2026-02-21T08:35:17.5696695Z [443s] Generation 13 complete: 2026-02-21T08:35:17.5699761Z ok=57 2026-02-21T08:35:17.5704346Z min=0.4300 2026-02-21T08:35:17.5708771Z mid=0.6134 2026-02-21T08:35:17.5710359Z max=5.7324 2026-02-21T08:35:17.5710564Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:35:17.5710889Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:35:17.5711521Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:35:17.5711731Z 'num_stages': 6, 2026-02-21T08:35:17.5712084Z 'num_warps': 2, 2026-02-21T08:35:17.5716106Z 'pid_type': 'flat', 2026-02-21T08:35:17.5719990Z 'range_flattens': [None, None, False], 2026-02-21T08:35:17.5720314Z 'range_multi_buffers': [None, True, None], 2026-02-21T08:35:17.5720545Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:35:17.5725785Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:35:17.5730391Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:35:17.5751217Z [443s] Fitting surrogate: 1058 points, 1058 targets 2026-02-21T08:35:18.5231405Z [444s] Generation 14 starting: 53 neighbors, 3 active search path(s) 2026-02-21T08:35:24.3773771Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 6.0 configs/s 2026-02-21T08:35:28.1081947Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 14.6 configs/s 2026-02-21T08:35:36.3272323Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 530/530 63.5 configs/s 2026-02-21T08:35:36.5495580Z [462s] Generation 14 complete: 2026-02-21T08:35:36.5499350Z ok=56 2026-02-21T08:35:36.5500833Z min=0.4311 2026-02-21T08:35:36.5500997Z mid=0.6276 2026-02-21T08:35:36.5501117Z max=12.0996 2026-02-21T08:35:36.5501264Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:35:36.5501520Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:35:36.5501782Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:35:36.5502160Z 'num_stages': 6, 2026-02-21T08:35:36.5502299Z 'num_warps': 2, 2026-02-21T08:35:36.5502446Z 'pid_type': 'flat', 2026-02-21T08:35:36.5502601Z 'range_flattens': [None, None, False], 2026-02-21T08:35:36.5502801Z 'range_multi_buffers': [None, True, False], 2026-02-21T08:35:36.5502988Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:35:36.5503188Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:35:36.5503387Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:35:36.5554243Z [462s] Fitting surrogate: 1114 points, 1114 targets 2026-02-21T08:35:37.2657025Z [463s] Generation 15 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:35:41.4969331Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 6.1 configs/s 2026-02-21T08:35:43.8739600Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 15.0 configs/s 2026-02-21T08:35:47.5682694Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━ 530/530 137.9 configs/s 2026-02-21T08:35:47.7556925Z [473s] Generation 15 complete: 2026-02-21T08:35:47.7561500Z ok=37 2026-02-21T08:35:47.7565263Z min=0.4402 2026-02-21T08:35:47.7569664Z mid=0.7454 2026-02-21T08:35:47.7571038Z max=6.9571 2026-02-21T08:35:47.7571267Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:35:47.7576489Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:35:47.7578531Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:35:47.7579119Z 'num_stages': 6, 2026-02-21T08:35:47.7579268Z 'num_warps': 2, 2026-02-21T08:35:47.7579407Z 'pid_type': 'flat', 2026-02-21T08:35:47.7579573Z 'range_flattens': [None, None, False], 2026-02-21T08:35:47.7579766Z 'range_multi_buffers': [None, True, False], 2026-02-21T08:35:47.7579957Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:35:47.7580123Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:35:47.7580320Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:35:47.7611152Z [473s] Fitting surrogate: 1151 points, 1151 targets 2026-02-21T08:35:48.4410342Z [474s] Generation 16 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:35:53.3086864Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 11.4 configs/s 2026-02-21T08:35:53.3289721Z [479s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 2], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, None]) 2026-02-21T08:35:53.3290935Z Tensor-likes are not close! 2026-02-21T08:35:53.3291078Z 2026-02-21T08:35:53.3291169Z Mismatched elements: 116 / 536870912 (0.0%) 2026-02-21T08:35:53.3291492Z Greatest absolute difference: 0.03125 at index (44238, 1904) (up to 0.01 allowed) 2026-02-21T08:35:53.3292179Z Greatest relative difference: 3.546875 at index (131167, 592) (up to 0.01 allowed) 2026-02-21T08:35:53.3292553Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:35:53.3292753Z 2026-02-21T08:35:53.4126322Z [479s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 2], range_unroll_factors=[0, 4, 0], range_warp_specializes=[None, False, None]) 2026-02-21T08:35:53.4127772Z Tensor-likes are not close! 2026-02-21T08:35:53.4127900Z 2026-02-21T08:35:53.4127977Z Mismatched elements: 116 / 536870912 (0.0%) 2026-02-21T08:35:53.4128252Z Greatest absolute difference: 0.03125 at index (44238, 1904) (up to 0.01 allowed) 2026-02-21T08:35:53.4128594Z Greatest relative difference: 3.546875 at index (131167, 592) (up to 0.01 allowed) 2026-02-21T08:35:53.4128905Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:35:53.4129065Z 2026-02-21T08:35:53.8745982Z [479s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, None]) 2026-02-21T08:35:53.8747229Z Tensor-likes are not close! 2026-02-21T08:35:53.8748935Z 2026-02-21T08:35:53.8749119Z Mismatched elements: 116 / 536870912 (0.0%) 2026-02-21T08:35:53.8749427Z Greatest absolute difference: 0.03125 at index (44238, 1904) (up to 0.01 allowed) 2026-02-21T08:35:53.8749783Z Greatest relative difference: 3.546875 at index (131167, 592) (up to 0.01 allowed) 2026-02-21T08:35:53.8750084Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:35:53.8750257Z 2026-02-21T08:35:55.4980960Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 16.3 configs/s 2026-02-21T08:36:00.6425747Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━ 530/530 100.2 configs/s 2026-02-21T08:36:00.8363209Z [486s] Generation 16 complete: 2026-02-21T08:36:00.8367357Z error=3 2026-02-21T08:36:00.8370575Z ok=34 2026-02-21T08:36:00.8374517Z min=0.4313 2026-02-21T08:36:00.8379641Z mid=0.5949 2026-02-21T08:36:00.8382738Z max=1.4961 2026-02-21T08:36:00.8386807Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:36:00.8390866Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:36:00.8391251Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:36:00.8391473Z 'num_stages': 6, 2026-02-21T08:36:00.8394164Z 'num_warps': 2, 2026-02-21T08:36:00.8394398Z 'pid_type': 'flat', 2026-02-21T08:36:00.8399246Z 'range_flattens': [None, None, False], 2026-02-21T08:36:00.8402581Z 'range_multi_buffers': [None, True, False], 2026-02-21T08:36:00.8407015Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:36:00.8410858Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:36:00.8416174Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:36:00.8421349Z [486s] Fitting surrogate: 1188 points, 1188 targets 2026-02-21T08:36:01.5505519Z [487s] Generation 17 starting: 35 neighbors, 2 active search path(s) 2026-02-21T08:36:05.9656823Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 5.0 configs/s 2026-02-21T08:36:06.1160249Z [492s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, False, None]) 2026-02-21T08:36:06.1161610Z Tensor-likes are not close! 2026-02-21T08:36:06.1165701Z 2026-02-21T08:36:06.1170137Z Mismatched elements: 116 / 536870912 (0.0%) 2026-02-21T08:36:06.1173768Z Greatest absolute difference: 0.03125 at index (44238, 1904) (up to 0.01 allowed) 2026-02-21T08:36:06.1178046Z Greatest relative difference: 3.546875 at index (131167, 592) (up to 0.01 allowed) 2026-02-21T08:36:06.1181451Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:36:06.1185607Z 2026-02-21T08:36:08.4048267Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 15.4 configs/s 2026-02-21T08:36:12.8456046Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━ 530/530 115.8 configs/s 2026-02-21T08:36:13.0345590Z [499s] Generation 17 complete: 2026-02-21T08:36:13.0350122Z error=1 2026-02-21T08:36:13.0352438Z ok=37 2026-02-21T08:36:13.0352694Z min=0.4280 2026-02-21T08:36:13.0352875Z mid=0.6493 2026-02-21T08:36:13.0353053Z max=6.7384 2026-02-21T08:36:13.0353227Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:36:13.0353549Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:36:13.0358950Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:36:13.0363124Z 'num_stages': 6, 2026-02-21T08:36:13.0369444Z 'num_warps': 2, 2026-02-21T08:36:13.0375011Z 'pid_type': 'flat', 2026-02-21T08:36:13.0379702Z 'range_flattens': [None, None, False], 2026-02-21T08:36:13.0381195Z 'range_multi_buffers': [None, True, False], 2026-02-21T08:36:13.0381446Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:36:13.0381636Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:36:13.0381836Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:36:13.0398935Z [499s] Fitting surrogate: 1226 points, 1226 targets 2026-02-21T08:36:13.7399638Z [499s] Generation 18 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:36:16.5753681Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 24.3 configs/s 2026-02-21T08:36:18.8496681Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 15.7 configs/s 2026-02-21T08:36:24.6499796Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 577/577 97.3 configs/s 2026-02-21T08:36:24.8416563Z [510s] Generation 18 complete: 2026-02-21T08:36:24.8418220Z ok=37 2026-02-21T08:36:24.8418384Z min=0.4364 2026-02-21T08:36:24.8418511Z mid=0.5540 2026-02-21T08:36:24.8418635Z max=1.9660 2026-02-21T08:36:24.8418806Z best={'block_sizes': [8, 64, 512], 2026-02-21T08:36:24.8419462Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:36:24.8419732Z 'load_eviction_policies': ['last', '', 'first', ''], 2026-02-21T08:36:24.8419936Z 'num_stages': 6, 2026-02-21T08:36:24.8420071Z 'num_warps': 2, 2026-02-21T08:36:24.8420214Z 'pid_type': 'flat', 2026-02-21T08:36:24.8420368Z 'range_flattens': [None, None, False], 2026-02-21T08:36:24.8420564Z 'range_multi_buffers': [None, True, False], 2026-02-21T08:36:24.8420755Z 'range_num_stages': [0, 0, 1], 2026-02-21T08:36:24.8420917Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:36:24.8421114Z 'range_warp_specializes': [None, None, False]} 2026-02-21T08:36:24.8464318Z [510s] Fitting surrogate: 1263 points, 1263 targets 2026-02-21T08:36:25.5367135Z [511s] Generation 19 starting: 33 neighbors, 2 active search path(s) 2026-02-21T08:36:28.6565405Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 8.3 configs/s 2026-02-21T08:36:28.8097097Z [514s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, True, None], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, None]) 2026-02-21T08:36:28.8098468Z Tensor-likes are not close! 2026-02-21T08:36:28.8102958Z 2026-02-21T08:36:28.8107648Z Mismatched elements: 116 / 536870912 (0.0%) 2026-02-21T08:36:28.8112325Z Greatest absolute difference: 0.03125 at index (44238, 1904) (up to 0.01 allowed) 2026-02-21T08:36:28.8117122Z Greatest relative difference: 3.546875 at index (131167, 592) (up to 0.01 allowed) 2026-02-21T08:36:28.8118654Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:36:28.8118842Z 2026-02-21T08:36:29.6066967Z [515s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, True, None], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, None, None]) 2026-02-21T08:36:29.6071388Z Tensor-likes are not close! 2026-02-21T08:36:29.6073126Z 2026-02-21T08:36:29.6073361Z Mismatched elements: 34695 / 536870912 (0.0%) 2026-02-21T08:36:29.6073693Z Greatest absolute difference: 0.125 at index (1159, 1628) (up to 0.01 allowed) 2026-02-21T08:36:29.6078848Z Greatest relative difference: 944.0 at index (24721, 1104) (up to 0.01 allowed) 2026-02-21T08:36:29.6083281Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:36:29.6084499Z 2026-02-21T08:36:30.9049669Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 15.8 configs/s 2026-02-21T08:36:36.2489381Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━ 577/577 105.6 configs/s 2026-02-21T08:36:36.4387025Z [522s] Generation 19 complete: 2026-02-21T08:36:36.4392383Z error=2 2026-02-21T08:36:36.4394244Z ok=34 2026-02-21T08:36:36.4394416Z min=0.4107 2026-02-21T08:36:36.4394562Z mid=0.5366 2026-02-21T08:36:36.4394687Z max=3.7335 2026-02-21T08:36:36.4394843Z best={'block_sizes': [2, 512, 512], 2026-02-21T08:36:36.4395139Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:36:36.4395473Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T08:36:36.4395709Z 'num_stages': 3, 2026-02-21T08:36:36.4395867Z 'num_warps': 1, 2026-02-21T08:36:36.4396012Z 'pid_type': 'flat', 2026-02-21T08:36:36.4396374Z 'range_flattens': [None, True, None], 2026-02-21T08:36:36.4396586Z 'range_multi_buffers': [None, True, None], 2026-02-21T08:36:36.4396783Z 'range_num_stages': [0, 3, 0], 2026-02-21T08:36:36.4396970Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T08:36:36.4397207Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:36:36.4449473Z [522s] Fitting surrogate: 1299 points, 1299 targets 2026-02-21T08:36:37.1810173Z [523s] Generation 20 starting: 34 neighbors, 2 active search path(s) 2026-02-21T08:36:39.9734757Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 25.6 configs/s 2026-02-21T08:36:42.1409675Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 16.0 configs/s 2026-02-21T08:36:50.4630588Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 588/588 69.8 configs/s 2026-02-21T08:36:50.6763856Z [536s] Generation 20 complete: 2026-02-21T08:36:50.6765638Z ok=36 2026-02-21T08:36:50.6765799Z min=0.4168 2026-02-21T08:36:50.6765934Z mid=0.4988 2026-02-21T08:36:50.6766058Z max=2.4647 2026-02-21T08:36:50.6766194Z best={'block_sizes': [2, 512, 512], 2026-02-21T08:36:50.6766481Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:36:50.6767165Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T08:36:50.6767431Z 'num_stages': 3, 2026-02-21T08:36:50.6767575Z 'num_warps': 1, 2026-02-21T08:36:50.6767729Z 'pid_type': 'flat', 2026-02-21T08:36:50.6767891Z 'range_flattens': [None, True, None], 2026-02-21T08:36:50.6768094Z 'range_multi_buffers': [None, True, None], 2026-02-21T08:36:50.6768280Z 'range_num_stages': [0, 2, 1], 2026-02-21T08:36:50.6768461Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T08:36:50.6768657Z 'range_warp_specializes': [None, None, None]} 2026-02-21T08:36:50.6822573Z [536s] Fitting surrogate: 1335 points, 1335 targets 2026-02-21T08:36:50.9945297Z [536s] Autotuning complete in 537.0s after searching 1300 configs. 2026-02-21T08:36:50.9949682Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:36:50.9952171Z @helion.kernel(config=helion.Config(block_sizes=[2, 512, 512], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, True, None], range_num_stages=[0, 2, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, None, None]), static_shapes=True) 2026-02-21T08:36:50.9953145Z 2026-02-21T08:36:50.9953417Z [536s] Code of selected kernel: /tmp/torchinductor_root/cj/ccjcl2eqgp3s5cpzzkpqjufmc6t4u6jv5ypbylpi3zgwbdh3eyeu.py 2026-02-21T08:36:52.3445512Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T08:36:52.3446312Z x_val 2026-02-21T08:36:52.3446459Z ------- 2026-02-21T08:36:52.3446634Z 2048 2026-02-21T08:36:52.3446742Z 2026-02-21T08:36:52.3481834Z 33%|███▎ | 2/6 [28:42<59:05, 886.44s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4: 2026-02-21T08:36:52.3485843Z x_val 2026-02-21T08:36:52.3487818Z ------- 2026-02-21T08:36:52.3487978Z 3072 2026-02-21T08:36:52.3501196Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T08:36:53.1426862Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T08:36:54.5005140Z INFO:tritonbench.utils.triton_op:Took 2.37ms to get benchmark function for torch_compile_welford 2026-02-21T08:49:14.7549514Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:49:14.7549875Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:49:14.7550122Z 'dtype': 'torch.bfloat16', 2026-02-21T08:49:14.7550381Z 'shape': (3072,), 2026-02-21T08:49:14.7550590Z 'stride': (1,)}, 2026-02-21T08:49:14.7550856Z { 'device': 'cuda:0', 2026-02-21T08:49:14.7551074Z 'dtype': 'torch.bfloat16', 2026-02-21T08:49:14.7551329Z 'shape': (3072,), 2026-02-21T08:49:14.7551524Z 'stride': (1,)}, 2026-02-21T08:49:14.7551752Z { 'device': 'cuda:0', 2026-02-21T08:49:14.7552156Z 'dtype': 'torch.bfloat16', 2026-02-21T08:49:14.7552410Z 'shape': (262144, 3072), 2026-02-21T08:49:14.7552730Z 'stride': (3072, 1)}), 2026-02-21T08:49:14.7553276Z 'kwargs': {}} 2026-02-21T08:49:14.7580418Z INFO:tritonbench.utils.triton_op:Took 3.29ms to get benchmark function for helion_welford 2026-02-21T08:49:15.0544337Z [0s] Autotune random seed: 2134763656 2026-02-21T08:49:15.2142664Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:49:30.7239829Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.4 configs/s 2026-02-21T08:49:35.3751936Z [20s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 4], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first', 'last', 'last'], maxnreg=32, num_sm_multiplier=32, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False, False], range_multi_buffers=[True, True, False], range_num_stages=[4, 3, 3], range_unroll_factors=[0, 4, 1], range_warp_specializes=[True, None, None]) 2026-02-21T08:49:35.3753379Z Tensor-likes are not close! 2026-02-21T08:49:35.3757102Z 2026-02-21T08:49:35.3759074Z Mismatched elements: 616793150 / 805306368 (76.6%) 2026-02-21T08:49:35.3759494Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T08:49:35.3759922Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T08:49:35.3760291Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T08:49:35.3760519Z 2026-02-21T08:50:06.5752834Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 2.6 configs/s 2026-02-21T08:50:06.5768766Z [51s] Adaptive compile timeout: 30s (90% percentile=5.2s, bounds=[30.0s, 30s]) 2026-02-21T08:50:07.1227214Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 228/228 263.4 configs/s 2026-02-21T08:50:07.4537116Z [52s] Initial random population of 100, 5 starting points: 2026-02-21T08:50:07.4540907Z error=7 2026-02-21T08:50:07.4545761Z ok=93 2026-02-21T08:50:07.4547129Z min=0.9124 2026-02-21T08:50:07.4547405Z mid=20.4094 2026-02-21T08:50:07.4547581Z max=684.7160 2026-02-21T08:50:07.4547815Z best={'block_sizes': [128, 32, 64], 2026-02-21T08:50:07.4548106Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:50:07.4548454Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:50:07.4548700Z 'maxnreg': 256, 2026-02-21T08:50:07.4548920Z 'num_sm_multiplier': 128, 2026-02-21T08:50:07.4549146Z 'num_stages': 1, 2026-02-21T08:50:07.4549330Z 'num_warps': 16, 2026-02-21T08:50:07.4549557Z 'pid_type': 'persistent_blocked', 2026-02-21T08:50:07.4553814Z 'range_flattens': [None, None, True], 2026-02-21T08:50:07.4558313Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:50:07.4560444Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:50:07.4560751Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:50:07.4561115Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:50:07.4561439Z [52s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:50:08.8896066Z [53s] Generation 1 starting: 104 neighbors, 5 active search path(s) 2026-02-21T08:50:26.0273290Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 108/108 8.8 configs/s 2026-02-21T08:50:34.7404057Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 108/108 12.4 configs/s 2026-02-21T08:50:44.6552063Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 238/238 23.4 configs/s 2026-02-21T08:50:44.9989885Z [89s] Generation 1 complete: 2026-02-21T08:50:44.9991209Z error=1 2026-02-21T08:50:44.9991441Z ok=108 2026-02-21T08:50:44.9991615Z min=0.8736 2026-02-21T08:50:44.9991815Z mid=1.5565 2026-02-21T08:50:44.9992153Z max=51.3208 2026-02-21T08:50:44.9992364Z best={'block_sizes': [128, 64, 64], 2026-02-21T08:50:44.9992662Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:50:44.9993007Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T08:50:44.9993643Z 'maxnreg': 256, 2026-02-21T08:50:44.9993843Z 'num_sm_multiplier': 128, 2026-02-21T08:50:44.9994218Z 'num_stages': 1, 2026-02-21T08:50:44.9994392Z 'num_warps': 16, 2026-02-21T08:50:44.9994609Z 'pid_type': 'persistent_blocked', 2026-02-21T08:50:44.9994834Z 'range_flattens': [True, None, True], 2026-02-21T08:50:44.9995099Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:50:44.9995327Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:50:44.9995565Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:50:44.9995801Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:50:45.0025382Z [89s] Fitting surrogate: 209 points, 209 targets 2026-02-21T08:50:46.3304346Z [91s] Generation 2 starting: 100 neighbors, 5 active search path(s) 2026-02-21T08:51:12.9614419Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 105/105 1.8 configs/s 2026-02-21T08:51:20.9841098Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 105/105 13.1 configs/s 2026-02-21T08:51:34.7654363Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 238/238 17.0 configs/s 2026-02-21T08:51:35.1454153Z [139s] Generation 2 complete: 2026-02-21T08:51:35.1456091Z error=1 2026-02-21T08:51:35.1456320Z ok=105 2026-02-21T08:51:35.1456489Z min=0.8725 2026-02-21T08:51:35.1456690Z mid=1.3618 2026-02-21T08:51:35.1456848Z max=8.9150 2026-02-21T08:51:35.1457047Z best={'block_sizes': [128, 64, 64], 2026-02-21T08:51:35.1457369Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:51:35.1457683Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T08:51:35.1457959Z 'maxnreg': 256, 2026-02-21T08:51:35.1458148Z 'num_sm_multiplier': 128, 2026-02-21T08:51:35.1458369Z 'num_stages': 1, 2026-02-21T08:51:35.1458550Z 'num_warps': 16, 2026-02-21T08:51:35.1458847Z 'pid_type': 'persistent_blocked', 2026-02-21T08:51:35.1462255Z 'range_flattens': [True, None, True], 2026-02-21T08:51:35.1466428Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:51:35.1468481Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:51:35.1468801Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:51:35.1473436Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:51:35.1514442Z [139s] Fitting surrogate: 315 points, 315 targets 2026-02-21T08:51:36.6039851Z [141s] Generation 3 starting: 103 neighbors, 5 active search path(s) 2026-02-21T08:52:04.2899816Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 105/105 0.7 configs/s 2026-02-21T08:52:12.1439430Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 105/105 13.4 configs/s 2026-02-21T08:52:30.1188736Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 238/238 13.1 configs/s 2026-02-21T08:52:30.4887183Z [195s] Generation 3 complete: 2026-02-21T08:52:30.4890012Z error=1 2026-02-21T08:52:30.4890295Z ok=108 2026-02-21T08:52:30.4894830Z min=0.8755 2026-02-21T08:52:30.4899013Z mid=1.1356 2026-02-21T08:52:30.4901085Z max=9.6891 2026-02-21T08:52:30.4901364Z best={'block_sizes': [128, 64, 64], 2026-02-21T08:52:30.4906218Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:52:30.4911092Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T08:52:30.4912740Z 'maxnreg': 256, 2026-02-21T08:52:30.4913005Z 'num_sm_multiplier': 128, 2026-02-21T08:52:30.4913223Z 'num_stages': 1, 2026-02-21T08:52:30.4913437Z 'num_warps': 16, 2026-02-21T08:52:30.4913690Z 'pid_type': 'persistent_blocked', 2026-02-21T08:52:30.4913960Z 'range_flattens': [True, None, True], 2026-02-21T08:52:30.4917800Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:52:30.4922131Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:52:30.4924275Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:52:30.4924579Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:52:30.4942343Z [195s] Fitting surrogate: 424 points, 424 targets 2026-02-21T08:52:31.9640453Z [196s] Generation 4 starting: 100 neighbors, 5 active search path(s) 2026-02-21T08:52:43.5574567Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 16.2 configs/s 2026-02-21T08:52:50.6960120Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 14.1 configs/s 2026-02-21T08:53:14.2140690Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 259/259 10.9 configs/s 2026-02-21T08:53:14.6094201Z [239s] Generation 4 complete: 2026-02-21T08:53:14.6094557Z ok=106 2026-02-21T08:53:14.6094782Z min=0.7835 2026-02-21T08:53:14.6095023Z mid=1.0494 2026-02-21T08:53:14.6095227Z max=5.4626 2026-02-21T08:53:14.6095484Z best={'block_sizes': [128, 32, 128], 2026-02-21T08:53:14.6095885Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:53:14.6096357Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:53:14.6096669Z 'num_sm_multiplier': 128, 2026-02-21T08:53:14.6096896Z 'num_stages': 1, 2026-02-21T08:53:14.6097102Z 'num_warps': 16, 2026-02-21T08:53:14.6097299Z 'pid_type': 'persistent_blocked', 2026-02-21T08:53:14.6097552Z 'range_flattens': [True, None, True], 2026-02-21T08:53:14.6097813Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:53:14.6098068Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:53:14.6098295Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T08:53:14.6098558Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:53:14.6158443Z [239s] Fitting surrogate: 530 points, 530 targets 2026-02-21T08:53:16.0395304Z [240s] Generation 5 starting: 96 neighbors, 5 active search path(s) 2026-02-21T08:53:42.4652056Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.5 configs/s 2026-02-21T08:53:51.2757674Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 11.0 configs/s 2026-02-21T08:54:09.6014900Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 263/263 14.2 configs/s 2026-02-21T08:54:09.9649066Z [294s] Generation 5 complete: 2026-02-21T08:54:09.9653456Z ok=101 2026-02-21T08:54:09.9657813Z min=0.7679 2026-02-21T08:54:09.9659469Z mid=1.1129 2026-02-21T08:54:09.9659728Z max=4.7145 2026-02-21T08:54:09.9659947Z best={'block_sizes': [128, 64, 128], 2026-02-21T08:54:09.9664472Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:54:09.9666532Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:54:09.9666849Z 'num_stages': 1, 2026-02-21T08:54:09.9667063Z 'num_warps': 16, 2026-02-21T08:54:09.9667254Z 'pid_type': 'flat', 2026-02-21T08:54:09.9667485Z 'range_flattens': [None, None, True], 2026-02-21T08:54:09.9667724Z 'range_multi_buffers': [None, True, True], 2026-02-21T08:54:09.9667986Z 'range_num_stages': [0, 0, 2], 2026-02-21T08:54:09.9668192Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:54:09.9668468Z 'range_warp_specializes': [None, False, True]} 2026-02-21T08:54:09.9723489Z [294s] Fitting surrogate: 631 points, 631 targets 2026-02-21T08:54:11.4304533Z [296s] Generation 6 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:54:26.0322826Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 1.7 configs/s 2026-02-21T08:54:34.4221289Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 11.8 configs/s 2026-02-21T08:54:54.9693010Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 273/273 13.2 configs/s 2026-02-21T08:54:55.3426435Z [340s] Generation 6 complete: 2026-02-21T08:54:55.3430805Z ok=103 2026-02-21T08:54:55.3432568Z min=0.7681 2026-02-21T08:54:55.3432810Z mid=0.9759 2026-02-21T08:54:55.3432979Z max=12.5583 2026-02-21T08:54:55.3433195Z best={'block_sizes': [64, 64, 256], 2026-02-21T08:54:55.3433490Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:54:55.3433844Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:54:55.3434126Z 'num_sm_multiplier': 128, 2026-02-21T08:54:55.3434328Z 'num_stages': 1, 2026-02-21T08:54:55.3434534Z 'num_warps': 16, 2026-02-21T08:54:55.3434733Z 'pid_type': 'persistent_blocked', 2026-02-21T08:54:55.3434991Z 'range_flattens': [True, None, True], 2026-02-21T08:54:55.3435225Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:54:55.3435480Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:54:55.3436010Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:54:55.3436308Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:54:55.3514305Z [340s] Fitting surrogate: 734 points, 734 targets 2026-02-21T08:54:56.7908507Z [341s] Generation 7 starting: 96 neighbors, 5 active search path(s) 2026-02-21T08:55:05.8033451Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 17.8 configs/s 2026-02-21T08:55:12.6530534Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 14.4 configs/s 2026-02-21T08:55:34.2073157Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 273/273 12.6 configs/s 2026-02-21T08:55:34.5720946Z [379s] Generation 7 complete: 2026-02-21T08:55:34.5725437Z ok=101 2026-02-21T08:55:34.5728706Z min=0.7639 2026-02-21T08:55:34.5733168Z mid=0.9737 2026-02-21T08:55:34.5734668Z max=3.9137 2026-02-21T08:55:34.5734893Z best={'block_sizes': [32, 64, 128], 2026-02-21T08:55:34.5735273Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:55:34.5735754Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:55:34.5736062Z 'num_sm_multiplier': 64, 2026-02-21T08:55:34.5736291Z 'num_stages': 1, 2026-02-21T08:55:34.5736470Z 'num_warps': 8, 2026-02-21T08:55:34.5736691Z 'pid_type': 'persistent_blocked', 2026-02-21T08:55:34.5736919Z 'range_flattens': [True, None, True], 2026-02-21T08:55:34.5737183Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:55:34.5737406Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:55:34.5737650Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:55:34.5737880Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:55:34.5808564Z [379s] Fitting surrogate: 835 points, 835 targets 2026-02-21T08:55:36.0611208Z [380s] Generation 8 starting: 99 neighbors, 5 active search path(s) 2026-02-21T08:55:45.4529590Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 9.0 configs/s 2026-02-21T08:55:52.5614662Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 14.1 configs/s 2026-02-21T08:56:13.3473106Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 273/273 13.0 configs/s 2026-02-21T08:56:13.7133453Z [418s] Generation 8 complete: 2026-02-21T08:56:13.7137188Z ok=104 2026-02-21T08:56:13.7140318Z min=0.7638 2026-02-21T08:56:13.7145600Z mid=0.9348 2026-02-21T08:56:13.7149943Z max=6.6223 2026-02-21T08:56:13.7154329Z best={'block_sizes': [32, 64, 128], 2026-02-21T08:56:13.7157714Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:56:13.7160965Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:56:13.7164781Z 'num_sm_multiplier': 64, 2026-02-21T08:56:13.7165194Z 'num_stages': 1, 2026-02-21T08:56:13.7165404Z 'num_warps': 8, 2026-02-21T08:56:13.7169416Z 'pid_type': 'persistent_blocked', 2026-02-21T08:56:13.7172560Z 'range_flattens': [True, None, True], 2026-02-21T08:56:13.7177215Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:56:13.7179335Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:56:13.7179678Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:56:13.7184392Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:56:13.7188756Z [418s] Fitting surrogate: 939 points, 939 targets 2026-02-21T08:56:15.0880342Z [419s] Generation 9 starting: 93 neighbors, 5 active search path(s) 2026-02-21T08:56:23.9157380Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 17.8 configs/s 2026-02-21T08:56:30.4089121Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 14.4 configs/s 2026-02-21T08:56:52.3720309Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 280/280 12.7 configs/s 2026-02-21T08:56:52.7470787Z [457s] Generation 9 complete: 2026-02-21T08:56:52.7475228Z ok=98 2026-02-21T08:56:52.7479888Z min=0.7629 2026-02-21T08:56:52.7485053Z mid=0.9047 2026-02-21T08:56:52.7489099Z max=1.9897 2026-02-21T08:56:52.7490590Z best={'block_sizes': [32, 64, 128], 2026-02-21T08:56:52.7490962Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T08:56:52.7491789Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T08:56:52.7492338Z 'num_sm_multiplier': 64, 2026-02-21T08:56:52.7492576Z 'num_stages': 1, 2026-02-21T08:56:52.7492828Z 'num_warps': 8, 2026-02-21T08:56:52.7493075Z 'pid_type': 'persistent_blocked', 2026-02-21T08:56:52.7493391Z 'range_flattens': [True, None, True], 2026-02-21T08:56:52.7493666Z 'range_multi_buffers': [False, True, True], 2026-02-21T08:56:52.7493949Z 'range_num_stages': [1, 0, 2], 2026-02-21T08:56:52.7494169Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T08:56:52.7494525Z 'range_warp_specializes': [False, False, True]} 2026-02-21T08:56:52.7522282Z [457s] Fitting surrogate: 1037 points, 1037 targets 2026-02-21T08:56:53.7999063Z [458s] Generation 10 starting: 61 neighbors, 3 active search path(s) 2026-02-21T08:57:02.9862888Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 4.2 configs/s 2026-02-21T08:57:07.5094904Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 64/64 14.3 configs/s 2026-02-21T08:57:20.3267481Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 284/284 21.9 configs/s 2026-02-21T08:57:20.6433327Z [485s] Generation 10 complete: 2026-02-21T08:57:20.6437534Z ok=65 2026-02-21T08:57:20.6438968Z min=0.7301 2026-02-21T08:57:20.6439207Z mid=0.9932 2026-02-21T08:57:20.6439376Z max=2.7137 2026-02-21T08:57:20.6439590Z best={'block_sizes': [16, 128, 512], 2026-02-21T08:57:20.6439910Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:57:20.6440284Z 'load_eviction_policies': ['last', 'first', 'first', 'last'], 2026-02-21T08:57:20.6440542Z 'num_stages': 3, 2026-02-21T08:57:20.6440753Z 'num_warps': 8, 2026-02-21T08:57:20.6440935Z 'pid_type': 'flat', 2026-02-21T08:57:20.6441160Z 'range_flattens': [None, None, None], 2026-02-21T08:57:20.6441421Z 'range_multi_buffers': [None, None, True], 2026-02-21T08:57:20.6441651Z 'range_num_stages': [0, 4, 1], 2026-02-21T08:57:20.6442053Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T08:57:20.6442329Z 'range_warp_specializes': [None, None, True]} 2026-02-21T08:57:20.6522493Z [485s] Fitting surrogate: 1102 points, 1102 targets 2026-02-21T08:57:21.7286846Z [486s] Generation 11 starting: 63 neighbors, 3 active search path(s) 2026-02-21T08:57:28.0353161Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 15.6 configs/s 2026-02-21T08:57:32.5089619Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 64/64 14.4 configs/s 2026-02-21T08:57:45.0384645Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 298/298 23.5 configs/s 2026-02-21T08:57:45.3464942Z [510s] Generation 11 complete: 2026-02-21T08:57:45.3467596Z ok=66 2026-02-21T08:57:45.3467836Z min=0.7086 2026-02-21T08:57:45.3468014Z mid=0.9379 2026-02-21T08:57:45.3468204Z max=4.3643 2026-02-21T08:57:45.3468463Z best={'block_sizes': [32, 128, 256], 2026-02-21T08:57:45.3468781Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:57:45.3469454Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T08:57:45.3469725Z 'num_stages': 1, 2026-02-21T08:57:45.3469954Z 'num_warps': 16, 2026-02-21T08:57:45.3470167Z 'pid_type': 'flat', 2026-02-21T08:57:45.3470430Z 'range_flattens': [None, True, None], 2026-02-21T08:57:45.3474566Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:57:45.3478965Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:57:45.3483899Z 'range_unroll_factors': [0, 2, 1], 2026-02-21T08:57:45.3485809Z 'range_warp_specializes': [None, None, True]} 2026-02-21T08:57:45.3548822Z [510s] Fitting surrogate: 1168 points, 1168 targets 2026-02-21T08:57:46.3195646Z [511s] Generation 12 starting: 55 neighbors, 3 active search path(s) 2026-02-21T08:57:52.1057318Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 13.1 configs/s 2026-02-21T08:57:56.0768116Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 56/56 14.2 configs/s 2026-02-21T08:58:05.8609034Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 324/324 32.5 configs/s 2026-02-21T08:58:06.1461543Z [530s] Generation 12 complete: 2026-02-21T08:58:06.1466470Z ok=58 2026-02-21T08:58:06.1470829Z min=0.6870 2026-02-21T08:58:06.1475183Z mid=0.9451 2026-02-21T08:58:06.1479563Z max=4.3805 2026-02-21T08:58:06.1483942Z best={'block_sizes': [32, 128, 256], 2026-02-21T08:58:06.1485404Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:58:06.1485778Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T08:58:06.1486010Z 'num_stages': 1, 2026-02-21T08:58:06.1486213Z 'num_warps': 16, 2026-02-21T08:58:06.1486389Z 'pid_type': 'flat', 2026-02-21T08:58:06.1486586Z 'range_flattens': [None, True, None], 2026-02-21T08:58:06.1486844Z 'range_multi_buffers': [None, None, None], 2026-02-21T08:58:06.1487059Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:58:06.1487293Z 'range_unroll_factors': [0, 2, 0], 2026-02-21T08:58:06.1487530Z 'range_warp_specializes': [None, None, True]} 2026-02-21T08:58:06.1545122Z [530s] Fitting surrogate: 1226 points, 1226 targets 2026-02-21T08:58:06.9248596Z [531s] Generation 13 starting: 40 neighbors, 2 active search path(s) 2026-02-21T08:58:13.3287818Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 3.8 configs/s 2026-02-21T08:58:16.1622207Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 14.3 configs/s 2026-02-21T08:58:22.4293970Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 324/324 50.1 configs/s 2026-02-21T08:58:22.7013683Z [547s] Generation 13 complete: 2026-02-21T08:58:22.7018007Z ok=42 2026-02-21T08:58:22.7019605Z min=0.7321 2026-02-21T08:58:22.7019850Z mid=0.9759 2026-02-21T08:58:22.7024742Z max=3.2849 2026-02-21T08:58:22.7029695Z best={'block_sizes': [32, 128, 256], 2026-02-21T08:58:22.7031294Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:58:22.7031647Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T08:58:22.7032193Z 'num_stages': 1, 2026-02-21T08:58:22.7032384Z 'num_warps': 16, 2026-02-21T08:58:22.7032632Z 'pid_type': 'flat', 2026-02-21T08:58:22.7033224Z 'range_flattens': [None, True, None], 2026-02-21T08:58:22.7033464Z 'range_multi_buffers': [None, False, None], 2026-02-21T08:58:22.7033715Z 'range_num_stages': [0, 0, 0], 2026-02-21T08:58:22.7033922Z 'range_unroll_factors': [0, 2, 0], 2026-02-21T08:58:22.7034186Z 'range_warp_specializes': [None, None, True]} 2026-02-21T08:58:22.7088900Z [547s] Fitting surrogate: 1268 points, 1268 targets 2026-02-21T08:58:23.0301609Z [547s] Autotuning complete in 547.8s after searching 1229 configs. 2026-02-21T08:58:23.0302390Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:58:23.0304166Z @helion.kernel(config=helion.Config(block_sizes=[32, 128, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', '', '', ''], num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, False, None], range_num_stages=[0, 0, 0], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, None, True]), static_shapes=True) 2026-02-21T08:58:23.0305427Z 2026-02-21T08:58:23.0305818Z [547s] Code of selected kernel: /tmp/torchinductor_root/jp/cjpeyybn5tjqhhryw7sikgt2xfu34tuzenrw6r457e7t37b4f5qc.py 2026-02-21T08:58:24.3141288Z WARNING:tritonbench.utils.triton_op:Completed input ID 4: 2026-02-21T08:58:24.3142395Z x_val 2026-02-21T08:58:24.3142573Z ------- 2026-02-21T08:58:24.3142757Z 3072 2026-02-21T08:58:24.3142849Z 2026-02-21T08:58:24.3199602Z 50%|█████ | 3/6 [50:14<53:34, 1071.61s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T08:58:24.3200044Z x_val 2026-02-21T08:58:24.3204158Z ------- 2026-02-21T08:58:24.3208022Z 4096 2026-02-21T08:58:24.3212582Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for eager_layer_norm 2026-02-21T08:58:25.2204603Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T08:58:26.5285169Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for torch_compile_welford 2026-02-21T09:13:18.9932504Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:13:18.9936715Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:13:18.9938560Z 'dtype': 'torch.bfloat16', 2026-02-21T09:13:18.9938775Z 'shape': (4096,), 2026-02-21T09:13:18.9938951Z 'stride': (1,)}, 2026-02-21T09:13:18.9939115Z { 'device': 'cuda:0', 2026-02-21T09:13:18.9939293Z 'dtype': 'torch.bfloat16', 2026-02-21T09:13:18.9939478Z 'shape': (4096,), 2026-02-21T09:13:18.9939633Z 'stride': (1,)}, 2026-02-21T09:13:18.9939797Z { 'device': 'cuda:0', 2026-02-21T09:13:18.9939961Z 'dtype': 'torch.bfloat16', 2026-02-21T09:13:18.9940140Z 'shape': (262144, 4096), 2026-02-21T09:13:18.9940309Z 'stride': (4096, 1)}), 2026-02-21T09:13:18.9940474Z 'kwargs': {}} 2026-02-21T09:13:18.9969066Z INFO:tritonbench.utils.triton_op:Took 4.13ms to get benchmark function for helion_welford 2026-02-21T09:13:19.2966057Z [0s] Autotune random seed: 2134763656 2026-02-21T09:13:19.4498897Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:13:35.6995716Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T09:13:41.4157165Z [21s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 4], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first', 'last', 'last'], maxnreg=32, num_sm_multiplier=32, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False, False], range_multi_buffers=[True, True, False], range_num_stages=[4, 3, 3], range_unroll_factors=[0, 4, 1], range_warp_specializes=[True, None, None]) 2026-02-21T09:13:41.4158417Z Tensor-likes are not close! 2026-02-21T09:13:41.4162034Z 2026-02-21T09:13:41.4166526Z Mismatched elements: 13 / 1073741824 (0.0%) 2026-02-21T09:13:41.4168062Z Greatest absolute difference: 0.01953125 at index (112537, 3028) (up to 0.01 allowed) 2026-02-21T09:13:41.4168445Z Greatest relative difference: 0.8203125 at index (160612, 3987) (up to 0.01 allowed) 2026-02-21T09:13:41.4168761Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T09:13:41.4168926Z 2026-02-21T09:14:19.0502251Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 2.6 configs/s 2026-02-21T09:14:19.0519110Z [59s] Adaptive compile timeout: 30s (90% percentile=7.1s, bounds=[30.0s, 30s]) 2026-02-21T09:14:19.1804118Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 171/171 344.8 configs/s 2026-02-21T09:14:19.5910719Z [60s] Initial random population of 100, 5 starting points: 2026-02-21T09:14:19.5915267Z error=7 2026-02-21T09:14:19.5919624Z ok=93 2026-02-21T09:14:19.5921175Z min=1.2433 2026-02-21T09:14:19.5921426Z mid=24.3564 2026-02-21T09:14:19.5926201Z max=767.9077 2026-02-21T09:14:19.5931767Z best={'block_sizes': [128, 32, 64], 2026-02-21T09:14:19.5932667Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:14:19.5932992Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T09:14:19.5933298Z 'maxnreg': 256, 2026-02-21T09:14:19.5937882Z 'num_sm_multiplier': 128, 2026-02-21T09:14:19.5941928Z 'num_stages': 1, 2026-02-21T09:14:19.5945888Z 'num_warps': 16, 2026-02-21T09:14:19.5949822Z 'pid_type': 'persistent_blocked', 2026-02-21T09:14:19.5952856Z 'range_flattens': [None, None, True], 2026-02-21T09:14:19.5957372Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:14:19.5961645Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:14:19.5965732Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:14:19.5969841Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:14:19.5970196Z [60s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:14:21.0345521Z [61s] Generation 1 starting: 104 neighbors, 5 active search path(s) 2026-02-21T09:14:37.8239338Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 108/108 1.1 configs/s 2026-02-21T09:14:48.2903901Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 108/108 10.2 configs/s 2026-02-21T09:14:56.1580703Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 174/174 21.2 configs/s 2026-02-21T09:14:56.5869372Z [97s] Generation 1 complete: 2026-02-21T09:14:56.5873712Z error=1 2026-02-21T09:14:56.5875583Z ok=109 2026-02-21T09:14:56.5875752Z min=1.1920 2026-02-21T09:14:56.5875880Z mid=2.4217 2026-02-21T09:14:56.5876009Z max=62.5683 2026-02-21T09:14:56.5876156Z best={'block_sizes': [128, 64, 64], 2026-02-21T09:14:56.5876406Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:14:56.5876690Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T09:14:56.5876896Z 'maxnreg': 256, 2026-02-21T09:14:56.5877043Z 'num_sm_multiplier': 128, 2026-02-21T09:14:56.5877194Z 'num_stages': 1, 2026-02-21T09:14:56.5877330Z 'num_warps': 16, 2026-02-21T09:14:56.5877503Z 'pid_type': 'persistent_blocked', 2026-02-21T09:14:56.5877708Z 'range_flattens': [None, True, True], 2026-02-21T09:14:56.5877902Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:14:56.5878085Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:14:56.5878255Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:14:56.5878444Z 'range_warp_specializes': [None, False, True]} 2026-02-21T09:14:56.5891809Z [97s] Fitting surrogate: 210 points, 210 targets 2026-02-21T09:14:57.9281063Z [98s] Generation 2 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:15:12.5647840Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 2.4 configs/s 2026-02-21T09:15:20.7765577Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 12.3 configs/s 2026-02-21T09:15:34.2609097Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 174/174 12.6 configs/s 2026-02-21T09:15:34.6864145Z [135s] Generation 2 complete: 2026-02-21T09:15:34.6864666Z error=4 2026-02-21T09:15:34.6864823Z ok=100 2026-02-21T09:15:34.6865274Z min=1.1880 2026-02-21T09:15:34.6865396Z mid=1.7060 2026-02-21T09:15:34.6865520Z max=20.7852 2026-02-21T09:15:34.6865656Z best={'block_sizes': [128, 64, 64], 2026-02-21T09:15:34.6865905Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:15:34.6866191Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T09:15:34.6870196Z 'maxnreg': 256, 2026-02-21T09:15:34.6871750Z 'num_sm_multiplier': 128, 2026-02-21T09:15:34.6872077Z 'num_stages': 1, 2026-02-21T09:15:34.6876223Z 'num_warps': 16, 2026-02-21T09:15:34.6880109Z 'pid_type': 'persistent_blocked', 2026-02-21T09:15:34.6884196Z 'range_flattens': [None, True, True], 2026-02-21T09:15:34.6888629Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:15:34.6891802Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:15:34.6896300Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:15:34.6897754Z 'range_warp_specializes': [None, False, True]} 2026-02-21T09:15:34.6902697Z [135s] Fitting surrogate: 314 points, 314 targets 2026-02-21T09:15:36.0439943Z [136s] Generation 3 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:15:50.4958241Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 16.0 configs/s 2026-02-21T09:15:58.2520495Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 12.9 configs/s 2026-02-21T09:16:17.3107662Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 180/180 9.3 configs/s 2026-02-21T09:16:17.7464284Z [178s] Generation 3 complete: 2026-02-21T09:16:17.7467523Z ok=104 2026-02-21T09:16:17.7470707Z min=1.1601 2026-02-21T09:16:17.7474597Z mid=1.4090 2026-02-21T09:16:17.7476044Z max=8.6692 2026-02-21T09:16:17.7476217Z best={'block_sizes': [32, 16, 128], 2026-02-21T09:16:17.7476475Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:16:17.7476730Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T09:16:17.7476928Z 'num_stages': 1, 2026-02-21T09:16:17.7477099Z 'num_warps': 2, 2026-02-21T09:16:17.7477248Z 'pid_type': 'flat', 2026-02-21T09:16:17.7477425Z 'range_flattens': [None, None, None], 2026-02-21T09:16:17.7477611Z 'range_multi_buffers': [None, True, None], 2026-02-21T09:16:17.7477800Z 'range_num_stages': [0, 0, 0], 2026-02-21T09:16:17.7477964Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T09:16:17.7478159Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:16:17.7538235Z [178s] Fitting surrogate: 418 points, 418 targets 2026-02-21T09:16:19.0173137Z [179s] Generation 4 starting: 89 neighbors, 5 active search path(s) 2026-02-21T09:16:30.0165376Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 12.3 configs/s 2026-02-21T09:16:37.0314849Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 13.0 configs/s 2026-02-21T09:16:54.5670692Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 10.9 configs/s 2026-02-21T09:16:54.9944143Z [215s] Generation 4 complete: 2026-02-21T09:16:54.9948505Z ok=94 2026-02-21T09:16:54.9952817Z min=1.0414 2026-02-21T09:16:54.9956105Z mid=1.4224 2026-02-21T09:16:54.9959866Z max=11.8784 2026-02-21T09:16:54.9964195Z best={'block_sizes': [32, 32, 128], 2026-02-21T09:16:54.9969164Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:16:54.9973421Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T09:16:54.9977255Z 'num_stages': 1, 2026-02-21T09:16:54.9977506Z 'num_warps': 2, 2026-02-21T09:16:54.9977692Z 'pid_type': 'flat', 2026-02-21T09:16:54.9977899Z 'range_flattens': [None, None, None], 2026-02-21T09:16:54.9978113Z 'range_multi_buffers': [None, True, None], 2026-02-21T09:16:54.9978307Z 'range_num_stages': [0, 0, 0], 2026-02-21T09:16:54.9978484Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T09:16:54.9978686Z 'range_warp_specializes': [None, True, None]} 2026-02-21T09:16:55.0025241Z [215s] Fitting surrogate: 512 points, 512 targets 2026-02-21T09:16:56.4493352Z [216s] Generation 5 starting: 96 neighbors, 5 active search path(s) 2026-02-21T09:17:10.0468360Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 2.4 configs/s 2026-02-21T09:17:17.5477767Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 13.2 configs/s 2026-02-21T09:17:35.4747523Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 10.7 configs/s 2026-02-21T09:17:35.8943715Z [256s] Generation 5 complete: 2026-02-21T09:17:35.8946057Z error=3 2026-02-21T09:17:35.8952526Z ok=99 2026-02-21T09:17:35.8956169Z min=1.0424 2026-02-21T09:17:35.8960057Z mid=1.4203 2026-02-21T09:17:35.8964047Z max=15.7860 2026-02-21T09:17:35.8967370Z best={'block_sizes': [32, 32, 128], 2026-02-21T09:17:35.8971520Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:17:35.8973326Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T09:17:35.8973520Z 'num_stages': 1, 2026-02-21T09:17:35.8973683Z 'num_warps': 4, 2026-02-21T09:17:35.8973819Z 'pid_type': 'flat', 2026-02-21T09:17:35.8973986Z 'range_flattens': [None, None, None], 2026-02-21T09:17:35.8974208Z 'range_multi_buffers': [None, False, None], 2026-02-21T09:17:35.8974414Z 'range_num_stages': [0, 0, 0], 2026-02-21T09:17:35.8974584Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T09:17:35.8974775Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:17:35.9022109Z [256s] Fitting surrogate: 614 points, 614 targets 2026-02-21T09:17:37.4196704Z [257s] Generation 6 starting: 101 neighbors, 5 active search path(s) 2026-02-21T09:17:51.8892764Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 7.4 configs/s 2026-02-21T09:17:59.4875966Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 13.6 configs/s 2026-02-21T09:18:17.8310064Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 197/197 10.6 configs/s 2026-02-21T09:18:18.2505534Z [298s] Generation 6 complete: 2026-02-21T09:18:18.2509751Z error=7 2026-02-21T09:18:18.2511077Z ok=100 2026-02-21T09:18:18.2511383Z min=1.0272 2026-02-21T09:18:18.2511591Z mid=1.3885 2026-02-21T09:18:18.2511839Z max=7.4476 2026-02-21T09:18:18.2512256Z best={'block_sizes': [64, 32, 128], 2026-02-21T09:18:18.2512668Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:18:18.2513027Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T09:18:18.2513317Z 'num_stages': 1, 2026-02-21T09:18:18.2513579Z 'num_warps': 8, 2026-02-21T09:18:18.2513743Z 'pid_type': 'flat', 2026-02-21T09:18:18.2513930Z 'range_flattens': [None, None, None], 2026-02-21T09:18:18.2514123Z 'range_multi_buffers': [None, False, None], 2026-02-21T09:18:18.2514353Z 'range_num_stages': [0, 1, 0], 2026-02-21T09:18:18.2514520Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T09:18:18.2514714Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:18:18.2596373Z [298s] Fitting surrogate: 721 points, 721 targets 2026-02-21T09:18:19.6660253Z [300s] Generation 7 starting: 97 neighbors, 5 active search path(s) 2026-02-21T09:18:33.8856021Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 2.8 configs/s 2026-02-21T09:18:41.4428992Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 13.0 configs/s 2026-02-21T09:18:59.8288044Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 199/199 10.7 configs/s 2026-02-21T09:19:00.2455714Z [340s] Generation 7 complete: 2026-02-21T09:19:00.2458917Z ok=102 2026-02-21T09:19:00.2463452Z min=1.0291 2026-02-21T09:19:00.2467225Z mid=1.4162 2026-02-21T09:19:00.2471632Z max=6.4337 2026-02-21T09:19:00.2476056Z best={'block_sizes': [64, 64, 128], 2026-02-21T09:19:00.2479467Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:19:00.2479863Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:19:00.2480134Z 'num_sm_multiplier': 128, 2026-02-21T09:19:00.2484752Z 'num_stages': 1, 2026-02-21T09:19:00.2489157Z 'num_warps': 16, 2026-02-21T09:19:00.2493477Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:19:00.2497875Z 'range_flattens': [None, True, True], 2026-02-21T09:19:00.2499557Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:19:00.2499810Z 'range_num_stages': [0, 1, 2], 2026-02-21T09:19:00.2500009Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:19:00.2500207Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:19:00.2545409Z [340s] Fitting surrogate: 823 points, 823 targets 2026-02-21T09:19:01.7612374Z [342s] Generation 8 starting: 101 neighbors, 5 active search path(s) 2026-02-21T09:19:15.0340750Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 7.1 configs/s 2026-02-21T09:19:22.7254636Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 13.3 configs/s 2026-02-21T09:19:45.6541825Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 200/200 8.7 configs/s 2026-02-21T09:19:46.0925779Z [386s] Generation 8 complete: 2026-02-21T09:19:46.0930245Z ok=106 2026-02-21T09:19:46.0934274Z min=1.0178 2026-02-21T09:19:46.0935807Z mid=1.3107 2026-02-21T09:19:46.0935961Z max=6.0948 2026-02-21T09:19:46.0936112Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:19:46.0936403Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:19:46.0936715Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:19:46.0936953Z 'num_sm_multiplier': 128, 2026-02-21T09:19:46.0937112Z 'num_stages': 1, 2026-02-21T09:19:46.0937256Z 'num_warps': 8, 2026-02-21T09:19:46.0937408Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:19:46.0937603Z 'range_flattens': [None, True, True], 2026-02-21T09:19:46.0937793Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:19:46.0937991Z 'range_num_stages': [0, 1, 1], 2026-02-21T09:19:46.0938170Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:19:46.0938366Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:19:46.1013430Z [386s] Fitting surrogate: 929 points, 929 targets 2026-02-21T09:19:47.5327941Z [388s] Generation 9 starting: 94 neighbors, 5 active search path(s) 2026-02-21T09:19:59.3462946Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 11.9 configs/s 2026-02-21T09:20:06.8241491Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 13.0 configs/s 2026-02-21T09:20:26.6911203Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 200/200 10.0 configs/s 2026-02-21T09:20:27.1022808Z [427s] Generation 9 complete: 2026-02-21T09:20:27.1026060Z ok=100 2026-02-21T09:20:27.1030439Z min=1.0307 2026-02-21T09:20:27.1034790Z mid=1.2851 2026-02-21T09:20:27.1037912Z max=14.6852 2026-02-21T09:20:27.1042328Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:20:27.1043814Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:20:27.1044126Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:20:27.1044367Z 'num_sm_multiplier': 128, 2026-02-21T09:20:27.1044527Z 'num_stages': 1, 2026-02-21T09:20:27.1044676Z 'num_warps': 8, 2026-02-21T09:20:27.1044845Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:20:27.1045041Z 'range_flattens': [None, True, True], 2026-02-21T09:20:27.1045264Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:20:27.1045748Z 'range_num_stages': [0, 1, 1], 2026-02-21T09:20:27.1045922Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:20:27.1046121Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:20:27.1107969Z [427s] Fitting surrogate: 1029 points, 1029 targets 2026-02-21T09:20:28.4912588Z [429s] Generation 10 starting: 94 neighbors, 5 active search path(s) 2026-02-21T09:20:42.5457550Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 11.1 configs/s 2026-02-21T09:20:50.0146058Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 96/96 12.9 configs/s 2026-02-21T09:21:10.0775068Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 10.0 configs/s 2026-02-21T09:21:10.4958615Z [471s] Generation 10 complete: 2026-02-21T09:21:10.4960639Z ok=99 2026-02-21T09:21:10.4960849Z min=1.0228 2026-02-21T09:21:10.4961004Z mid=1.2759 2026-02-21T09:21:10.4961139Z max=13.5813 2026-02-21T09:21:10.4961307Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:21:10.4965971Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:21:10.4969568Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:21:10.4971104Z 'num_stages': 1, 2026-02-21T09:21:10.4971282Z 'num_warps': 8, 2026-02-21T09:21:10.4971440Z 'pid_type': 'flat', 2026-02-21T09:21:10.4971609Z 'range_flattens': [None, True, True], 2026-02-21T09:21:10.4971819Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:21:10.4972104Z 'range_num_stages': [0, 1, 1], 2026-02-21T09:21:10.4972286Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:21:10.4972486Z 'range_warp_specializes': [None, False, True]} 2026-02-21T09:21:10.5066292Z [471s] Fitting surrogate: 1128 points, 1128 targets 2026-02-21T09:21:11.6664505Z [472s] Generation 11 starting: 71 neighbors, 4 active search path(s) 2026-02-21T09:21:20.8133220Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 8.4 configs/s 2026-02-21T09:21:26.4787169Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 73/73 12.9 configs/s 2026-02-21T09:21:41.7253577Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 13.1 configs/s 2026-02-21T09:21:42.1325366Z [502s] Generation 11 complete: 2026-02-21T09:21:42.1329662Z ok=75 2026-02-21T09:21:42.1333539Z min=1.0068 2026-02-21T09:21:42.1335118Z mid=1.3506 2026-02-21T09:21:42.1335331Z max=12.4785 2026-02-21T09:21:42.1339296Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:21:42.1343875Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:21:42.1345198Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T09:21:42.1345476Z 'num_stages': 6, 2026-02-21T09:21:42.1345701Z 'num_warps': 2, 2026-02-21T09:21:42.1345894Z 'pid_type': 'flat', 2026-02-21T09:21:42.1346125Z 'range_flattens': [None, None, None], 2026-02-21T09:21:42.1346361Z 'range_multi_buffers': [None, None, False], 2026-02-21T09:21:42.1346613Z 'range_num_stages': [0, 4, 4], 2026-02-21T09:21:42.1346849Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T09:21:42.1347120Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:21:42.1416065Z [502s] Fitting surrogate: 1203 points, 1203 targets 2026-02-21T09:21:43.3414619Z [503s] Generation 12 starting: 75 neighbors, 4 active search path(s) 2026-02-21T09:21:53.2067337Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 9.8 configs/s 2026-02-21T09:21:59.0550579Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 77/77 13.2 configs/s 2026-02-21T09:22:14.6298963Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 12.9 configs/s 2026-02-21T09:22:15.0272473Z [535s] Generation 12 complete: 2026-02-21T09:22:15.0274444Z ok=79 2026-02-21T09:22:15.0274624Z min=1.0372 2026-02-21T09:22:15.0274826Z mid=1.3240 2026-02-21T09:22:15.0275029Z max=7.9887 2026-02-21T09:22:15.0275201Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:22:15.0275900Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:22:15.0276268Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T09:22:15.0276670Z 'num_stages': 7, 2026-02-21T09:22:15.0276854Z 'num_warps': 2, 2026-02-21T09:22:15.0277058Z 'pid_type': 'flat', 2026-02-21T09:22:15.0277258Z 'range_flattens': [None, False, None], 2026-02-21T09:22:15.0277527Z 'range_multi_buffers': [None, None, False], 2026-02-21T09:22:15.0277786Z 'range_num_stages': [0, 4, 4], 2026-02-21T09:22:15.0277976Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T09:22:15.0278238Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:22:15.0363952Z [535s] Fitting surrogate: 1282 points, 1282 targets 2026-02-21T09:22:16.2954203Z [536s] Generation 13 starting: 78 neighbors, 4 active search path(s) 2026-02-21T09:22:26.5399501Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 7.4 configs/s 2026-02-21T09:22:32.5675597Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 80/80 13.3 configs/s 2026-02-21T09:22:47.9622074Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 13.0 configs/s 2026-02-21T09:22:48.3675714Z [568s] Generation 13 complete: 2026-02-21T09:22:48.3678904Z ok=82 2026-02-21T09:22:48.3682720Z min=1.0086 2026-02-21T09:22:48.3684624Z mid=1.3415 2026-02-21T09:22:48.3684857Z max=5.5978 2026-02-21T09:22:48.3685039Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:22:48.3685431Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:22:48.3685801Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T09:22:48.3686059Z 'num_stages': 7, 2026-02-21T09:22:48.3686264Z 'num_warps': 2, 2026-02-21T09:22:48.3686440Z 'pid_type': 'flat', 2026-02-21T09:22:48.3686660Z 'range_flattens': [None, True, None], 2026-02-21T09:22:48.3686898Z 'range_multi_buffers': [None, None, False], 2026-02-21T09:22:48.3687146Z 'range_num_stages': [0, 4, 4], 2026-02-21T09:22:48.3687363Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T09:22:48.3687642Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:22:48.3785582Z [568s] Fitting surrogate: 1364 points, 1364 targets 2026-02-21T09:22:49.5845991Z [570s] Generation 14 starting: 76 neighbors, 4 active search path(s) 2026-02-21T09:22:59.0669645Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 13.8 configs/s 2026-02-21T09:23:04.9107386Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 78/78 13.4 configs/s 2026-02-21T09:23:21.6257814Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 12.0 configs/s 2026-02-21T09:23:22.0311281Z [602s] Generation 14 complete: 2026-02-21T09:23:22.0315510Z ok=80 2026-02-21T09:23:22.0319312Z min=1.0085 2026-02-21T09:23:22.0323735Z mid=1.2984 2026-02-21T09:23:22.0328215Z max=6.8628 2026-02-21T09:23:22.0332600Z best={'block_sizes': [16, 128, 128], 2026-02-21T09:23:22.0334538Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:23:22.0334943Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T09:23:22.0335239Z 'num_stages': 7, 2026-02-21T09:23:22.0335711Z 'num_warps': 2, 2026-02-21T09:23:22.0335926Z 'pid_type': 'flat', 2026-02-21T09:23:22.0336137Z 'range_flattens': [None, True, None], 2026-02-21T09:23:22.0336397Z 'range_multi_buffers': [None, None, False], 2026-02-21T09:23:22.0336608Z 'range_num_stages': [0, 4, 4], 2026-02-21T09:23:22.0336851Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T09:23:22.0337109Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:23:22.0424260Z [602s] Fitting surrogate: 1444 points, 1444 targets 2026-02-21T09:23:23.1092423Z [603s] Generation 15 starting: 66 neighbors, 4 active search path(s) 2026-02-21T09:23:36.5225489Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.1 configs/s 2026-02-21T09:23:41.9288726Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 68/68 12.6 configs/s 2026-02-21T09:23:56.1742795Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 14.0 configs/s 2026-02-21T09:23:56.5696788Z [637s] Generation 15 complete: 2026-02-21T09:23:56.5700223Z ok=70 2026-02-21T09:23:56.5703783Z min=1.0332 2026-02-21T09:23:56.5705448Z mid=1.2789 2026-02-21T09:23:56.5705653Z max=17.5427 2026-02-21T09:23:56.5705874Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:23:56.5706171Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:23:56.5706526Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:23:56.5706795Z 'num_sm_multiplier': 128, 2026-02-21T09:23:56.5707026Z 'num_stages': 2, 2026-02-21T09:23:56.5707207Z 'num_warps': 8, 2026-02-21T09:23:56.5707428Z 'pid_type': 'persistent_blocked', 2026-02-21T09:23:56.5707684Z 'range_flattens': [False, True, None], 2026-02-21T09:23:56.5707917Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:23:56.5708168Z 'range_num_stages': [0, 3, 1], 2026-02-21T09:23:56.5708373Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:23:56.5708630Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:23:56.5818889Z [637s] Fitting surrogate: 1514 points, 1514 targets 2026-02-21T09:23:57.5763726Z [638s] Generation 16 starting: 57 neighbors, 3 active search path(s) 2026-02-21T09:24:04.8491571Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 12.7 configs/s 2026-02-21T09:24:09.2990507Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 59/59 13.3 configs/s 2026-02-21T09:24:20.2174901Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 18.2 configs/s 2026-02-21T09:24:20.6031329Z [661s] Generation 16 complete: 2026-02-21T09:24:20.6032638Z ok=60 2026-02-21T09:24:20.6032856Z min=1.0301 2026-02-21T09:24:20.6033070Z mid=1.3671 2026-02-21T09:24:20.6033228Z max=5.2900 2026-02-21T09:24:20.6033509Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:24:20.6038014Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:24:20.6038487Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:24:20.6038818Z 'num_sm_multiplier': 128, 2026-02-21T09:24:20.6044502Z 'num_stages': 2, 2026-02-21T09:24:20.6044744Z 'num_warps': 8, 2026-02-21T09:24:20.6044979Z 'pid_type': 'persistent_blocked', 2026-02-21T09:24:20.6045229Z 'range_flattens': [False, True, None], 2026-02-21T09:24:20.6045499Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:24:20.6052386Z 'range_num_stages': [0, 4, 1], 2026-02-21T09:24:20.6055123Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:24:20.6055390Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:24:20.6132081Z [661s] Fitting surrogate: 1574 points, 1574 targets 2026-02-21T09:24:21.6298207Z [662s] Generation 17 starting: 60 neighbors, 3 active search path(s) 2026-02-21T09:24:29.3101174Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 13.8 configs/s 2026-02-21T09:24:34.1123435Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 62/62 13.0 configs/s 2026-02-21T09:24:46.8707424Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 15.6 configs/s 2026-02-21T09:24:47.2613421Z [687s] Generation 17 complete: 2026-02-21T09:24:47.2617910Z ok=63 2026-02-21T09:24:47.2619892Z min=1.0272 2026-02-21T09:24:47.2620128Z mid=1.3158 2026-02-21T09:24:47.2620299Z max=8.8658 2026-02-21T09:24:47.2620510Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:24:47.2620805Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:24:47.2621157Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:24:47.2621432Z 'num_sm_multiplier': 128, 2026-02-21T09:24:47.2621654Z 'num_stages': 2, 2026-02-21T09:24:47.2621929Z 'num_warps': 8, 2026-02-21T09:24:47.2622130Z 'pid_type': 'persistent_blocked', 2026-02-21T09:24:47.2622384Z 'range_flattens': [None, True, None], 2026-02-21T09:24:47.2622620Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:24:47.2622880Z 'range_num_stages': [0, 4, 1], 2026-02-21T09:24:47.2623091Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:24:47.2623359Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:24:47.2723541Z [687s] Fitting surrogate: 1637 points, 1637 targets 2026-02-21T09:24:48.2520577Z [688s] Generation 18 starting: 57 neighbors, 3 active search path(s) 2026-02-21T09:24:56.5492885Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 4.3 configs/s 2026-02-21T09:25:00.9934486Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 59/59 13.4 configs/s 2026-02-21T09:25:14.0009468Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 15.3 configs/s 2026-02-21T09:25:14.4089842Z [714s] Generation 18 complete: 2026-02-21T09:25:14.4094315Z ok=60 2026-02-21T09:25:14.4096384Z min=1.0239 2026-02-21T09:25:14.4096616Z mid=1.3394 2026-02-21T09:25:14.4096852Z max=4.0941 2026-02-21T09:25:14.4101485Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:25:14.4106671Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:25:14.4110511Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:25:14.4110932Z 'num_sm_multiplier': 64, 2026-02-21T09:25:14.4111227Z 'num_stages': 2, 2026-02-21T09:25:14.4118690Z 'num_warps': 8, 2026-02-21T09:25:14.4118939Z 'pid_type': 'persistent_blocked', 2026-02-21T09:25:14.4119199Z 'range_flattens': [None, True, None], 2026-02-21T09:25:14.4119427Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:25:14.4119683Z 'range_num_stages': [0, 4, 1], 2026-02-21T09:25:14.4119892Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:25:14.4120163Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:25:14.4209282Z [714s] Fitting surrogate: 1697 points, 1697 targets 2026-02-21T09:25:15.4086999Z [715s] Generation 19 starting: 59 neighbors, 3 active search path(s) 2026-02-21T09:25:23.5166712Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 6.7 configs/s 2026-02-21T09:25:28.3936441Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 61/61 12.6 configs/s 2026-02-21T09:25:39.2840120Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 18.3 configs/s 2026-02-21T09:25:39.6861833Z [740s] Generation 19 complete: 2026-02-21T09:25:39.6866057Z ok=62 2026-02-21T09:25:39.6870177Z min=1.0260 2026-02-21T09:25:39.6874720Z mid=1.3823 2026-02-21T09:25:39.6874983Z max=28.0750 2026-02-21T09:25:39.6875219Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:25:39.6875658Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:25:39.6876141Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:25:39.6876477Z 'num_sm_multiplier': 64, 2026-02-21T09:25:39.6876677Z 'num_stages': 2, 2026-02-21T09:25:39.6876886Z 'num_warps': 8, 2026-02-21T09:25:39.6877150Z 'pid_type': 'persistent_blocked', 2026-02-21T09:25:39.6877442Z 'range_flattens': [None, True, None], 2026-02-21T09:25:39.6877752Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:25:39.6877990Z 'range_num_stages': [0, 4, 0], 2026-02-21T09:25:39.6878202Z 'range_unroll_factors': [1, 4, 0], 2026-02-21T09:25:39.6878444Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:25:39.6974724Z [740s] Fitting surrogate: 1759 points, 1759 targets 2026-02-21T09:25:40.4600726Z [741s] Generation 20 starting: 41 neighbors, 2 active search path(s) 2026-02-21T09:25:49.2687323Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 3.0 configs/s 2026-02-21T09:25:53.2956445Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 10.4 configs/s 2026-02-21T09:26:00.5831076Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 203/203 26.8 configs/s 2026-02-21T09:26:00.9469760Z [761s] Generation 20 complete: 2026-02-21T09:26:00.9474070Z ok=43 2026-02-21T09:26:00.9478564Z min=1.0352 2026-02-21T09:26:00.9481664Z mid=1.3558 2026-02-21T09:26:00.9485657Z max=84.9039 2026-02-21T09:26:00.9488881Z best={'block_sizes': [32, 64, 128], 2026-02-21T09:26:00.9492230Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:26:00.9496133Z 'load_eviction_policies': ['last', 'first', 'first', 'first'], 2026-02-21T09:26:00.9496564Z 'num_sm_multiplier': 64, 2026-02-21T09:26:00.9497188Z 'num_stages': 2, 2026-02-21T09:26:00.9501737Z 'num_warps': 8, 2026-02-21T09:26:00.9505633Z 'pid_type': 'persistent_blocked', 2026-02-21T09:26:00.9507538Z 'range_flattens': [None, True, None], 2026-02-21T09:26:00.9507824Z 'range_multi_buffers': [True, True, True], 2026-02-21T09:26:00.9508090Z 'range_num_stages': [0, 4, 0], 2026-02-21T09:26:00.9508311Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:26:00.9508585Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:26:00.9569343Z [761s] Fitting surrogate: 1802 points, 1802 targets 2026-02-21T09:26:01.2919756Z [761s] Autotuning complete in 761.8s after searching 1756 configs. 2026-02-21T09:26:01.2924811Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:26:01.2929596Z @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'first', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True, None], range_multi_buffers=[True, True, True], range_num_stages=[0, 4, 0], range_unroll_factors=[1, 4, 1], range_warp_specializes=[False, False, True]), static_shapes=True) 2026-02-21T09:26:01.2930715Z 2026-02-21T09:26:01.2933631Z [761s] Code of selected kernel: /tmp/torchinductor_root/zi/cziazpiyrvyznuyg3k36sttw6w6nttnxzcvfoz2jdz2lp2i3lgya.py 2026-02-21T09:26:02.5644144Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T09:26:02.5648364Z x_val 2026-02-21T09:26:02.5652709Z ------- 2026-02-21T09:26:02.5657144Z 4096 2026-02-21T09:26:02.5661437Z 2026-02-21T09:26:02.5723271Z 67%|██████▋ | 4/6 [1:17:52<43:26, 1303.21s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7: 2026-02-21T09:26:02.5723801Z x_val 2026-02-21T09:26:02.5724000Z ------- 2026-02-21T09:26:02.5724165Z 6144 2026-02-21T09:26:02.5749597Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T09:26:03.3776298Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T09:26:04.7034599Z INFO:tritonbench.utils.triton_op:Took 2.63ms to get benchmark function for torch_compile_welford 2026-02-21T09:49:57.0627886Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:49:57.0628223Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:49:57.0628435Z 'dtype': 'torch.bfloat16', 2026-02-21T09:49:57.0628670Z 'shape': (6144,), 2026-02-21T09:49:57.0628892Z 'stride': (1,)}, 2026-02-21T09:49:57.0636246Z { 'device': 'cuda:0', 2026-02-21T09:49:57.0636487Z 'dtype': 'torch.bfloat16', 2026-02-21T09:49:57.0636702Z 'shape': (6144,), 2026-02-21T09:49:57.0636872Z 'stride': (1,)}, 2026-02-21T09:49:57.0637042Z { 'device': 'cuda:0', 2026-02-21T09:49:57.0637214Z 'dtype': 'torch.bfloat16', 2026-02-21T09:49:57.0637401Z 'shape': (262144, 6144), 2026-02-21T09:49:57.0637594Z 'stride': (6144, 1)}), 2026-02-21T09:49:57.0637779Z 'kwargs': {}} 2026-02-21T09:49:57.0663233Z INFO:tritonbench.utils.triton_op:Took 3.91ms to get benchmark function for helion_welford 2026-02-21T09:49:57.3610495Z [0s] Autotune random seed: 2134763656 2026-02-21T09:49:57.5885272Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:50:31.7560876Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first', 'last', ''], maxnreg=128, num_sm_multiplier=64, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False, None], range_multi_buffers=[False, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[4, 4, 4], range_warp_specializes=[False, None, False]) 2026-02-21T09:50:31.7577561Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T09:51:45.7565657Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.2 configs/s 2026-02-21T09:51:45.7579345Z [108s] Adaptive compile timeout: 30s (90% percentile=12.3s, bounds=[30.0s, 30s]) 2026-02-21T09:51:45.7744245Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110/110 - configs/s 2026-02-21T09:51:46.2988549Z [108s] Initial random population of 100, 5 starting points: 2026-02-21T09:51:46.2992874Z error=6 2026-02-21T09:51:46.2996657Z timeout=1 2026-02-21T09:51:46.3000957Z ok=93 2026-02-21T09:51:46.3006089Z min=1.8575 2026-02-21T09:51:46.3010703Z mid=47.8239 2026-02-21T09:51:46.3014669Z max=1359.5135 2026-02-21T09:51:46.3014938Z best={'block_sizes': [128, 32, 64], 2026-02-21T09:51:46.3015248Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:51:46.3015564Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T09:51:46.3019695Z 'maxnreg': 256, 2026-02-21T09:51:46.3024864Z 'num_sm_multiplier': 128, 2026-02-21T09:51:46.3028854Z 'num_stages': 1, 2026-02-21T09:51:46.3032494Z 'num_warps': 16, 2026-02-21T09:51:46.3034045Z 'pid_type': 'persistent_blocked', 2026-02-21T09:51:46.3034277Z 'range_flattens': [None, None, True], 2026-02-21T09:51:46.3034493Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:51:46.3034690Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:51:46.3034872Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:51:46.3035077Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:51:46.3035388Z [108s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:51:47.7079556Z [110s] Generation 1 starting: 100 neighbors, 5 active search path(s) 2026-02-21T09:52:05.3067471Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 105/105 9.2 configs/s 2026-02-21T09:52:17.7687816Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 105/105 8.4 configs/s 2026-02-21T09:52:26.3084165Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 118/118 13.0 configs/s 2026-02-21T09:52:26.9270622Z [149s] Generation 1 complete: 2026-02-21T09:52:26.9275703Z ok=106 2026-02-21T09:52:26.9280106Z min=1.7541 2026-02-21T09:52:26.9281527Z mid=3.5882 2026-02-21T09:52:26.9281686Z max=30.2223 2026-02-21T09:52:26.9281827Z best={'block_sizes': [128, 64, 64], 2026-02-21T09:52:26.9282242Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:52:26.9282527Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T09:52:26.9282735Z 'maxnreg': 256, 2026-02-21T09:52:26.9282886Z 'num_sm_multiplier': 128, 2026-02-21T09:52:26.9283037Z 'num_stages': 1, 2026-02-21T09:52:26.9283177Z 'num_warps': 16, 2026-02-21T09:52:26.9283321Z 'pid_type': 'persistent_blocked', 2026-02-21T09:52:26.9283508Z 'range_flattens': [True, None, True], 2026-02-21T09:52:26.9283698Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:52:26.9283886Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:52:26.9284344Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:52:26.9284543Z 'range_warp_specializes': [False, False, True]} 2026-02-21T09:52:26.9303076Z [149s] Fitting surrogate: 206 points, 206 targets 2026-02-21T09:52:28.2249596Z [150s] Generation 2 starting: 96 neighbors, 5 active search path(s) 2026-02-21T09:52:50.0782587Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.8 configs/s 2026-02-21T09:52:59.7635350Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 10.2 configs/s 2026-02-21T09:53:09.8319594Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 120/120 11.4 configs/s 2026-02-21T09:53:10.4223887Z [192s] Generation 2 complete: 2026-02-21T09:53:10.4224141Z error=2 2026-02-21T09:53:10.4224350Z ok=99 2026-02-21T09:53:10.4224509Z min=1.7162 2026-02-21T09:53:10.4224714Z mid=2.8468 2026-02-21T09:53:10.4225213Z max=21.7780 2026-02-21T09:53:10.4225425Z best={'block_sizes': [32, 64, 64], 2026-02-21T09:53:10.4225695Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:53:10.4226005Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T09:53:10.4226241Z 'maxnreg': 256, 2026-02-21T09:53:10.4226386Z 'num_sm_multiplier': 128, 2026-02-21T09:53:10.4226543Z 'num_stages': 1, 2026-02-21T09:53:10.4226677Z 'num_warps': 4, 2026-02-21T09:53:10.4226831Z 'pid_type': 'persistent_blocked', 2026-02-21T09:53:10.4227008Z 'range_flattens': [True, None, True], 2026-02-21T09:53:10.4227207Z 'range_multi_buffers': [False, True, True], 2026-02-21T09:53:10.4227397Z 'range_num_stages': [1, 0, 2], 2026-02-21T09:53:10.4227561Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T09:53:10.4227758Z 'range_warp_specializes': [False, None, True]} 2026-02-21T09:53:10.4256791Z [192s] Fitting surrogate: 307 points, 307 targets 2026-02-21T09:53:11.8363740Z [194s] Generation 3 starting: 103 neighbors, 5 active search path(s) 2026-02-21T09:53:32.2249870Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 107/107 1.9 configs/s 2026-02-21T09:53:41.8448872Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 107/107 11.1 configs/s 2026-02-21T09:53:59.9973002Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 129/129 7.0 configs/s 2026-02-21T09:54:00.5765672Z [242s] Generation 3 complete: 2026-02-21T09:54:00.5765933Z ok=109 2026-02-21T09:54:00.5766079Z min=1.4541 2026-02-21T09:54:00.5766243Z mid=2.2261 2026-02-21T09:54:00.5766375Z max=18.8641 2026-02-21T09:54:00.5766554Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:54:00.5766829Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:54:00.5767112Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:54:00.5767304Z 'num_stages': 5, 2026-02-21T09:54:00.5767442Z 'num_warps': 1, 2026-02-21T09:54:00.5767589Z 'pid_type': 'flat', 2026-02-21T09:54:00.5767748Z 'range_flattens': [None, None, False], 2026-02-21T09:54:00.5767951Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:54:00.5768134Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:54:00.5768329Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:54:00.5768518Z 'range_warp_specializes': [None, None, False]} 2026-02-21T09:54:00.5817377Z [242s] Fitting surrogate: 416 points, 416 targets 2026-02-21T09:54:01.9058550Z [244s] Generation 4 starting: 95 neighbors, 5 active search path(s) 2026-02-21T09:54:23.5681443Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 1.2 configs/s 2026-02-21T09:54:32.3937363Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 11.0 configs/s 2026-02-21T09:54:49.5285481Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 137/137 7.9 configs/s 2026-02-21T09:54:50.0817233Z [292s] Generation 4 complete: 2026-02-21T09:54:50.0817480Z ok=100 2026-02-21T09:54:50.0817630Z min=1.4603 2026-02-21T09:54:50.0817794Z mid=2.1157 2026-02-21T09:54:50.0817921Z max=17.5273 2026-02-21T09:54:50.0818115Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:54:50.0826415Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:54:50.0827062Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:54:50.0827289Z 'num_stages': 5, 2026-02-21T09:54:50.0827436Z 'num_warps': 1, 2026-02-21T09:54:50.0827693Z 'pid_type': 'flat', 2026-02-21T09:54:50.0827884Z 'range_flattens': [None, None, False], 2026-02-21T09:54:50.0828165Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:54:50.0828429Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:54:50.0828663Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:54:50.0828945Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:54:50.0873514Z [292s] Fitting surrogate: 516 points, 516 targets 2026-02-21T09:54:51.4152770Z [293s] Generation 5 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:55:05.9552677Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 6.6 configs/s 2026-02-21T09:55:13.8813003Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.2 configs/s 2026-02-21T09:55:30.9770458Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 137/137 7.9 configs/s 2026-02-21T09:55:31.5207690Z [333s] Generation 5 complete: 2026-02-21T09:55:31.5208032Z ok=93 2026-02-21T09:55:31.5208285Z min=1.4601 2026-02-21T09:55:31.5208487Z mid=1.9436 2026-02-21T09:55:31.5208706Z max=19.6439 2026-02-21T09:55:31.5208898Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:55:31.5209247Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:55:31.5209610Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:55:31.5209847Z 'num_stages': 5, 2026-02-21T09:55:31.5210064Z 'num_warps': 1, 2026-02-21T09:55:31.5210251Z 'pid_type': 'flat', 2026-02-21T09:55:31.5210490Z 'range_flattens': [None, None, None], 2026-02-21T09:55:31.5210739Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:55:31.5211004Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:55:31.5211220Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:55:31.5211490Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:55:31.5260756Z [333s] Fitting surrogate: 609 points, 609 targets 2026-02-21T09:55:32.7032741Z [335s] Generation 6 starting: 74 neighbors, 4 active search path(s) 2026-02-21T09:55:45.3319079Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 13.8 configs/s 2026-02-21T09:55:52.3828403Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 10.8 configs/s 2026-02-21T09:56:05.5571310Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 138/138 10.2 configs/s 2026-02-21T09:56:06.0988447Z [368s] Generation 6 complete: 2026-02-21T09:56:06.0988796Z ok=78 2026-02-21T09:56:06.0989011Z min=1.4582 2026-02-21T09:56:06.0989252Z mid=2.1617 2026-02-21T09:56:06.0995878Z max=16.9810 2026-02-21T09:56:06.0996115Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:56:06.0996498Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:56:06.0996866Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:56:06.0997100Z 'num_stages': 5, 2026-02-21T09:56:06.0997289Z 'num_warps': 1, 2026-02-21T09:56:06.0997492Z 'pid_type': 'flat', 2026-02-21T09:56:06.0998075Z 'range_flattens': [None, None, True], 2026-02-21T09:56:06.0998303Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:56:06.0998568Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:56:06.0998805Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:56:06.0999040Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:56:06.1039239Z [368s] Fitting surrogate: 687 points, 687 targets 2026-02-21T09:56:07.3184577Z [369s] Generation 7 starting: 76 neighbors, 4 active search path(s) 2026-02-21T09:56:20.4198966Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 9.2 configs/s 2026-02-21T09:56:27.2592108Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 11.4 configs/s 2026-02-21T09:56:41.4274449Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 138/138 9.5 configs/s 2026-02-21T09:56:41.9654940Z [404s] Generation 7 complete: 2026-02-21T09:56:41.9659852Z ok=80 2026-02-21T09:56:41.9661454Z min=1.4613 2026-02-21T09:56:41.9662148Z mid=2.1745 2026-02-21T09:56:41.9662359Z max=11.0126 2026-02-21T09:56:41.9662562Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:56:41.9662878Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:56:41.9663237Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:56:41.9663465Z 'num_stages': 5, 2026-02-21T09:56:41.9663672Z 'num_warps': 1, 2026-02-21T09:56:41.9663852Z 'pid_type': 'flat', 2026-02-21T09:56:41.9664076Z 'range_flattens': [None, None, True], 2026-02-21T09:56:41.9664333Z 'range_multi_buffers': [None, None, True], 2026-02-21T09:56:41.9664549Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:56:41.9664743Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:56:41.9664929Z 'range_warp_specializes': [None, None, None]} 2026-02-21T09:56:41.9714096Z [404s] Fitting surrogate: 767 points, 767 targets 2026-02-21T09:56:43.1188903Z [405s] Generation 8 starting: 71 neighbors, 4 active search path(s) 2026-02-21T09:56:56.2689248Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 4.5 configs/s 2026-02-21T09:57:02.5945988Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 11.7 configs/s 2026-02-21T09:57:18.0702298Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 144/144 9.1 configs/s 2026-02-21T09:57:18.6105137Z [441s] Generation 8 complete: 2026-02-21T09:57:18.6105434Z ok=76 2026-02-21T09:57:18.6105578Z min=1.4602 2026-02-21T09:57:18.6105742Z mid=2.0797 2026-02-21T09:57:18.6105875Z max=7.4516 2026-02-21T09:57:18.6106041Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:57:18.6106339Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:57:18.6106646Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:57:18.6106854Z 'num_stages': 5, 2026-02-21T09:57:18.6106996Z 'num_warps': 1, 2026-02-21T09:57:18.6107129Z 'pid_type': 'flat', 2026-02-21T09:57:18.6107294Z 'range_flattens': [None, None, True], 2026-02-21T09:57:18.6107508Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:57:18.6107701Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:57:18.6108206Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:57:18.6108405Z 'range_warp_specializes': [None, False, None]} 2026-02-21T09:57:18.6169943Z [441s] Fitting surrogate: 843 points, 843 targets 2026-02-21T09:57:19.8177580Z [442s] Generation 9 starting: 76 neighbors, 4 active search path(s) 2026-02-21T09:57:33.4298805Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 9.3 configs/s 2026-02-21T09:57:40.3327754Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 11.6 configs/s 2026-02-21T09:57:54.3639186Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 144/144 10.0 configs/s 2026-02-21T09:57:54.9051525Z [477s] Generation 9 complete: 2026-02-21T09:57:54.9051769Z ok=81 2026-02-21T09:57:54.9052108Z min=1.4602 2026-02-21T09:57:54.9053009Z mid=2.1012 2026-02-21T09:57:54.9053141Z max=14.8818 2026-02-21T09:57:54.9053280Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:57:54.9053867Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:57:54.9054172Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:57:54.9054359Z 'num_stages': 5, 2026-02-21T09:57:54.9054505Z 'num_warps': 1, 2026-02-21T09:57:54.9054641Z 'pid_type': 'flat', 2026-02-21T09:57:54.9054804Z 'range_flattens': [None, True, True], 2026-02-21T09:57:54.9054991Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:57:54.9055180Z 'range_num_stages': [0, 3, 0], 2026-02-21T09:57:54.9055343Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:57:54.9055592Z 'range_warp_specializes': [None, False, None]} 2026-02-21T09:57:54.9112581Z [477s] Fitting surrogate: 924 points, 924 targets 2026-02-21T09:57:56.1793284Z [478s] Generation 10 starting: 80 neighbors, 4 active search path(s) 2026-02-21T09:58:09.4010037Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 9.3 configs/s 2026-02-21T09:58:16.7302861Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 83/83 11.3 configs/s 2026-02-21T09:58:31.4230216Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━ 144/144 9.6 configs/s 2026-02-21T09:58:31.9574950Z [514s] Generation 10 complete: 2026-02-21T09:58:31.9575228Z ok=85 2026-02-21T09:58:31.9575383Z min=1.4601 2026-02-21T09:58:31.9575537Z mid=2.1186 2026-02-21T09:58:31.9575672Z max=14.0379 2026-02-21T09:58:31.9575833Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:58:31.9576488Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:58:31.9576829Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:58:31.9577044Z 'num_stages': 5, 2026-02-21T09:58:31.9577189Z 'num_warps': 1, 2026-02-21T09:58:31.9577345Z 'pid_type': 'flat', 2026-02-21T09:58:31.9577511Z 'range_flattens': [None, True, True], 2026-02-21T09:58:31.9577713Z 'range_multi_buffers': [None, True, True], 2026-02-21T09:58:31.9577903Z 'range_num_stages': [0, 3, 1], 2026-02-21T09:58:31.9578072Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:58:31.9578297Z 'range_warp_specializes': [None, False, None]} 2026-02-21T09:58:31.9641474Z [514s] Fitting surrogate: 1009 points, 1009 targets 2026-02-21T09:58:33.1511237Z [515s] Generation 11 starting: 75 neighbors, 4 active search path(s) 2026-02-21T09:58:50.7001031Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 7.8 configs/s 2026-02-21T09:58:57.4881485Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 78/78 11.5 configs/s 2026-02-21T09:59:11.4378752Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 144/144 10.0 configs/s 2026-02-21T09:59:11.9732697Z [554s] Generation 11 complete: 2026-02-21T09:59:11.9732945Z ok=79 2026-02-21T09:59:11.9733122Z min=1.4601 2026-02-21T09:59:11.9733275Z mid=2.1442 2026-02-21T09:59:11.9733450Z max=7.9114 2026-02-21T09:59:11.9733595Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:59:11.9738346Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:59:11.9738707Z 'load_eviction_policies': ['last', '', '', ''], 2026-02-21T09:59:11.9739173Z 'num_stages': 5, 2026-02-21T09:59:11.9739323Z 'num_warps': 1, 2026-02-21T09:59:11.9739459Z 'pid_type': 'flat', 2026-02-21T09:59:11.9739624Z 'range_flattens': [None, True, True], 2026-02-21T09:59:11.9739822Z 'range_multi_buffers': [None, False, True], 2026-02-21T09:59:11.9740015Z 'range_num_stages': [0, 3, 1], 2026-02-21T09:59:11.9740180Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:59:11.9740380Z 'range_warp_specializes': [None, False, None]} 2026-02-21T09:59:11.9800600Z [554s] Fitting surrogate: 1088 points, 1088 targets 2026-02-21T09:59:13.1605123Z [555s] Generation 12 starting: 75 neighbors, 4 active search path(s) 2026-02-21T09:59:26.6960256Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 12.1 configs/s 2026-02-21T09:59:33.7252902Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 79/79 11.3 configs/s 2026-02-21T09:59:47.6929252Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 144/144 10.0 configs/s 2026-02-21T09:59:48.2431057Z [590s] Generation 12 complete: 2026-02-21T09:59:48.2431383Z ok=80 2026-02-21T09:59:48.2431537Z min=1.4581 2026-02-21T09:59:48.2431689Z mid=2.1693 2026-02-21T09:59:48.2432248Z max=16.2243 2026-02-21T09:59:48.2432458Z best={'block_sizes': [4, 128, 256], 2026-02-21T09:59:48.2432746Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T09:59:48.2433041Z 'load_eviction_policies': ['last', 'last', '', ''], 2026-02-21T09:59:48.2433245Z 'num_stages': 5, 2026-02-21T09:59:48.2433382Z 'num_warps': 1, 2026-02-21T09:59:48.2433522Z 'pid_type': 'flat', 2026-02-21T09:59:48.2433675Z 'range_flattens': [None, True, True], 2026-02-21T09:59:48.2433870Z 'range_multi_buffers': [None, False, True], 2026-02-21T09:59:48.2434052Z 'range_num_stages': [0, 3, 1], 2026-02-21T09:59:48.2434220Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T09:59:48.2434410Z 'range_warp_specializes': [None, False, None]} 2026-02-21T09:59:48.2508409Z [590s] Fitting surrogate: 1168 points, 1168 targets 2026-02-21T09:59:49.2490493Z [591s] Generation 13 starting: 59 neighbors, 3 active search path(s) 2026-02-21T10:00:04.6133958Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 1.4 configs/s 2026-02-21T10:00:06.9472333Z [609s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', 'last', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, None]) 2026-02-21T10:00:06.9473476Z Tensor-likes are not close! 2026-02-21T10:00:06.9473617Z 2026-02-21T10:00:06.9473697Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T10:00:06.9473978Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T10:00:06.9474331Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T10:00:06.9474662Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:00:06.9474822Z 2026-02-21T10:00:06.9949547Z [609s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', 'last', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, None]) 2026-02-21T10:00:06.9950567Z Tensor-likes are not close! 2026-02-21T10:00:06.9950680Z 2026-02-21T10:00:06.9950760Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T10:00:06.9951034Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T10:00:06.9951363Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T10:00:06.9951687Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:00:06.9952164Z 2026-02-21T10:00:09.9130129Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 60/60 11.6 configs/s 2026-02-21T10:00:18.9492723Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 15.7 configs/s 2026-02-21T10:00:19.4578223Z [621s] Generation 13 complete: 2026-02-21T10:00:19.4583348Z error=2 2026-02-21T10:00:19.4584723Z ok=61 2026-02-21T10:00:19.4584890Z min=1.4704 2026-02-21T10:00:19.4585018Z mid=2.1402 2026-02-21T10:00:19.4585143Z max=7.5202 2026-02-21T10:00:19.4585279Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:00:19.4585554Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:00:19.4585826Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:00:19.4586020Z 'num_stages': 2, 2026-02-21T10:00:19.4586164Z 'num_warps': 1, 2026-02-21T10:00:19.4586299Z 'pid_type': 'flat', 2026-02-21T10:00:19.4586794Z 'range_flattens': [None, False, True], 2026-02-21T10:00:19.4587000Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:00:19.4587210Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:00:19.4587381Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:00:19.4587578Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:00:19.4661071Z [621s] Fitting surrogate: 1231 points, 1231 targets 2026-02-21T10:00:20.2866638Z [622s] Generation 14 starting: 43 neighbors, 2 active search path(s) 2026-02-21T10:00:29.6907527Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 5.0 configs/s 2026-02-21T10:00:33.6894231Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 11.0 configs/s 2026-02-21T10:00:39.1051648Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 25.4 configs/s 2026-02-21T10:00:39.6065739Z [642s] Generation 14 complete: 2026-02-21T10:00:39.6070299Z ok=46 2026-02-21T10:00:39.6072470Z min=1.5416 2026-02-21T10:00:39.6072636Z mid=2.3665 2026-02-21T10:00:39.6072755Z max=8.5832 2026-02-21T10:00:39.6072921Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:00:39.6073204Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:00:39.6073488Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:00:39.6073670Z 'num_stages': 2, 2026-02-21T10:00:39.6073814Z 'num_warps': 1, 2026-02-21T10:00:39.6073956Z 'pid_type': 'flat', 2026-02-21T10:00:39.6074112Z 'range_flattens': [None, False, True], 2026-02-21T10:00:39.6074312Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:00:39.6074492Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:00:39.6074666Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:00:39.6074852Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:00:39.6124615Z [642s] Fitting surrogate: 1277 points, 1277 targets 2026-02-21T10:00:40.4470185Z [642s] Generation 15 starting: 44 neighbors, 2 active search path(s) 2026-02-21T10:01:12.8798374Z [675s] Timeout after 30s compiling Config(block_sizes=[64, 64, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'last', 'first', 'first'], maxnreg=64, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False, None], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 2], range_unroll_factors=[1, 4, 1], range_warp_specializes=[False, None, None]) 2026-02-21T10:01:12.8814527Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 0.4 configs/s 2026-02-21T10:01:16.7535021Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 45/45 11.7 configs/s 2026-02-21T10:01:23.9360813Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 19.5 configs/s 2026-02-21T10:01:24.4461313Z [686s] Generation 15 complete: 2026-02-21T10:01:24.4465637Z timeout=1 2026-02-21T10:01:24.4466908Z ok=46 2026-02-21T10:01:24.4467073Z min=1.4776 2026-02-21T10:01:24.4467200Z mid=2.0761 2026-02-21T10:01:24.4467330Z max=5.5711 2026-02-21T10:01:24.4467496Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:01:24.4467765Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:01:24.4468419Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:01:24.4468610Z 'num_stages': 2, 2026-02-21T10:01:24.4468760Z 'num_warps': 1, 2026-02-21T10:01:24.4468900Z 'pid_type': 'flat', 2026-02-21T10:01:24.4469071Z 'range_flattens': [None, False, True], 2026-02-21T10:01:24.4469268Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:01:24.4469464Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:01:24.4469637Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:01:24.4469826Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:01:24.4527180Z [686s] Fitting surrogate: 1324 points, 1324 targets 2026-02-21T10:01:25.2816559Z [687s] Generation 16 starting: 43 neighbors, 2 active search path(s) 2026-02-21T10:01:33.7365825Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 5.2 configs/s 2026-02-21T10:01:37.7513783Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 11.0 configs/s 2026-02-21T10:01:44.5735092Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 20.4 configs/s 2026-02-21T10:01:45.1116053Z [707s] Generation 16 complete: 2026-02-21T10:01:45.1120549Z ok=46 2026-02-21T10:01:45.1122678Z min=1.4797 2026-02-21T10:01:45.1127408Z mid=2.1878 2026-02-21T10:01:45.1132433Z max=6.5762 2026-02-21T10:01:45.1134250Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:01:45.1134551Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:01:45.1134885Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:01:45.1135075Z 'num_stages': 2, 2026-02-21T10:01:45.1135224Z 'num_warps': 1, 2026-02-21T10:01:45.1135361Z 'pid_type': 'flat', 2026-02-21T10:01:45.1139106Z 'range_flattens': [None, False, True], 2026-02-21T10:01:45.1140569Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:01:45.1140847Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:01:45.1145884Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:01:45.1150300Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:01:45.1181530Z [707s] Fitting surrogate: 1370 points, 1370 targets 2026-02-21T10:01:45.8875731Z [708s] Generation 17 starting: 43 neighbors, 2 active search path(s) 2026-02-21T10:01:53.8451470Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 9.3 configs/s 2026-02-21T10:01:57.7380571Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 45/45 11.6 configs/s 2026-02-21T10:02:05.5258712Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 18.1 configs/s 2026-02-21T10:02:06.0462411Z [728s] Generation 17 complete: 2026-02-21T10:02:06.0464872Z ok=46 2026-02-21T10:02:06.0465049Z min=1.5263 2026-02-21T10:02:06.0465183Z mid=2.0624 2026-02-21T10:02:06.0465316Z max=5.1211 2026-02-21T10:02:06.0465455Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:02:06.0465741Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:02:06.0466045Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:02:06.0466580Z 'num_stages': 2, 2026-02-21T10:02:06.0466719Z 'num_warps': 1, 2026-02-21T10:02:06.0466865Z 'pid_type': 'flat', 2026-02-21T10:02:06.0467021Z 'range_flattens': [None, False, True], 2026-02-21T10:02:06.0467222Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:02:06.0467414Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:02:06.0467580Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:02:06.0467776Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:02:06.0537336Z [728s] Fitting surrogate: 1416 points, 1416 targets 2026-02-21T10:02:06.9259139Z [729s] Generation 18 starting: 45 neighbors, 2 active search path(s) 2026-02-21T10:02:15.2169867Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 8.5 configs/s 2026-02-21T10:02:19.5697980Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 46/46 10.6 configs/s 2026-02-21T10:02:25.6802693Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 22.6 configs/s 2026-02-21T10:02:26.1936708Z [748s] Generation 18 complete: 2026-02-21T10:02:26.1941669Z ok=48 2026-02-21T10:02:26.1946074Z min=1.4966 2026-02-21T10:02:26.1950361Z mid=2.3100 2026-02-21T10:02:26.1954790Z max=24.6415 2026-02-21T10:02:26.1959393Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:02:26.1963329Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:02:26.1963682Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:02:26.1968191Z 'num_stages': 2, 2026-02-21T10:02:26.1972496Z 'num_warps': 1, 2026-02-21T10:02:26.1973913Z 'pid_type': 'flat', 2026-02-21T10:02:26.1974121Z 'range_flattens': [None, False, True], 2026-02-21T10:02:26.1974329Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:02:26.1974533Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:02:26.1974709Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:02:26.1974903Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:02:26.2006443Z [748s] Fitting surrogate: 1464 points, 1464 targets 2026-02-21T10:02:27.0293853Z [749s] Generation 19 starting: 46 neighbors, 2 active search path(s) 2026-02-21T10:02:39.5151157Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 1.3 configs/s 2026-02-21T10:02:43.7974971Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 47/47 11.0 configs/s 2026-02-21T10:02:50.7178567Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 20.1 configs/s 2026-02-21T10:02:51.2318690Z [773s] Generation 19 complete: 2026-02-21T10:02:51.2323671Z ok=49 2026-02-21T10:02:51.2325251Z min=1.5155 2026-02-21T10:02:51.2325413Z mid=2.1619 2026-02-21T10:02:51.2325532Z max=9.7132 2026-02-21T10:02:51.2325676Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:02:51.2325950Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:02:51.2326238Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:02:51.2326434Z 'num_stages': 2, 2026-02-21T10:02:51.2326598Z 'num_warps': 1, 2026-02-21T10:02:51.2326744Z 'pid_type': 'flat', 2026-02-21T10:02:51.2326918Z 'range_flattens': [None, False, True], 2026-02-21T10:02:51.2327116Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:02:51.2327299Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:02:51.2327468Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:02:51.2327655Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:02:51.2388729Z [773s] Fitting surrogate: 1513 points, 1513 targets 2026-02-21T10:02:52.1241043Z [774s] Generation 20 starting: 42 neighbors, 2 active search path(s) 2026-02-21T10:02:59.5769460Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 8.6 configs/s 2026-02-21T10:03:03.3706095Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 43/43 11.4 configs/s 2026-02-21T10:03:11.4643408Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 148/148 17.4 configs/s 2026-02-21T10:03:11.9899968Z [794s] Generation 20 complete: 2026-02-21T10:03:11.9903725Z ok=45 2026-02-21T10:03:11.9908144Z min=1.4949 2026-02-21T10:03:11.9912718Z mid=2.0665 2026-02-21T10:03:11.9916536Z max=6.1133 2026-02-21T10:03:11.9920854Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:03:11.9925459Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:03:11.9925840Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:03:11.9926065Z 'num_stages': 2, 2026-02-21T10:03:11.9932410Z 'num_warps': 1, 2026-02-21T10:03:11.9936399Z 'pid_type': 'flat', 2026-02-21T10:03:11.9937912Z 'range_flattens': [None, False, True], 2026-02-21T10:03:11.9938214Z 'range_multi_buffers': [None, True, False], 2026-02-21T10:03:11.9938410Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:03:11.9943167Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:03:11.9945309Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:03:11.9978217Z [794s] Fitting surrogate: 1558 points, 1558 targets 2026-02-21T10:03:12.3110481Z [794s] Autotuning complete in 794.7s after searching 1516 configs. 2026-02-21T10:03:12.3115871Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:03:12.3117591Z @helion.kernel(config=helion.Config(block_sizes=[4, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', 'last', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, None]), static_shapes=True) 2026-02-21T10:03:12.3118569Z 2026-02-21T10:03:12.3118826Z [794s] Code of selected kernel: /tmp/torchinductor_root/ta/ctapcytwemh4ymsiemy6ax5xeknf4tbg56xy3vp5pedfwbuxtvo6.py 2026-02-21T10:03:13.5760659Z WARNING:tritonbench.utils.triton_op:Completed input ID 7: 2026-02-21T10:03:13.5762676Z x_val 2026-02-21T10:03:13.5762839Z ------- 2026-02-21T10:03:13.5762967Z 6144 2026-02-21T10:03:13.5763048Z 2026-02-21T10:03:13.5880705Z 83%|████████▎ | 5/6 [1:55:03<27:17, 1637.78s/it]WARNING:tritonbench.utils.triton_op:Running input ID 9: 2026-02-21T10:03:13.5882116Z x_val 2026-02-21T10:03:13.5882314Z ------- 2026-02-21T10:03:13.5884418Z 8192 2026-02-21T10:03:13.5894436Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for eager_layer_norm 2026-02-21T10:03:14.5518801Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T10:03:15.8728405Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_welford 2026-02-21T10:36:00.5891127Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:36:00.5895039Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:36:00.5898499Z 'dtype': 'torch.bfloat16', 2026-02-21T10:36:00.5902447Z 'shape': (8192,), 2026-02-21T10:36:00.5906165Z 'stride': (1,)}, 2026-02-21T10:36:00.5909657Z { 'device': 'cuda:0', 2026-02-21T10:36:00.5913222Z 'dtype': 'torch.bfloat16', 2026-02-21T10:36:00.5913577Z 'shape': (8192,), 2026-02-21T10:36:00.5913832Z 'stride': (1,)}, 2026-02-21T10:36:00.5914069Z { 'device': 'cuda:0', 2026-02-21T10:36:00.5914312Z 'dtype': 'torch.bfloat16', 2026-02-21T10:36:00.5914530Z 'shape': (262144, 8192), 2026-02-21T10:36:00.5914760Z 'stride': (8192, 1)}), 2026-02-21T10:36:00.5914957Z 'kwargs': {}} 2026-02-21T10:36:00.5949371Z INFO:tritonbench.utils.triton_op:Took 6.43ms to get benchmark function for helion_welford 2026-02-21T10:36:00.8927596Z [0s] Autotune random seed: 2134763656 2026-02-21T10:36:01.0526078Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:36:35.5507477Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first', 'last', ''], maxnreg=128, num_sm_multiplier=64, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False, None], range_multi_buffers=[False, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[4, 4, 4], range_warp_specializes=[False, None, False]) 2026-02-21T10:36:35.5525568Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T10:38:09.2034198Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T10:38:09.2051275Z [128s] Adaptive compile timeout: 30s (90% percentile=16.4s, bounds=[30.0s, 30s]) 2026-02-21T10:38:09.4479710Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 85.8 configs/s 2026-02-21T10:38:10.2546737Z [129s] Initial random population of 100, 5 starting points: 2026-02-21T10:38:10.2550681Z error=1 2026-02-21T10:38:10.2554880Z timeout=1 2026-02-21T10:38:10.2559165Z ok=98 2026-02-21T10:38:10.2563596Z min=2.4779 2026-02-21T10:38:10.2565192Z mid=51.0464 2026-02-21T10:38:10.2565477Z max=1596.3556 2026-02-21T10:38:10.2570666Z best={'block_sizes': [128, 32, 64], 2026-02-21T10:38:10.2572725Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:38:10.2576614Z 'load_eviction_policies': ['', 'first', 'first', 'first'], 2026-02-21T10:38:10.2580533Z 'maxnreg': 256, 2026-02-21T10:38:10.2584107Z 'num_sm_multiplier': 128, 2026-02-21T10:38:10.2585694Z 'num_stages': 1, 2026-02-21T10:38:10.2585919Z 'num_warps': 16, 2026-02-21T10:38:10.2586156Z 'pid_type': 'persistent_blocked', 2026-02-21T10:38:10.2586394Z 'range_flattens': [None, None, True], 2026-02-21T10:38:10.2586665Z 'range_multi_buffers': [False, True, True], 2026-02-21T10:38:10.2586897Z 'range_num_stages': [1, 0, 2], 2026-02-21T10:38:10.2587138Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T10:38:10.2587408Z 'range_warp_specializes': [False, False, True]} 2026-02-21T10:38:10.2587724Z [129s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:38:11.6340907Z [130s] Generation 1 starting: 98 neighbors, 5 active search path(s) 2026-02-21T10:38:32.5303040Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 6.9 configs/s 2026-02-21T10:38:43.2229912Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 102/102 9.5 configs/s 2026-02-21T10:38:55.0664287Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 86/86 6.9 configs/s 2026-02-21T10:38:55.8843080Z [174s] Generation 1 complete: 2026-02-21T10:38:55.8847083Z error=4 2026-02-21T10:38:55.8851106Z ok=99 2026-02-21T10:38:55.8855969Z min=2.3244 2026-02-21T10:38:55.8857433Z mid=3.6552 2026-02-21T10:38:55.8857666Z max=20.4186 2026-02-21T10:38:55.8857855Z best={'block_sizes': [128, 64, 64], 2026-02-21T10:38:55.8858186Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:38:55.8858512Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T10:38:55.8858789Z 'maxnreg': 256, 2026-02-21T10:38:55.8858975Z 'num_sm_multiplier': 128, 2026-02-21T10:38:55.8859197Z 'num_stages': 1, 2026-02-21T10:38:55.8859447Z 'num_warps': 16, 2026-02-21T10:38:55.8859638Z 'pid_type': 'persistent_blocked', 2026-02-21T10:38:55.8859903Z 'range_flattens': [True, None, True], 2026-02-21T10:38:55.8860143Z 'range_multi_buffers': [False, True, True], 2026-02-21T10:38:55.8860395Z 'range_num_stages': [1, 0, 2], 2026-02-21T10:38:55.8860598Z 'range_unroll_factors': [1, 4, 1], 2026-02-21T10:38:55.8860859Z 'range_warp_specializes': [False, False, True]} 2026-02-21T10:38:55.8870900Z [174s] Fitting surrogate: 203 points, 203 targets 2026-02-21T10:38:57.1847152Z [176s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T10:39:19.6734627Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 9.4 configs/s 2026-02-21T10:39:29.0988090Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 10.4 configs/s 2026-02-21T10:39:42.2911095Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 89/89 6.5 configs/s 2026-02-21T10:39:43.0502704Z [221s] Generation 2 complete: 2026-02-21T10:39:43.0507706Z error=7 2026-02-21T10:39:43.0512133Z ok=92 2026-02-21T10:39:43.0514004Z min=2.2600 2026-02-21T10:39:43.0514232Z mid=3.3197 2026-02-21T10:39:43.0514399Z max=16.3369 2026-02-21T10:39:43.0514613Z best={'block_sizes': [16, 16, 256], 2026-02-21T10:39:43.0514915Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:39:43.0515247Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:39:43.0515474Z 'num_stages': 2, 2026-02-21T10:39:43.0515683Z 'num_warps': 1, 2026-02-21T10:39:43.0515887Z 'pid_type': 'flat', 2026-02-21T10:39:43.0516082Z 'range_flattens': [None, True, None], 2026-02-21T10:39:43.0516342Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:39:43.0516570Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:39:43.0516805Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:39:43.0517039Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:39:43.0537834Z [222s] Fitting surrogate: 302 points, 302 targets 2026-02-21T10:39:44.4492791Z [223s] Generation 3 starting: 99 neighbors, 5 active search path(s) 2026-02-21T10:40:09.1463080Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 1.0 configs/s 2026-02-21T10:40:19.5779753Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 102/102 9.8 configs/s 2026-02-21T10:40:37.2094287Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 91/91 5.0 configs/s 2026-02-21T10:40:37.9812049Z [276s] Generation 3 complete: 2026-02-21T10:40:37.9816280Z ok=105 2026-02-21T10:40:37.9820812Z min=2.2272 2026-02-21T10:40:37.9823170Z mid=3.1712 2026-02-21T10:40:37.9828598Z max=34.9297 2026-02-21T10:40:37.9833701Z best={'block_sizes': [16, 16, 256], 2026-02-21T10:40:37.9838199Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:40:37.9839676Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:40:37.9839980Z 'num_stages': 2, 2026-02-21T10:40:37.9840176Z 'num_warps': 1, 2026-02-21T10:40:37.9840394Z 'pid_type': 'flat', 2026-02-21T10:40:37.9840635Z 'range_flattens': [None, True, None], 2026-02-21T10:40:37.9840917Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:40:37.9841146Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:40:37.9841380Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:40:37.9841620Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:40:37.9853032Z [276s] Fitting surrogate: 407 points, 407 targets 2026-02-21T10:40:39.3142825Z [278s] Generation 4 starting: 95 neighbors, 5 active search path(s) 2026-02-21T10:41:22.0863039Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.3 configs/s 2026-02-21T10:41:32.3581011Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 97/97 9.4 configs/s 2026-02-21T10:41:55.6871917Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 96/96 5.4 configs/s 2026-02-21T10:41:56.4216168Z [355s] Generation 4 complete: 2026-02-21T10:41:56.4220494Z ok=100 2026-02-21T10:41:56.4224450Z min=2.1391 2026-02-21T10:41:56.4229625Z mid=2.7637 2026-02-21T10:41:56.4233453Z max=44.7172 2026-02-21T10:41:56.4237919Z best={'block_sizes': [32, 32, 256], 2026-02-21T10:41:56.4242494Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:41:56.4247281Z 'load_eviction_policies': ['last', 'last', 'first', 'last'], 2026-02-21T10:41:56.4248208Z 'num_stages': 4, 2026-02-21T10:41:56.4248437Z 'num_warps': 4, 2026-02-21T10:41:56.4248631Z 'pid_type': 'flat', 2026-02-21T10:41:56.4248871Z 'range_flattens': [None, None, True], 2026-02-21T10:41:56.4249124Z 'range_multi_buffers': [None, True, None], 2026-02-21T10:41:56.4249384Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:41:56.4249617Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:41:56.4249879Z 'range_warp_specializes': [None, None, True]} 2026-02-21T10:41:56.4290466Z [355s] Fitting surrogate: 507 points, 507 targets 2026-02-21T10:41:57.7887502Z [356s] Generation 5 starting: 87 neighbors, 5 active search path(s) 2026-02-21T10:42:27.9969513Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 0.5 configs/s 2026-02-21T10:42:37.3316363Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 88/88 9.4 configs/s 2026-02-21T10:42:52.9664238Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 6.1 configs/s 2026-02-21T10:42:53.6949354Z [412s] Generation 5 complete: 2026-02-21T10:42:53.6953633Z ok=92 2026-02-21T10:42:53.6957956Z min=2.1023 2026-02-21T10:42:53.6959455Z mid=2.9214 2026-02-21T10:42:53.6959663Z max=55.8479 2026-02-21T10:42:53.6959891Z best={'block_sizes': [32, 32, 256], 2026-02-21T10:42:53.6960216Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:42:53.6960587Z 'load_eviction_policies': ['last', 'last', 'first', 'last'], 2026-02-21T10:42:53.6960849Z 'num_stages': 4, 2026-02-21T10:42:53.6961058Z 'num_warps': 4, 2026-02-21T10:42:53.6961242Z 'pid_type': 'flat', 2026-02-21T10:42:53.6961479Z 'range_flattens': [None, None, True], 2026-02-21T10:42:53.6961715Z 'range_multi_buffers': [None, True, None], 2026-02-21T10:42:53.6962141Z 'range_num_stages': [0, 0, 2], 2026-02-21T10:42:53.6962400Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:42:53.6962638Z 'range_warp_specializes': [None, None, True]} 2026-02-21T10:42:53.7014058Z [412s] Fitting surrogate: 599 points, 599 targets 2026-02-21T10:42:54.8789219Z [413s] Generation 6 starting: 76 neighbors, 4 active search path(s) 2026-02-21T10:43:41.2759894Z [460s] Timeout after 30s compiling Config(block_sizes=[128, 16, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'last', '', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 4], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, None, False]) 2026-02-21T10:43:41.2777747Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.4 configs/s 2026-02-21T10:43:48.7415225Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 10.3 configs/s 2026-02-21T10:44:03.1437715Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 6.6 configs/s 2026-02-21T10:44:03.8621233Z [482s] Generation 6 complete: 2026-02-21T10:44:03.8625303Z timeout=1 2026-02-21T10:44:03.8625553Z ok=79 2026-02-21T10:44:03.8629798Z min=2.1484 2026-02-21T10:44:03.8633101Z mid=2.7780 2026-02-21T10:44:03.8637549Z max=18.6532 2026-02-21T10:44:03.8638911Z best={'block_sizes': [32, 32, 256], 2026-02-21T10:44:03.8639292Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:44:03.8639642Z 'load_eviction_policies': ['last', 'last', 'first', 'last'], 2026-02-21T10:44:03.8639929Z 'num_stages': 4, 2026-02-21T10:44:03.8640149Z 'num_warps': 2, 2026-02-21T10:44:03.8640332Z 'pid_type': 'flat', 2026-02-21T10:44:03.8640564Z 'range_flattens': [None, False, True], 2026-02-21T10:44:03.8640798Z 'range_multi_buffers': [None, True, None], 2026-02-21T10:44:03.8641055Z 'range_num_stages': [0, 0, 2], 2026-02-21T10:44:03.8641289Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:44:03.8641552Z 'range_warp_specializes': [None, None, True]} 2026-02-21T10:44:03.8683719Z [482s] Fitting surrogate: 679 points, 679 targets 2026-02-21T10:44:05.1472540Z [484s] Generation 7 starting: 80 neighbors, 4 active search path(s) 2026-02-21T10:44:26.2304182Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 1.8 configs/s 2026-02-21T10:44:34.1719073Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 10.2 configs/s 2026-02-21T10:44:48.6262316Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 6.6 configs/s 2026-02-21T10:44:49.3382377Z [528s] Generation 7 complete: 2026-02-21T10:44:49.3386466Z ok=84 2026-02-21T10:44:49.3387964Z min=2.0972 2026-02-21T10:44:49.3388166Z mid=2.8037 2026-02-21T10:44:49.3388370Z max=14.6504 2026-02-21T10:44:49.3388586Z best={'block_sizes': [64, 64, 128], 2026-02-21T10:44:49.3388911Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:44:49.3389659Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T10:44:49.3389941Z 'maxnreg': 256, 2026-02-21T10:44:49.3390172Z 'num_sm_multiplier': 128, 2026-02-21T10:44:49.3390371Z 'num_stages': 1, 2026-02-21T10:44:49.3390587Z 'num_warps': 8, 2026-02-21T10:44:49.3390776Z 'pid_type': 'persistent_blocked', 2026-02-21T10:44:49.3391027Z 'range_flattens': [None, None, None], 2026-02-21T10:44:49.3391288Z 'range_multi_buffers': [False, True, True], 2026-02-21T10:44:49.3391510Z 'range_num_stages': [1, 0, 3], 2026-02-21T10:44:49.3391747Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T10:44:49.3392057Z 'range_warp_specializes': [False, None, True]} 2026-02-21T10:44:49.3448245Z [528s] Fitting surrogate: 763 points, 763 targets 2026-02-21T10:44:50.3418364Z [529s] Generation 8 starting: 59 neighbors, 3 active search path(s) 2026-02-21T10:45:25.7741641Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.3 configs/s 2026-02-21T10:45:31.9310937Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 60/60 9.7 configs/s 2026-02-21T10:45:43.6288992Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 8.0 configs/s 2026-02-21T10:45:44.3282741Z [583s] Generation 8 complete: 2026-02-21T10:45:44.3286874Z ok=62 2026-02-21T10:45:44.3288557Z min=2.0869 2026-02-21T10:45:44.3288831Z mid=2.6255 2026-02-21T10:45:44.3293409Z max=23.8208 2026-02-21T10:45:44.3295453Z best={'block_sizes': [64, 64, 128], 2026-02-21T10:45:44.3295810Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:45:44.3296192Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T10:45:44.3296440Z 'maxnreg': 256, 2026-02-21T10:45:44.3296661Z 'num_sm_multiplier': 128, 2026-02-21T10:45:44.3296849Z 'num_stages': 1, 2026-02-21T10:45:44.3297053Z 'num_warps': 8, 2026-02-21T10:45:44.3297266Z 'pid_type': 'persistent_blocked', 2026-02-21T10:45:44.3297500Z 'range_flattens': [None, None, None], 2026-02-21T10:45:44.3297763Z 'range_multi_buffers': [True, True, True], 2026-02-21T10:45:44.3298030Z 'range_num_stages': [1, 0, 3], 2026-02-21T10:45:44.3298282Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T10:45:44.3298510Z 'range_warp_specializes': [False, None, True]} 2026-02-21T10:45:44.3345264Z [583s] Fitting surrogate: 825 points, 825 targets 2026-02-21T10:45:45.3741697Z [584s] Generation 9 starting: 63 neighbors, 3 active search path(s) 2026-02-21T10:46:27.5618712Z [626s] Timeout after 30s compiling Config(block_sizes=[64, 32, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'last', 'first', ''], num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, False, False], range_num_stages=[0, 4, 3], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, None, False]) 2026-02-21T10:46:27.5633866Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 0.4 configs/s 2026-02-21T10:46:33.6079543Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 10.6 configs/s 2026-02-21T10:46:44.8124672Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 8.4 configs/s 2026-02-21T10:46:45.5168067Z [644s] Generation 9 complete: 2026-02-21T10:46:45.5172405Z error=1 2026-02-21T10:46:45.5176856Z timeout=1 2026-02-21T10:46:45.5181299Z ok=64 2026-02-21T10:46:45.5185774Z min=2.0541 2026-02-21T10:46:45.5190199Z mid=2.6051 2026-02-21T10:46:45.5193948Z max=12.5809 2026-02-21T10:46:45.5198623Z best={'block_sizes': [64, 32, 128], 2026-02-21T10:46:45.5199063Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:46:45.5199468Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T10:46:45.5204137Z 'num_stages': 1, 2026-02-21T10:46:45.5205526Z 'num_warps': 8, 2026-02-21T10:46:45.5205755Z 'pid_type': 'flat', 2026-02-21T10:46:45.5206001Z 'range_flattens': [None, None, None], 2026-02-21T10:46:45.5206240Z 'range_multi_buffers': [None, True, True], 2026-02-21T10:46:45.5206496Z 'range_num_stages': [0, 0, 3], 2026-02-21T10:46:45.5207047Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T10:46:45.5207330Z 'range_warp_specializes': [None, None, True]} 2026-02-21T10:46:45.5230814Z [644s] Fitting surrogate: 891 points, 891 targets 2026-02-21T10:46:46.4412508Z [645s] Generation 10 starting: 52 neighbors, 3 active search path(s) 2026-02-21T10:47:15.4721043Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 0.5 configs/s 2026-02-21T10:47:20.5218786Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 53/53 10.5 configs/s 2026-02-21T10:47:31.4524047Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 99/99 8.7 configs/s 2026-02-21T10:47:32.1479671Z [691s] Generation 10 complete: 2026-02-21T10:47:32.1483728Z ok=55 2026-02-21T10:47:32.1487611Z min=2.0338 2026-02-21T10:47:32.1489097Z mid=2.5332 2026-02-21T10:47:32.1489334Z max=7.0891 2026-02-21T10:47:32.1489523Z best={'block_sizes': [32, 64, 128], 2026-02-21T10:47:32.1489909Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:47:32.1490279Z 'load_eviction_policies': ['', 'last', 'first', 'first'], 2026-02-21T10:47:32.1490573Z 'num_stages': 1, 2026-02-21T10:47:32.1490756Z 'num_warps': 8, 2026-02-21T10:47:32.1490961Z 'pid_type': 'flat', 2026-02-21T10:47:32.1491158Z 'range_flattens': [None, None, None], 2026-02-21T10:47:32.1491431Z 'range_multi_buffers': [None, True, True], 2026-02-21T10:47:32.1493733Z 'range_num_stages': [0, 0, 3], 2026-02-21T10:47:32.1494053Z 'range_unroll_factors': [0, 4, 0], 2026-02-21T10:47:32.1499473Z 'range_warp_specializes': [None, None, True]} 2026-02-21T10:47:32.1553455Z [691s] Fitting surrogate: 946 points, 946 targets 2026-02-21T10:47:33.1562415Z [692s] Generation 11 starting: 60 neighbors, 3 active search path(s) 2026-02-21T10:48:03.1203376Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.3 configs/s 2026-02-21T10:48:08.9264942Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 60/60 10.3 configs/s 2026-02-21T10:48:19.5132865Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━ 100/100 9.1 configs/s 2026-02-21T10:48:20.1884514Z [739s] Generation 11 complete: 2026-02-21T10:48:20.1886104Z ok=63 2026-02-21T10:48:20.1886358Z min=2.0358 2026-02-21T10:48:20.1890658Z mid=2.7104 2026-02-21T10:48:20.1895066Z max=8.5259 2026-02-21T10:48:20.1896958Z best={'block_sizes': [16, 128, 128], 2026-02-21T10:48:20.1900046Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:48:20.1901307Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:48:20.1901574Z 'num_stages': 2, 2026-02-21T10:48:20.1901758Z 'num_warps': 2, 2026-02-21T10:48:20.1902098Z 'pid_type': 'flat', 2026-02-21T10:48:20.1902303Z 'range_flattens': [None, True, None], 2026-02-21T10:48:20.1902576Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:48:20.1902834Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:48:20.1903045Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:48:20.1903307Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:48:20.1954634Z [739s] Fitting surrogate: 1009 points, 1009 targets 2026-02-21T10:48:20.8991628Z [739s] Generation 12 starting: 35 neighbors, 2 active search path(s) 2026-02-21T10:48:28.7499890Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.4 configs/s 2026-02-21T10:48:31.9770197Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 10.9 configs/s 2026-02-21T10:48:40.0416922Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 101/101 11.8 configs/s 2026-02-21T10:48:40.7048439Z [759s] Generation 12 complete: 2026-02-21T10:48:40.7052841Z ok=37 2026-02-21T10:48:40.7054366Z min=2.0768 2026-02-21T10:48:40.7054614Z mid=2.3452 2026-02-21T10:48:40.7054810Z max=3.3842 2026-02-21T10:48:40.7057766Z best={'block_sizes': [16, 128, 128], 2026-02-21T10:48:40.7058338Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:48:40.7058836Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:48:40.7059984Z 'num_stages': 2, 2026-02-21T10:48:40.7060262Z 'num_warps': 2, 2026-02-21T10:48:40.7060564Z 'pid_type': 'flat', 2026-02-21T10:48:40.7060837Z 'range_flattens': [None, True, False], 2026-02-21T10:48:40.7061214Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:48:40.7061546Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:48:40.7062068Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:48:40.7062442Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:48:40.7107065Z [759s] Fitting surrogate: 1046 points, 1046 targets 2026-02-21T10:48:41.4071090Z [760s] Generation 13 starting: 36 neighbors, 2 active search path(s) 2026-02-21T10:48:49.8066293Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 8.5 configs/s 2026-02-21T10:48:53.3470216Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 10.5 configs/s 2026-02-21T10:49:01.3651057Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 101/101 11.9 configs/s 2026-02-21T10:49:02.0444308Z [780s] Generation 13 complete: 2026-02-21T10:49:02.0444630Z ok=38 2026-02-21T10:49:02.0444854Z min=2.0173 2026-02-21T10:49:02.0445026Z mid=2.4534 2026-02-21T10:49:02.0445220Z max=11.0685 2026-02-21T10:49:02.0445431Z best={'block_sizes': [16, 128, 256], 2026-02-21T10:49:02.0445865Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:49:02.0450159Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:49:02.0452287Z 'num_stages': 2, 2026-02-21T10:49:02.0452575Z 'num_warps': 2, 2026-02-21T10:49:02.0455803Z 'pid_type': 'flat', 2026-02-21T10:49:02.0456044Z 'range_flattens': [None, True, False], 2026-02-21T10:49:02.0456286Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:49:02.0456544Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:49:02.0456754Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:49:02.0457016Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:49:02.0506330Z [780s] Fitting surrogate: 1084 points, 1084 targets 2026-02-21T10:49:02.7638812Z [781s] Generation 14 starting: 35 neighbors, 2 active search path(s) 2026-02-21T10:49:13.6269279Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.8 configs/s 2026-02-21T10:49:16.9928276Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 10.7 configs/s 2026-02-21T10:49:25.5140714Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 101/101 11.2 configs/s 2026-02-21T10:49:26.1919338Z [805s] Generation 14 complete: 2026-02-21T10:49:26.1923790Z ok=37 2026-02-21T10:49:26.1925355Z min=1.9723 2026-02-21T10:49:26.1925643Z mid=2.3489 2026-02-21T10:49:26.1931123Z max=3.3392 2026-02-21T10:49:26.1935539Z best={'block_sizes': [4, 128, 256], 2026-02-21T10:49:26.1940163Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:49:26.1945231Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:49:26.1949558Z 'num_stages': 2, 2026-02-21T10:49:26.1953955Z 'num_warps': 1, 2026-02-21T10:49:26.1955545Z 'pid_type': 'flat', 2026-02-21T10:49:26.1956261Z 'range_flattens': [None, True, False], 2026-02-21T10:49:26.1960474Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:49:26.1960844Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:49:26.1961121Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:49:26.1966815Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:49:26.1993665Z [805s] Fitting surrogate: 1121 points, 1121 targets 2026-02-21T10:49:26.9136330Z [805s] Generation 15 starting: 35 neighbors, 2 active search path(s) 2026-02-21T10:49:35.7355642Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 7.6 configs/s 2026-02-21T10:49:39.4004257Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 10.1 configs/s 2026-02-21T10:49:46.2077593Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 101/101 13.7 configs/s 2026-02-21T10:49:46.9071584Z [825s] Generation 15 complete: 2026-02-21T10:49:46.9072199Z ok=38 2026-02-21T10:49:46.9072453Z min=1.9703 2026-02-21T10:49:46.9077495Z mid=2.8466 2026-02-21T10:49:46.9081843Z max=15.7942 2026-02-21T10:49:46.9084773Z best={'block_sizes': [4, 128, 256], 2026-02-21T10:49:46.9089507Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:49:46.9093806Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:49:46.9098063Z 'num_stages': 2, 2026-02-21T10:49:46.9099689Z 'num_warps': 1, 2026-02-21T10:49:46.9099916Z 'pid_type': 'flat', 2026-02-21T10:49:46.9131844Z 'range_flattens': [None, True, False], 2026-02-21T10:49:46.9132228Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:49:46.9132526Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:49:46.9132738Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:49:46.9133011Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:49:46.9133277Z [825s] Fitting surrogate: 1159 points, 1159 targets 2026-02-21T10:49:47.6333097Z [826s] Generation 16 starting: 36 neighbors, 2 active search path(s) 2026-02-21T10:49:55.8808961Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 7.1 configs/s 2026-02-21T10:49:59.3392003Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 10.7 configs/s 2026-02-21T10:50:08.2897244Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 10.7 configs/s 2026-02-21T10:50:08.9633489Z [847s] Generation 16 complete: 2026-02-21T10:50:08.9638342Z ok=39 2026-02-21T10:50:08.9642835Z min=1.9353 2026-02-21T10:50:08.9644590Z mid=2.3869 2026-02-21T10:50:08.9644880Z max=3.9301 2026-02-21T10:50:08.9650154Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:50:08.9654719Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:50:08.9658890Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:50:08.9660316Z 'num_stages': 2, 2026-02-21T10:50:08.9660576Z 'num_warps': 1, 2026-02-21T10:50:08.9660780Z 'pid_type': 'flat', 2026-02-21T10:50:08.9661036Z 'range_flattens': [None, True, None], 2026-02-21T10:50:08.9661314Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:50:08.9661579Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:50:08.9662198Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T10:50:08.9662452Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:50:08.9701492Z [847s] Fitting surrogate: 1198 points, 1198 targets 2026-02-21T10:50:09.7093618Z [848s] Generation 17 starting: 37 neighbors, 2 active search path(s) 2026-02-21T10:50:18.3822530Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 5.4 configs/s 2026-02-21T10:50:21.2350284Z [860s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 3], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, None, False]) 2026-02-21T10:50:21.2351407Z Tensor-likes are not close! 2026-02-21T10:50:21.2352089Z 2026-02-21T10:50:21.2352239Z Mismatched elements: 5699 / 2147483648 (0.0%) 2026-02-21T10:50:21.2352589Z Greatest absolute difference: 0.0625 at index (168148, 3534) (up to 0.01 allowed) 2026-02-21T10:50:21.2352968Z Greatest relative difference: 728.0 at index (170455, 4801) (up to 0.01 allowed) 2026-02-21T10:50:21.2353347Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:50:21.2353528Z 2026-02-21T10:50:21.2887927Z [860s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 4], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, None, False]) 2026-02-21T10:50:21.2889088Z Tensor-likes are not close! 2026-02-21T10:50:21.2889268Z 2026-02-21T10:50:21.2889383Z Mismatched elements: 1 / 2147483648 (0.0%) 2026-02-21T10:50:21.2889728Z Greatest absolute difference: 0.01123046875 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:21.2890191Z Greatest relative difference: 0.1376953125 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:21.2890588Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:50:21.2890776Z 2026-02-21T10:50:21.7588988Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 11.0 configs/s 2026-02-21T10:50:29.3400912Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 104/104 12.8 configs/s 2026-02-21T10:50:30.0000127Z [868s] Generation 17 complete: 2026-02-21T10:50:30.0000428Z error=2 2026-02-21T10:50:30.0000676Z ok=37 2026-02-21T10:50:30.0000886Z min=2.0132 2026-02-21T10:50:30.0001089Z mid=2.4464 2026-02-21T10:50:30.0001307Z max=3.5830 2026-02-21T10:50:30.0001485Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:50:30.0004203Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:50:30.0004596Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:50:30.0004854Z 'num_stages': 2, 2026-02-21T10:50:30.0005037Z 'num_warps': 1, 2026-02-21T10:50:30.0005255Z 'pid_type': 'flat', 2026-02-21T10:50:30.0005455Z 'range_flattens': [None, True, None], 2026-02-21T10:50:30.0005718Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:50:30.0005972Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:50:30.0006189Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:50:30.0006451Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:50:30.0069252Z [868s] Fitting surrogate: 1237 points, 1237 targets 2026-02-21T10:50:30.6627186Z [869s] Generation 18 starting: 31 neighbors, 2 active search path(s) 2026-02-21T10:50:38.0824614Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 4.2 configs/s 2026-02-21T10:50:39.7979070Z [878s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 3], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, False]) 2026-02-21T10:50:39.7980464Z Tensor-likes are not close! 2026-02-21T10:50:39.7980600Z 2026-02-21T10:50:39.7980726Z Mismatched elements: 1 / 2147483648 (0.0%) 2026-02-21T10:50:39.7981047Z Greatest absolute difference: 0.01123046875 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:39.7981483Z Greatest relative difference: 0.1376953125 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:39.7981839Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:50:39.7982197Z 2026-02-21T10:50:40.9084596Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 11.0 configs/s 2026-02-21T10:50:48.1098190Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 104/104 13.5 configs/s 2026-02-21T10:50:48.7687854Z [887s] Generation 18 complete: 2026-02-21T10:50:48.7688100Z error=1 2026-02-21T10:50:48.7688299Z ok=32 2026-02-21T10:50:48.7688462Z min=1.9704 2026-02-21T10:50:48.7688746Z mid=2.3860 2026-02-21T10:50:48.7688936Z max=3.7039 2026-02-21T10:50:48.7695124Z best={'block_sizes': [4, 512, 256], 2026-02-21T10:50:48.7695513Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:50:48.7695893Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:50:48.7696131Z 'num_stages': 2, 2026-02-21T10:50:48.7696341Z 'num_warps': 1, 2026-02-21T10:50:48.7696522Z 'pid_type': 'flat', 2026-02-21T10:50:48.7696753Z 'range_flattens': [None, None, None], 2026-02-21T10:50:48.7696994Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:50:48.7697251Z 'range_num_stages': [0, 3, 3], 2026-02-21T10:50:48.7697489Z 'range_unroll_factors': [0, 1, 1], 2026-02-21T10:50:48.7697741Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:50:48.7761179Z [887s] Fitting surrogate: 1270 points, 1270 targets 2026-02-21T10:50:49.5262870Z [888s] Generation 19 starting: 38 neighbors, 2 active search path(s) 2026-02-21T10:50:57.6971671Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 10.8 configs/s 2026-02-21T10:50:59.6614116Z [898s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, None, False]) 2026-02-21T10:50:59.6615236Z Tensor-likes are not close! 2026-02-21T10:50:59.6615373Z 2026-02-21T10:50:59.6615470Z Mismatched elements: 1 / 2147483648 (0.0%) 2026-02-21T10:50:59.6615842Z Greatest absolute difference: 0.01123046875 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:59.6616518Z Greatest relative difference: 0.1376953125 at index (89500, 3161) (up to 0.01 allowed) 2026-02-21T10:50:59.6616898Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:50:59.6617076Z 2026-02-21T10:51:00.5580876Z [899s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 3, 2], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, False]) 2026-02-21T10:51:00.5582127Z Tensor-likes are not close! 2026-02-21T10:51:00.5582271Z 2026-02-21T10:51:00.5582377Z Mismatched elements: 5699 / 2147483648 (0.0%) 2026-02-21T10:51:00.5582744Z Greatest absolute difference: 0.0625 at index (168148, 3534) (up to 0.01 allowed) 2026-02-21T10:51:00.5583186Z Greatest relative difference: 728.0 at index (170455, 4801) (up to 0.01 allowed) 2026-02-21T10:51:00.5583548Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:51:00.5583740Z 2026-02-21T10:51:01.0789985Z [900s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, False]) 2026-02-21T10:51:01.0791058Z Tensor-likes are not close! 2026-02-21T10:51:01.0791200Z 2026-02-21T10:51:01.0791301Z Mismatched elements: 5699 / 2147483648 (0.0%) 2026-02-21T10:51:01.0791645Z Greatest absolute difference: 0.0625 at index (168148, 3534) (up to 0.01 allowed) 2026-02-21T10:51:01.0792479Z Greatest relative difference: 728.0 at index (170455, 4801) (up to 0.01 allowed) 2026-02-21T10:51:01.0792880Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:51:01.0793065Z 2026-02-21T10:51:01.1795020Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 10.9 configs/s 2026-02-21T10:51:08.3987205Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 106/106 13.6 configs/s 2026-02-21T10:51:09.0736790Z [908s] Generation 19 complete: 2026-02-21T10:51:09.0737158Z error=3 2026-02-21T10:51:09.0737368Z ok=37 2026-02-21T10:51:09.0737594Z min=1.9016 2026-02-21T10:51:09.0737785Z mid=2.5406 2026-02-21T10:51:09.0737997Z max=5.3240 2026-02-21T10:51:09.0738214Z best={'block_sizes': [4, 512, 512], 2026-02-21T10:51:09.0738963Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T10:51:09.0739402Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:51:09.0739666Z 'num_stages': 2, 2026-02-21T10:51:09.0739850Z 'num_warps': 1, 2026-02-21T10:51:09.0740082Z 'pid_type': 'flat', 2026-02-21T10:51:09.0740303Z 'range_flattens': [None, None, None], 2026-02-21T10:51:09.0740572Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:51:09.0740798Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:51:09.0741037Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T10:51:09.0741276Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:51:09.0806184Z [908s] Fitting surrogate: 1310 points, 1310 targets 2026-02-21T10:51:09.5498316Z [908s] Generation 20 starting: 16 neighbors, 1 active search path(s) 2026-02-21T10:51:15.3553190Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.2 configs/s 2026-02-21T10:51:16.8145524Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 11.1 configs/s 2026-02-21T10:51:20.4194361Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 106/106 25.4 configs/s 2026-02-21T10:51:21.0645035Z [920s] Generation 20 complete: 2026-02-21T10:51:21.0646996Z ok=17 2026-02-21T10:51:21.0647201Z min=1.9024 2026-02-21T10:51:21.0647473Z mid=2.3645 2026-02-21T10:51:21.0647937Z max=3.5790 2026-02-21T10:51:21.0651844Z best={'block_sizes': [4, 512, 512], 2026-02-21T10:51:21.0653461Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:51:21.0653856Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:51:21.0654096Z 'num_stages': 2, 2026-02-21T10:51:21.0654308Z 'num_warps': 1, 2026-02-21T10:51:21.0654514Z 'pid_type': 'flat', 2026-02-21T10:51:21.0654711Z 'range_flattens': [None, None, None], 2026-02-21T10:51:21.0655002Z 'range_multi_buffers': [None, None, False], 2026-02-21T10:51:21.0655236Z 'range_num_stages': [0, 4, 3], 2026-02-21T10:51:21.0655474Z 'range_unroll_factors': [0, 0, 1], 2026-02-21T10:51:21.0655708Z 'range_warp_specializes': [None, None, False]} 2026-02-21T10:51:21.0708234Z [920s] Fitting surrogate: 1327 points, 1327 targets 2026-02-21T10:51:21.3974235Z [920s] Autotuning complete in 920.3s after searching 1283 configs. 2026-02-21T10:51:21.3974990Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:51:21.3976057Z @helion.kernel(config=helion.Config(block_sizes=[4, 512, 512], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', '', 'first', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None, None], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 3], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, None, False]), static_shapes=True) 2026-02-21T10:51:21.3976995Z 2026-02-21T10:51:21.3977262Z [920s] Code of selected kernel: /tmp/torchinductor_root/up/cup6twyi64pkqxnqb4wbf3u3txtv2bkn2twqsxoqm5n5du7jfmt3.py 2026-02-21T10:51:22.7231275Z WARNING:tritonbench.utils.triton_op:Completed input ID 9: 2026-02-21T10:51:22.7233222Z x_val 2026-02-21T10:51:22.7233423Z ------- 2026-02-21T10:51:22.7233617Z 8192 2026-02-21T10:51:22.7233773Z 2026-02-21T10:51:22.7238398Z 100%|██████████| 6/6 [2:43:12<00:00, 2063.25s/it] 2026-02-21T10:51:22.7242900Z 100%|██████████| 6/6 [2:43:12<00:00, 1632.13s/it] 2026-02-21T10:51:22.7251929Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp7hhoeeig.csv 2026-02-21T10:51:37.7558117Z x_val triton_welford-speedup triton_welford-accuracy torch_compile_welford-speedup torch_compile_welford-accuracy helion_welford-speedup helion_welford-accuracy 2026-02-21T10:51:37.7559389Z ------- ------------------------ ------------------------- ------------------------------- -------------------------------- ------------------------ ------------------------- 2026-02-21T10:51:37.7559943Z 1024 0.728005 1 0.569143 1 3.53479 1 2026-02-21T10:51:37.7560436Z 2048 0.740912 1 0.408347 1 2.46464 1 2026-02-21T10:51:37.7560950Z 3072 0.809693 1 0.380964 1 1.90844 1 2026-02-21T10:51:37.7566183Z 4096 0.825673 1 0.359 1 1.45602 1 2026-02-21T10:51:37.7567612Z 6144 0.873974 1 0.336458 1 1.46327 1 2026-02-21T10:51:37.7568149Z 8192 0.891128 1 0.32773 1 1.38334 1 2026-02-21T10:51:37.7568649Z average 0.811564 1 0.39694 1 2.03508 1 2026-02-21T10:54:03.6918905Z Applying custom args for welford: {'num_inputs': 6} 2026-02-21T10:54:03.7038155Z Running welford benchmark with Helion implementation... 2026-02-21T10:54:03.7042178Z 2026-02-21T10:54:03.9255725Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 10) 2026-02-21T10:54:03.9260689Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 2, 4, 5, 7, 9] 2026-02-21T10:54:03.9262264Z 2026-02-21T10:54:03.9268517Z 0%| | 0/6 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}, %arg4: f32) attributes {noinline = false} { 2026-02-21T10:56:45.0172307Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T10:56:45.0172638Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:56:45.0172915Z %cst = arith.constant dense<1.000000e+00> : tensor<16xf32> 2026-02-21T10:56:45.0173274Z %cst_0 = arith.constant dense<1.600000e+01> : tensor<16xf32> 2026-02-21T10:56:45.0173883Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:56:45.0174115Z %c296_i32 = arith.constant 296 : i32 2026-02-21T10:56:45.0174399Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T10:56:45.0174659Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:56:45.0174923Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T10:56:45.0175152Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T10:56:45.0175405Z %c1024_i64 = arith.constant 1024 : i64 2026-02-21T10:56:45.0175621Z %c1_i64 = arith.constant 1 : i64 2026-02-21T10:56:45.0176018Z %0 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.0176583Z %1 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.0177272Z %2 = tt.make_tensor_descriptor %arg3, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.0177659Z %3 = tt.get_program_id x : i32 2026-02-21T10:56:45.0177948Z scf.for %arg5 = %3 to %c16384_i32 step %c296_i32 : i32 { 2026-02-21T10:56:45.0178234Z %4 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T10:56:45.0178468Z %c1008_i32 = arith.constant 1008 : i32 2026-02-21T10:56:45.0178719Z %c48_i32 = arith.constant 48 : i32 2026-02-21T10:56:45.0179181Z %5:3 = scf.for %arg6 = %c0_i32 to %c1008_i32 step %c48_i32 iter_args(%arg7 = %cst_1, %arg8 = %cst_1, %arg9 = %cst_1) -> (tensor<16xf32>, tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T10:56:45.0179776Z %35 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<16x16xbf16> 2026-02-21T10:56:45.0180115Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0180374Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0180655Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0180888Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0181147Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0181395Z %37 = arith.mulf %35, %35 : tensor<16x16xbf16> 2026-02-21T10:56:45.0181658Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0181927Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0182183Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0182437Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0182656Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0182947Z %39 = arith.extf %36 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0183212Z %40 = arith.divf %39, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0183481Z %41 = arith.mulf %36, %36 : tensor<16xbf16> 2026-02-21T10:56:45.0183738Z %42 = arith.extf %41 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0184030Z %43 = arith.divf %42, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0184284Z %44 = arith.extf %38 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0184558Z %45 = arith.subf %44, %43 : tensor<16xf32> 2026-02-21T10:56:45.0184824Z %46 = arith.subf %40, %arg8 : tensor<16xf32> 2026-02-21T10:56:45.0185068Z %47 = arith.addf %arg7, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0185335Z %48 = arith.divf %cst, %47 : tensor<16xf32> 2026-02-21T10:56:45.0185571Z %49 = arith.mulf %48, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0185829Z %50 = arith.mulf %46, %49 : tensor<16xf32> 2026-02-21T10:56:45.0186062Z %51 = arith.addf %arg8, %50 : tensor<16xf32> 2026-02-21T10:56:45.0186320Z %52 = arith.addf %arg9, %45 : tensor<16xf32> 2026-02-21T10:56:45.0186577Z %53 = arith.mulf %46, %46 : tensor<16xf32> 2026-02-21T10:56:45.0186828Z %54 = arith.mulf %arg7, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0187095Z %55 = arith.divf %54, %47 : tensor<16xf32> 2026-02-21T10:56:45.0187404Z %56 = arith.mulf %53, %55 : tensor<16xf32> 2026-02-21T10:56:45.0187671Z %57 = arith.addf %52, %56 : tensor<16xf32> 2026-02-21T10:56:45.0187902Z %c1_i32 = arith.constant 1 : i32 2026-02-21T10:56:45.0188165Z %58 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T10:56:45.0188395Z %59 = arith.addi %arg6, %58 : i32 2026-02-21T10:56:45.0188740Z %60 = tt.descriptor_load %0[%4, %59] : !tt.tensordesc> -> tensor<16x16xbf16> 2026-02-21T10:56:45.0189097Z %61 = "tt.reduce"(%60) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0189326Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0189581Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0189805Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0190063Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0190306Z %62 = arith.mulf %60, %60 : tensor<16x16xbf16> 2026-02-21T10:56:45.0190640Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0190894Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0191115Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0191368Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0191592Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0191930Z %64 = arith.extf %61 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0192191Z %65 = arith.divf %64, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0192460Z %66 = arith.mulf %61, %61 : tensor<16xbf16> 2026-02-21T10:56:45.0192737Z %67 = arith.extf %66 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0192988Z %68 = arith.divf %67, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0193261Z %69 = arith.extf %63 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0193516Z %70 = arith.subf %69, %68 : tensor<16xf32> 2026-02-21T10:56:45.0193780Z %71 = arith.subf %65, %51 : tensor<16xf32> 2026-02-21T10:56:45.0194021Z %72 = arith.addf %47, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0194281Z %73 = arith.divf %cst, %72 : tensor<16xf32> 2026-02-21T10:56:45.0194540Z %74 = arith.mulf %73, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0194773Z %75 = arith.mulf %71, %74 : tensor<16xf32> 2026-02-21T10:56:45.0195025Z %76 = arith.addf %51, %75 : tensor<16xf32> 2026-02-21T10:56:45.0195256Z %77 = arith.addf %57, %70 : tensor<16xf32> 2026-02-21T10:56:45.0195508Z %78 = arith.mulf %71, %71 : tensor<16xf32> 2026-02-21T10:56:45.0195739Z %79 = arith.mulf %47, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0195998Z %80 = arith.divf %79, %72 : tensor<16xf32> 2026-02-21T10:56:45.0196225Z %81 = arith.mulf %78, %80 : tensor<16xf32> 2026-02-21T10:56:45.0196481Z %82 = arith.addf %77, %81 : tensor<16xf32> 2026-02-21T10:56:45.0196734Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:56:45.0196959Z %83 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T10:56:45.0197219Z %84 = arith.addi %arg6, %83 : i32 2026-02-21T10:56:45.0197533Z %85 = tt.descriptor_load %0[%4, %84] : !tt.tensordesc> -> tensor<16x16xbf16> 2026-02-21T10:56:45.0197884Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0198119Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0198368Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0198615Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0198839Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0199099Z %87 = arith.mulf %85, %85 : tensor<16x16xbf16> 2026-02-21T10:56:45.0199329Z %88 = "tt.reduce"(%87) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0199568Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.0199827Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.0200072Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.0200318Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0200670Z %89 = arith.extf %86 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0200965Z %90 = arith.divf %89, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0201213Z %91 = arith.mulf %86, %86 : tensor<16xbf16> 2026-02-21T10:56:45.0201501Z %92 = arith.extf %91 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0201765Z %93 = arith.divf %92, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0202194Z %94 = arith.extf %88 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0202491Z %95 = arith.subf %94, %93 : tensor<16xf32> 2026-02-21T10:56:45.0202730Z %96 = arith.subf %90, %76 : tensor<16xf32> 2026-02-21T10:56:45.0203010Z %97 = arith.addf %72, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0203257Z %98 = arith.divf %cst, %97 : tensor<16xf32> 2026-02-21T10:56:45.0203536Z %99 = arith.mulf %98, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0203851Z %100 = arith.mulf %96, %99 : tensor<16xf32> 2026-02-21T10:56:45.0204132Z %101 = arith.addf %76, %100 : tensor<16xf32> 2026-02-21T10:56:45.0204376Z %102 = arith.addf %82, %95 : tensor<16xf32> 2026-02-21T10:56:45.0204651Z %103 = arith.mulf %96, %96 : tensor<16xf32> 2026-02-21T10:56:45.0204931Z %104 = arith.mulf %72, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0205182Z %105 = arith.divf %104, %97 : tensor<16xf32> 2026-02-21T10:56:45.0205462Z %106 = arith.mulf %103, %105 : tensor<16xf32> 2026-02-21T10:56:45.0205710Z %107 = arith.addf %102, %106 : tensor<16xf32> 2026-02-21T10:56:45.0206037Z scf.yield %97, %101, %107 : tensor<16xf32>, tensor<16xf32>, tensor<16xf32> 2026-02-21T10:56:45.0206332Z } {tt.num_stages = 1 : i32} 2026-02-21T10:56:45.0206696Z %6 = tt.descriptor_load %0[%4, %c1008_i32] : !tt.tensordesc> -> tensor<16x16xbf16> 2026-02-21T10:56:45.0207075Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0207314Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T10:56:45.0207579Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T10:56:45.0207813Z tt.reduce.return %35 : bf16 2026-02-21T10:56:45.0208073Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0208328Z %8 = arith.mulf %6, %6 : tensor<16x16xbf16> 2026-02-21T10:56:45.0208587Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.0208830Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T10:56:45.0209045Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T10:56:45.0209287Z tt.reduce.return %35 : bf16 2026-02-21T10:56:45.0209505Z }) : (tensor<16x16xbf16>) -> tensor<16xbf16> 2026-02-21T10:56:45.0209780Z %10 = arith.extf %7 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0210029Z %11 = arith.divf %10, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0210293Z %12 = arith.mulf %7, %7 : tensor<16xbf16> 2026-02-21T10:56:45.0210546Z %13 = arith.extf %12 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0210823Z %14 = arith.divf %13, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0211102Z %15 = arith.extf %9 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T10:56:45.0211346Z %16 = arith.subf %15, %14 : tensor<16xf32> 2026-02-21T10:56:45.0211600Z %17 = arith.subf %11, %5#1 : tensor<16xf32> 2026-02-21T10:56:45.0211836Z %18 = arith.addf %5#0, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0212136Z %19 = arith.divf %cst, %18 : tensor<16xf32> 2026-02-21T10:56:45.0212365Z %20 = arith.mulf %19, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0212621Z %21 = arith.mulf %17, %20 : tensor<16xf32> 2026-02-21T10:56:45.0212868Z %22 = arith.addf %5#1, %21 : tensor<16xf32> 2026-02-21T10:56:45.0213097Z %23 = arith.addf %5#2, %16 : tensor<16xf32> 2026-02-21T10:56:45.0213349Z %24 = arith.mulf %17, %17 : tensor<16xf32> 2026-02-21T10:56:45.0213580Z %25 = arith.mulf %5#0, %cst_0 : tensor<16xf32> 2026-02-21T10:56:45.0213843Z %26 = arith.divf %25, %18 : tensor<16xf32> 2026-02-21T10:56:45.0214130Z %27 = arith.mulf %24, %26 : tensor<16xf32> 2026-02-21T10:56:45.0214380Z %28 = arith.addf %23, %27 : tensor<16xf32> 2026-02-21T10:56:45.0214601Z %29 = arith.divf %28, %18 : tensor<16xf32> 2026-02-21T10:56:45.0214860Z %30 = tt.splat %arg4 : f32 -> tensor<16xf32> 2026-02-21T10:56:45.0215118Z %31 = arith.addf %29, %30 : tensor<16xf32> 2026-02-21T10:56:45.0215508Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_rsqrtf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:56:45.0216017Z %33 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:56:45.0216375Z %34 = tt.expand_dims %32 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:56:45.0216693Z scf.for %arg6 = %c0_i32 to %c1024_i32 step %c128_i32 : i32 { 2026-02-21T10:56:45.0217116Z %35 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:56:45.0217416Z %36 = tt.splat %arg6 : i32 -> tensor<128xi32> 2026-02-21T10:56:45.0217685Z %37 = arith.addi %36, %35 : tensor<128xi32> 2026-02-21T10:56:45.0218018Z %38 = tt.descriptor_load %1[%4, %arg6] : !tt.tensordesc> -> tensor<16x128xbf16> 2026-02-21T10:56:45.0218421Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T10:56:45.0218756Z %40 = tt.addptr %39, %37 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T10:56:45.0219084Z %41 = tt.load %40 evictionPolicy = evict_last : tensor<128x!tt.ptr> 2026-02-21T10:56:45.0219466Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<128xbf16> -> tensor<1x128xbf16> 2026-02-21T10:56:45.0219802Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T10:56:45.0220137Z %44 = tt.addptr %43, %37 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T10:56:45.0220444Z %45 = tt.load %44 : tensor<128x!tt.ptr> 2026-02-21T10:56:45.0220742Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<128xbf16> -> tensor<1x128xbf16> 2026-02-21T10:56:45.0221102Z %47 = arith.extf %38 : tensor<16x128xbf16> to tensor<16x128xf32> 2026-02-21T10:56:45.0221403Z %48 = tt.broadcast %33 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:56:45.0221706Z %49 = arith.subf %47, %48 : tensor<16x128xf32> 2026-02-21T10:56:45.0222008Z %50 = tt.broadcast %34 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:56:45.0222304Z %51 = arith.mulf %49, %50 : tensor<16x128xf32> 2026-02-21T10:56:45.0222596Z %52 = arith.extf %42 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T10:56:45.0222889Z %53 = tt.broadcast %52 : tensor<1x128xf32> -> tensor<16x128xf32> 2026-02-21T10:56:45.0223186Z %54 = arith.mulf %51, %53 : tensor<16x128xf32> 2026-02-21T10:56:45.0223452Z %55 = arith.extf %46 : tensor<1x128xbf16> to tensor<1x128xf32> 2026-02-21T10:56:45.0223763Z %56 = tt.broadcast %55 : tensor<1x128xf32> -> tensor<16x128xf32> 2026-02-21T10:56:45.0224029Z %57 = arith.addf %54, %56 : tensor<16x128xf32> 2026-02-21T10:56:45.0224325Z %58 = arith.truncf %57 : tensor<16x128xf32> to tensor<16x128xbf16> 2026-02-21T10:56:45.0224714Z tt.descriptor_store %2[%4, %arg6], %58 : !tt.tensordesc>, tensor<16x128xbf16> 2026-02-21T10:56:45.0225034Z } {tt.num_stages = 1 : i32} 2026-02-21T10:56:45.0225346Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T10:56:45.0225627Z tt.return 2026-02-21T10:56:45.0225820Z } 2026-02-21T10:56:45.0225982Z } 2026-02-21T10:56:45.0226093Z 2026-02-21T10:56:45.0226161Z {-# 2026-02-21T10:56:45.0226324Z external_resources: { 2026-02-21T10:56:45.0226542Z mlir_reproducer: { 2026-02-21T10:56:45.0230999Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T10:56:45.0235737Z disable_threading: false, 2026-02-21T10:56:45.0235971Z verify_each: true 2026-02-21T10:56:45.0236151Z } 2026-02-21T10:56:45.0236331Z } 2026-02-21T10:56:45.0236491Z #-} 2026-02-21T10:56:45.0236984Z /tmp/torchinductor_root/qf/cqfuyqx6a5git56l3n2bw7r6m36opxjvuytc6a73jaosnmny7pyh.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:56:45.0238268Z /tmp/torchinductor_root/qf/cqfuyqx6a5git56l3n2bw7r6m36opxjvuytc6a73jaosnmny7pyh.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:56:45.0239279Z [152s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:56:45.0240563Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 16, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None, None], range_multi_buffers=[False, None, True], range_num_stages=[2, 1, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[True, None, None]), static_shapes=True) 2026-02-21T10:56:45.0241730Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:56:45.0242090Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:56:45.1922970Z module { 2026-02-21T10:56:45.1925399Z tt.func public @_helion_welford(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}, %arg4: f32) attributes {noinline = false} { 2026-02-21T10:56:45.1926121Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T10:56:45.1926363Z %c32_i32 = arith.constant 32 : i32 2026-02-21T10:56:45.1926658Z %cst = arith.constant dense<1.000000e+00> : tensor<64xf32> 2026-02-21T10:56:45.1926956Z %cst_0 = arith.constant dense<1.600000e+01> : tensor<64xf32> 2026-02-21T10:56:45.1927244Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:56:45.1927463Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:56:45.1927725Z %c296_i32 = arith.constant 296 : i32 2026-02-21T10:56:45.1928235Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<64xf32> 2026-02-21T10:56:45.1928535Z %c64_i32 = arith.constant 64 : i32 2026-02-21T10:56:45.1928793Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T10:56:45.1929025Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T10:56:45.1929276Z %c1024_i64 = arith.constant 1024 : i64 2026-02-21T10:56:45.1929495Z %c1_i64 = arith.constant 1 : i64 2026-02-21T10:56:45.1929884Z %0 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.1930372Z %1 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.1930874Z %2 = tt.make_tensor_descriptor %arg3, [%c262144_i32, %c1024_i32], [%c1024_i64, %c1_i64] : , > 2026-02-21T10:56:45.1931329Z %3 = tt.get_program_id x : i32 2026-02-21T10:56:45.1931580Z scf.for %arg5 = %3 to %c4096_i32 step %c296_i32 : i32 { 2026-02-21T10:56:45.1932042Z %4 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T10:56:45.1932276Z %c1008_i32 = arith.constant 1008 : i32 2026-02-21T10:56:45.1932530Z %c48_i32 = arith.constant 48 : i32 2026-02-21T10:56:45.1932997Z %5:3 = scf.for %arg6 = %c0_i32 to %c1008_i32 step %c48_i32 iter_args(%arg7 = %cst_1, %arg8 = %cst_1, %arg9 = %cst_1) -> (tensor<64xf32>, tensor<64xf32>, tensor<64xf32>) : i32 { 2026-02-21T10:56:45.1933594Z %35 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T10:56:45.1933963Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1934196Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1934452Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1934678Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1934937Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1935206Z %37 = arith.mulf %35, %35 : tensor<64x16xbf16> 2026-02-21T10:56:45.1935446Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1935696Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1935920Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1936168Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1936457Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1936713Z %39 = arith.extf %36 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1937031Z %40 = arith.divf %39, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1937304Z %41 = arith.mulf %36, %36 : tensor<64xbf16> 2026-02-21T10:56:45.1937568Z %42 = arith.extf %41 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1937876Z %43 = arith.divf %42, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1938139Z %44 = arith.extf %38 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1938438Z %45 = arith.subf %44, %43 : tensor<64xf32> 2026-02-21T10:56:45.1938690Z %46 = arith.subf %40, %arg8 : tensor<64xf32> 2026-02-21T10:56:45.1938974Z %47 = arith.addf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1939251Z %48 = arith.divf %cst, %47 : tensor<64xf32> 2026-02-21T10:56:45.1939499Z %49 = arith.mulf %48, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1939767Z %50 = arith.mulf %46, %49 : tensor<64xf32> 2026-02-21T10:56:45.1940007Z %51 = arith.addf %arg8, %50 : tensor<64xf32> 2026-02-21T10:56:45.1940276Z %52 = arith.addf %arg9, %45 : tensor<64xf32> 2026-02-21T10:56:45.1940514Z %53 = arith.mulf %46, %46 : tensor<64xf32> 2026-02-21T10:56:45.1940796Z %54 = arith.mulf %arg7, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1941043Z %55 = arith.divf %54, %47 : tensor<64xf32> 2026-02-21T10:56:45.1941308Z %56 = arith.mulf %53, %55 : tensor<64xf32> 2026-02-21T10:56:45.1941575Z %57 = arith.addf %52, %56 : tensor<64xf32> 2026-02-21T10:56:45.1941917Z %c1_i32 = arith.constant 1 : i32 2026-02-21T10:56:45.1942187Z %58 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T10:56:45.1942426Z %59 = arith.addi %arg6, %58 : i32 2026-02-21T10:56:45.1942784Z %60 = tt.descriptor_load %0[%4, %59] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T10:56:45.1943130Z %61 = "tt.reduce"(%60) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1943400Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1943668Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1943905Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1944174Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1944426Z %62 = arith.mulf %60, %60 : tensor<64x16xbf16> 2026-02-21T10:56:45.1944704Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1944939Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1945252Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1945522Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1945762Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1946056Z %64 = arith.extf %61 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1946326Z %65 = arith.divf %64, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1946595Z %66 = arith.mulf %61, %61 : tensor<64xbf16> 2026-02-21T10:56:45.1946853Z %67 = arith.extf %66 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1947130Z %68 = arith.divf %67, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1947382Z %69 = arith.extf %63 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1947654Z %70 = arith.subf %69, %68 : tensor<64xf32> 2026-02-21T10:56:45.1947907Z %71 = arith.subf %65, %51 : tensor<64xf32> 2026-02-21T10:56:45.1948137Z %72 = arith.addf %47, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1948395Z %73 = arith.divf %cst, %72 : tensor<64xf32> 2026-02-21T10:56:45.1948627Z %74 = arith.mulf %73, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1948909Z %75 = arith.mulf %71, %74 : tensor<64xf32> 2026-02-21T10:56:45.1949133Z %76 = arith.addf %51, %75 : tensor<64xf32> 2026-02-21T10:56:45.1949381Z %77 = arith.addf %57, %70 : tensor<64xf32> 2026-02-21T10:56:45.1949600Z %78 = arith.mulf %71, %71 : tensor<64xf32> 2026-02-21T10:56:45.1949852Z %79 = arith.mulf %47, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1950104Z %80 = arith.divf %79, %72 : tensor<64xf32> 2026-02-21T10:56:45.1950326Z %81 = arith.mulf %78, %80 : tensor<64xf32> 2026-02-21T10:56:45.1950572Z %82 = arith.addf %77, %81 : tensor<64xf32> 2026-02-21T10:56:45.1950794Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:56:45.1951040Z %83 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T10:56:45.1951260Z %84 = arith.addi %arg6, %83 : i32 2026-02-21T10:56:45.1951595Z %85 = tt.descriptor_load %0[%4, %84] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T10:56:45.1951981Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1952210Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1952464Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1952694Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1952947Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1953192Z %87 = arith.mulf %85, %85 : tensor<64x16xbf16> 2026-02-21T10:56:45.1953456Z %88 = "tt.reduce"(%87) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1953710Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T10:56:45.1953933Z %108 = arith.addf %arg10, %arg11 : bf16 2026-02-21T10:56:45.1954188Z tt.reduce.return %108 : bf16 2026-02-21T10:56:45.1954408Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1954697Z %89 = arith.extf %86 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1954954Z %90 = arith.divf %89, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1955271Z %91 = arith.mulf %86, %86 : tensor<64xbf16> 2026-02-21T10:56:45.1955521Z %92 = arith.extf %91 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1955808Z %93 = arith.divf %92, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1956090Z %94 = arith.extf %88 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1956342Z %95 = arith.subf %94, %93 : tensor<64xf32> 2026-02-21T10:56:45.1956602Z %96 = arith.subf %90, %76 : tensor<64xf32> 2026-02-21T10:56:45.1956832Z %97 = arith.addf %72, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1957090Z %98 = arith.divf %cst, %97 : tensor<64xf32> 2026-02-21T10:56:45.1957322Z %99 = arith.mulf %98, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1957580Z %100 = arith.mulf %96, %99 : tensor<64xf32> 2026-02-21T10:56:45.1957839Z %101 = arith.addf %76, %100 : tensor<64xf32> 2026-02-21T10:56:45.1958119Z %102 = arith.addf %82, %95 : tensor<64xf32> 2026-02-21T10:56:45.1958379Z %103 = arith.mulf %96, %96 : tensor<64xf32> 2026-02-21T10:56:45.1958614Z %104 = arith.mulf %72, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1958875Z %105 = arith.divf %104, %97 : tensor<64xf32> 2026-02-21T10:56:45.1959110Z %106 = arith.mulf %103, %105 : tensor<64xf32> 2026-02-21T10:56:45.1959371Z %107 = arith.addf %102, %106 : tensor<64xf32> 2026-02-21T10:56:45.1959676Z scf.yield %97, %101, %107 : tensor<64xf32>, tensor<64xf32>, tensor<64xf32> 2026-02-21T10:56:45.1959960Z } {tt.num_stages = 1 : i32} 2026-02-21T10:56:45.1960310Z %6 = tt.descriptor_load %0[%4, %c1008_i32] : !tt.tensordesc> -> tensor<64x16xbf16> 2026-02-21T10:56:45.1960644Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1960898Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T10:56:45.1961113Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T10:56:45.1961365Z tt.reduce.return %35 : bf16 2026-02-21T10:56:45.1961590Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1961884Z %8 = arith.mulf %6, %6 : tensor<64x16xbf16> 2026-02-21T10:56:45.1962143Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T10:56:45.1962362Z ^bb0(%arg6: bf16, %arg7: bf16): 2026-02-21T10:56:45.1962602Z %35 = arith.addf %arg6, %arg7 : bf16 2026-02-21T10:56:45.1962822Z tt.reduce.return %35 : bf16 2026-02-21T10:56:45.1963064Z }) : (tensor<64x16xbf16>) -> tensor<64xbf16> 2026-02-21T10:56:45.1963309Z %10 = arith.extf %7 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1963589Z %11 = arith.divf %10, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1963846Z %12 = arith.mulf %7, %7 : tensor<64xbf16> 2026-02-21T10:56:45.1964098Z %13 = arith.extf %12 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1964381Z %14 = arith.divf %13, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1964635Z %15 = arith.extf %9 : tensor<64xbf16> to tensor<64xf32> 2026-02-21T10:56:45.1964911Z %16 = arith.subf %15, %14 : tensor<64xf32> 2026-02-21T10:56:45.1965139Z %17 = arith.subf %11, %5#1 : tensor<64xf32> 2026-02-21T10:56:45.1965402Z %18 = arith.addf %5#0, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1965636Z %19 = arith.divf %cst, %18 : tensor<64xf32> 2026-02-21T10:56:45.1965891Z %20 = arith.mulf %19, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1966146Z %21 = arith.mulf %17, %20 : tensor<64xf32> 2026-02-21T10:56:45.1966373Z %22 = arith.addf %5#1, %21 : tensor<64xf32> 2026-02-21T10:56:45.1966630Z %23 = arith.addf %5#2, %16 : tensor<64xf32> 2026-02-21T10:56:45.1966856Z %24 = arith.mulf %17, %17 : tensor<64xf32> 2026-02-21T10:56:45.1967109Z %25 = arith.mulf %5#0, %cst_0 : tensor<64xf32> 2026-02-21T10:56:45.1967340Z %26 = arith.divf %25, %18 : tensor<64xf32> 2026-02-21T10:56:45.1967596Z %27 = arith.mulf %24, %26 : tensor<64xf32> 2026-02-21T10:56:45.1967846Z %28 = arith.addf %23, %27 : tensor<64xf32> 2026-02-21T10:56:45.1968112Z %29 = arith.divf %28, %18 : tensor<64xf32> 2026-02-21T10:56:45.1968370Z %30 = tt.splat %arg4 : f32 -> tensor<64xf32> 2026-02-21T10:56:45.1968602Z %31 = arith.addf %29, %30 : tensor<64xf32> 2026-02-21T10:56:45.1969021Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_rsqrtf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T10:56:45.1969465Z %33 = tt.expand_dims %22 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T10:56:45.1969832Z %34 = tt.expand_dims %32 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T10:56:45.1970174Z scf.for %arg6 = %c0_i32 to %c1024_i32 step %c32_i32 : i32 { 2026-02-21T10:56:45.1970481Z %35 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T10:56:45.1970795Z %36 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T10:56:45.1971093Z %37 = arith.addi %36, %35 : tensor<32xi32> 2026-02-21T10:56:45.1971447Z %38 = tt.descriptor_load %1[%4, %arg6] : !tt.tensordesc> -> tensor<64x32xbf16> 2026-02-21T10:56:45.1971811Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T10:56:45.1972174Z %40 = tt.addptr %39, %37 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T10:56:45.1972521Z %41 = tt.load %40 evictionPolicy = evict_last : tensor<32x!tt.ptr> 2026-02-21T10:56:45.1972862Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T10:56:45.1973216Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T10:56:45.1973512Z %44 = tt.addptr %43, %37 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T10:56:45.1973813Z %45 = tt.load %44 : tensor<32x!tt.ptr> 2026-02-21T10:56:45.1974127Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T10:56:45.1974453Z %47 = arith.extf %38 : tensor<64x32xbf16> to tensor<64x32xf32> 2026-02-21T10:56:45.1974778Z %48 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x32xf32> 2026-02-21T10:56:45.1975052Z %49 = arith.subf %47, %48 : tensor<64x32xf32> 2026-02-21T10:56:45.1975340Z %50 = tt.broadcast %34 : tensor<64x1xf32> -> tensor<64x32xf32> 2026-02-21T10:56:45.1975604Z %51 = arith.mulf %49, %50 : tensor<64x32xf32> 2026-02-21T10:56:45.1975895Z %52 = arith.extf %42 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T10:56:45.1976202Z %53 = tt.broadcast %52 : tensor<1x32xf32> -> tensor<64x32xf32> 2026-02-21T10:56:45.1976460Z %54 = arith.mulf %51, %53 : tensor<64x32xf32> 2026-02-21T10:56:45.1976743Z %55 = arith.extf %46 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T10:56:45.1977027Z %56 = tt.broadcast %55 : tensor<1x32xf32> -> tensor<64x32xf32> 2026-02-21T10:56:45.1977313Z %57 = arith.addf %54, %56 : tensor<64x32xf32> 2026-02-21T10:56:45.1977582Z %58 = arith.truncf %57 : tensor<64x32xf32> to tensor<64x32xbf16> 2026-02-21T10:56:45.1977967Z tt.descriptor_store %2[%4, %arg6], %58 : !tt.tensordesc>, tensor<64x32xbf16> 2026-02-21T10:56:45.1978314Z } {tt.num_stages = 1 : i32} 2026-02-21T10:56:45.1978600Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T10:56:45.1978916Z tt.return 2026-02-21T10:56:45.1979083Z } 2026-02-21T10:56:45.1979267Z } 2026-02-21T10:56:45.1979358Z 2026-02-21T10:56:45.1979427Z {-# 2026-02-21T10:56:45.1979628Z external_resources: { 2026-02-21T10:56:45.1979819Z mlir_reproducer: { 2026-02-21T10:56:45.1984669Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T10:56:45.1989482Z disable_threading: false, 2026-02-21T10:56:45.1989714Z verify_each: true 2026-02-21T10:56:45.1989968Z } 2026-02-21T10:56:45.1990150Z } 2026-02-21T10:56:45.1990375Z #-} 2026-02-21T10:56:45.1990930Z /tmp/torchinductor_root/tc/ctc7h5im7px25rte73a6qochgm5f4dml7tvoubjmhldxzgtxa4h6.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:56:45.1992362Z /tmp/torchinductor_root/tc/ctc7h5im7px25rte73a6qochgm5f4dml7tvoubjmhldxzgtxa4h6.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:56:45.1993521Z [152s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:56:45.1994856Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None, None], range_multi_buffers=[False, None, True], range_num_stages=[2, 1, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[True, None, None]), static_shapes=True) 2026-02-21T10:56:45.1996075Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:56:45.1996487Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:56:47.0128664Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 16.4 configs/s 2026-02-21T10:56:59.1535534Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 856/856 70.3 configs/s 2026-02-21T10:56:59.3997828Z [166s] Generation 3 complete: 2026-02-21T10:56:59.4002200Z error=2 2026-02-21T10:56:59.4003365Z ok=100 2026-02-21T10:56:59.4003609Z min=0.2468 2026-02-21T10:56:59.4003790Z mid=0.3788 2026-02-21T10:56:59.4003981Z max=3.3013 2026-02-21T10:56:59.4004159Z best={'block_sizes': [32, 64, 64], 2026-02-21T10:56:59.4004486Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:56:59.4004811Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:56:59.4005041Z 'num_stages': 1, 2026-02-21T10:56:59.4005251Z 'num_warps': 8, 2026-02-21T10:56:59.4005432Z 'pid_type': 'flat', 2026-02-21T10:56:59.4005656Z 'range_flattens': [None, None, False], 2026-02-21T10:56:59.4005897Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:56:59.4006175Z 'range_num_stages': [0, 0, 0], 2026-02-21T10:56:59.4006692Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:56:59.4006961Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:56:59.4028255Z [166s] Fitting surrogate: 416 points, 416 targets 2026-02-21T10:57:00.7023325Z [167s] Generation 4 starting: 98 neighbors, 5 active search path(s) 2026-02-21T10:57:06.7853808Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 16.4 configs/s 2026-02-21T10:57:12.9335464Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 16.7 configs/s 2026-02-21T10:57:30.0377121Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 917/917 54.2 configs/s 2026-02-21T10:57:30.3561755Z [197s] Generation 4 complete: 2026-02-21T10:57:30.3565606Z error=1 2026-02-21T10:57:30.3569978Z ok=103 2026-02-21T10:57:30.3573779Z min=0.2305 2026-02-21T10:57:30.3577645Z mid=0.3348 2026-02-21T10:57:30.3579270Z max=3.0515 2026-02-21T10:57:30.3579570Z best={'block_sizes': [32, 64, 128], 2026-02-21T10:57:30.3585020Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:57:30.3586764Z 'load_eviction_policies': ['', '', 'last', ''], 2026-02-21T10:57:30.3587118Z 'num_stages': 1, 2026-02-21T10:57:30.3592388Z 'num_warps': 8, 2026-02-21T10:57:30.3594168Z 'pid_type': 'flat', 2026-02-21T10:57:30.3598803Z 'range_flattens': [None, False, False], 2026-02-21T10:57:30.3602417Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:57:30.3604683Z 'range_num_stages': [0, 0, 0], 2026-02-21T10:57:30.3609147Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:57:30.3612834Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:57:30.3615789Z [197s] Fitting surrogate: 520 points, 520 targets 2026-02-21T10:57:31.7161036Z [198s] Generation 5 starting: 100 neighbors, 5 active search path(s) 2026-02-21T10:57:38.2477698Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 103/103 13.0 configs/s 2026-02-21T10:57:39.5349420Z [206s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', '', 'first', ''], maxnreg=128, num_sm_multiplier=1, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, False, None], range_multi_buffers=[True, False, True], range_num_stages=[1, 0, 2], range_unroll_factors=[3, 0, 3], range_warp_specializes=[False, True, None]) 2026-02-21T10:57:39.5350837Z Tensor-likes are not close! 2026-02-21T10:57:39.5355961Z 2026-02-21T10:57:39.5360534Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:57:39.5364779Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:57:39.5366402Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:57:39.5366841Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:57:39.5371422Z 2026-02-21T10:57:41.3023785Z [208s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', 'first', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False, True], range_multi_buffers=[False, False, False], range_num_stages=[2, 3, 3], range_unroll_factors=[0, 1, 1], range_warp_specializes=[True, None, None]) 2026-02-21T10:57:41.3025066Z Tensor-likes are not close! 2026-02-21T10:57:41.3030122Z 2026-02-21T10:57:41.3031705Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:57:41.3032232Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:57:41.3032653Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:57:41.3033031Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:57:41.3033218Z 2026-02-21T10:57:41.6169236Z [208s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', 'first', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False, False], range_multi_buffers=[False, None, False], range_num_stages=[2, 4, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[True, None, None]) 2026-02-21T10:57:41.6170813Z Tensor-likes are not close! 2026-02-21T10:57:41.6174616Z 2026-02-21T10:57:41.6178727Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:57:41.6182512Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:57:41.6187212Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:57:41.6187680Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:57:41.6191535Z 2026-02-21T10:57:41.8087052Z [209s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', 'first', 'last'], maxnreg=256, num_sm_multiplier=16, num_stages=7, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False, False], range_multi_buffers=[False, False, False], range_num_stages=[2, 3, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[True, None, None]) 2026-02-21T10:57:41.8088351Z Tensor-likes are not close! 2026-02-21T10:57:41.8092541Z 2026-02-21T10:57:41.8097114Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:57:41.8098548Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:57:41.8099000Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:57:41.8099351Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:57:41.8099567Z 2026-02-21T10:57:44.5719386Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 103/103 16.4 configs/s 2026-02-21T10:57:55.3763560Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 92.3 configs/s 2026-02-21T10:57:55.6227055Z [222s] Generation 5 complete: 2026-02-21T10:57:55.6230247Z error=4 2026-02-21T10:57:55.6234676Z ok=102 2026-02-21T10:57:55.6236873Z min=0.2091 2026-02-21T10:57:55.6237104Z mid=0.3286 2026-02-21T10:57:55.6237278Z max=2.1443 2026-02-21T10:57:55.6237487Z best={'block_sizes': [8, 64, 128], 2026-02-21T10:57:55.6237781Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:57:55.6238115Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:57:55.6238349Z 'num_stages': 1, 2026-02-21T10:57:55.6238560Z 'num_warps': 2, 2026-02-21T10:57:55.6238739Z 'pid_type': 'flat', 2026-02-21T10:57:55.6238964Z 'range_flattens': [None, False, False], 2026-02-21T10:57:55.6239195Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:57:55.6239463Z 'range_num_stages': [0, 0, 0], 2026-02-21T10:57:55.6256542Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:57:55.6256862Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:57:55.6257168Z [222s] Fitting surrogate: 626 points, 626 targets 2026-02-21T10:57:56.9910736Z [224s] Generation 6 starting: 93 neighbors, 5 active search path(s) 2026-02-21T10:58:02.4846464Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 17.5 configs/s 2026-02-21T10:58:04.6038497Z [231s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last', 'first', ''], num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, False, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 0, 3], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:04.6039628Z Tensor-likes are not close! 2026-02-21T10:58:04.6039769Z 2026-02-21T10:58:04.6039894Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:04.6040616Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:04.6040997Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:04.6041372Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:04.6041555Z 2026-02-21T10:58:08.2633670Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.7 configs/s 2026-02-21T10:58:21.2319358Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 77.0 configs/s 2026-02-21T10:58:21.5116640Z [248s] Generation 6 complete: 2026-02-21T10:58:21.5120877Z error=1 2026-02-21T10:58:21.5122553Z ok=98 2026-02-21T10:58:21.5122783Z min=0.2078 2026-02-21T10:58:21.5122955Z mid=0.3032 2026-02-21T10:58:21.5123146Z max=1.3404 2026-02-21T10:58:21.5123326Z best={'block_sizes': [32, 64, 128], 2026-02-21T10:58:21.5123667Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T10:58:21.5124357Z 'load_eviction_policies': ['last', 'last', 'first', ''], 2026-02-21T10:58:21.5124662Z 'num_stages': 2, 2026-02-21T10:58:21.5124845Z 'num_warps': 8, 2026-02-21T10:58:21.5125061Z 'pid_type': 'flat', 2026-02-21T10:58:21.5125263Z 'range_flattens': [None, False, None], 2026-02-21T10:58:21.5125534Z 'range_multi_buffers': [None, False, None], 2026-02-21T10:58:21.5125791Z 'range_num_stages': [0, 0, 2], 2026-02-21T10:58:21.5126005Z 'range_unroll_factors': [0, 0, 3], 2026-02-21T10:58:21.5126272Z 'range_warp_specializes': [None, True, None]} 2026-02-21T10:58:21.5152118Z [248s] Fitting surrogate: 725 points, 725 targets 2026-02-21T10:58:22.9292592Z [250s] Generation 7 starting: 97 neighbors, 5 active search path(s) 2026-02-21T10:58:28.9912046Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 13.7 configs/s 2026-02-21T10:58:31.2832959Z [258s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', '', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False, False], range_multi_buffers=[False, False, False], range_num_stages=[2, 4, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[True, None, None]) 2026-02-21T10:58:31.2834250Z Tensor-likes are not close! 2026-02-21T10:58:31.2836397Z 2026-02-21T10:58:31.2836714Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:31.2837081Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:31.2837504Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:31.2837859Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:31.2838072Z 2026-02-21T10:58:32.5573550Z [259s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, None, False], range_multi_buffers=[False, True, None], range_num_stages=[2, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:32.5575084Z Tensor-likes are not close! 2026-02-21T10:58:32.5579627Z 2026-02-21T10:58:32.5579932Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:32.5580339Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:32.5580822Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:32.5581216Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:32.5581401Z 2026-02-21T10:58:33.0670478Z [260s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None, False], range_multi_buffers=[False, True, True], range_num_stages=[2, 1, 2], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:33.0671781Z Tensor-likes are not close! 2026-02-21T10:58:33.0676063Z 2026-02-21T10:58:33.0677829Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:33.0678264Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:33.0682185Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:33.0682674Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:33.0686296Z 2026-02-21T10:58:34.9146405Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 17.2 configs/s 2026-02-21T10:58:45.8564493Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 91.1 configs/s 2026-02-21T10:58:46.0957859Z [273s] Generation 7 complete: 2026-02-21T10:58:46.0959904Z error=4 2026-02-21T10:58:46.0960145Z ok=99 2026-02-21T10:58:46.0960347Z min=0.1936 2026-02-21T10:58:46.0964940Z mid=0.2857 2026-02-21T10:58:46.0966540Z max=2.5570 2026-02-21T10:58:46.0966839Z best={'block_sizes': [8, 64, 512], 2026-02-21T10:58:46.0971691Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:58:46.0976088Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:58:46.0977455Z 'num_stages': 1, 2026-02-21T10:58:46.0977748Z 'num_warps': 2, 2026-02-21T10:58:46.0982682Z 'pid_type': 'flat', 2026-02-21T10:58:46.0986601Z 'range_flattens': [None, None, False], 2026-02-21T10:58:46.0988159Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:58:46.0988483Z 'range_num_stages': [0, 0, 0], 2026-02-21T10:58:46.0993315Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:58:46.0997890Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:58:46.1002327Z [273s] Fitting surrogate: 828 points, 828 targets 2026-02-21T10:58:47.4278427Z [274s] Generation 8 starting: 90 neighbors, 5 active search path(s) 2026-02-21T10:58:52.6325747Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 49.5 configs/s 2026-02-21T10:58:55.9088977Z [283s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, None, False], range_multi_buffers=[False, True, True], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:55.9090222Z Tensor-likes are not close! 2026-02-21T10:58:55.9095447Z 2026-02-21T10:58:55.9100625Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:55.9105254Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:55.9109274Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:55.9110885Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:55.9111141Z 2026-02-21T10:58:55.9191677Z [283s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, None, False], range_multi_buffers=[False, True, True], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:55.9196341Z Tensor-likes are not close! 2026-02-21T10:58:55.9199774Z 2026-02-21T10:58:55.9200157Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:55.9200868Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:55.9201273Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:55.9201648Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:55.9201834Z 2026-02-21T10:58:56.4780585Z [283s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None, False], range_multi_buffers=[False, None, True], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:56.4781990Z Tensor-likes are not close! 2026-02-21T10:58:56.4786525Z 2026-02-21T10:58:56.4790590Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:56.4795045Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:56.4798906Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:56.4799404Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:56.4799614Z 2026-02-21T10:58:56.6692008Z [283s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[64, 512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, None, False], range_multi_buffers=[False, True, True], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:56.6693315Z Tensor-likes are not close! 2026-02-21T10:58:56.6695354Z 2026-02-21T10:58:56.6695602Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:56.6696032Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:56.6700525Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:56.6704900Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:56.6708896Z 2026-02-21T10:58:56.6795149Z [283s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, None, None], range_multi_buffers=[False, True, True], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:58:56.6796421Z Tensor-likes are not close! 2026-02-21T10:58:56.6800437Z 2026-02-21T10:58:56.6804511Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:58:56.6804965Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:58:56.6805615Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:58:56.6810494Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:58:56.6812117Z 2026-02-21T10:58:58.0691008Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 17.4 configs/s 2026-02-21T10:59:09.3755893Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 90.1 configs/s 2026-02-21T10:59:09.6243979Z [296s] Generation 8 complete: 2026-02-21T10:59:09.6248382Z error=5 2026-02-21T10:59:09.6252735Z ok=90 2026-02-21T10:59:09.6254205Z min=0.1914 2026-02-21T10:59:09.6254454Z mid=0.2846 2026-02-21T10:59:09.6254628Z max=2.4290 2026-02-21T10:59:09.6254848Z best={'block_sizes': [8, 64, 512], 2026-02-21T10:59:09.6255185Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:59:09.6255502Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:59:09.6255770Z 'num_stages': 1, 2026-02-21T10:59:09.6256282Z 'num_warps': 2, 2026-02-21T10:59:09.6256534Z 'pid_type': 'flat', 2026-02-21T10:59:09.6256747Z 'range_flattens': [None, None, False], 2026-02-21T10:59:09.6257032Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:59:09.6257269Z 'range_num_stages': [0, 0, 0], 2026-02-21T10:59:09.6257515Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:59:09.6257786Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:59:09.6275315Z [296s] Fitting surrogate: 923 points, 923 targets 2026-02-21T10:59:10.9920109Z [298s] Generation 9 starting: 89 neighbors, 5 active search path(s) 2026-02-21T10:59:15.7041320Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 65.4 configs/s 2026-02-21T10:59:17.8841382Z [305s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', '', '', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False, None], range_multi_buffers=[None, False, False], range_num_stages=[3, 3, 3], range_unroll_factors=[0, 1, 1], range_warp_specializes=[True, None, None]) 2026-02-21T10:59:17.8842892Z Tensor-likes are not close! 2026-02-21T10:59:17.8846325Z 2026-02-21T10:59:17.8849717Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:59:17.8854232Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:59:17.8855722Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:59:17.8856140Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:17.8856326Z 2026-02-21T10:59:19.3669969Z [306s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, None, False], range_multi_buffers=[False, True, True], range_num_stages=[3, 0, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:59:19.3671261Z Tensor-likes are not close! 2026-02-21T10:59:19.3674822Z 2026-02-21T10:59:19.3678881Z Mismatched elements: 422 / 268435456 (0.0%) 2026-02-21T10:59:19.3683345Z Greatest absolute difference: 0.03125 at index (38640, 908) (up to 0.01 allowed) 2026-02-21T10:59:19.3685166Z Greatest relative difference: 25.0 at index (30765, 437) (up to 0.01 allowed) 2026-02-21T10:59:19.3685576Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:19.3685761Z 2026-02-21T10:59:21.1174664Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 16.9 configs/s 2026-02-21T10:59:35.4424241Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 70.9 configs/s 2026-02-21T10:59:35.7337763Z [323s] Generation 9 complete: 2026-02-21T10:59:35.7342293Z error=2 2026-02-21T10:59:35.7346508Z ok=92 2026-02-21T10:59:35.7350841Z min=0.1915 2026-02-21T10:59:35.7354742Z mid=0.2673 2026-02-21T10:59:35.7359141Z max=1.5267 2026-02-21T10:59:35.7363460Z best={'block_sizes': [8, 64, 512], 2026-02-21T10:59:35.7363874Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:59:35.7368718Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:59:35.7373045Z 'num_stages': 1, 2026-02-21T10:59:35.7377499Z 'num_warps': 2, 2026-02-21T10:59:35.7378864Z 'pid_type': 'flat', 2026-02-21T10:59:35.7379118Z 'range_flattens': [None, None, False], 2026-02-21T10:59:35.7379401Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:59:35.7379668Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:59:35.7379886Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:59:35.7380163Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:59:35.7380431Z [323s] Fitting surrogate: 1017 points, 1017 targets 2026-02-21T10:59:36.5883275Z [323s] Generation 10 starting: 51 neighbors, 3 active search path(s) 2026-02-21T10:59:42.1001458Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 5.9 configs/s 2026-02-21T10:59:43.5145275Z [330s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', '', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False, False], range_multi_buffers=[None, True, False], range_num_stages=[3, 3, 3], range_unroll_factors=[2, 1, 1], range_warp_specializes=[None, False, False]) 2026-02-21T10:59:43.5146584Z Tensor-likes are not close! 2026-02-21T10:59:43.5150262Z 2026-02-21T10:59:43.5155526Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:59:43.5160022Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:59:43.5161364Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:59:43.5161769Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:43.5162040Z 2026-02-21T10:59:44.0702423Z [331s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', '', 'last'], num_sm_multiplier=16, num_stages=7, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, False, False], range_multi_buffers=[None, True, False], range_num_stages=[3, 3, 3], range_unroll_factors=[2, 1, 1], range_warp_specializes=[None, False, False]) 2026-02-21T10:59:44.0703629Z Tensor-likes are not close! 2026-02-21T10:59:44.0703765Z 2026-02-21T10:59:44.0703863Z Mismatched elements: 17 / 268435456 (0.0%) 2026-02-21T10:59:44.0704212Z Greatest absolute difference: 0.013671875 at index (32168, 938) (up to 0.01 allowed) 2026-02-21T10:59:44.0704622Z Greatest relative difference: 1.9296875 at index (95396, 437) (up to 0.01 allowed) 2026-02-21T10:59:44.0705006Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:44.0705188Z 2026-02-21T10:59:44.6332966Z [331s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 64], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, None], range_num_stages=[0, 1, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T10:59:44.6334203Z Tensor-likes are not close! 2026-02-21T10:59:44.6337970Z 2026-02-21T10:59:44.6342575Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:59:44.6344745Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:59:44.6345381Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:59:44.6348964Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:44.6353066Z 2026-02-21T10:59:45.0595790Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 52/52 17.8 configs/s 2026-02-21T10:59:53.7522093Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 114.5 2026-02-21T10:59:53.7526199Z configs/s 2026-02-21T10:59:53.9631428Z [341s] Generation 10 complete: 2026-02-21T10:59:53.9635835Z error=4 2026-02-21T10:59:53.9637381Z ok=51 2026-02-21T10:59:53.9637637Z min=0.1894 2026-02-21T10:59:53.9637822Z mid=0.2642 2026-02-21T10:59:53.9638018Z max=1.4460 2026-02-21T10:59:53.9638258Z best={'block_sizes': [8, 64, 512], 2026-02-21T10:59:53.9638607Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T10:59:53.9642375Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T10:59:53.9644295Z 'num_stages': 1, 2026-02-21T10:59:53.9644959Z 'num_warps': 2, 2026-02-21T10:59:53.9649155Z 'pid_type': 'flat', 2026-02-21T10:59:53.9650734Z 'range_flattens': [None, None, False], 2026-02-21T10:59:53.9651061Z 'range_multi_buffers': [None, None, None], 2026-02-21T10:59:53.9653627Z 'range_num_stages': [0, 0, 1], 2026-02-21T10:59:53.9653936Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T10:59:53.9654198Z 'range_warp_specializes': [None, None, None]} 2026-02-21T10:59:53.9672910Z [341s] Fitting surrogate: 1072 points, 1072 targets 2026-02-21T10:59:54.6571338Z [341s] Generation 11 starting: 35 neighbors, 2 active search path(s) 2026-02-21T10:59:56.9719441Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 33.0 configs/s 2026-02-21T10:59:58.5479701Z [345s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 1, 1], range_unroll_factors=[0, 4, 1], range_warp_specializes=[None, False, True]) 2026-02-21T10:59:58.5480860Z Tensor-likes are not close! 2026-02-21T10:59:58.5486228Z 2026-02-21T10:59:58.5490328Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T10:59:58.5491757Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T10:59:58.5492407Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T10:59:58.5492788Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:58.5492973Z 2026-02-21T10:59:58.5582152Z [345s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, False, False], range_num_stages=[0, 1, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, True]) 2026-02-21T10:59:58.5583512Z Tensor-likes are not close! 2026-02-21T10:59:58.5583652Z 2026-02-21T10:59:58.5583778Z Mismatched elements: 422 / 268435456 (0.0%) 2026-02-21T10:59:58.5584113Z Greatest absolute difference: 0.03125 at index (38640, 908) (up to 0.01 allowed) 2026-02-21T10:59:58.5584517Z Greatest relative difference: 25.0 at index (30765, 437) (up to 0.01 allowed) 2026-02-21T10:59:58.5588274Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T10:59:58.5592475Z 2026-02-21T10:59:58.9822505Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 17.8 configs/s 2026-02-21T11:00:06.3992120Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 134.2 2026-02-21T11:00:06.3992977Z configs/s 2026-02-21T11:00:06.5994498Z [353s] Generation 11 complete: 2026-02-21T11:00:06.5998690Z error=2 2026-02-21T11:00:06.6002767Z ok=36 2026-02-21T11:00:06.6007229Z min=0.1924 2026-02-21T11:00:06.6011088Z mid=0.2263 2026-02-21T11:00:06.6015268Z max=1.1337 2026-02-21T11:00:06.6017292Z best={'block_sizes': [8, 64, 512], 2026-02-21T11:00:06.6017692Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:00:06.6022408Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T11:00:06.6026759Z 'num_stages': 1, 2026-02-21T11:00:06.6030643Z 'num_warps': 2, 2026-02-21T11:00:06.6034582Z 'pid_type': 'flat', 2026-02-21T11:00:06.6038601Z 'range_flattens': [None, None, False], 2026-02-21T11:00:06.6042472Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:00:06.6044456Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:00:06.6044770Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:00:06.6049652Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:00:06.6053476Z [353s] Fitting surrogate: 1110 points, 1110 targets 2026-02-21T11:00:07.3367727Z [354s] Generation 12 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:00:09.7267589Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 34.6 configs/s 2026-02-21T11:00:11.4850280Z [358s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 1, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, True]) 2026-02-21T11:00:11.4851408Z Tensor-likes are not close! 2026-02-21T11:00:11.4851546Z 2026-02-21T11:00:11.4851644Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T11:00:11.4852360Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T11:00:11.4852757Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T11:00:11.4853168Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:00:11.4853364Z 2026-02-21T11:00:12.0326117Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 17.2 configs/s 2026-02-21T11:00:20.8271075Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 116.5 2026-02-21T11:00:20.8275208Z configs/s 2026-02-21T11:00:21.0501118Z [368s] Generation 12 complete: 2026-02-21T11:00:21.0505509Z error=1 2026-02-21T11:00:21.0507127Z ok=40 2026-02-21T11:00:21.0507403Z min=0.1914 2026-02-21T11:00:21.0512767Z mid=0.2407 2026-02-21T11:00:21.0516707Z max=1.1986 2026-02-21T11:00:21.0518867Z best={'block_sizes': [8, 64, 512], 2026-02-21T11:00:21.0519231Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:00:21.0519540Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T11:00:21.0519812Z 'num_stages': 1, 2026-02-21T11:00:21.0520020Z 'num_warps': 2, 2026-02-21T11:00:21.0520233Z 'pid_type': 'flat', 2026-02-21T11:00:21.0520475Z 'range_flattens': [None, None, False], 2026-02-21T11:00:21.0520717Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:00:21.0525083Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:00:21.0529442Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:00:21.0534506Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:00:21.0543406Z [368s] Fitting surrogate: 1151 points, 1151 targets 2026-02-21T11:00:21.7921310Z [369s] Generation 13 starting: 40 neighbors, 2 active search path(s) 2026-02-21T11:00:24.4061661Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 22.0 configs/s 2026-02-21T11:00:26.2552104Z [373s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 1024, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'first'], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, None, False], range_multi_buffers=[False, False, False], range_num_stages=[3, 1, 1], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, True]) 2026-02-21T11:00:26.2553677Z Tensor-likes are not close! 2026-02-21T11:00:26.2557720Z 2026-02-21T11:00:26.2559385Z Mismatched elements: 422 / 268435456 (0.0%) 2026-02-21T11:00:26.2559748Z Greatest absolute difference: 0.03125 at index (38640, 908) (up to 0.01 allowed) 2026-02-21T11:00:26.2560165Z Greatest relative difference: 25.0 at index (30765, 437) (up to 0.01 allowed) 2026-02-21T11:00:26.2560539Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:00:26.2560724Z 2026-02-21T11:00:26.7394164Z [374s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, False, False], range_multi_buffers=[False, False, False], range_num_stages=[3, 1, 2], range_unroll_factors=[0, 3, 0], range_warp_specializes=[None, False, True]) 2026-02-21T11:00:26.7395455Z Tensor-likes are not close! 2026-02-21T11:00:26.7395593Z 2026-02-21T11:00:26.7395690Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T11:00:26.7396031Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T11:00:26.7396417Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T11:00:26.7396787Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:00:26.7396970Z 2026-02-21T11:00:26.7407127Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 17.5 configs/s 2026-02-21T11:00:34.7438883Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 124.5 2026-02-21T11:00:34.7443009Z configs/s 2026-02-21T11:00:34.9404565Z [382s] Generation 13 complete: 2026-02-21T11:00:34.9408438Z error=2 2026-02-21T11:00:34.9410367Z ok=41 2026-02-21T11:00:34.9410601Z min=0.1906 2026-02-21T11:00:34.9410798Z mid=0.2427 2026-02-21T11:00:34.9410962Z max=2.3387 2026-02-21T11:00:34.9411171Z best={'block_sizes': [8, 64, 512], 2026-02-21T11:00:34.9411535Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:00:34.9415025Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T11:00:34.9418073Z 'num_stages': 1, 2026-02-21T11:00:34.9421508Z 'num_warps': 2, 2026-02-21T11:00:34.9426067Z 'pid_type': 'flat', 2026-02-21T11:00:34.9430948Z 'range_flattens': [None, None, False], 2026-02-21T11:00:34.9435303Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:00:34.9439668Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:00:34.9440927Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:00:34.9441223Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:00:34.9448844Z [382s] Fitting surrogate: 1194 points, 1194 targets 2026-02-21T11:00:35.6670199Z [382s] Generation 14 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:00:37.9825613Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 33.0 configs/s 2026-02-21T11:00:40.0478806Z [387s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[32, 512, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_sm_multiplier=2, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[None, None, False], range_multi_buffers=[False, False, None], range_num_stages=[3, 1, 1], range_unroll_factors=[1, 3, 0], range_warp_specializes=[None, False, True]) 2026-02-21T11:00:40.0480134Z Tensor-likes are not close! 2026-02-21T11:00:40.0480316Z 2026-02-21T11:00:40.0480423Z Mismatched elements: 7 / 268435456 (0.0%) 2026-02-21T11:00:40.0480750Z Greatest absolute difference: 0.015625 at index (202563, 437) (up to 0.01 allowed) 2026-02-21T11:00:40.0481231Z Greatest relative difference: 7.65625 at index (202563, 341) (up to 0.01 allowed) 2026-02-21T11:00:40.0482159Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:00:40.0482360Z 2026-02-21T11:00:40.1687526Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 17.8 configs/s 2026-02-21T11:00:48.5248643Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 123.3 2026-02-21T11:00:48.5252638Z configs/s 2026-02-21T11:00:48.7293138Z [395s] Generation 14 complete: 2026-02-21T11:00:48.7297447Z error=2 2026-02-21T11:00:48.7299094Z ok=39 2026-02-21T11:00:48.7299343Z min=0.1914 2026-02-21T11:00:48.7299586Z mid=0.2243 2026-02-21T11:00:48.7299781Z max=0.6749 2026-02-21T11:00:48.7300047Z best={'block_sizes': [8, 64, 512], 2026-02-21T11:00:48.7300394Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:00:48.7300790Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T11:00:48.7301435Z 'num_stages': 1, 2026-02-21T11:00:48.7301666Z 'num_warps': 2, 2026-02-21T11:00:48.7302014Z 'pid_type': 'flat', 2026-02-21T11:00:48.7302244Z 'range_flattens': [None, None, False], 2026-02-21T11:00:48.7302582Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:00:48.7302852Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:00:48.7303144Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:00:48.7303416Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:00:48.7328850Z [396s] Fitting surrogate: 1235 points, 1235 targets 2026-02-21T11:00:49.2051409Z [396s] Generation 15 starting: 16 neighbors, 1 active search path(s) 2026-02-21T11:00:51.1982655Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 12.4 configs/s 2026-02-21T11:00:52.1718981Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s 2026-02-21T11:00:56.5458907Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 225.7 2026-02-21T11:00:56.5463078Z configs/s 2026-02-21T11:00:56.6807510Z [403s] Generation 15 complete: 2026-02-21T11:00:56.6812062Z ok=18 2026-02-21T11:00:56.6813799Z min=0.1934 2026-02-21T11:00:56.6814037Z mid=0.2243 2026-02-21T11:00:56.6814208Z max=0.7849 2026-02-21T11:00:56.6814415Z best={'block_sizes': [8, 64, 512], 2026-02-21T11:00:56.6814706Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:00:56.6815040Z 'load_eviction_policies': ['', '', 'first', ''], 2026-02-21T11:00:56.6815270Z 'num_stages': 1, 2026-02-21T11:00:56.6815476Z 'num_warps': 2, 2026-02-21T11:00:56.6815655Z 'pid_type': 'flat', 2026-02-21T11:00:56.6815879Z 'range_flattens': [None, None, False], 2026-02-21T11:00:56.6816111Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:00:56.6816372Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:00:56.6816604Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:00:56.6816835Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:00:56.6862950Z [403s] Fitting surrogate: 1253 points, 1253 targets 2026-02-21T11:00:56.9743251Z [404s] Autotuning complete in 404.2s after searching 1216 configs. 2026-02-21T11:00:56.9746918Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:00:56.9751219Z @helion.kernel(config=helion.Config(block_sizes=[8, 64, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', '', 'first', ''], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, None, None]), static_shapes=True) 2026-02-21T11:00:56.9752241Z 2026-02-21T11:00:56.9752519Z [404s] Code of selected kernel: /tmp/torchinductor_root/dq/cdqrztlkchn7jptrdb6ydizenrg32q6zd3uv4z5frsl2gxiqh43g.py 2026-02-21T11:00:57.0080787Z from __future__ import annotations 2026-02-21T11:00:57.0084635Z 2026-02-21T11:00:57.0086804Z import torch 2026-02-21T11:00:57.0087073Z import triton 2026-02-21T11:00:57.0087274Z import triton.language as tl 2026-02-21T11:00:57.0087580Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:00:57.0088009Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:00:57.0089462Z 2026-02-21T11:00:57.0089568Z _BLOCK_SIZE_0 = tl.constexpr(8) 2026-02-21T11:00:57.0089811Z _BLOCK_SIZE_1 = tl.constexpr(64) 2026-02-21T11:00:57.0090052Z _BLOCK_SIZE_2 = tl.constexpr(512) 2026-02-21T11:00:57.0094822Z 2026-02-21T11:00:57.0096429Z @triton.jit 2026-02-21T11:00:57.0096749Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:00:57.0097067Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:00:57.0097309Z pid_0 = tl.program_id(0) 2026-02-21T11:00:57.0097555Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T11:00:57.0097829Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T11:00:57.0098253Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:00:57.0098612Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:00:57.0098914Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:00:57.0099241Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:00:57.0099509Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:00:57.0099807Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:00:57.0100117Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:00:57.0100379Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0100668Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:00:57.0100898Z # src[welford.py:50-63]: ... 2026-02-21T11:00:57.0101154Z for offset_1 in tl.range(0, 1024, _BLOCK_SIZE_1): 2026-02-21T11:00:57.0101448Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T11:00:57.0101746Z acc_mean_copy = acc_mean 2026-02-21T11:00:57.0102015Z acc_cnt_copy = acc_cnt 2026-02-21T11:00:57.0102241Z acc_m2_copy = acc_m2 2026-02-21T11:00:57.0102479Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:00:57.0102707Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:00:57.0102955Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:00:57.0103185Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0103518Z chunk = tl.load(x + (indices_0[:, None] * 1024 + indices_1[None, :] * 1), None) 2026-02-21T11:00:57.0103842Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:00:57.0104142Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:00:57.0104446Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:00:57.0104711Z v_0 = chunk * chunk 2026-02-21T11:00:57.0104956Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:00:57.0105210Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0105471Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:00:57.0105944Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:00:57.0106223Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:00:57.0106451Z v_2 = sum_x / v_1 2026-02-21T11:00:57.0106725Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:00:57.0107001Z v_3 = sum_x * sum_x 2026-02-21T11:00:57.0107216Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0107479Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:00:57.0107736Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:00:57.0108034Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:00:57.0108261Z v_5 = v_3 / v_4 2026-02-21T11:00:57.0108474Z v_6 = sum_x2 - v_5 2026-02-21T11:00:57.0108689Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:00:57.0109066Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:00:57.0109314Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:00:57.0109546Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0109812Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:00:57.0110048Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:00:57.0110314Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:00:57.0110544Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:00:57.0110832Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:00:57.0111126Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:00:57.0111334Z v_12 = v_11 / acc_cnt 2026-02-21T11:00:57.0111578Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0111810Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:00:57.0112130Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:00:57.0112415Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:00:57.0112664Z v_14 = v_12 * v_13 2026-02-21T11:00:57.0112853Z v_15 = v_8 * v_14 2026-02-21T11:00:57.0113082Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:00:57.0113407Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:00:57.0113709Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:00:57.0113969Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:00:57.0114171Z v_19 = v_8 * v_8 2026-02-21T11:00:57.0114404Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:00:57.0114643Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:00:57.0114956Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:00:57.0115291Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:00:57.0115543Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:00:57.0115774Z v_22 = v_21 / acc_cnt 2026-02-21T11:00:57.0115969Z v_23 = v_19 * v_22 2026-02-21T11:00:57.0116183Z acc_m2 = v_18 + v_23 2026-02-21T11:00:57.0116432Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:00:57.0116720Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:00:57.0116906Z v_26 = v_25 + eps 2026-02-21T11:00:57.0117116Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:00:57.0117374Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:00:57.0117611Z mean_col = acc_mean[:, None] 2026-02-21T11:00:57.0117871Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:00:57.0118103Z rstd_col = v_27[:, None] 2026-02-21T11:00:57.0118341Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:00:57.0118601Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:00:57.0118913Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:00:57.0119166Z # src[welford.py:69-77]: ... 2026-02-21T11:00:57.0119473Z for offset_2 in tl.range(0, 1024, _BLOCK_SIZE_2, num_stages=1, flatten=False): 2026-02-21T11:00:57.0119908Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T11:00:57.0120180Z mean_col_copy = mean_col 2026-02-21T11:00:57.0120414Z rstd_col_copy = rstd_col 2026-02-21T11:00:57.0120625Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:00:57.0120873Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:00:57.0121113Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:00:57.0121463Z xi_chuck = tl.load(x + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), None) 2026-02-21T11:00:57.0121820Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:00:57.0122177Z load_1 = tl.load(weight + indices_2 * 1, None, eviction_policy='evict_first') 2026-02-21T11:00:57.0122500Z w_chuck = load_1[None, :] 2026-02-21T11:00:57.0122743Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:00:57.0123120Z load_2 = tl.load(bias + indices_2 * 1, None) 2026-02-21T11:00:57.0123370Z b_chuck = load_2[None, :] 2026-02-21T11:00:57.0123663Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:00:57.0123970Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:00:57.0124207Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:00:57.0124464Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:00:57.0124703Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:00:57.0124980Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:00:57.0125205Z v_32 = v_30 * v_31 2026-02-21T11:00:57.0125439Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:00:57.0125662Z v_34 = v_32 + v_33 2026-02-21T11:00:57.0125933Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:00:57.0126227Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:00:57.0126529Z tl.store(out + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:00:57.0126755Z 2026-02-21T11:00:57.0127057Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:00:57.0127450Z """ 2026-02-21T11:00:57.0127713Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:00:57.0127985Z Args: 2026-02-21T11:00:57.0128202Z weight: weight tensor of shape [N] 2026-02-21T11:00:57.0128465Z bias: bias tensor of shape [N] 2026-02-21T11:00:57.0128692Z x: input tensor of shape [M, N] 2026-02-21T11:00:57.0128929Z Returns: 2026-02-21T11:00:57.0129117Z Output tensor of shape [M, N] 2026-02-21T11:00:57.0129355Z """ 2026-02-21T11:00:57.0129534Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:00:57.0129778Z m, n = x.size() 2026-02-21T11:00:57.0130043Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:00:57.0130422Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:00:57.0130719Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:00:57.0130946Z _BLOCK_SIZE_0 = 8 2026-02-21T11:00:57.0131177Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:00:57.0131490Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:00:57.0131888Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:00:57.0132147Z # src[welford.py:45-77]: ... 2026-02-21T11:00:57.0132540Z _launcher(_helion_welford, (triton.cdiv(262144, _BLOCK_SIZE_0),), x, weight, bias, out, eps, num_warps=2, num_stages=1) 2026-02-21T11:00:57.0132955Z # src[welford.py:78]: return out 2026-02-21T11:00:57.0133162Z return out 2026-02-21T11:00:58.3684782Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T11:00:58.3686500Z x_val 2026-02-21T11:00:58.3686715Z ------- 2026-02-21T11:00:58.3686923Z 1024 2026-02-21T11:00:58.3687015Z 2026-02-21T11:00:58.3703933Z 17%|█▋ | 1/6 [06:54<34:32, 414.44s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T11:00:58.3704594Z x_val 2026-02-21T11:00:58.3704805Z ------- 2026-02-21T11:00:58.3704974Z 2048 2026-02-21T11:00:58.3719508Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T11:00:59.1555461Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T11:01:00.4573705Z INFO:tritonbench.utils.triton_op:Took 2.14ms to get benchmark function for torch_compile_welford 2026-02-21T11:01:10.0373536Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:01:10.0377607Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:01:10.0379151Z 'dtype': 'torch.bfloat16', 2026-02-21T11:01:10.0379463Z 'shape': (2048,), 2026-02-21T11:01:10.0379761Z 'stride': (1,)}, 2026-02-21T11:01:10.0380009Z { 'device': 'cuda:0', 2026-02-21T11:01:10.0380681Z 'dtype': 'torch.bfloat16', 2026-02-21T11:01:10.0380980Z 'shape': (2048,), 2026-02-21T11:01:10.0381260Z 'stride': (1,)}, 2026-02-21T11:01:10.0381534Z { 'device': 'cuda:0', 2026-02-21T11:01:10.0381784Z 'dtype': 'torch.bfloat16', 2026-02-21T11:01:10.0382865Z 'shape': (262144, 2048), 2026-02-21T11:01:10.0383088Z 'stride': (2048, 1)}), 2026-02-21T11:01:10.0383339Z 'kwargs': {}} 2026-02-21T11:01:10.0392718Z INFO:tritonbench.utils.triton_op:Took 2.26ms to get benchmark function for helion_welford 2026-02-21T11:01:10.3203971Z [0s] Autotune random seed: 2144717750 2026-02-21T11:01:10.3613242Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:01:45.0631466Z [34s] Timeout after 30s compiling Config(block_sizes=[8192, 1, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', '', 'first', 'last'], num_sm_multiplier=128, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True, False], range_multi_buffers=[False, True, False], range_num_stages=[0, 4, 0], range_unroll_factors=[3, 0, 0], range_warp_specializes=[False, None, None]) 2026-02-21T11:01:46.2370553Z [35s] Timeout after 30s compiling Config(block_sizes=[8192, 16, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', '', 'last', 'first'], num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, False, None], range_num_stages=[0, 1, 4], range_unroll_factors=[0, 0, 3], range_warp_specializes=[None, False, False]) 2026-02-21T11:01:46.2390426Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T11:01:48.9312622Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T11:01:48.9317230Z tt.func public @_helion_welford(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: !tt.ptr {tt.divisibility = 16 : i32}, %arg4: f32) attributes {noinline = false} { 2026-02-21T11:01:48.9318038Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T11:01:48.9318352Z %cst = arith.constant dense<1.000000e+00> : tensor<16xf32> 2026-02-21T11:01:48.9318660Z %cst_0 = arith.constant dense<3.200000e+01> : tensor<16xf32> 2026-02-21T11:01:48.9318949Z %c32_i32 = arith.constant 32 : i32 2026-02-21T11:01:48.9319172Z %c0_i32 = arith.constant 0 : i32 2026-02-21T11:01:48.9319426Z %c4736_i32 = arith.constant 4736 : i32 2026-02-21T11:01:48.9319681Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T11:01:48.9319970Z %c16_i32 = arith.constant 16 : i32 2026-02-21T11:01:48.9320197Z %c262144_i32 = arith.constant 262144 : i32 2026-02-21T11:01:48.9320459Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T11:01:48.9320712Z %c2048_i64 = arith.constant 2048 : i64 2026-02-21T11:01:48.9321289Z %c1_i64 = arith.constant 1 : i64 2026-02-21T11:01:48.9321687Z %0 = tt.make_tensor_descriptor %arg0, [%c262144_i32, %c2048_i32], [%c2048_i64, %c1_i64] : , > 2026-02-21T11:01:48.9322251Z %1 = tt.make_tensor_descriptor %arg3, [%c262144_i32, %c2048_i32], [%c2048_i64, %c1_i64] : , > 2026-02-21T11:01:48.9322643Z %2 = tt.get_program_id x : i32 2026-02-21T11:01:48.9322898Z scf.for %arg5 = %2 to %c16384_i32 step %c4736_i32 : i32 { 2026-02-21T11:01:48.9323258Z %3 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T11:01:48.9325118Z %c64_i32 = arith.constant 64 : i32 2026-02-21T11:01:48.9325623Z %4:3 = scf.for %arg6 = %c0_i32 to %c2048_i32 step %c64_i32 iter_args(%arg7 = %cst_1, %arg8 = %cst_1, %arg9 = %cst_1) -> (tensor<16xf32>, tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T11:01:48.9326419Z %35 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9326797Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T11:01:48.9327031Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T11:01:48.9327290Z %83 = arith.addf %arg10, %arg11 : bf16 2026-02-21T11:01:48.9327522Z tt.reduce.return %83 : bf16 2026-02-21T11:01:48.9327781Z }) : (tensor<16x32xbf16>) -> tensor<16xbf16> 2026-02-21T11:01:48.9328029Z %37 = arith.mulf %35, %35 : tensor<16x32xbf16> 2026-02-21T11:01:48.9328298Z %38 = "tt.reduce"(%37) <{axis = 1 : i32}> ({ 2026-02-21T11:01:48.9328555Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T11:01:48.9328775Z %83 = arith.addf %arg10, %arg11 : bf16 2026-02-21T11:01:48.9329030Z tt.reduce.return %83 : bf16 2026-02-21T11:01:48.9329256Z }) : (tensor<16x32xbf16>) -> tensor<16xbf16> 2026-02-21T11:01:48.9329542Z %39 = arith.extf %36 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9329814Z %40 = arith.divf %39, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9330085Z %41 = arith.mulf %36, %36 : tensor<16xbf16> 2026-02-21T11:01:48.9330367Z %42 = arith.extf %41 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9330623Z %43 = arith.divf %42, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9330901Z %44 = arith.extf %38 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9331171Z %45 = arith.subf %44, %43 : tensor<16xf32> 2026-02-21T11:01:48.9331434Z %46 = arith.subf %40, %arg8 : tensor<16xf32> 2026-02-21T11:01:48.9331673Z %47 = arith.addf %arg7, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9331976Z %48 = arith.divf %cst, %47 : tensor<16xf32> 2026-02-21T11:01:48.9332248Z %49 = arith.mulf %48, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9332492Z %50 = arith.mulf %46, %49 : tensor<16xf32> 2026-02-21T11:01:48.9332759Z %51 = arith.addf %arg8, %50 : tensor<16xf32> 2026-02-21T11:01:48.9333065Z %52 = arith.addf %arg9, %45 : tensor<16xf32> 2026-02-21T11:01:48.9333316Z %53 = arith.mulf %46, %46 : tensor<16xf32> 2026-02-21T11:01:48.9333594Z %54 = arith.mulf %arg7, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9333843Z %55 = arith.divf %54, %47 : tensor<16xf32> 2026-02-21T11:01:48.9334109Z %56 = arith.mulf %53, %55 : tensor<16xf32> 2026-02-21T11:01:48.9334375Z %57 = arith.addf %52, %56 : tensor<16xf32> 2026-02-21T11:01:48.9334612Z %c1_i32 = arith.constant 1 : i32 2026-02-21T11:01:48.9334882Z %58 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T11:01:48.9335122Z %59 = arith.addi %arg6, %58 : i32 2026-02-21T11:01:48.9335477Z %60 = tt.descriptor_load %0[%3, %59] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9335821Z %61 = "tt.reduce"(%60) <{axis = 1 : i32}> ({ 2026-02-21T11:01:48.9336087Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T11:01:48.9336357Z %83 = arith.addf %arg10, %arg11 : bf16 2026-02-21T11:01:48.9336679Z tt.reduce.return %83 : bf16 2026-02-21T11:01:48.9336946Z }) : (tensor<16x32xbf16>) -> tensor<16xbf16> 2026-02-21T11:01:48.9337201Z %62 = arith.mulf %60, %60 : tensor<16x32xbf16> 2026-02-21T11:01:48.9337479Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T11:01:48.9337714Z ^bb0(%arg10: bf16, %arg11: bf16): 2026-02-21T11:01:48.9337975Z %83 = arith.addf %arg10, %arg11 : bf16 2026-02-21T11:01:48.9338216Z tt.reduce.return %83 : bf16 2026-02-21T11:01:48.9338508Z }) : (tensor<16x32xbf16>) -> tensor<16xbf16> 2026-02-21T11:01:48.9338804Z %64 = arith.extf %61 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9339080Z %65 = arith.divf %64, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9339360Z %66 = arith.mulf %61, %61 : tensor<16xbf16> 2026-02-21T11:01:48.9339626Z %67 = arith.extf %66 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9339992Z %68 = arith.divf %67, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9340304Z %69 = arith.extf %63 : tensor<16xbf16> to tensor<16xf32> 2026-02-21T11:01:48.9340563Z %70 = arith.subf %69, %68 : tensor<16xf32> 2026-02-21T11:01:48.9340821Z %71 = arith.subf %65, %51 : tensor<16xf32> 2026-02-21T11:01:48.9341057Z %72 = arith.addf %47, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9341321Z %73 = arith.divf %cst, %72 : tensor<16xf32> 2026-02-21T11:01:48.9341579Z %74 = arith.mulf %73, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9341808Z %75 = arith.mulf %71, %74 : tensor<16xf32> 2026-02-21T11:01:48.9342117Z %76 = arith.addf %51, %75 : tensor<16xf32> 2026-02-21T11:01:48.9342339Z %77 = arith.addf %57, %70 : tensor<16xf32> 2026-02-21T11:01:48.9342590Z %78 = arith.mulf %71, %71 : tensor<16xf32> 2026-02-21T11:01:48.9342823Z %79 = arith.mulf %47, %cst_0 : tensor<16xf32> 2026-02-21T11:01:48.9343086Z %80 = arith.divf %79, %72 : tensor<16xf32> 2026-02-21T11:01:48.9343341Z %81 = arith.mulf %78, %80 : tensor<16xf32> 2026-02-21T11:01:48.9343570Z %82 = arith.addf %77, %81 : tensor<16xf32> 2026-02-21T11:01:48.9343872Z scf.yield %72, %76, %82 : tensor<16xf32>, tensor<16xf32>, tensor<16xf32> 2026-02-21T11:01:48.9344139Z } 2026-02-21T11:01:48.9344348Z %5 = arith.divf %4#2, %4#0 : tensor<16xf32> 2026-02-21T11:01:48.9344590Z %6 = tt.splat %arg4 : f32 -> tensor<16xf32> 2026-02-21T11:01:48.9344854Z %7 = arith.addf %5, %6 : tensor<16xf32> 2026-02-21T11:01:48.9345277Z %8 = tt.extern_elementwise %7 {libname = "", libpath = "", pure = true, symbol = "__nv_rsqrtf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T11:01:48.9345726Z %9 = tt.expand_dims %4#1 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T11:01:48.9346099Z %10 = tt.expand_dims %8 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T11:01:48.9346388Z %c2016_i32 = arith.constant 2016 : i32 2026-02-21T11:01:48.9346644Z %c96_i32 = arith.constant 96 : i32 2026-02-21T11:01:48.9346905Z scf.for %arg6 = %c0_i32 to %c2016_i32 step %c96_i32 : i32 { 2026-02-21T11:01:48.9347244Z %35 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T11:01:48.9347554Z %36 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T11:01:48.9347796Z %37 = arith.addi %36, %35 : tensor<32xi32> 2026-02-21T11:01:48.9348152Z %38 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9348516Z %39 = tt.splat %arg1 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9348850Z %40 = tt.addptr %39, %37 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9349171Z %41 = tt.load %40 evictionPolicy = evict_last : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9349541Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9349899Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9350266Z %44 = tt.addptr %43, %37 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9350572Z %45 = tt.load %44 : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9350864Z %46 = tt.expand_dims %45 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9351215Z %47 = arith.extf %38 : tensor<16x32xbf16> to tensor<16x32xf32> 2026-02-21T11:01:48.9351544Z %48 = tt.broadcast %9 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9351813Z %49 = arith.subf %47, %48 : tensor<16x32xf32> 2026-02-21T11:01:48.9352133Z %50 = tt.broadcast %10 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9352398Z %51 = arith.mulf %49, %50 : tensor<16x32xf32> 2026-02-21T11:01:48.9352688Z %52 = arith.extf %42 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9353033Z %53 = tt.broadcast %52 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9353329Z %54 = arith.mulf %51, %53 : tensor<16x32xf32> 2026-02-21T11:01:48.9353619Z %55 = arith.extf %46 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9353901Z %56 = tt.broadcast %55 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9354187Z %57 = arith.addf %54, %56 : tensor<16x32xf32> 2026-02-21T11:01:48.9354451Z %58 = arith.truncf %57 : tensor<16x32xf32> to tensor<16x32xbf16> 2026-02-21T11:01:48.9354844Z tt.descriptor_store %1[%3, %arg6], %58 : !tt.tensordesc>, tensor<16x32xbf16> 2026-02-21T11:01:48.9355162Z %c1_i32 = arith.constant 1 : i32 2026-02-21T11:01:48.9355415Z %59 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T11:01:48.9355663Z %60 = arith.addi %arg6, %59 : i32 2026-02-21T11:01:48.9355926Z %61 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T11:01:48.9356233Z %62 = tt.splat %60 : i32 -> tensor<32xi32> 2026-02-21T11:01:48.9356467Z %63 = arith.addi %62, %61 : tensor<32xi32> 2026-02-21T11:01:48.9356808Z %64 = tt.descriptor_load %0[%3, %60] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9357156Z %65 = tt.splat %arg1 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9357486Z %66 = tt.addptr %65, %63 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9357828Z %67 = tt.load %66 evictionPolicy = evict_last : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9358163Z %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9358508Z %69 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9358804Z %70 = tt.addptr %69, %63 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9359106Z %71 = tt.load %70 : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9359390Z %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9359746Z %73 = arith.extf %64 : tensor<16x32xbf16> to tensor<16x32xf32> 2026-02-21T11:01:48.9360071Z %74 = tt.broadcast %9 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9360336Z %75 = arith.subf %73, %74 : tensor<16x32xf32> 2026-02-21T11:01:48.9360627Z %76 = tt.broadcast %10 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9360887Z %77 = arith.mulf %75, %76 : tensor<16x32xf32> 2026-02-21T11:01:48.9361179Z %78 = arith.extf %68 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9361492Z %79 = tt.broadcast %78 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9361753Z %80 = arith.mulf %77, %79 : tensor<16x32xf32> 2026-02-21T11:01:48.9362089Z %81 = arith.extf %72 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9362371Z %82 = tt.broadcast %81 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9362664Z %83 = arith.addf %80, %82 : tensor<16x32xf32> 2026-02-21T11:01:48.9362991Z %84 = arith.truncf %83 : tensor<16x32xf32> to tensor<16x32xbf16> 2026-02-21T11:01:48.9363373Z tt.descriptor_store %1[%3, %60], %84 : !tt.tensordesc>, tensor<16x32xbf16> 2026-02-21T11:01:48.9363722Z %c2_i32 = arith.constant 2 : i32 2026-02-21T11:01:48.9363947Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T11:01:48.9364202Z %86 = arith.addi %arg6, %85 : i32 2026-02-21T11:01:48.9364464Z %87 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T11:01:48.9364770Z %88 = tt.splat %86 : i32 -> tensor<32xi32> 2026-02-21T11:01:48.9365001Z %89 = arith.addi %88, %87 : tensor<32xi32> 2026-02-21T11:01:48.9365334Z %90 = tt.descriptor_load %0[%3, %86] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9365708Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9366086Z %92 = tt.addptr %91, %89 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9366438Z %93 = tt.load %92 evictionPolicy = evict_last : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9366776Z %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9367123Z %95 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9367423Z %96 = tt.addptr %95, %89 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9367726Z %97 = tt.load %96 : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9368047Z %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9368365Z %99 = arith.extf %90 : tensor<16x32xbf16> to tensor<16x32xf32> 2026-02-21T11:01:48.9368678Z %100 = tt.broadcast %9 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9368950Z %101 = arith.subf %99, %100 : tensor<16x32xf32> 2026-02-21T11:01:48.9369246Z %102 = tt.broadcast %10 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9369520Z %103 = arith.mulf %101, %102 : tensor<16x32xf32> 2026-02-21T11:01:48.9369814Z %104 = arith.extf %94 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9370137Z %105 = tt.broadcast %104 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9370412Z %106 = arith.mulf %103, %105 : tensor<16x32xf32> 2026-02-21T11:01:48.9370701Z %107 = arith.extf %98 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9370988Z %108 = tt.broadcast %107 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9371292Z %109 = arith.addf %106, %108 : tensor<16x32xf32> 2026-02-21T11:01:48.9371593Z %110 = arith.truncf %109 : tensor<16x32xf32> to tensor<16x32xbf16> 2026-02-21T11:01:48.9371974Z tt.descriptor_store %1[%3, %86], %110 : !tt.tensordesc>, tensor<16x32xbf16> 2026-02-21T11:01:48.9372373Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T11:01:48.9372702Z %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T11:01:48.9373017Z %12 = tt.splat %c2016_i32 : i32 -> tensor<32xi32> 2026-02-21T11:01:48.9373260Z %13 = arith.addi %12, %11 : tensor<32xi32> 2026-02-21T11:01:48.9373619Z %14 = tt.descriptor_load %0[%3, %c2016_i32] : !tt.tensordesc> -> tensor<16x32xbf16> 2026-02-21T11:01:48.9374020Z %15 = tt.splat %arg1 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9374316Z %16 = tt.addptr %15, %13 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9374656Z %17 = tt.load %16 evictionPolicy = evict_last : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9374991Z %18 = tt.expand_dims %17 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9375345Z %19 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T11:01:48.9375646Z %20 = tt.addptr %19, %13 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T11:01:48.9376008Z %21 = tt.load %20 : tensor<32x!tt.ptr> 2026-02-21T11:01:48.9376341Z %22 = tt.expand_dims %21 {axis = 0 : i32} : tensor<32xbf16> -> tensor<1x32xbf16> 2026-02-21T11:01:48.9376677Z %23 = arith.extf %14 : tensor<16x32xbf16> to tensor<16x32xf32> 2026-02-21T11:01:48.9377015Z %24 = tt.broadcast %9 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9377300Z %25 = arith.subf %23, %24 : tensor<16x32xf32> 2026-02-21T11:01:48.9377604Z %26 = tt.broadcast %10 : tensor<16x1xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9377906Z %27 = arith.mulf %25, %26 : tensor<16x32xf32> 2026-02-21T11:01:48.9378183Z %28 = arith.extf %18 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9378507Z %29 = tt.broadcast %28 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9378780Z %30 = arith.mulf %27, %29 : tensor<16x32xf32> 2026-02-21T11:01:48.9379141Z %31 = arith.extf %22 : tensor<1x32xbf16> to tensor<1x32xf32> 2026-02-21T11:01:48.9379441Z %32 = tt.broadcast %31 : tensor<1x32xf32> -> tensor<16x32xf32> 2026-02-21T11:01:48.9379738Z %33 = arith.addf %30, %32 : tensor<16x32xf32> 2026-02-21T11:01:48.9380042Z %34 = arith.truncf %33 : tensor<16x32xf32> to tensor<16x32xbf16> 2026-02-21T11:01:48.9380423Z tt.descriptor_store %1[%3, %c2016_i32], %34 : !tt.tensordesc>, tensor<16x32xbf16> 2026-02-21T11:01:48.9380837Z } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T11:01:48.9381102Z tt.return 2026-02-21T11:01:48.9381301Z } 2026-02-21T11:01:48.9381466Z } 2026-02-21T11:01:48.9381587Z 2026-02-21T11:01:48.9381659Z {-# 2026-02-21T11:01:48.9381833Z external_resources: { 2026-02-21T11:01:48.9382083Z mlir_reproducer: { 2026-02-21T11:01:48.9386674Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T11:01:48.9391358Z disable_threading: false, 2026-02-21T11:01:48.9391596Z verify_each: true 2026-02-21T11:01:48.9391781Z } 2026-02-21T11:01:48.9392007Z } 2026-02-21T11:01:48.9392163Z #-} 2026-02-21T11:01:48.9392715Z /tmp/torchinductor_root/ub/cubccsw5j33qlahgkebamakkeza244t2gd2nuw5vukt3pwoxj5ee.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T11:01:48.9394102Z /tmp/torchinductor_root/ub/cubccsw5j33qlahgkebamakkeza244t2gd2nuw5vukt3pwoxj5ee.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T11:01:48.9395305Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T11:01:48.9396788Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 32, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first', 'last', ''], maxnreg=128, num_sm_multiplier=32, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None, True], range_multi_buffers=[True, None, False], range_num_stages=[4, 0, 2], range_unroll_factors=[0, 2, 3], range_warp_specializes=[True, None, None]), static_shapes=True) 2026-02-21T11:01:48.9398080Z Error: RuntimeError: PassManager::run failed 2026-02-21T11:01:48.9398418Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T11:01:52.2013936Z [41s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 1], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', '', 'first', 'last'], maxnreg=128, num_sm_multiplier=1, num_stages=3, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True, False], range_multi_buffers=[True, False, None], range_num_stages=[1, 3, 1], range_unroll_factors=[2, 1, 3], range_warp_specializes=[False, False, None]) 2026-02-21T11:01:52.2015341Z Tensor-likes are not close! 2026-02-21T11:01:52.2021125Z 2026-02-21T11:01:52.2025800Z Mismatched elements: 11 / 536870912 (0.0%) 2026-02-21T11:01:52.2027496Z Greatest absolute difference: 0.015625 at index (148778, 1010) (up to 0.01 allowed) 2026-02-21T11:01:52.2028038Z Greatest relative difference: 2.5 at index (149651, 1010) (up to 0.01 allowed) 2026-02-21T11:01:52.2032178Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:01:52.2036332Z 2026-02-21T11:01:52.5380124Z [42s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first', '', 'first'], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, False, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 4, 3], range_warp_specializes=[None, False, False]) 2026-02-21T11:01:52.5381373Z Tensor-likes are not close! 2026-02-21T11:01:52.5385108Z 2026-02-21T11:01:52.5386618Z Mismatched elements: 572 / 536870912 (0.0%) 2026-02-21T11:01:52.5386973Z Greatest absolute difference: 0.03125 at index (12980, 1242) (up to 0.01 allowed) 2026-02-21T11:01:52.5387405Z Greatest relative difference: 213.0 at index (135330, 1455) (up to 0.01 allowed) 2026-02-21T11:01:52.5387762Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:01:52.5387971Z 2026-02-21T11:01:55.1976419Z [44s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 1024, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first', 'last', 'last'], maxnreg=256, num_sm_multiplier=128, num_stages=6, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, None, True], range_multi_buffers=[None, False, None], range_num_stages=[2, 0, 3], range_unroll_factors=[0, 3, 2], range_warp_specializes=[True, None, None]) 2026-02-21T11:01:55.1977906Z Tensor-likes are not close! 2026-02-21T11:01:55.1982637Z 2026-02-21T11:01:55.1985981Z Mismatched elements: 12 / 536870912 (0.0%) 2026-02-21T11:01:55.1990458Z Greatest absolute difference: 0.015625 at index (46881, 1563) (up to 0.01 allowed) 2026-02-21T11:01:55.1994285Z Greatest relative difference: 0.63671875 at index (242337, 1242) (up to 0.01 allowed) 2026-02-21T11:01:55.1998369Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:01:55.2002004Z 2026-02-21T11:02:02.4456074Z [52s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', '', 'first', 'last'], maxnreg=256, num_sm_multiplier=128, num_stages=3, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, None, None], range_multi_buffers=[False, False, True], range_num_stages=[1, 4, 4], range_unroll_factors=[2, 2, 0], range_warp_specializes=[False, False, None]) 2026-02-21T11:02:02.4457321Z Tensor-likes are not close! 2026-02-21T11:02:02.4457462Z 2026-02-21T11:02:02.4457590Z Mismatched elements: 567 / 536870912 (0.0%) 2026-02-21T11:02:02.4457905Z Greatest absolute difference: 0.0625 at index (234199, 1152) (up to 0.01 allowed) 2026-02-21T11:02:02.4458729Z Greatest relative difference: 64.0 at index (73540, 592) (up to 0.01 allowed) 2026-02-21T11:02:02.4459096Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:02:02.4459307Z 2026-02-21T11:02:06.9707756Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 4.8 configs/s 2026-02-21T11:02:06.9723278Z [56s] Adaptive compile timeout: 30s (90% percentile=2.2s, bounds=[30.0s, 30s]) 2026-02-21T11:02:07.5617157Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 317/317 392.1 configs/s 2026-02-21T11:02:07.7843765Z [57s] Initial random population of 100, 5 starting points: 2026-02-21T11:02:07.7848078Z error=10 2026-02-21T11:02:07.7852413Z timeout=2 2026-02-21T11:02:07.7856984Z ok=88 2026-02-21T11:02:07.7861378Z min=0.6748 2026-02-21T11:02:07.7865304Z mid=10.1807 2026-02-21T11:02:07.7865630Z max=219.6736 2026-02-21T11:02:07.7865857Z best={'block_sizes': [64, 8, 64], 2026-02-21T11:02:07.7866282Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T11:02:07.7866691Z 'load_eviction_policies': ['', 'last', 'first', 'last'], 2026-02-21T11:02:07.7866960Z 'num_stages': 3, 2026-02-21T11:02:07.7867177Z 'num_warps': 4, 2026-02-21T11:02:07.7867361Z 'pid_type': 'flat', 2026-02-21T11:02:07.7867595Z 'range_flattens': [None, False, None], 2026-02-21T11:02:07.7867827Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:02:07.7868088Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:02:07.7868303Z 'range_unroll_factors': [0, 1, 3], 2026-02-21T11:02:07.7868569Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:02:07.7868853Z [57s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:02:09.0882973Z [58s] Generation 1 starting: 96 neighbors, 5 active search path(s) 2026-02-21T11:02:15.1119893Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 41.2 configs/s 2026-02-21T11:02:18.8236820Z [68s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T11:02:18.8238008Z Tensor-likes are not close! 2026-02-21T11:02:18.8242099Z 2026-02-21T11:02:18.8247268Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:02:18.8251282Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:02:18.8252743Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:02:18.8253165Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:02:18.8253353Z 2026-02-21T11:02:22.1707102Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 14.4 configs/s 2026-02-21T11:02:25.9092633Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 511/511 131.3 configs/s 2026-02-21T11:02:26.1006747Z [75s] Generation 1 complete: 2026-02-21T11:02:26.1011072Z error=1 2026-02-21T11:02:26.1012723Z ok=101 2026-02-21T11:02:26.1012973Z min=0.4393 2026-02-21T11:02:26.1013199Z mid=0.9432 2026-02-21T11:02:26.1013392Z max=8.0978 2026-02-21T11:02:26.1013647Z best={'block_sizes': [16, 256, 128], 2026-02-21T11:02:26.1014035Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:02:26.1014470Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:02:26.1014737Z 'num_stages': 4, 2026-02-21T11:02:26.1014978Z 'num_warps': 16, 2026-02-21T11:02:26.1015179Z 'pid_type': 'flat', 2026-02-21T11:02:26.1015453Z 'range_flattens': [None, False, False], 2026-02-21T11:02:26.1015785Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:02:26.1016080Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:02:26.1016412Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:02:26.1016688Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:02:26.1029248Z [75s] Fitting surrogate: 202 points, 202 targets 2026-02-21T11:02:27.4196248Z [77s] Generation 2 starting: 96 neighbors, 5 active search path(s) 2026-02-21T11:02:33.2511326Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 28.0 configs/s 2026-02-21T11:02:35.9134537Z [85s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T11:02:35.9135731Z Tensor-likes are not close! 2026-02-21T11:02:35.9140271Z 2026-02-21T11:02:35.9145609Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:02:35.9149390Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:02:35.9150853Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:02:35.9151262Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:02:35.9151482Z 2026-02-21T11:02:36.7632849Z [86s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False, False], range_multi_buffers=[False, None, True], range_num_stages=[1, 3, 1], range_unroll_factors=[1, 3, 0], range_warp_specializes=[True, None, None]) 2026-02-21T11:02:36.7634162Z Tensor-likes are not close! 2026-02-21T11:02:36.7634306Z 2026-02-21T11:02:36.7634409Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:02:36.7634776Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:02:36.7635245Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:02:36.7635619Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:02:36.7635810Z 2026-02-21T11:02:40.0436946Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 14.9 configs/s 2026-02-21T11:02:49.2281245Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 511/511 54.9 configs/s 2026-02-21T11:02:49.4578710Z [99s] Generation 2 complete: 2026-02-21T11:02:49.4582931Z error=2 2026-02-21T11:02:49.4584526Z ok=100 2026-02-21T11:02:49.4584768Z min=0.4627 2026-02-21T11:02:49.4584946Z mid=0.6217 2026-02-21T11:02:49.4585142Z max=4.3316 2026-02-21T11:02:49.4585320Z best={'block_sizes': [16, 256, 128], 2026-02-21T11:02:49.4585666Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:02:49.4586075Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:02:49.4589989Z 'num_stages': 4, 2026-02-21T11:02:49.4594348Z 'num_warps': 16, 2026-02-21T11:02:49.4598836Z 'pid_type': 'flat', 2026-02-21T11:02:49.4603336Z 'range_flattens': [None, False, False], 2026-02-21T11:02:49.4604694Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:02:49.4605022Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:02:49.4605253Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:02:49.4605533Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:02:49.4605901Z [99s] Fitting surrogate: 304 points, 304 targets 2026-02-21T11:02:50.6542320Z [100s] Generation 3 starting: 85 neighbors, 5 active search path(s) 2026-02-21T11:02:55.5192611Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 34.3 configs/s 2026-02-21T11:02:58.5524656Z [108s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False, False], range_multi_buffers=[False, None, True], range_num_stages=[1, 3, 0], range_unroll_factors=[1, 3, 0], range_warp_specializes=[True, None, None]) 2026-02-21T11:02:58.5525926Z Tensor-likes are not close! 2026-02-21T11:02:58.5530299Z 2026-02-21T11:02:58.5534928Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:02:58.5536528Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:02:58.5536992Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:02:58.5537369Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:02:58.5537559Z 2026-02-21T11:03:00.9786764Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 15.9 configs/s 2026-02-21T11:03:16.1712429Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 529/529 34.6 configs/s 2026-02-21T11:03:16.4485640Z [126s] Generation 3 complete: 2026-02-21T11:03:16.4490071Z error=1 2026-02-21T11:03:16.4491571Z ok=90 2026-02-21T11:03:16.4491814Z min=0.4208 2026-02-21T11:03:16.4492143Z mid=0.5847 2026-02-21T11:03:16.4492339Z max=1.3753 2026-02-21T11:03:16.4492522Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:03:16.4492849Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:03:16.4493147Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:03:16.4493407Z 'num_stages': 3, 2026-02-21T11:03:16.4493612Z 'num_warps': 4, 2026-02-21T11:03:16.4493794Z 'pid_type': 'flat', 2026-02-21T11:03:16.4494022Z 'range_flattens': [None, None, True], 2026-02-21T11:03:16.4494259Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:03:16.4494522Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:03:16.4494768Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:03:16.4495086Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:03:16.4511500Z [126s] Fitting surrogate: 395 points, 395 targets 2026-02-21T11:03:17.9566405Z [127s] Generation 4 starting: 81 neighbors, 5 active search path(s) 2026-02-21T11:03:23.2958785Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 12.0 configs/s 2026-02-21T11:03:28.7399349Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 15.5 configs/s 2026-02-21T11:03:36.4435821Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 76.5 configs/s 2026-02-21T11:03:36.6491771Z [146s] Generation 4 complete: 2026-02-21T11:03:36.6496230Z ok=86 2026-02-21T11:03:36.6500159Z min=0.3759 2026-02-21T11:03:36.6504524Z mid=0.5683 2026-02-21T11:03:36.6508936Z max=2.5641 2026-02-21T11:03:36.6513370Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:03:36.6517834Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:03:36.6521765Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:03:36.6523178Z 'num_stages': 3, 2026-02-21T11:03:36.6523455Z 'num_warps': 4, 2026-02-21T11:03:36.6528116Z 'pid_type': 'flat', 2026-02-21T11:03:36.6532445Z 'range_flattens': [None, None, True], 2026-02-21T11:03:36.6533863Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:03:36.6534557Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:03:36.6534796Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:03:36.6535073Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:03:36.6535468Z [146s] Fitting surrogate: 481 points, 481 targets 2026-02-21T11:03:37.7798337Z [147s] Generation 5 starting: 83 neighbors, 5 active search path(s) 2026-02-21T11:03:43.1730158Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 22.7 configs/s 2026-02-21T11:03:46.5333117Z [156s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T11:03:46.5334599Z Tensor-likes are not close! 2026-02-21T11:03:46.5339295Z 2026-02-21T11:03:46.5341158Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:03:46.5341564Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:03:46.5345550Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:03:46.5349455Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:03:46.5353029Z 2026-02-21T11:03:48.6954914Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 15.3 configs/s 2026-02-21T11:03:57.7738352Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 64.9 configs/s 2026-02-21T11:03:57.9910297Z [167s] Generation 5 complete: 2026-02-21T11:03:57.9914816Z error=1 2026-02-21T11:03:57.9919399Z ok=87 2026-02-21T11:03:57.9923824Z min=0.3900 2026-02-21T11:03:57.9927686Z mid=0.5673 2026-02-21T11:03:57.9932091Z max=6.6622 2026-02-21T11:03:57.9932427Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:03:57.9932804Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:03:57.9937537Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:03:57.9941345Z 'num_stages': 3, 2026-02-21T11:03:57.9945411Z 'num_warps': 4, 2026-02-21T11:03:57.9949136Z 'pid_type': 'flat', 2026-02-21T11:03:57.9950868Z 'range_flattens': [None, None, True], 2026-02-21T11:03:57.9951160Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:03:57.9951396Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:03:57.9951650Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:03:57.9951990Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:03:57.9952275Z [167s] Fitting surrogate: 569 points, 569 targets 2026-02-21T11:03:59.1904998Z [168s] Generation 6 starting: 78 neighbors, 5 active search path(s) 2026-02-21T11:04:03.8020987Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 48.2 configs/s 2026-02-21T11:04:08.9779828Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 15.4 configs/s 2026-02-21T11:04:20.4178161Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 53.3 configs/s 2026-02-21T11:04:20.6577673Z [190s] Generation 6 complete: 2026-02-21T11:04:20.6582090Z ok=83 2026-02-21T11:04:20.6583349Z min=0.3881 2026-02-21T11:04:20.6583591Z mid=0.5570 2026-02-21T11:04:20.6583762Z max=3.4852 2026-02-21T11:04:20.6583979Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:04:20.6584286Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:04:20.6584640Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:04:20.6584884Z 'num_stages': 3, 2026-02-21T11:04:20.6585097Z 'num_warps': 4, 2026-02-21T11:04:20.6585281Z 'pid_type': 'flat', 2026-02-21T11:04:20.6585523Z 'range_flattens': [None, None, True], 2026-02-21T11:04:20.6585802Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:04:20.6587859Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:04:20.6588189Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:04:20.6588454Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:04:20.6606927Z [190s] Fitting surrogate: 652 points, 652 targets 2026-02-21T11:04:21.7410851Z [191s] Generation 7 starting: 67 neighbors, 4 active search path(s) 2026-02-21T11:04:25.6223189Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 55.8 configs/s 2026-02-21T11:04:27.9572733Z [197s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 2, 0], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T11:04:27.9573911Z Tensor-likes are not close! 2026-02-21T11:04:27.9578584Z 2026-02-21T11:04:27.9583148Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:04:27.9585065Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:04:27.9585570Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:04:27.9585947Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:04:27.9586128Z 2026-02-21T11:04:27.9722853Z [197s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 2, 0], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T11:04:27.9723941Z Tensor-likes are not close! 2026-02-21T11:04:27.9724114Z 2026-02-21T11:04:27.9724215Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:04:27.9724538Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:04:27.9724952Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:04:27.9725327Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:04:27.9725508Z 2026-02-21T11:04:28.6395881Z [198s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T11:04:28.6396944Z Tensor-likes are not close! 2026-02-21T11:04:28.6399667Z 2026-02-21T11:04:28.6404253Z Mismatched elements: 14 / 536870912 (0.0%) 2026-02-21T11:04:28.6407586Z Greatest absolute difference: 0.015625 at index (237166, 1242) (up to 0.01 allowed) 2026-02-21T11:04:28.6412071Z Greatest relative difference: 207.0 at index (237166, 233) (up to 0.01 allowed) 2026-02-21T11:04:28.6416624Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:04:28.6416909Z 2026-02-21T11:04:28.6570883Z [198s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 1024, 128], indexing=['tensor_descriptor', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 2, 0], range_unroll_factors=[0, 0, 1], range_warp_specializes=[None, True, None]) 2026-02-21T11:04:28.6572079Z Tensor-likes are not close! 2026-02-21T11:04:28.6572243Z 2026-02-21T11:04:28.6572341Z Mismatched elements: 13 / 536870912 (0.0%) 2026-02-21T11:04:28.6572665Z Greatest absolute difference: 0.013671875 at index (248326, 1455) (up to 0.01 allowed) 2026-02-21T11:04:28.6573098Z Greatest relative difference: 2.125 at index (52004, 1242) (up to 0.01 allowed) 2026-02-21T11:04:28.6573465Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:04:28.6573650Z 2026-02-21T11:04:29.7392259Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 16.4 configs/s 2026-02-21T11:04:36.5081267Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 86.8 configs/s 2026-02-21T11:04:36.7109657Z [206s] Generation 7 complete: 2026-02-21T11:04:36.7114567Z error=4 2026-02-21T11:04:36.7118851Z ok=68 2026-02-21T11:04:36.7122831Z min=0.4045 2026-02-21T11:04:36.7124189Z mid=0.5909 2026-02-21T11:04:36.7124419Z max=1.4541 2026-02-21T11:04:36.7124604Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:04:36.7124926Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:04:36.7125233Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:04:36.7125490Z 'num_stages': 3, 2026-02-21T11:04:36.7125698Z 'num_warps': 4, 2026-02-21T11:04:36.7125875Z 'pid_type': 'flat', 2026-02-21T11:04:36.7126368Z 'range_flattens': [None, None, True], 2026-02-21T11:04:36.7126623Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:04:36.7126876Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:04:36.7127080Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:04:36.7127338Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:04:36.7144393Z [206s] Fitting surrogate: 724 points, 724 targets 2026-02-21T11:04:37.6564821Z [207s] Generation 8 starting: 55 neighbors, 4 active search path(s) 2026-02-21T11:04:41.2409979Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 19.9 configs/s 2026-02-21T11:04:45.1411495Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 55/55 14.2 configs/s 2026-02-21T11:04:51.7227766Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 89.0 configs/s 2026-02-21T11:04:51.9210251Z [221s] Generation 8 complete: 2026-02-21T11:04:51.9215297Z ok=60 2026-02-21T11:04:51.9219664Z min=0.3903 2026-02-21T11:04:51.9223214Z mid=0.5489 2026-02-21T11:04:51.9227684Z max=2.2384 2026-02-21T11:04:51.9229124Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:04:51.9229493Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:04:51.9229852Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:04:51.9230130Z 'num_stages': 3, 2026-02-21T11:04:51.9230318Z 'num_warps': 4, 2026-02-21T11:04:51.9230533Z 'pid_type': 'flat', 2026-02-21T11:04:51.9230737Z 'range_flattens': [None, None, True], 2026-02-21T11:04:51.9231011Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:04:51.9231231Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:04:51.9231466Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:04:51.9231694Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:04:51.9252964Z [221s] Fitting surrogate: 784 points, 784 targets 2026-02-21T11:04:52.7435568Z [222s] Generation 9 starting: 45 neighbors, 3 active search path(s) 2026-02-21T11:04:55.5897610Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 43.5 configs/s 2026-02-21T11:04:58.5723571Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 15.6 configs/s 2026-02-21T11:05:05.2975338Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 597/597 87.0 configs/s 2026-02-21T11:05:05.4962157Z [235s] Generation 9 complete: 2026-02-21T11:05:05.4965881Z ok=49 2026-02-21T11:05:05.4970272Z min=0.4024 2026-02-21T11:05:05.4973677Z mid=0.5826 2026-02-21T11:05:05.4977125Z max=1.5329 2026-02-21T11:05:05.4978617Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:05:05.4978982Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:05:05.4979305Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:05:05.4979579Z 'num_stages': 3, 2026-02-21T11:05:05.4979761Z 'num_warps': 4, 2026-02-21T11:05:05.4979969Z 'pid_type': 'flat', 2026-02-21T11:05:05.4980166Z 'range_flattens': [None, None, True], 2026-02-21T11:05:05.4980432Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:05:05.4981063Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:05:05.4981300Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:05:05.4981577Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:05:05.5001250Z [235s] Fitting surrogate: 833 points, 833 targets 2026-02-21T11:05:06.1612669Z [235s] Generation 10 starting: 33 neighbors, 2 active search path(s) 2026-02-21T11:05:08.3126114Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 45.1 configs/s 2026-02-21T11:05:10.5352472Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 15.6 configs/s 2026-02-21T11:05:15.1504117Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━ 597/597 126.0 configs/s 2026-02-21T11:05:15.3352114Z [244s] Generation 10 complete: 2026-02-21T11:05:15.3356032Z ok=36 2026-02-21T11:05:15.3359931Z min=0.3985 2026-02-21T11:05:15.3364434Z mid=0.5673 2026-02-21T11:05:15.3368810Z max=2.5661 2026-02-21T11:05:15.3373191Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:05:15.3377645Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:05:15.3381471Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:05:15.3381907Z 'num_stages': 3, 2026-02-21T11:05:15.3382146Z 'num_warps': 4, 2026-02-21T11:05:15.3382331Z 'pid_type': 'flat', 2026-02-21T11:05:15.3382567Z 'range_flattens': [None, None, True], 2026-02-21T11:05:15.3387244Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:05:15.3391477Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:05:15.3395379Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:05:15.3399915Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:05:15.3404423Z [244s] Fitting surrogate: 869 points, 869 targets 2026-02-21T11:05:16.0028445Z [245s] Generation 11 starting: 36 neighbors, 2 active search path(s) 2026-02-21T11:05:18.4493132Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 35.3 configs/s 2026-02-21T11:05:21.2241323Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 13.5 configs/s 2026-02-21T11:05:24.9942500Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 597/597 153.1 configs/s 2026-02-21T11:05:25.1662401Z [254s] Generation 11 complete: 2026-02-21T11:05:25.1666754Z ok=39 2026-02-21T11:05:25.1670644Z min=0.3851 2026-02-21T11:05:25.1675082Z mid=0.5867 2026-02-21T11:05:25.1679351Z max=1.7940 2026-02-21T11:05:25.1683165Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:05:25.1687637Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:05:25.1689155Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:05:25.1689437Z 'num_stages': 3, 2026-02-21T11:05:25.1689659Z 'num_warps': 4, 2026-02-21T11:05:25.1689868Z 'pid_type': 'flat', 2026-02-21T11:05:25.1690073Z 'range_flattens': [None, None, True], 2026-02-21T11:05:25.1690335Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:05:25.1690563Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:05:25.1690795Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:05:25.1691025Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:05:25.1691346Z [254s] Fitting surrogate: 908 points, 908 targets 2026-02-21T11:05:25.7802488Z [255s] Generation 12 starting: 31 neighbors, 2 active search path(s) 2026-02-21T11:05:27.9688905Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 31.9 configs/s 2026-02-21T11:05:30.0366382Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 15.8 configs/s 2026-02-21T11:05:33.4910863Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 597/597 166.6 configs/s 2026-02-21T11:05:33.6578595Z [263s] Generation 12 complete: 2026-02-21T11:05:33.6582441Z ok=34 2026-02-21T11:05:33.6582699Z min=0.3871 2026-02-21T11:05:33.6587381Z mid=0.5602 2026-02-21T11:05:33.6591759Z max=1.2258 2026-02-21T11:05:33.6596080Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:05:33.6600488Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:05:33.6602156Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:05:33.6602496Z 'num_stages': 3, 2026-02-21T11:05:33.6607064Z 'num_warps': 4, 2026-02-21T11:05:33.6611252Z 'pid_type': 'flat', 2026-02-21T11:05:33.6615754Z 'range_flattens': [None, None, True], 2026-02-21T11:05:33.6619698Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:05:33.6620083Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:05:33.6620349Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:05:33.6624265Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:05:33.6627617Z [263s] Fitting surrogate: 942 points, 942 targets 2026-02-21T11:05:34.1214406Z [263s] Generation 13 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:05:36.6379069Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 9.0 configs/s 2026-02-21T11:05:37.8272332Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 15.6 configs/s 2026-02-21T11:05:39.5153528Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 597/597 324.4 configs/s 2026-02-21T11:05:39.6876853Z [269s] Generation 13 complete: 2026-02-21T11:05:39.6881770Z ok=19 2026-02-21T11:05:39.6885102Z min=0.4149 2026-02-21T11:05:39.6889572Z mid=0.5774 2026-02-21T11:05:39.6893423Z max=1.5759 2026-02-21T11:05:39.6896614Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:05:39.6901173Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:05:39.6905893Z 'load_eviction_policies': ['', 'first', '', 'last'], 2026-02-21T11:05:39.6906246Z 'num_stages': 3, 2026-02-21T11:05:39.6906487Z 'num_warps': 4, 2026-02-21T11:05:39.6910447Z 'pid_type': 'flat', 2026-02-21T11:05:39.6914953Z 'range_flattens': [None, None, True], 2026-02-21T11:05:39.6918846Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:05:39.6923212Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:05:39.6927048Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:05:39.6931539Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:05:39.6935352Z [269s] Fitting surrogate: 961 points, 961 targets 2026-02-21T11:05:39.9853104Z [269s] Autotuning complete in 269.6s after searching 922 configs. 2026-02-21T11:05:39.9856287Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:05:39.9861983Z @helion.kernel(config=helion.Config(block_sizes=[8, 128, 1024], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', 'last'], num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]), static_shapes=True) 2026-02-21T11:05:39.9862987Z 2026-02-21T11:05:39.9867798Z [269s] Code of selected kernel: /tmp/torchinductor_root/lv/clvh3f47sog7m6umuyimnlrqlows5dtrvu7x4nxuenyvunbwxi3l.py 2026-02-21T11:05:40.0192811Z from __future__ import annotations 2026-02-21T11:05:40.0197088Z 2026-02-21T11:05:40.0200533Z import torch 2026-02-21T11:05:40.0205067Z import triton 2026-02-21T11:05:40.0206876Z import triton.language as tl 2026-02-21T11:05:40.0207204Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:05:40.0207840Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:05:40.0208034Z 2026-02-21T11:05:40.0208157Z _BLOCK_SIZE_0 = tl.constexpr(8) 2026-02-21T11:05:40.0208372Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T11:05:40.0208620Z _BLOCK_SIZE_2 = tl.constexpr(1024) 2026-02-21T11:05:40.0208754Z 2026-02-21T11:05:40.0208829Z @triton.jit 2026-02-21T11:05:40.0209054Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:05:40.0209315Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:05:40.0209596Z pid_0 = tl.program_id(0) 2026-02-21T11:05:40.0209973Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T11:05:40.0210274Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T11:05:40.0210657Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:05:40.0210995Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:05:40.0211430Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:05:40.0211745Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:05:40.0212182Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:05:40.0212473Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:05:40.0212733Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:05:40.0213028Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0213282Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:05:40.0213555Z # src[welford.py:50-63]: ... 2026-02-21T11:05:40.0213907Z for offset_1 in tl.range(0, 2048, _BLOCK_SIZE_1, warp_specialize=True, disallow_acc_multi_buffer=False): 2026-02-21T11:05:40.0214291Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T11:05:40.0214578Z acc_mean_copy = acc_mean 2026-02-21T11:05:40.0214777Z acc_cnt_copy = acc_cnt 2026-02-21T11:05:40.0215006Z acc_m2_copy = acc_m2 2026-02-21T11:05:40.0215229Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:05:40.0215491Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:05:40.0215726Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:05:40.0215962Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0216287Z chunk = tl.load(x + (indices_0[:, None] * 2048 + indices_1[None, :] * 1), None) 2026-02-21T11:05:40.0216611Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:05:40.0216907Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:05:40.0217185Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:05:40.0217476Z v_0 = chunk * chunk 2026-02-21T11:05:40.0217694Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:05:40.0217978Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0218248Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:05:40.0218478Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:05:40.0218752Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:05:40.0218977Z v_2 = sum_x / v_1 2026-02-21T11:05:40.0219257Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:05:40.0219505Z v_3 = sum_x * sum_x 2026-02-21T11:05:40.0219747Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0220010Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:05:40.0220262Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:05:40.0220557Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:05:40.0220778Z v_5 = v_3 / v_4 2026-02-21T11:05:40.0220988Z v_6 = sum_x2 - v_5 2026-02-21T11:05:40.0221248Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:05:40.0221477Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:05:40.0221716Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:05:40.0221985Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0222338Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:05:40.0222573Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:05:40.0222844Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:05:40.0223104Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:05:40.0223373Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:05:40.0223674Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:05:40.0223884Z v_12 = v_11 / acc_cnt 2026-02-21T11:05:40.0224131Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0224361Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:05:40.0224648Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:05:40.0224924Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:05:40.0225170Z v_14 = v_12 * v_13 2026-02-21T11:05:40.0225452Z v_15 = v_8 * v_14 2026-02-21T11:05:40.0225651Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:05:40.0225970Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:05:40.0226266Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:05:40.0226496Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:05:40.0226694Z v_19 = v_8 * v_8 2026-02-21T11:05:40.0226930Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:05:40.0227192Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:05:40.0227484Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:05:40.0227820Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:05:40.0228046Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:05:40.0228277Z v_22 = v_21 / acc_cnt 2026-02-21T11:05:40.0228469Z v_23 = v_19 * v_22 2026-02-21T11:05:40.0228684Z acc_m2 = v_18 + v_23 2026-02-21T11:05:40.0228939Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:05:40.0229235Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:05:40.0229449Z v_26 = v_25 + eps 2026-02-21T11:05:40.0229635Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:05:40.0229889Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:05:40.0230122Z mean_col = acc_mean[:, None] 2026-02-21T11:05:40.0230378Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:05:40.0230607Z rstd_col = v_27[:, None] 2026-02-21T11:05:40.0230846Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:05:40.0231103Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:05:40.0231408Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:05:40.0231691Z # src[welford.py:69-77]: ... 2026-02-21T11:05:40.0232063Z for offset_2 in tl.range(0, 2048, _BLOCK_SIZE_2, num_stages=1, disallow_acc_multi_buffer=False, flatten=True): 2026-02-21T11:05:40.0232490Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T11:05:40.0232757Z mean_col_copy = mean_col 2026-02-21T11:05:40.0232982Z rstd_col_copy = rstd_col 2026-02-21T11:05:40.0233188Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:05:40.0233427Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:05:40.0233698Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:05:40.0234079Z xi_chuck = tl.load(x + (indices_0[:, None] * 2048 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T11:05:40.0234499Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:05:40.0234773Z load_1 = tl.load(weight + indices_2 * 1, None) 2026-02-21T11:05:40.0235034Z w_chuck = load_1[None, :] 2026-02-21T11:05:40.0235269Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:05:40.0235609Z load_2 = tl.load(bias + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:05:40.0235918Z b_chuck = load_2[None, :] 2026-02-21T11:05:40.0236229Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:05:40.0236521Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:05:40.0236744Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:05:40.0236979Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:05:40.0237205Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:05:40.0237471Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:05:40.0237687Z v_32 = v_30 * v_31 2026-02-21T11:05:40.0237914Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:05:40.0238152Z v_34 = v_32 + v_33 2026-02-21T11:05:40.0238377Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:05:40.0238686Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:05:40.0238966Z tl.store(out + (indices_0[:, None] * 2048 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:05:40.0239261Z 2026-02-21T11:05:40.0239516Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:05:40.0239918Z """ 2026-02-21T11:05:40.0240137Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:05:40.0240422Z Args: 2026-02-21T11:05:40.0240597Z weight: weight tensor of shape [N] 2026-02-21T11:05:40.0240851Z bias: bias tensor of shape [N] 2026-02-21T11:05:40.0241071Z x: input tensor of shape [M, N] 2026-02-21T11:05:40.0241308Z Returns: 2026-02-21T11:05:40.0241487Z Output tensor of shape [M, N] 2026-02-21T11:05:40.0241724Z """ 2026-02-21T11:05:40.0241948Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:05:40.0242195Z m, n = x.size() 2026-02-21T11:05:40.0242482Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:05:40.0242819Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:05:40.0243125Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:05:40.0243359Z _BLOCK_SIZE_0 = 8 2026-02-21T11:05:40.0243598Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:05:40.0243913Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:05:40.0244294Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:05:40.0244586Z # src[welford.py:45-77]: ... 2026-02-21T11:05:40.0244951Z _launcher(_helion_welford, (triton.cdiv(262144, _BLOCK_SIZE_0),), x, weight, bias, out, eps, num_warps=4, num_stages=3) 2026-02-21T11:05:40.0245368Z # src[welford.py:78]: return out 2026-02-21T11:05:40.0245576Z return out 2026-02-21T11:05:41.3970447Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T11:05:41.3974229Z x_val 2026-02-21T11:05:41.3978638Z ------- 2026-02-21T11:05:41.3982910Z 2048 2026-02-21T11:05:41.3984225Z 2026-02-21T11:05:41.3993695Z 33%|███▎ | 2/6 [11:37<22:28, 337.14s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4: 2026-02-21T11:05:41.3995276Z x_val 2026-02-21T11:05:41.3995502Z ------- 2026-02-21T11:05:41.3995667Z 3072 2026-02-21T11:05:41.4022899Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T11:05:42.1415538Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T11:05:43.4360405Z INFO:tritonbench.utils.triton_op:Took 2.14ms to get benchmark function for torch_compile_welford 2026-02-21T11:05:57.4198240Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:05:57.4202550Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:05:57.4207225Z 'dtype': 'torch.bfloat16', 2026-02-21T11:05:57.4213018Z 'shape': (3072,), 2026-02-21T11:05:57.4215477Z 'stride': (1,)}, 2026-02-21T11:05:57.4219146Z { 'device': 'cuda:0', 2026-02-21T11:05:57.4220126Z 'dtype': 'torch.bfloat16', 2026-02-21T11:05:57.4220418Z 'shape': (3072,), 2026-02-21T11:05:57.4220969Z 'stride': (1,)}, 2026-02-21T11:05:57.4221187Z { 'device': 'cuda:0', 2026-02-21T11:05:57.4221454Z 'dtype': 'torch.bfloat16', 2026-02-21T11:05:57.4221692Z 'shape': (262144, 3072), 2026-02-21T11:05:57.4222003Z 'stride': (3072, 1)}), 2026-02-21T11:05:57.4222225Z 'kwargs': {}} 2026-02-21T11:05:57.4222590Z INFO:tritonbench.utils.triton_op:Took 1.51ms to get benchmark function for helion_welford 2026-02-21T11:05:57.6994251Z [0s] Autotune random seed: 2144717750 2026-02-21T11:05:57.7403647Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:06:32.6934562Z [34s] Timeout after 30s compiling Config(block_sizes=[8192, 128, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first', 'first', ''], maxnreg=64, num_sm_multiplier=2, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 4, 0], range_unroll_factors=[3, 3, 3], range_warp_specializes=[None, None, None]) 2026-02-21T11:06:32.7626140Z [35s] Timeout after 30s compiling Config(block_sizes=[16384, 32, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', 'first', ''], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 2, 1], range_unroll_factors=[0, 4, 0], range_warp_specializes=[None, None, None]) 2026-02-21T11:06:32.7643551Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T11:07:06.9006960Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.4 configs/s 2026-02-21T11:07:06.9019871Z [69s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 30s]) 2026-02-21T11:07:07.5798394Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 186/186 180.5 configs/s 2026-02-21T11:07:07.9381053Z [70s] Initial random population of 100, 5 starting points: 2026-02-21T11:07:07.9384688Z error=6 2026-02-21T11:07:07.9388657Z timeout=2 2026-02-21T11:07:07.9393965Z ok=92 2026-02-21T11:07:07.9398248Z min=1.1069 2026-02-21T11:07:07.9399585Z mid=13.5516 2026-02-21T11:07:07.9399787Z max=402.8560 2026-02-21T11:07:07.9400046Z best={'block_sizes': [4, 512, 64], 2026-02-21T11:07:07.9400444Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:07:07.9400794Z 'tensor_descriptor'], 2026-02-21T11:07:07.9401088Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:07:07.9401353Z 'num_stages': 7, 2026-02-21T11:07:07.9401600Z 'num_warps': 8, 2026-02-21T11:07:07.9401790Z 'pid_type': 'flat', 2026-02-21T11:07:07.9402095Z 'range_flattens': [None, True, False], 2026-02-21T11:07:07.9402418Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:07:07.9402667Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:07:07.9402925Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:07:07.9403184Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:07:07.9403485Z [70s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:07:09.2100821Z [71s] Generation 1 starting: 95 neighbors, 5 active search path(s) 2026-02-21T11:07:15.8682589Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 34.4 configs/s 2026-02-21T11:07:16.3890542Z [78s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:07:16.3892409Z Tensor-likes are not close! 2026-02-21T11:07:16.3895470Z 2026-02-21T11:07:16.3898421Z Mismatched elements: 617152719 / 805306368 (76.6%) 2026-02-21T11:07:16.3902250Z Greatest absolute difference: 2.5625 at index (30977, 217) (up to 0.01 allowed) 2026-02-21T11:07:16.3903174Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:07:16.3903571Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:07:16.3903756Z 2026-02-21T11:07:17.0756099Z [79s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', '', 'last'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 3], range_warp_specializes=[None, None, None]) 2026-02-21T11:07:17.0761659Z Tensor-likes are not close! 2026-02-21T11:07:17.0762353Z 2026-02-21T11:07:17.0762596Z Mismatched elements: 616793150 / 805306368 (76.6%) 2026-02-21T11:07:17.0762985Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:07:17.0763402Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:07:17.0763800Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:07:17.0764001Z 2026-02-21T11:07:19.7428629Z [82s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', '', 'last'], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T11:07:19.7429844Z Tensor-likes are not close! 2026-02-21T11:07:19.7434439Z 2026-02-21T11:07:19.7438821Z Mismatched elements: 616703324 / 805306368 (76.6%) 2026-02-21T11:07:19.7442058Z Greatest absolute difference: 2.5625 at index (30977, 217) (up to 0.01 allowed) 2026-02-21T11:07:19.7443347Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:07:19.7443941Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:07:19.7445654Z 2026-02-21T11:07:23.4759312Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 12.9 configs/s 2026-02-21T11:07:31.4416673Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 263/263 32.0 configs/s 2026-02-21T11:07:31.7603400Z [94s] Generation 1 complete: 2026-02-21T11:07:31.7608215Z error=3 2026-02-21T11:07:31.7611548Z ok=97 2026-02-21T11:07:31.7615251Z min=0.8202 2026-02-21T11:07:31.7618997Z mid=1.4182 2026-02-21T11:07:31.7624179Z max=19.9388 2026-02-21T11:07:31.7624374Z best={'block_sizes': [32, 128, 128], 2026-02-21T11:07:31.7624742Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:07:31.7625352Z 'load_eviction_policies': ['first', 'first', 'first', 'last'], 2026-02-21T11:07:31.7625620Z 'maxnreg': 64, 2026-02-21T11:07:31.7625822Z 'num_sm_multiplier': 16, 2026-02-21T11:07:31.7626019Z 'num_stages': 7, 2026-02-21T11:07:31.7626297Z 'num_warps': 4, 2026-02-21T11:07:31.7626490Z 'pid_type': 'persistent_blocked', 2026-02-21T11:07:31.7626764Z 'range_flattens': [False, False, False], 2026-02-21T11:07:31.7627032Z 'range_multi_buffers': [False, False, False], 2026-02-21T11:07:31.7627301Z 'range_num_stages': [2, 3, 4], 2026-02-21T11:07:31.7627549Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T11:07:31.7627811Z 'range_warp_specializes': [True, None, None]} 2026-02-21T11:07:31.7628094Z [94s] Fitting surrogate: 200 points, 200 targets 2026-02-21T11:07:33.0769462Z [95s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T11:07:39.4573412Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 35.8 configs/s 2026-02-21T11:07:40.6795684Z [102s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 0, 2], range_warp_specializes=[None, True, None]) 2026-02-21T11:07:40.6796943Z Tensor-likes are not close! 2026-02-21T11:07:40.6801424Z 2026-02-21T11:07:40.6805494Z Mismatched elements: 616793150 / 805306368 (76.6%) 2026-02-21T11:07:40.6807868Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:07:40.6808360Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:07:40.6808756Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:07:46.5148910Z 2026-02-21T11:07:46.5149438Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 14.4 configs/s 2026-02-21T11:07:56.4817915Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 343/343 33.7 configs/s 2026-02-21T11:07:56.7689660Z [119s] Generation 2 complete: 2026-02-21T11:07:56.7691548Z error=1 2026-02-21T11:07:56.7696156Z ok=102 2026-02-21T11:07:56.7696474Z min=0.6500 2026-02-21T11:07:56.7696726Z mid=0.9820 2026-02-21T11:07:56.7697368Z max=4.6692 2026-02-21T11:07:56.7697678Z best={'block_sizes': [4, 256, 512], 2026-02-21T11:07:56.7698014Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:07:56.7698530Z 'tensor_descriptor'], 2026-02-21T11:07:56.7698821Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:07:56.7699073Z 'num_stages': 7, 2026-02-21T11:07:56.7699328Z 'num_warps': 4, 2026-02-21T11:07:56.7699515Z 'pid_type': 'flat', 2026-02-21T11:07:56.7699752Z 'range_flattens': [None, False, False], 2026-02-21T11:07:56.7700159Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:07:56.7700443Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:07:56.7700682Z 'range_unroll_factors': [0, 0, 2], 2026-02-21T11:07:56.7700967Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:07:56.7719481Z [119s] Fitting surrogate: 303 points, 303 targets 2026-02-21T11:07:57.9847545Z [120s] Generation 3 starting: 91 neighbors, 5 active search path(s) 2026-02-21T11:08:05.0963498Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 13.5 configs/s 2026-02-21T11:08:11.8423493Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 14.1 configs/s 2026-02-21T11:08:24.0505422Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 355/355 28.6 configs/s 2026-02-21T11:08:24.3424354Z [146s] Generation 3 complete: 2026-02-21T11:08:24.3429544Z ok=97 2026-02-21T11:08:24.3433673Z min=0.6133 2026-02-21T11:08:24.3437561Z mid=0.9134 2026-02-21T11:08:24.3439262Z max=7.9237 2026-02-21T11:08:24.3439537Z best={'block_sizes': [4, 256, 1024], 2026-02-21T11:08:24.3440258Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:08:24.3440614Z 'tensor_descriptor'], 2026-02-21T11:08:24.3440940Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:08:24.3441186Z 'num_stages': 7, 2026-02-21T11:08:24.3441442Z 'num_warps': 4, 2026-02-21T11:08:24.3441643Z 'pid_type': 'flat', 2026-02-21T11:08:24.3441926Z 'range_flattens': [None, False, False], 2026-02-21T11:08:24.3442241Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:08:24.3442487Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:08:24.3442721Z 'range_unroll_factors': [0, 0, 2], 2026-02-21T11:08:24.3443003Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:08:24.3451968Z [146s] Fitting surrogate: 400 points, 400 targets 2026-02-21T11:08:25.5521834Z [147s] Generation 4 starting: 89 neighbors, 5 active search path(s) 2026-02-21T11:08:31.4372077Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 32.1 configs/s 2026-02-21T11:08:38.3294303Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 13.4 configs/s 2026-02-21T11:08:51.9935480Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 380/380 27.5 configs/s 2026-02-21T11:08:52.2916541Z [174s] Generation 4 complete: 2026-02-21T11:08:52.2921004Z ok=95 2026-02-21T11:08:52.2924618Z min=0.6558 2026-02-21T11:08:52.2926140Z mid=0.8511 2026-02-21T11:08:52.2926474Z max=6.3396 2026-02-21T11:08:52.2931618Z best={'block_sizes': [4, 256, 1024], 2026-02-21T11:08:52.2935689Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:08:52.2936636Z 'tensor_descriptor'], 2026-02-21T11:08:52.2936974Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:08:52.2937280Z 'num_stages': 7, 2026-02-21T11:08:52.2937486Z 'num_warps': 4, 2026-02-21T11:08:52.2937726Z 'pid_type': 'flat', 2026-02-21T11:08:52.2937958Z 'range_flattens': [None, False, False], 2026-02-21T11:08:52.2938271Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:08:52.2938539Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:08:52.2938838Z 'range_unroll_factors': [0, 0, 2], 2026-02-21T11:08:52.2942259Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:08:52.2942641Z [174s] Fitting surrogate: 495 points, 495 targets 2026-02-21T11:08:53.5017030Z [175s] Generation 5 starting: 89 neighbors, 5 active search path(s) 2026-02-21T11:08:59.4502115Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 20.3 configs/s 2026-02-21T11:09:05.8938639Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 14.0 configs/s 2026-02-21T11:09:18.0161486Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 380/380 30.9 configs/s 2026-02-21T11:09:18.3078351Z [200s] Generation 5 complete: 2026-02-21T11:09:18.3080307Z ok=95 2026-02-21T11:09:18.3080558Z min=0.6564 2026-02-21T11:09:18.3080800Z mid=0.8869 2026-02-21T11:09:18.3081193Z max=6.5526 2026-02-21T11:09:18.3081415Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:09:18.3081803Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:09:18.3082310Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:09:18.3082619Z 'num_stages': 3, 2026-02-21T11:09:18.3082821Z 'num_warps': 4, 2026-02-21T11:09:18.3083049Z 'pid_type': 'flat', 2026-02-21T11:09:18.3083271Z 'range_flattens': [None, False, False], 2026-02-21T11:09:18.3083559Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:09:18.3083834Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:09:18.3084066Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:09:18.3084350Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:09:18.3116100Z [200s] Fitting surrogate: 590 points, 590 targets 2026-02-21T11:09:19.6093121Z [201s] Generation 6 starting: 88 neighbors, 5 active search path(s) 2026-02-21T11:09:26.0764495Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 28.4 configs/s 2026-02-21T11:09:31.4379707Z [213s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:31.4381272Z Tensor-likes are not close! 2026-02-21T11:09:31.4381422Z 2026-02-21T11:09:31.4381582Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:09:31.4382158Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:09:31.4382705Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:09:31.4383112Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:31.4383360Z 2026-02-21T11:09:31.5315392Z [213s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:31.5316642Z Tensor-likes are not close! 2026-02-21T11:09:31.5320048Z 2026-02-21T11:09:31.5324277Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:09:31.5327795Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:09:31.5328590Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:09:31.5329188Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:31.5329478Z 2026-02-21T11:09:31.7599887Z [214s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', 'last', ''], num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None, False], range_multi_buffers=[True, True, True], range_num_stages=[1, 0, 2], range_unroll_factors=[0, 2, 1], range_warp_specializes=[False, False, None]) 2026-02-21T11:09:31.7601288Z Tensor-likes are not close! 2026-02-21T11:09:31.7604724Z 2026-02-21T11:09:31.7608318Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:09:31.7612665Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:09:31.7615560Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:09:31.7617482Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:31.7617721Z 2026-02-21T11:09:31.8516673Z [214s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:31.8517918Z Tensor-likes are not close! 2026-02-21T11:09:31.8522730Z 2026-02-21T11:09:31.8524335Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:09:31.8524757Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:09:31.8525165Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:09:31.8525654Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:31.8525840Z 2026-02-21T11:09:32.0042756Z [214s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 512], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:32.0044278Z Tensor-likes are not close! 2026-02-21T11:09:32.0047465Z 2026-02-21T11:09:32.0050409Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:09:32.0054413Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:09:32.0055514Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:09:32.0055896Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:32.0056109Z 2026-02-21T11:09:32.1364267Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 14.9 configs/s 2026-02-21T11:09:46.5590901Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 381/381 26.1 configs/s 2026-02-21T11:09:46.8484830Z [229s] Generation 6 complete: 2026-02-21T11:09:46.8485679Z error=5 2026-02-21T11:09:46.8485875Z ok=89 2026-02-21T11:09:46.8486132Z min=0.6646 2026-02-21T11:09:46.8486317Z mid=0.8550 2026-02-21T11:09:46.8486538Z max=3.3146 2026-02-21T11:09:46.8486731Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:09:46.8487099Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:09:46.8487494Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:09:46.8487747Z 'num_stages': 4, 2026-02-21T11:09:46.8487996Z 'num_warps': 4, 2026-02-21T11:09:46.8488200Z 'pid_type': 'flat', 2026-02-21T11:09:46.8488442Z 'range_flattens': [None, False, False], 2026-02-21T11:09:46.8488701Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:09:46.8488970Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:09:46.8489222Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:09:46.8489484Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:09:46.8543135Z [229s] Fitting surrogate: 684 points, 684 targets 2026-02-21T11:09:48.1332515Z [230s] Generation 7 starting: 86 neighbors, 5 active search path(s) 2026-02-21T11:09:53.8226018Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 49.1 configs/s 2026-02-21T11:09:59.2188131Z [241s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:59.2189798Z Tensor-likes are not close! 2026-02-21T11:09:59.2194192Z 2026-02-21T11:09:59.2197054Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:09:59.2197407Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:09:59.2197947Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:09:59.2198319Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:59.2198539Z 2026-02-21T11:09:59.3793231Z [241s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:09:59.3794452Z Tensor-likes are not close! 2026-02-21T11:09:59.3794591Z 2026-02-21T11:09:59.3794754Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:09:59.3795232Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:09:59.3795691Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:09:59.3796045Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:09:59.3796500Z 2026-02-21T11:09:59.9300936Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 14.7 configs/s 2026-02-21T11:10:14.7363323Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 381/381 25.5 configs/s 2026-02-21T11:10:15.0338322Z [257s] Generation 7 complete: 2026-02-21T11:10:15.0342632Z error=2 2026-02-21T11:10:15.0344229Z ok=89 2026-02-21T11:10:15.0344551Z min=0.6226 2026-02-21T11:10:15.0349607Z mid=0.8243 2026-02-21T11:10:15.0353257Z max=5.5383 2026-02-21T11:10:15.0355328Z best={'block_sizes': [8, 128, 1024], 2026-02-21T11:10:15.0355758Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:10:15.0356253Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:10:15.0356509Z 'num_stages': 4, 2026-02-21T11:10:15.0356733Z 'num_warps': 4, 2026-02-21T11:10:15.0356957Z 'pid_type': 'flat', 2026-02-21T11:10:15.0357508Z 'range_flattens': [None, False, False], 2026-02-21T11:10:15.0357807Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:10:15.0358109Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:10:15.0358337Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:10:15.0358618Z 'range_warp_specializes': [None, None, True]} 2026-02-21T11:10:15.0378406Z [257s] Fitting surrogate: 775 points, 775 targets 2026-02-21T11:10:16.2939025Z [258s] Generation 8 starting: 86 neighbors, 5 active search path(s) 2026-02-21T11:10:22.5007218Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 51.5 configs/s 2026-02-21T11:10:28.1253040Z [270s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:10:28.1254221Z Tensor-likes are not close! 2026-02-21T11:10:28.1258138Z 2026-02-21T11:10:28.1261734Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:10:28.1265781Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:10:28.1266571Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:10:28.1267045Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:10:28.1267237Z 2026-02-21T11:10:28.5844212Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 14.5 configs/s 2026-02-21T11:10:40.8984275Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 383/383 30.7 configs/s 2026-02-21T11:10:41.1815249Z [283s] Generation 8 complete: 2026-02-21T11:10:41.1819737Z error=1 2026-02-21T11:10:41.1824095Z ok=90 2026-02-21T11:10:41.1825602Z min=0.6329 2026-02-21T11:10:41.1825965Z mid=0.8152 2026-02-21T11:10:41.1826146Z max=7.8193 2026-02-21T11:10:41.1826398Z best={'block_sizes': [2, 512, 256], 2026-02-21T11:10:41.1826750Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:10:41.1827239Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:10:41.1827466Z 'num_stages': 3, 2026-02-21T11:10:41.1827687Z 'num_warps': 1, 2026-02-21T11:10:41.1827900Z 'pid_type': 'flat', 2026-02-21T11:10:41.1828105Z 'range_flattens': [None, None, False], 2026-02-21T11:10:41.1828405Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:10:41.1828647Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:10:41.1828890Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:10:41.1829126Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:10:41.1878913Z [283s] Fitting surrogate: 866 points, 866 targets 2026-02-21T11:10:42.4087773Z [284s] Generation 9 starting: 82 neighbors, 5 active search path(s) 2026-02-21T11:10:48.0972550Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 17.0 configs/s 2026-02-21T11:10:52.0401388Z [294s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first', 'first', 'last'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, True, False], range_num_stages=[0, 3, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, False]) 2026-02-21T11:10:52.0403099Z Tensor-likes are not close! 2026-02-21T11:10:52.0406923Z 2026-02-21T11:10:52.0409910Z Mismatched elements: 616793150 / 805306368 (76.6%) 2026-02-21T11:10:52.0413790Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:10:52.0414712Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:10:52.0415144Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:10:52.0415347Z 2026-02-21T11:10:54.0526119Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 14.2 configs/s 2026-02-21T11:11:07.4596308Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 383/383 28.3 configs/s 2026-02-21T11:11:07.7475574Z [310s] Generation 9 complete: 2026-02-21T11:11:07.7476792Z error=1 2026-02-21T11:11:07.7477009Z ok=86 2026-02-21T11:11:07.7477323Z min=0.6092 2026-02-21T11:11:07.7477517Z mid=0.8540 2026-02-21T11:11:07.7477716Z max=7.2023 2026-02-21T11:11:07.7477960Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:11:07.7478352Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:11:07.7478663Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:11:07.7478929Z 'num_stages': 3, 2026-02-21T11:11:07.7479176Z 'num_warps': 1, 2026-02-21T11:11:07.7479362Z 'pid_type': 'flat', 2026-02-21T11:11:07.7479592Z 'range_flattens': [None, None, False], 2026-02-21T11:11:07.7479869Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:11:07.7480135Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:11:07.7480397Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:11:07.7480716Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:11:07.7520331Z [310s] Fitting surrogate: 953 points, 953 targets 2026-02-21T11:11:08.7875944Z [311s] Generation 10 starting: 68 neighbors, 4 active search path(s) 2026-02-21T11:11:14.0359305Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 35.4 configs/s 2026-02-21T11:11:17.1849484Z [319s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, False, True], range_multi_buffers=[None, True, False], range_num_stages=[0, 3, 4], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, False, False]) 2026-02-21T11:11:17.1850764Z Tensor-likes are not close! 2026-02-21T11:11:17.1854983Z 2026-02-21T11:11:17.1857547Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:11:17.1858210Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:11:17.1858637Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:11:17.1859101Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:11:17.1859291Z 2026-02-21T11:11:18.8454559Z [321s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, True], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:11:18.8455733Z Tensor-likes are not close! 2026-02-21T11:11:18.8455910Z 2026-02-21T11:11:18.8456015Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:11:18.8456373Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:11:18.8456848Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:11:18.8457245Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:11:18.8457445Z 2026-02-21T11:11:18.8468659Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 70/70 14.7 configs/s 2026-02-21T11:11:29.1890790Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 36.4 configs/s 2026-02-21T11:11:29.4487979Z [331s] Generation 10 complete: 2026-02-21T11:11:29.4489694Z error=2 2026-02-21T11:11:29.4490009Z ok=70 2026-02-21T11:11:29.4490242Z min=0.6175 2026-02-21T11:11:29.4490418Z mid=0.8478 2026-02-21T11:11:29.4490647Z max=7.2761 2026-02-21T11:11:29.4490865Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:11:29.4491172Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:11:29.4491834Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:11:29.4492173Z 'num_stages': 3, 2026-02-21T11:11:29.4492418Z 'num_warps': 1, 2026-02-21T11:11:29.4492628Z 'pid_type': 'flat', 2026-02-21T11:11:29.4492885Z 'range_flattens': [None, None, False], 2026-02-21T11:11:29.4493168Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:11:29.4493417Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:11:29.4493681Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:11:29.4493926Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:11:29.4547994Z [331s] Fitting surrogate: 1025 points, 1025 targets 2026-02-21T11:11:30.3253402Z [332s] Generation 11 starting: 51 neighbors, 3 active search path(s) 2026-02-21T11:11:34.1183967Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 14.8 configs/s 2026-02-21T11:11:37.4276736Z [339s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 1, 0], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:11:37.4277916Z Tensor-likes are not close! 2026-02-21T11:11:37.4278072Z 2026-02-21T11:11:37.4278173Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:11:37.4278558Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:11:37.4278953Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:11:37.4279403Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:11:37.4279592Z 2026-02-21T11:11:37.7070098Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 52/52 14.6 configs/s 2026-02-21T11:11:45.6773468Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 47.0 configs/s 2026-02-21T11:11:45.9249463Z [348s] Generation 11 complete: 2026-02-21T11:11:45.9252644Z error=1 2026-02-21T11:11:45.9255150Z ok=53 2026-02-21T11:11:45.9259693Z min=0.6074 2026-02-21T11:11:45.9261371Z mid=0.8571 2026-02-21T11:11:45.9261591Z max=3.7121 2026-02-21T11:11:45.9261769Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:11:45.9262301Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:11:45.9262615Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:11:45.9262878Z 'num_stages': 3, 2026-02-21T11:11:45.9263109Z 'num_warps': 1, 2026-02-21T11:11:45.9263333Z 'pid_type': 'flat', 2026-02-21T11:11:45.9263559Z 'range_flattens': [None, None, False], 2026-02-21T11:11:45.9263842Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:11:45.9264123Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:11:45.9264339Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:11:45.9264638Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:11:45.9310046Z [348s] Fitting surrogate: 1079 points, 1079 targets 2026-02-21T11:11:46.7518368Z [349s] Generation 12 starting: 50 neighbors, 3 active search path(s) 2026-02-21T11:11:50.7425863Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 13.3 configs/s 2026-02-21T11:11:54.3156391Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 14.4 configs/s 2026-02-21T11:12:04.4198773Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 39.3 configs/s 2026-02-21T11:12:04.6774110Z [366s] Generation 12 complete: 2026-02-21T11:12:04.6776097Z ok=53 2026-02-21T11:12:04.6776370Z min=0.6196 2026-02-21T11:12:04.6780440Z mid=0.7926 2026-02-21T11:12:04.6784354Z max=10.8677 2026-02-21T11:12:04.6788576Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:04.6790062Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:04.6790481Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:04.6795235Z 'num_stages': 3, 2026-02-21T11:12:04.6795569Z 'num_warps': 1, 2026-02-21T11:12:04.6795792Z 'pid_type': 'flat', 2026-02-21T11:12:04.6796069Z 'range_flattens': [None, None, False], 2026-02-21T11:12:04.6799955Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:12:04.6803930Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:12:04.6808425Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:04.6811545Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:04.6821354Z [366s] Fitting surrogate: 1132 points, 1132 targets 2026-02-21T11:12:05.4741129Z [367s] Generation 13 starting: 44 neighbors, 3 active search path(s) 2026-02-21T11:12:08.6361845Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 24.5 configs/s 2026-02-21T11:12:11.5377251Z [373s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 1], range_unroll_factors=[0, 1, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:12:11.5378390Z Tensor-likes are not close! 2026-02-21T11:12:11.5383397Z 2026-02-21T11:12:11.5385597Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:12:11.5385958Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:12:11.5386374Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:12:11.5386713Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:11.5386919Z 2026-02-21T11:12:11.6760388Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 45/45 15.0 configs/s 2026-02-21T11:12:19.8617194Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 45.8 configs/s 2026-02-21T11:12:20.1124150Z [382s] Generation 13 complete: 2026-02-21T11:12:20.1128689Z error=1 2026-02-21T11:12:20.1132686Z ok=46 2026-02-21T11:12:20.1134328Z min=0.6155 2026-02-21T11:12:20.1134488Z mid=0.7988 2026-02-21T11:12:20.1134623Z max=4.1463 2026-02-21T11:12:20.1134792Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:20.1135489Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:20.1135756Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:20.1135966Z 'num_stages': 3, 2026-02-21T11:12:20.1136109Z 'num_warps': 1, 2026-02-21T11:12:20.1136264Z 'pid_type': 'flat', 2026-02-21T11:12:20.1136431Z 'range_flattens': [None, None, False], 2026-02-21T11:12:20.1136624Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:12:20.1136826Z 'range_num_stages': [0, 0, 1], 2026-02-21T11:12:20.1136996Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:20.1193174Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:20.1193443Z [382s] Fitting surrogate: 1179 points, 1179 targets 2026-02-21T11:12:20.7110772Z [382s] Generation 14 starting: 28 neighbors, 2 active search path(s) 2026-02-21T11:12:22.9631145Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 21.8 configs/s 2026-02-21T11:12:24.7723871Z [387s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 512], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first', '', ''], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, True, True], range_num_stages=[0, 0, 0], range_unroll_factors=[0, 1, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:12:24.7724964Z Tensor-likes are not close! 2026-02-21T11:12:24.7730221Z 2026-02-21T11:12:24.7732016Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:12:24.7732359Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:12:24.7732720Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:12:24.7733045Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:24.7733210Z 2026-02-21T11:12:24.9813500Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 29/29 14.6 configs/s 2026-02-21T11:12:28.9746356Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 91.8 configs/s 2026-02-21T11:12:29.1922518Z [391s] Generation 14 complete: 2026-02-21T11:12:29.1926497Z error=1 2026-02-21T11:12:29.1928356Z ok=29 2026-02-21T11:12:29.1928514Z min=0.6164 2026-02-21T11:12:29.1928656Z mid=0.8468 2026-02-21T11:12:29.1928773Z max=3.7510 2026-02-21T11:12:29.1928920Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:29.1929172Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:29.1929438Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:29.1929630Z 'num_stages': 3, 2026-02-21T11:12:29.1929776Z 'num_warps': 1, 2026-02-21T11:12:29.1929923Z 'pid_type': 'flat', 2026-02-21T11:12:29.1930081Z 'range_flattens': [None, None, False], 2026-02-21T11:12:29.1930284Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:12:29.1930468Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:12:29.1930661Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:29.1930863Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:29.1977836Z [391s] Fitting surrogate: 1209 points, 1209 targets 2026-02-21T11:12:29.8198926Z [392s] Generation 15 starting: 31 neighbors, 2 active search path(s) 2026-02-21T11:12:32.0662289Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 42.6 configs/s 2026-02-21T11:12:34.2778539Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 14.7 configs/s 2026-02-21T11:12:40.4716239Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 383/383 60.1 configs/s 2026-02-21T11:12:40.7114629Z [402s] Generation 15 complete: 2026-02-21T11:12:40.7118746Z ok=33 2026-02-21T11:12:40.7120578Z min=0.6317 2026-02-21T11:12:40.7120748Z mid=0.8141 2026-02-21T11:12:40.7120872Z max=3.4729 2026-02-21T11:12:40.7121024Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:40.7121290Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:40.7121593Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:40.7122533Z 'num_stages': 3, 2026-02-21T11:12:40.7122692Z 'num_warps': 1, 2026-02-21T11:12:40.7122838Z 'pid_type': 'flat', 2026-02-21T11:12:40.7123008Z 'range_flattens': [None, None, False], 2026-02-21T11:12:40.7123202Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:12:40.7123401Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:12:40.7123581Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:40.7123773Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:40.7169193Z [402s] Fitting surrogate: 1242 points, 1242 targets 2026-02-21T11:12:41.1672223Z [403s] Generation 16 starting: 16 neighbors, 1 active search path(s) 2026-02-21T11:12:42.5171497Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 38.5 configs/s 2026-02-21T11:12:42.5403402Z [404s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', ''], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:42.5404535Z Tensor-likes are not close! 2026-02-21T11:12:42.5408566Z 2026-02-21T11:12:42.5412959Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:12:42.5414438Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:12:42.5414841Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:12:42.5415181Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:42.5415354Z 2026-02-21T11:12:42.9863667Z [405s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:42.9864763Z Tensor-likes are not close! 2026-02-21T11:12:42.9864911Z 2026-02-21T11:12:42.9869012Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:12:42.9869407Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:12:42.9869785Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:12:42.9873698Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:42.9876589Z 2026-02-21T11:12:43.0099712Z [405s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 1, 0], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:43.0100790Z Tensor-likes are not close! 2026-02-21T11:12:43.0100944Z 2026-02-21T11:12:43.0101190Z Mismatched elements: 616793150 / 805306368 (76.6%) 2026-02-21T11:12:43.0101493Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:12:43.0105574Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:12:43.0108781Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:43.0111675Z 2026-02-21T11:12:43.1024078Z [405s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:43.1025394Z Tensor-likes are not close! 2026-02-21T11:12:43.1029583Z 2026-02-21T11:12:43.1034250Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:12:43.1034649Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:12:43.1035004Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:12:43.1039429Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:43.1041246Z 2026-02-21T11:12:43.5402737Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.3 configs/s 2026-02-21T11:12:45.6989053Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━ 383/383 160.9 configs/s 2026-02-21T11:12:45.9360703Z [408s] Generation 16 complete: 2026-02-21T11:12:45.9365011Z error=4 2026-02-21T11:12:45.9368861Z ok=14 2026-02-21T11:12:45.9373226Z min=0.6288 2026-02-21T11:12:45.9377892Z mid=0.9379 2026-02-21T11:12:45.9380682Z max=2.8120 2026-02-21T11:12:45.9385689Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:45.9387256Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:45.9387578Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:45.9391707Z 'num_stages': 3, 2026-02-21T11:12:45.9394816Z 'num_warps': 1, 2026-02-21T11:12:45.9399398Z 'pid_type': 'flat', 2026-02-21T11:12:45.9403855Z 'range_flattens': [None, None, False], 2026-02-21T11:12:45.9405440Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:12:45.9405666Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:12:45.9405848Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:45.9406048Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:45.9406361Z [408s] Fitting surrogate: 1260 points, 1260 targets 2026-02-21T11:12:46.3938053Z [408s] Generation 17 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:12:48.2096028Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 14.3 configs/s 2026-02-21T11:12:49.5074789Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 14.2 configs/s 2026-02-21T11:12:51.5852608Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━ 383/383 167.0 configs/s 2026-02-21T11:12:51.8108020Z [414s] Generation 17 complete: 2026-02-21T11:12:51.8112324Z ok=19 2026-02-21T11:12:51.8113990Z min=0.6155 2026-02-21T11:12:51.8114192Z mid=0.9984 2026-02-21T11:12:51.8114329Z max=2.1770 2026-02-21T11:12:51.8118811Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:51.8122830Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:51.8126580Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:51.8126903Z 'num_stages': 3, 2026-02-21T11:12:51.8127084Z 'num_warps': 1, 2026-02-21T11:12:51.8127268Z 'pid_type': 'flat', 2026-02-21T11:12:51.8127451Z 'range_flattens': [None, None, False], 2026-02-21T11:12:51.8127679Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:12:51.8127883Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:12:51.8128066Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:51.8128269Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:51.8154301Z [414s] Fitting surrogate: 1279 points, 1279 targets 2026-02-21T11:12:52.2827402Z [414s] Generation 18 starting: 16 neighbors, 1 active search path(s) 2026-02-21T11:12:54.3142467Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 9.0 configs/s 2026-02-21T11:12:55.1817400Z [417s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 1024], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:55.1818459Z Tensor-likes are not close! 2026-02-21T11:12:55.1818620Z 2026-02-21T11:12:55.1823318Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:12:55.1824824Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:12:55.1825228Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:12:55.1825543Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:55.1825707Z 2026-02-21T11:12:55.2032774Z [417s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 1024, 512], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, None, False]) 2026-02-21T11:12:55.2033842Z Tensor-likes are not close! 2026-02-21T11:12:55.2039631Z 2026-02-21T11:12:55.2043352Z Mismatched elements: 9 / 805306368 (0.0%) 2026-02-21T11:12:55.2047287Z Greatest absolute difference: 0.0234375 at index (171338, 1559) (up to 0.01 allowed) 2026-02-21T11:12:55.2047747Z Greatest relative difference: 0.302734375 at index (207681, 2025) (up to 0.01 allowed) 2026-02-21T11:12:55.2048084Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:12:55.2052076Z 2026-02-21T11:12:55.4148776Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.0 configs/s 2026-02-21T11:12:58.0817848Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━ 383/383 133.8 configs/s 2026-02-21T11:12:58.3068356Z [420s] Generation 18 complete: 2026-02-21T11:12:58.3072725Z error=2 2026-02-21T11:12:58.3076610Z ok=16 2026-02-21T11:12:58.3080597Z min=0.6134 2026-02-21T11:12:58.3084614Z mid=0.8366 2026-02-21T11:12:58.3089007Z max=1.7388 2026-02-21T11:12:58.3094099Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:12:58.3094467Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:12:58.3094773Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:12:58.3098643Z 'num_stages': 3, 2026-02-21T11:12:58.3102531Z 'num_warps': 1, 2026-02-21T11:12:58.3106337Z 'pid_type': 'flat', 2026-02-21T11:12:58.3110133Z 'range_flattens': [None, None, False], 2026-02-21T11:12:58.3114673Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:12:58.3119642Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:12:58.3121042Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:12:58.3121305Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:12:58.3121607Z [420s] Fitting surrogate: 1297 points, 1297 targets 2026-02-21T11:12:58.7312420Z [420s] Generation 19 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:13:00.8322576Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 34.3 configs/s 2026-02-21T11:13:00.8555328Z [423s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 2048, 512], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 1], range_warp_specializes=[None, None, False]) 2026-02-21T11:13:00.8556938Z Tensor-likes are not close! 2026-02-21T11:13:00.8561470Z 2026-02-21T11:13:00.8566015Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:13:00.8570279Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:13:00.8570696Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:13:00.8574895Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:13:00.8576710Z 2026-02-21T11:13:02.0455945Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 14.4 configs/s 2026-02-21T11:13:04.0536009Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━ 383/383 174.8 configs/s 2026-02-21T11:13:04.2729042Z [426s] Generation 19 complete: 2026-02-21T11:13:04.2733475Z error=1 2026-02-21T11:13:04.2737910Z ok=18 2026-02-21T11:13:04.2742090Z min=0.6093 2026-02-21T11:13:04.2745668Z mid=0.8837 2026-02-21T11:13:04.2749052Z max=2.6418 2026-02-21T11:13:04.2752115Z best={'block_sizes': [2, 512, 512], 2026-02-21T11:13:04.2752422Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:13:04.2752685Z 'load_eviction_policies': ['', 'first', '', ''], 2026-02-21T11:13:04.2752881Z 'num_stages': 3, 2026-02-21T11:13:04.2753017Z 'num_warps': 1, 2026-02-21T11:13:04.2753164Z 'pid_type': 'flat', 2026-02-21T11:13:04.2753320Z 'range_flattens': [None, None, False], 2026-02-21T11:13:04.2753524Z 'range_multi_buffers': [None, True, False], 2026-02-21T11:13:04.2753718Z 'range_num_stages': [0, 1, 1], 2026-02-21T11:13:04.2753881Z 'range_unroll_factors': [0, 1, 2], 2026-02-21T11:13:04.2754094Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:13:04.2778420Z [426s] Fitting surrogate: 1316 points, 1316 targets 2026-02-21T11:13:04.7281025Z [426s] Generation 20 starting: 15 neighbors, 1 active search path(s) 2026-02-21T11:13:05.9597310Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 32.4 configs/s 2026-02-21T11:13:06.1236903Z [428s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 512], indexing=['pointer', 'tensor_descriptor', 'pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:13:06.1238076Z Tensor-likes are not close! 2026-02-21T11:13:06.1238220Z 2026-02-21T11:13:06.1238444Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:13:06.1238768Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:13:06.1243424Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:13:06.1244778Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:13:06.1244970Z 2026-02-21T11:13:06.4245409Z [428s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, False]) 2026-02-21T11:13:06.4246541Z Tensor-likes are not close! 2026-02-21T11:13:06.4246688Z 2026-02-21T11:13:06.4246909Z Mismatched elements: 618298450 / 805306368 (76.8%) 2026-02-21T11:13:06.4247221Z Greatest absolute difference: 2.625 at index (1924, 2693) (up to 0.01 allowed) 2026-02-21T11:13:06.4247568Z Greatest relative difference: inf at index (14949, 2261) (up to 0.01 allowed) 2026-02-21T11:13:06.4248152Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:13:06.4252297Z 2026-02-21T11:13:06.8998633Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.7 configs/s 2026-02-21T11:13:09.6990476Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━ 415/415 139.7 configs/s 2026-02-21T11:13:09.8976642Z [432s] Generation 20 complete: 2026-02-21T11:13:09.8980904Z error=2 2026-02-21T11:13:09.8982524Z ok=15 2026-02-21T11:13:09.8983037Z min=0.5969 2026-02-21T11:13:09.8983230Z mid=0.6665 2026-02-21T11:13:09.8983375Z max=1.2676 2026-02-21T11:13:09.8983534Z best={'block_sizes': [1, 512, 1024], 2026-02-21T11:13:09.8983868Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:13:09.8984199Z 'load_eviction_policies': ['last', 'first', 'last', 'last'], 2026-02-21T11:13:09.8984425Z 'num_stages': 7, 2026-02-21T11:13:09.8984849Z 'num_warps': 1, 2026-02-21T11:13:09.8985015Z 'pid_type': 'flat', 2026-02-21T11:13:09.8985170Z 'range_flattens': [None, None, True], 2026-02-21T11:13:09.8985369Z 'range_multi_buffers': [None, True, True], 2026-02-21T11:13:09.8985561Z 'range_num_stages': [0, 3, 3], 2026-02-21T11:13:09.8985727Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:13:09.8985923Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:13:09.9031504Z [432s] Fitting surrogate: 1333 points, 1333 targets 2026-02-21T11:13:10.2022333Z [432s] Autotuning complete in 432.5s after searching 1289 configs. 2026-02-21T11:13:10.2026789Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:13:10.2031365Z @helion.kernel(config=helion.Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, True, True], range_num_stages=[0, 3, 3], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, False]), static_shapes=True) 2026-02-21T11:13:10.2032388Z 2026-02-21T11:13:10.2032650Z [432s] Code of selected kernel: /tmp/torchinductor_root/kw/ckwtmbhn5eksnhebayfggs744n2jl7hszrdrao5irljshygk47cj.py 2026-02-21T11:13:10.2370643Z from __future__ import annotations 2026-02-21T11:13:10.2372315Z 2026-02-21T11:13:10.2372541Z import torch 2026-02-21T11:13:10.2372743Z import triton 2026-02-21T11:13:10.2377504Z import triton.language as tl 2026-02-21T11:13:10.2379624Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:13:10.2379944Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:13:10.2380119Z 2026-02-21T11:13:10.2380189Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T11:13:10.2380372Z _BLOCK_SIZE_1 = tl.constexpr(512) 2026-02-21T11:13:10.2380557Z _BLOCK_SIZE_2 = tl.constexpr(1024) 2026-02-21T11:13:10.2380670Z 2026-02-21T11:13:10.2380738Z @triton.jit 2026-02-21T11:13:10.2380904Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:13:10.2381126Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:13:10.2381334Z pid_0 = tl.program_id(0) 2026-02-21T11:13:10.2381491Z offset_0 = pid_0 2026-02-21T11:13:10.2381669Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T11:13:10.2382029Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:13:10.2382330Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:13:10.2382578Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:13:10.2382813Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:13:10.2383057Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:13:10.2383285Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:13:10.2383508Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:13:10.2383728Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2384191Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:13:10.2384392Z # src[welford.py:50-63]: ... 2026-02-21T11:13:10.2384718Z for offset_1 in tl.range(0, 3072, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False): 2026-02-21T11:13:10.2385111Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T11:13:10.2385340Z acc_mean_copy = acc_mean 2026-02-21T11:13:10.2385511Z acc_cnt_copy = acc_cnt 2026-02-21T11:13:10.2385670Z acc_m2_copy = acc_m2 2026-02-21T11:13:10.2385839Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:13:10.2386015Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:13:10.2386225Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:13:10.2386422Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2386821Z chunk = tl.load(x + (indices_0[:, None] * 3072 + indices_1[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T11:13:10.2387173Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:13:10.2387408Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:13:10.2387645Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:13:10.2387877Z v_0 = chunk * chunk 2026-02-21T11:13:10.2388051Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:13:10.2388267Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2388460Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:13:10.2388651Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:13:10.2388855Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:13:10.2389036Z v_2 = sum_x / v_1 2026-02-21T11:13:10.2389234Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:13:10.2389437Z v_3 = sum_x * sum_x 2026-02-21T11:13:10.2389612Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2389802Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:13:10.2390018Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:13:10.2390248Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:13:10.2390428Z v_5 = v_3 / v_4 2026-02-21T11:13:10.2390578Z v_6 = sum_x2 - v_5 2026-02-21T11:13:10.2390746Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:13:10.2390944Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:13:10.2391115Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:13:10.2391305Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2391493Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:13:10.2391687Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:13:10.2391936Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:13:10.2392125Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:13:10.2392354Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:13:10.2392584Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:13:10.2392764Z v_12 = v_11 / acc_cnt 2026-02-21T11:13:10.2392940Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2393160Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:13:10.2393383Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:13:10.2393642Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:13:10.2393829Z v_14 = v_12 * v_13 2026-02-21T11:13:10.2393972Z v_15 = v_8 * v_14 2026-02-21T11:13:10.2394133Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:13:10.2394384Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:13:10.2394650Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:13:10.2394819Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:13:10.2394989Z v_19 = v_8 * v_8 2026-02-21T11:13:10.2395153Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:13:10.2395454Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:13:10.2395707Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:13:10.2395974Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:13:10.2396168Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:13:10.2396332Z v_22 = v_21 / acc_cnt 2026-02-21T11:13:10.2396489Z v_23 = v_19 * v_22 2026-02-21T11:13:10.2396641Z acc_m2 = v_18 + v_23 2026-02-21T11:13:10.2396869Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:13:10.2397111Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:13:10.2397262Z v_26 = v_25 + eps 2026-02-21T11:13:10.2397428Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:13:10.2397623Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:13:10.2397894Z mean_col = acc_mean[:, None] 2026-02-21T11:13:10.2398090Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:13:10.2398300Z rstd_col = v_27[:, None] 2026-02-21T11:13:10.2398480Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:13:10.2398713Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:13:10.2398970Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:13:10.2399189Z # src[welford.py:69-77]: ... 2026-02-21T11:13:10.2399612Z for offset_2 in tl.range(0, 3072, _BLOCK_SIZE_2, loop_unroll_factor=2, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=False, flatten=True): 2026-02-21T11:13:10.2400093Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T11:13:10.2400334Z mean_col_copy = mean_col 2026-02-21T11:13:10.2400497Z rstd_col_copy = rstd_col 2026-02-21T11:13:10.2400675Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:13:10.2400863Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:13:10.2401068Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:13:10.2401429Z xi_chuck = tl.load(x + (indices_0[:, None] * 3072 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T11:13:10.2401790Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:13:10.2402130Z load_1 = tl.load(weight + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:13:10.2402397Z w_chuck = load_1[None, :] 2026-02-21T11:13:10.2402613Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:13:10.2402906Z load_2 = tl.load(bias + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:13:10.2403163Z b_chuck = load_2[None, :] 2026-02-21T11:13:10.2403383Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:13:10.2403615Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:13:10.2403819Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:13:10.2404005Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:13:10.2404219Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:13:10.2404444Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:13:10.2404627Z v_32 = v_30 * v_31 2026-02-21T11:13:10.2404799Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:13:10.2404979Z v_34 = v_32 + v_33 2026-02-21T11:13:10.2405180Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:13:10.2405437Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:13:10.2405683Z tl.store(out + (indices_0[:, None] * 3072 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:13:10.2405876Z 2026-02-21T11:13:10.2406116Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:13:10.2406441Z """ 2026-02-21T11:13:10.2406628Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:13:10.2406846Z Args: 2026-02-21T11:13:10.2406991Z weight: weight tensor of shape [N] 2026-02-21T11:13:10.2407239Z bias: bias tensor of shape [N] 2026-02-21T11:13:10.2407429Z x: input tensor of shape [M, N] 2026-02-21T11:13:10.2407599Z Returns: 2026-02-21T11:13:10.2407749Z Output tensor of shape [M, N] 2026-02-21T11:13:10.2407912Z """ 2026-02-21T11:13:10.2408056Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:13:10.2408239Z m, n = x.size() 2026-02-21T11:13:10.2408456Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:13:10.2408758Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:13:10.2408995Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:13:10.2409283Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:13:10.2409592Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:13:10.2409877Z # src[welford.py:45-77]: ... 2026-02-21T11:13:10.2410148Z _launcher(_helion_welford, (262144,), x, weight, bias, out, eps, num_warps=1, num_stages=7) 2026-02-21T11:13:10.2410429Z # src[welford.py:78]: return out 2026-02-21T11:13:10.2410596Z return out 2026-02-21T11:13:11.6071090Z WARNING:tritonbench.utils.triton_op:Completed input ID 4: 2026-02-21T11:13:11.6073022Z x_val 2026-02-21T11:13:11.6073259Z ------- 2026-02-21T11:13:11.6073410Z 3072 2026-02-21T11:13:11.6099045Z 2026-02-21T11:13:11.6099506Z 50%|█████ | 3/6 [19:07<19:26, 388.77s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T11:13:11.6099916Z x_val 2026-02-21T11:13:11.6100081Z ------- 2026-02-21T11:13:11.6100220Z 4096 2026-02-21T11:13:11.6111194Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T11:13:12.4470620Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T11:13:13.7544789Z INFO:tritonbench.utils.triton_op:Took 2.56ms to get benchmark function for torch_compile_welford 2026-02-21T11:13:31.7782520Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:13:31.7787165Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:13:31.7788590Z 'dtype': 'torch.bfloat16', 2026-02-21T11:13:31.7788823Z 'shape': (4096,), 2026-02-21T11:13:31.7788994Z 'stride': (1,)}, 2026-02-21T11:13:31.7789164Z { 'device': 'cuda:0', 2026-02-21T11:13:31.7789334Z 'dtype': 'torch.bfloat16', 2026-02-21T11:13:31.7789604Z 'shape': (4096,), 2026-02-21T11:13:31.7793810Z 'stride': (1,)}, 2026-02-21T11:13:31.7798443Z { 'device': 'cuda:0', 2026-02-21T11:13:31.7801817Z 'dtype': 'torch.bfloat16', 2026-02-21T11:13:31.7803834Z 'shape': (262144, 4096), 2026-02-21T11:13:31.7804059Z 'stride': (4096, 1)}), 2026-02-21T11:13:31.7804227Z 'kwargs': {}} 2026-02-21T11:13:31.7804595Z INFO:tritonbench.utils.triton_op:Took 2.87ms to get benchmark function for helion_welford 2026-02-21T11:13:32.0646587Z [0s] Autotune random seed: 2144717750 2026-02-21T11:13:32.1069226Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:14:08.1384328Z [36s] Timeout after 30s compiling Config(block_sizes=[8192, 128, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first', 'first', ''], maxnreg=64, num_sm_multiplier=2, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 4, 0], range_unroll_factors=[3, 3, 3], range_warp_specializes=[None, None, None]) 2026-02-21T11:14:08.2233695Z [36s] Timeout after 30s compiling Config(block_sizes=[16384, 32, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first', 'first', ''], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 2, 1], range_unroll_factors=[0, 4, 0], range_warp_specializes=[None, None, None]) 2026-02-21T11:14:08.2251665Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T11:14:45.4327484Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.0 configs/s 2026-02-21T11:14:45.4342106Z [73s] Adaptive compile timeout: 30s (90% percentile=4.8s, bounds=[30.0s, 30s]) 2026-02-21T11:14:45.7556657Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 148/148 200.5 configs/s 2026-02-21T11:14:46.1950927Z [74s] Initial random population of 100, 5 starting points: 2026-02-21T11:14:46.1955194Z error=8 2026-02-21T11:14:46.1959576Z timeout=2 2026-02-21T11:14:46.1963866Z ok=90 2026-02-21T11:14:46.1968392Z min=1.5001 2026-02-21T11:14:46.1969958Z mid=16.7393 2026-02-21T11:14:46.1970160Z max=435.0536 2026-02-21T11:14:46.1975970Z best={'block_sizes': [4, 512, 64], 2026-02-21T11:14:46.1980181Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:14:46.1983423Z 'tensor_descriptor'], 2026-02-21T11:14:46.1983716Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:14:46.1983925Z 'num_stages': 7, 2026-02-21T11:14:46.1984074Z 'num_warps': 8, 2026-02-21T11:14:46.1984215Z 'pid_type': 'flat', 2026-02-21T11:14:46.1984383Z 'range_flattens': [None, True, False], 2026-02-21T11:14:46.1984590Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:14:46.1984774Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:14:46.1984946Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:14:46.1985136Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:14:46.1985356Z [74s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:14:47.4738469Z [75s] Generation 1 starting: 95 neighbors, 5 active search path(s) 2026-02-21T11:14:54.5300568Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 23.0 configs/s 2026-02-21T11:15:03.7573251Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 10.7 configs/s 2026-02-21T11:15:09.9059400Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 194/194 30.1 configs/s 2026-02-21T11:15:10.2795835Z [98s] Generation 1 complete: 2026-02-21T11:15:10.2800214Z ok=101 2026-02-21T11:15:10.2801629Z min=1.0711 2026-02-21T11:15:10.2801792Z mid=2.3141 2026-02-21T11:15:10.2802070Z max=27.5446 2026-02-21T11:15:10.2802218Z best={'block_sizes': [32, 128, 128], 2026-02-21T11:15:10.2802507Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T11:15:10.2802819Z 'load_eviction_policies': ['first', '', '', 'last'], 2026-02-21T11:15:10.2803017Z 'maxnreg': 64, 2026-02-21T11:15:10.2803158Z 'num_sm_multiplier': 16, 2026-02-21T11:15:10.2803318Z 'num_stages': 7, 2026-02-21T11:15:10.2803451Z 'num_warps': 8, 2026-02-21T11:15:10.2803628Z 'pid_type': 'persistent_blocked', 2026-02-21T11:15:10.2803811Z 'range_flattens': [False, False, False], 2026-02-21T11:15:10.2804053Z 'range_multi_buffers': [False, False, False], 2026-02-21T11:15:10.2804241Z 'range_num_stages': [2, 3, 4], 2026-02-21T11:15:10.2804412Z 'range_unroll_factors': [0, 1, 0], 2026-02-21T11:15:10.2804595Z 'range_warp_specializes': [True, None, None]} 2026-02-21T11:15:10.2816333Z [98s] Fitting surrogate: 201 points, 201 targets 2026-02-21T11:15:11.4456334Z [99s] Generation 2 starting: 90 neighbors, 5 active search path(s) 2026-02-21T11:15:19.1726976Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 17.5 configs/s 2026-02-21T11:15:19.5018797Z [107s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 0], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:15:19.5020368Z Tensor-likes are not close! 2026-02-21T11:15:19.5020486Z 2026-02-21T11:15:19.5020571Z Mismatched elements: 76 / 1073741824 (0.0%) 2026-02-21T11:15:19.5025701Z Greatest absolute difference: 0.03125 at index (22909, 2963) (up to 0.01 allowed) 2026-02-21T11:15:19.5028496Z Greatest relative difference: 4.65625 at index (188830, 1069) (up to 0.01 allowed) 2026-02-21T11:15:19.5028889Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:15:19.5033054Z 2026-02-21T11:15:22.8514344Z [110s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, False, True], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T11:15:22.8515651Z Tensor-likes are not close! 2026-02-21T11:15:22.8520104Z 2026-02-21T11:15:22.8522012Z Mismatched elements: 7 / 1073741824 (0.0%) 2026-02-21T11:15:22.8522403Z Greatest absolute difference: 0.0234375 at index (52631, 267) (up to 0.01 allowed) 2026-02-21T11:15:22.8522804Z Greatest relative difference: 0.546875 at index (184637, 3489) (up to 0.01 allowed) 2026-02-21T11:15:22.8526751Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:15:22.8530360Z 2026-02-21T11:15:23.1042833Z [110s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', '', 'last'], maxnreg=256, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False, False], range_multi_buffers=[False, None, True], range_num_stages=[1, 3, 0], range_unroll_factors=[1, 3, 0], range_warp_specializes=[True, None, None]) 2026-02-21T11:15:23.1043956Z Tensor-likes are not close! 2026-02-21T11:15:23.1047363Z 2026-02-21T11:15:23.1051949Z Mismatched elements: 7 / 1073741824 (0.0%) 2026-02-21T11:15:23.1055903Z Greatest absolute difference: 0.0234375 at index (52631, 267) (up to 0.01 allowed) 2026-02-21T11:15:23.1057859Z Greatest relative difference: 0.546875 at index (184637, 3489) (up to 0.01 allowed) 2026-02-21T11:15:23.1058253Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:15:23.1062502Z 2026-02-21T11:15:26.1141225Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 13.7 configs/s 2026-02-21T11:15:42.7296319Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 214/214 12.7 configs/s 2026-02-21T11:15:43.1186725Z [131s] Generation 2 complete: 2026-02-21T11:15:43.1188283Z error=3 2026-02-21T11:15:43.1188427Z ok=93 2026-02-21T11:15:43.1188559Z min=0.9368 2026-02-21T11:15:43.1188684Z mid=1.3453 2026-02-21T11:15:43.1188833Z max=3.7090 2026-02-21T11:15:43.1188976Z best={'block_sizes': [2, 512, 256], 2026-02-21T11:15:43.1189278Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:15:43.1189572Z 'tensor_descriptor'], 2026-02-21T11:15:43.1189777Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:15:43.1189983Z 'num_stages': 7, 2026-02-21T11:15:43.1190115Z 'num_warps': 1, 2026-02-21T11:15:43.1190257Z 'pid_type': 'flat', 2026-02-21T11:15:43.1190407Z 'range_flattens': [None, True, False], 2026-02-21T11:15:43.1190605Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:15:43.1190798Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:15:43.1190960Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:15:43.1191152Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:15:43.1208970Z [131s] Fitting surrogate: 297 points, 297 targets 2026-02-21T11:15:44.3914563Z [132s] Generation 3 starting: 90 neighbors, 5 active search path(s) 2026-02-21T11:15:52.5275387Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 6.2 configs/s 2026-02-21T11:15:53.3730477Z [141s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, False], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:15:53.3732065Z Tensor-likes are not close! 2026-02-21T11:15:53.3732198Z 2026-02-21T11:15:53.3732280Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:15:53.3732565Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:15:53.3732940Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:15:53.3733578Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:15:53.3733761Z 2026-02-21T11:15:53.5304316Z [141s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[2, 1024, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 0], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:15:53.5305415Z Tensor-likes are not close! 2026-02-21T11:15:53.5305533Z 2026-02-21T11:15:53.5305627Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:15:53.5305906Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:15:53.5306263Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:15:53.5306576Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:15:53.5306746Z 2026-02-21T11:15:59.3801500Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 13.3 configs/s 2026-02-21T11:16:11.3804995Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 253/253 20.7 configs/s 2026-02-21T11:16:11.7403316Z [159s] Generation 3 complete: 2026-02-21T11:16:11.7407276Z error=2 2026-02-21T11:16:11.7408622Z ok=94 2026-02-21T11:16:11.7408777Z min=0.8151 2026-02-21T11:16:11.7408912Z mid=1.2718 2026-02-21T11:16:11.7409027Z max=7.2483 2026-02-21T11:16:11.7409169Z best={'block_sizes': [1, 512, 512], 2026-02-21T11:16:11.7409463Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:16:11.7409758Z 'tensor_descriptor'], 2026-02-21T11:16:11.7409973Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:16:11.7410172Z 'num_stages': 7, 2026-02-21T11:16:11.7410315Z 'num_warps': 1, 2026-02-21T11:16:11.7410449Z 'pid_type': 'flat', 2026-02-21T11:16:11.7410633Z 'range_flattens': [None, True, None], 2026-02-21T11:16:11.7410836Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:16:11.7411026Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:16:11.7411186Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:16:11.7411383Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:16:11.7437132Z [159s] Fitting surrogate: 393 points, 393 targets 2026-02-21T11:16:12.8516013Z [160s] Generation 4 starting: 82 neighbors, 5 active search path(s) 2026-02-21T11:16:21.1168630Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 4.2 configs/s 2026-02-21T11:16:27.4514608Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 13.3 configs/s 2026-02-21T11:16:39.3526579Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 263/263 22.9 configs/s 2026-02-21T11:16:39.7001347Z [187s] Generation 4 complete: 2026-02-21T11:16:39.7005865Z error=1 2026-02-21T11:16:39.7007748Z ok=87 2026-02-21T11:16:39.7007906Z min=0.8203 2026-02-21T11:16:39.7008063Z mid=1.2770 2026-02-21T11:16:39.7008540Z max=6.2208 2026-02-21T11:16:39.7008686Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:16:39.7008975Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:16:39.7009270Z 'tensor_descriptor'], 2026-02-21T11:16:39.7009477Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:16:39.7009686Z 'num_stages': 7, 2026-02-21T11:16:39.7009823Z 'num_warps': 1, 2026-02-21T11:16:39.7009966Z 'pid_type': 'flat', 2026-02-21T11:16:39.7010126Z 'range_flattens': [None, True, False], 2026-02-21T11:16:39.7010313Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:16:39.7010506Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:16:39.7010668Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:16:39.7010859Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:16:39.7029328Z [187s] Fitting surrogate: 481 points, 481 targets 2026-02-21T11:16:40.8187171Z [188s] Generation 5 starting: 79 neighbors, 5 active search path(s) 2026-02-21T11:16:46.9677734Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 21.9 configs/s 2026-02-21T11:16:53.1140905Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 13.2 configs/s 2026-02-21T11:16:59.9517293Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 278/278 39.3 configs/s 2026-02-21T11:17:00.2552778Z [208s] Generation 5 complete: 2026-02-21T11:17:00.2557214Z ok=85 2026-02-21T11:17:00.2558721Z min=0.8254 2026-02-21T11:17:00.2558875Z mid=1.2636 2026-02-21T11:17:00.2559009Z max=5.1702 2026-02-21T11:17:00.2559147Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:17:00.2559449Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:17:00.2559735Z 'tensor_descriptor'], 2026-02-21T11:17:00.2559956Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:17:00.2560152Z 'num_stages': 7, 2026-02-21T11:17:00.2560299Z 'num_warps': 1, 2026-02-21T11:17:00.2560463Z 'pid_type': 'flat', 2026-02-21T11:17:00.2560639Z 'range_flattens': [None, True, True], 2026-02-21T11:17:00.2560833Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:17:00.2561017Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:17:00.2561187Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:17:00.2561374Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:17:00.2591180Z [208s] Fitting surrogate: 566 points, 566 targets 2026-02-21T11:17:01.3352431Z [209s] Generation 6 starting: 70 neighbors, 4 active search path(s) 2026-02-21T11:17:07.2339098Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 12.8 configs/s 2026-02-21T11:17:07.3366275Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:17:07.3367802Z Tensor-likes are not close! 2026-02-21T11:17:07.3367915Z 2026-02-21T11:17:07.3367992Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:17:07.3368277Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:17:07.3368641Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:17:07.3368945Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:17:07.3369104Z 2026-02-21T11:17:07.5979723Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:17:07.5980869Z Tensor-likes are not close! 2026-02-21T11:17:07.5982821Z 2026-02-21T11:17:07.5983034Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:17:07.5983397Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:17:07.5987607Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:17:07.5990807Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:17:07.5995037Z 2026-02-21T11:17:07.9336125Z [215s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:17:07.9337655Z Tensor-likes are not close! 2026-02-21T11:17:07.9337776Z 2026-02-21T11:17:07.9337890Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:17:07.9338185Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:17:07.9338540Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:17:07.9342417Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:17:07.9346142Z 2026-02-21T11:17:08.1722374Z [216s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, False], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:17:08.1723607Z Tensor-likes are not close! 2026-02-21T11:17:08.1725870Z 2026-02-21T11:17:08.1726059Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:17:08.1726430Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:17:08.1730436Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:17:08.1735450Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:17:08.1736709Z 2026-02-21T11:17:12.5690500Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 13.6 configs/s 2026-02-21T11:17:21.7148650Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 278/278 29.6 configs/s 2026-02-21T11:17:22.0392777Z [229s] Generation 6 complete: 2026-02-21T11:17:22.0394628Z error=4 2026-02-21T11:17:22.0394783Z ok=71 2026-02-21T11:17:22.0394914Z min=0.7997 2026-02-21T11:17:22.0395037Z mid=1.2196 2026-02-21T11:17:22.0395194Z max=5.6269 2026-02-21T11:17:22.0395765Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:17:22.0396066Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:17:22.0396348Z 'tensor_descriptor'], 2026-02-21T11:17:22.0396558Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:17:22.0396757Z 'num_stages': 6, 2026-02-21T11:17:22.0396899Z 'num_warps': 1, 2026-02-21T11:17:22.0397035Z 'pid_type': 'flat', 2026-02-21T11:17:22.0397196Z 'range_flattens': [None, True, True], 2026-02-21T11:17:22.0397392Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:17:22.0397573Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:17:22.0397744Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:17:22.0397931Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:17:22.0429976Z [229s] Fitting surrogate: 641 points, 641 targets 2026-02-21T11:17:23.0571293Z [230s] Generation 7 starting: 65 neighbors, 4 active search path(s) 2026-02-21T11:17:29.6388623Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 11.9 configs/s 2026-02-21T11:17:30.3037503Z [238s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:17:30.3038608Z Tensor-likes are not close! 2026-02-21T11:17:30.3038730Z 2026-02-21T11:17:30.3038812Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:17:30.3039114Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:17:30.3039482Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:17:30.3039823Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:17:30.3039994Z 2026-02-21T11:17:34.5909645Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 13.4 configs/s 2026-02-21T11:17:41.5246409Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 291/291 40.5 configs/s 2026-02-21T11:17:41.8225302Z [249s] Generation 7 complete: 2026-02-21T11:17:41.8228522Z error=1 2026-02-21T11:17:41.8232331Z ok=68 2026-02-21T11:17:41.8237418Z min=0.8344 2026-02-21T11:17:41.8242324Z mid=1.2389 2026-02-21T11:17:41.8245408Z max=18.4893 2026-02-21T11:17:41.8249887Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:17:41.8253261Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:17:41.8256315Z 'tensor_descriptor'], 2026-02-21T11:17:41.8256620Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:17:41.8260600Z 'num_stages': 6, 2026-02-21T11:17:41.8265090Z 'num_warps': 1, 2026-02-21T11:17:41.8269122Z 'pid_type': 'flat', 2026-02-21T11:17:41.8272881Z 'range_flattens': [None, True, True], 2026-02-21T11:17:41.8277259Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:17:41.8277544Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:17:41.8281583Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:17:41.8284920Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:17:41.8289298Z [249s] Fitting surrogate: 710 points, 710 targets 2026-02-21T11:17:42.9131461Z [250s] Generation 8 starting: 68 neighbors, 4 active search path(s) 2026-02-21T11:17:52.2408530Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 3.2 configs/s 2026-02-21T11:17:57.5607124Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 13.2 configs/s 2026-02-21T11:18:03.3141254Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 291/291 48.6 configs/s 2026-02-21T11:18:03.5948217Z [271s] Generation 8 complete: 2026-02-21T11:18:03.5951237Z ok=72 2026-02-21T11:18:03.5955668Z min=0.8080 2026-02-21T11:18:03.5957111Z mid=1.2933 2026-02-21T11:18:03.5957645Z max=8.7808 2026-02-21T11:18:03.5957785Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:18:03.5958089Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 2026-02-21T11:18:03.5958383Z 'tensor_descriptor'], 2026-02-21T11:18:03.5958590Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:18:03.5958807Z 'num_stages': 6, 2026-02-21T11:18:03.5958950Z 'num_warps': 1, 2026-02-21T11:18:03.5959094Z 'pid_type': 'flat', 2026-02-21T11:18:03.5959249Z 'range_flattens': [None, True, True], 2026-02-21T11:18:03.5959446Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:18:03.5959627Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:18:03.5959800Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:18:03.5959988Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:18:03.5991795Z [271s] Fitting surrogate: 782 points, 782 targets 2026-02-21T11:18:04.4648757Z [272s] Generation 9 starting: 52 neighbors, 3 active search path(s) 2026-02-21T11:18:09.2108222Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 10.8 configs/s 2026-02-21T11:18:09.5968808Z [277s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 4096], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:18:09.5969952Z Tensor-likes are not close! 2026-02-21T11:18:09.5970069Z 2026-02-21T11:18:09.5970151Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:18:09.5970432Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:09.5970820Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:09.5971136Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:18:09.5971302Z 2026-02-21T11:18:10.1215362Z [278s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 4096], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, False, None]) 2026-02-21T11:18:10.1216485Z Tensor-likes are not close! 2026-02-21T11:18:10.1216613Z 2026-02-21T11:18:10.1216748Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:18:10.1217086Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:18:10.1222447Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:18:10.1226828Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:18:10.1227996Z 2026-02-21T11:18:13.1409094Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 13.8 configs/s 2026-02-21T11:18:17.3813262Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 291/291 65.0 configs/s 2026-02-21T11:18:17.6628418Z [285s] Generation 9 complete: 2026-02-21T11:18:17.6632833Z error=2 2026-02-21T11:18:17.6637553Z ok=53 2026-02-21T11:18:17.6639479Z min=0.8808 2026-02-21T11:18:17.6639638Z mid=1.2519 2026-02-21T11:18:17.6639768Z max=3.7725 2026-02-21T11:18:17.6639903Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:18:17.6640206Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 2026-02-21T11:18:17.6640491Z 'tensor_descriptor'], 2026-02-21T11:18:17.6640711Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:18:17.6640914Z 'num_stages': 6, 2026-02-21T11:18:17.6641089Z 'num_warps': 1, 2026-02-21T11:18:17.6641231Z 'pid_type': 'flat', 2026-02-21T11:18:17.6641826Z 'range_flattens': [None, True, True], 2026-02-21T11:18:17.6642127Z 'range_multi_buffers': [None, False, True], 2026-02-21T11:18:17.6642320Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:18:17.6642504Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:18:17.6642700Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:18:17.6670472Z [285s] Fitting surrogate: 837 points, 837 targets 2026-02-21T11:18:18.3340442Z [286s] Generation 10 starting: 33 neighbors, 2 active search path(s) 2026-02-21T11:18:21.9543445Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 10.4 configs/s 2026-02-21T11:18:24.5106698Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 13.4 configs/s 2026-02-21T11:18:30.3539235Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 47.9 configs/s 2026-02-21T11:18:30.6514449Z [298s] Generation 10 complete: 2026-02-21T11:18:30.6518305Z ok=35 2026-02-21T11:18:30.6523087Z min=0.8140 2026-02-21T11:18:30.6526500Z mid=1.1818 2026-02-21T11:18:30.6530795Z max=3.4416 2026-02-21T11:18:30.6535227Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:18:30.6539093Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:18:30.6539501Z 'tensor_descriptor'], 2026-02-21T11:18:30.6539758Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:18:30.6539970Z 'num_stages': 6, 2026-02-21T11:18:30.6540117Z 'num_warps': 1, 2026-02-21T11:18:30.6540264Z 'pid_type': 'flat', 2026-02-21T11:18:30.6540427Z 'range_flattens': [None, True, True], 2026-02-21T11:18:30.6540617Z 'range_multi_buffers': [None, False, True], 2026-02-21T11:18:30.6540811Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:18:30.6540981Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:18:30.6541171Z 'range_warp_specializes': [None, False, None]} 2026-02-21T11:18:30.6562717Z [298s] Fitting surrogate: 872 points, 872 targets 2026-02-21T11:18:31.2989816Z [299s] Generation 11 starting: 32 neighbors, 2 active search path(s) 2026-02-21T11:18:36.0089155Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 4.5 configs/s 2026-02-21T11:18:38.5297213Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 13.6 configs/s 2026-02-21T11:18:43.7072186Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 53.6 configs/s 2026-02-21T11:18:43.9973285Z [311s] Generation 11 complete: 2026-02-21T11:18:43.9977563Z ok=35 2026-02-21T11:18:43.9982140Z min=0.8550 2026-02-21T11:18:43.9983504Z mid=1.0670 2026-02-21T11:18:43.9983675Z max=5.0740 2026-02-21T11:18:43.9983823Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:18:43.9984161Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:18:43.9984477Z 'tensor_descriptor'], 2026-02-21T11:18:43.9984687Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:18:43.9984923Z 'num_stages': 6, 2026-02-21T11:18:43.9985065Z 'num_warps': 1, 2026-02-21T11:18:43.9985228Z 'pid_type': 'flat', 2026-02-21T11:18:43.9985383Z 'range_flattens': [None, True, True], 2026-02-21T11:18:43.9985584Z 'range_multi_buffers': [None, False, True], 2026-02-21T11:18:43.9985770Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:18:43.9985942Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:18:43.9986130Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:18:44.0017428Z [311s] Fitting surrogate: 907 points, 907 targets 2026-02-21T11:18:44.6184587Z [312s] Generation 12 starting: 30 neighbors, 2 active search path(s) 2026-02-21T11:18:48.6570195Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 10.0 configs/s 2026-02-21T11:18:48.6860180Z [316s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, True], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:18:48.6861770Z Tensor-likes are not close! 2026-02-21T11:18:48.6865826Z 2026-02-21T11:18:48.6869293Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:18:48.6873893Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:48.6878183Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:48.6882020Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:18:48.6882297Z 2026-02-21T11:18:49.6466119Z [317s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, False], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:18:49.6467416Z Tensor-likes are not close! 2026-02-21T11:18:49.6472257Z 2026-02-21T11:18:49.6476826Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:18:49.6481224Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:49.6485607Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:18:49.6489356Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:18:49.6490554Z 2026-02-21T11:18:50.1952307Z [318s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 2048, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', '', '', ''], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, False], range_num_stages=[0, 3, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, False, False]) 2026-02-21T11:18:50.1953361Z Tensor-likes are not close! 2026-02-21T11:18:50.1953476Z 2026-02-21T11:18:50.1953559Z Mismatched elements: 13 / 1073741824 (0.0%) 2026-02-21T11:18:50.1953831Z Greatest absolute difference: 0.01953125 at index (112537, 3028) (up to 0.01 allowed) 2026-02-21T11:18:50.1954191Z Greatest relative difference: 0.8203125 at index (160612, 3987) (up to 0.01 allowed) 2026-02-21T11:18:50.1954497Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:18:50.1954665Z 2026-02-21T11:18:50.8027506Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 14.2 configs/s 2026-02-21T11:18:55.2540305Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 61.8 configs/s 2026-02-21T11:18:55.5343120Z [323s] Generation 12 complete: 2026-02-21T11:18:55.5346982Z error=3 2026-02-21T11:18:55.5348400Z ok=29 2026-02-21T11:18:55.5348565Z min=0.8591 2026-02-21T11:18:55.5348716Z mid=1.0629 2026-02-21T11:18:55.5348839Z max=2.7731 2026-02-21T11:18:55.5348988Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:18:55.5349306Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:18:55.5349612Z 'tensor_descriptor'], 2026-02-21T11:18:55.5349832Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:18:55.5350034Z 'num_stages': 7, 2026-02-21T11:18:55.5350177Z 'num_warps': 1, 2026-02-21T11:18:55.5350314Z 'pid_type': 'flat', 2026-02-21T11:18:55.5350476Z 'range_flattens': [None, True, True], 2026-02-21T11:18:55.5350667Z 'range_multi_buffers': [None, False, True], 2026-02-21T11:18:55.5350859Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:18:55.5351028Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:18:55.5351215Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:18:55.5385572Z [323s] Fitting surrogate: 939 points, 939 targets 2026-02-21T11:18:56.1650904Z [324s] Generation 13 starting: 30 neighbors, 2 active search path(s) 2026-02-21T11:19:00.4232460Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.5 configs/s 2026-02-21T11:19:01.6999890Z [329s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 2048, 512], indexing=['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last', 'last', ''], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, False, False], range_num_stages=[0, 4, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, None, False]) 2026-02-21T11:19:01.7001013Z Tensor-likes are not close! 2026-02-21T11:19:01.7001158Z 2026-02-21T11:19:01.7001364Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:19:01.7001674Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:19:01.7006210Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:19:01.7010221Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:01.7013173Z 2026-02-21T11:19:02.7947824Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 13.2 configs/s 2026-02-21T11:19:08.3465859Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 50.2 configs/s 2026-02-21T11:19:08.6431476Z [336s] Generation 13 complete: 2026-02-21T11:19:08.6435727Z error=1 2026-02-21T11:19:08.6439634Z ok=31 2026-02-21T11:19:08.6444025Z min=0.8592 2026-02-21T11:19:08.6447118Z mid=1.0752 2026-02-21T11:19:08.6450368Z max=7.2704 2026-02-21T11:19:08.6454713Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:19:08.6456252Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:19:08.6456587Z 'tensor_descriptor'], 2026-02-21T11:19:08.6456813Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:19:08.6457036Z 'num_stages': 7, 2026-02-21T11:19:08.6457197Z 'num_warps': 1, 2026-02-21T11:19:08.6457339Z 'pid_type': 'flat', 2026-02-21T11:19:08.6457504Z 'range_flattens': [None, True, True], 2026-02-21T11:19:08.6457698Z 'range_multi_buffers': [None, False, True], 2026-02-21T11:19:08.6457893Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:19:08.6458061Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:19:08.6458247Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:19:08.6475682Z [336s] Fitting surrogate: 971 points, 971 targets 2026-02-21T11:19:09.3049312Z [337s] Generation 14 starting: 33 neighbors, 2 active search path(s) 2026-02-21T11:19:12.2599136Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 21.4 configs/s 2026-02-21T11:19:13.2888077Z [341s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, True], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:19:13.2889755Z Tensor-likes are not close! 2026-02-21T11:19:13.2889871Z 2026-02-21T11:19:13.2889948Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:19:13.2890233Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:13.2890584Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:13.2890892Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:13.2891047Z 2026-02-21T11:19:14.7659150Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 13.7 configs/s 2026-02-21T11:19:20.5065904Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 48.7 configs/s 2026-02-21T11:19:20.7921966Z [348s] Generation 14 complete: 2026-02-21T11:19:20.7926051Z error=1 2026-02-21T11:19:20.7930370Z ok=34 2026-02-21T11:19:20.7932019Z min=0.8817 2026-02-21T11:19:20.7932173Z mid=1.0784 2026-02-21T11:19:20.7932299Z max=5.0729 2026-02-21T11:19:20.7932435Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:19:20.7932751Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:19:20.7933054Z 'tensor_descriptor'], 2026-02-21T11:19:20.7933268Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:19:20.7933474Z 'num_stages': 7, 2026-02-21T11:19:20.7933610Z 'num_warps': 1, 2026-02-21T11:19:20.7933753Z 'pid_type': 'flat', 2026-02-21T11:19:20.7933909Z 'range_flattens': [None, True, True], 2026-02-21T11:19:20.7934105Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:19:20.7934287Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:19:20.7934455Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:19:20.7934655Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:19:20.7971641Z [348s] Fitting surrogate: 1006 points, 1006 targets 2026-02-21T11:19:21.4302425Z [349s] Generation 15 starting: 32 neighbors, 2 active search path(s) 2026-02-21T11:19:24.1850198Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 23.2 configs/s 2026-02-21T11:19:26.5285425Z [354s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[8, 2048, 512], indexing=['pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last', '', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, True, False], range_num_stages=[0, 4, 4], range_unroll_factors=[0, 1, 1], range_warp_specializes=[None, False, True]) 2026-02-21T11:19:26.5286609Z Tensor-likes are not close! 2026-02-21T11:19:26.5286737Z 2026-02-21T11:19:26.5286853Z Mismatched elements: 13 / 1073741824 (0.0%) 2026-02-21T11:19:26.5287269Z Greatest absolute difference: 0.01953125 at index (112537, 3028) (up to 0.01 allowed) 2026-02-21T11:19:26.5287696Z Greatest relative difference: 0.8203125 at index (160612, 3987) (up to 0.01 allowed) 2026-02-21T11:19:26.5292791Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:26.5297962Z 2026-02-21T11:19:26.5298497Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 14.3 configs/s 2026-02-21T11:19:32.2256986Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 48.9 configs/s 2026-02-21T11:19:32.5149176Z [360s] Generation 15 complete: 2026-02-21T11:19:32.5151040Z error=1 2026-02-21T11:19:32.5151183Z ok=33 2026-02-21T11:19:32.5151313Z min=0.8858 2026-02-21T11:19:32.5151438Z mid=1.0650 2026-02-21T11:19:32.5151563Z max=3.1016 2026-02-21T11:19:32.5151695Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:19:32.5152184Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:19:32.5152499Z 'tensor_descriptor'], 2026-02-21T11:19:32.5152744Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:19:32.5153238Z 'num_stages': 7, 2026-02-21T11:19:32.5153379Z 'num_warps': 1, 2026-02-21T11:19:32.5153526Z 'pid_type': 'flat', 2026-02-21T11:19:32.5153681Z 'range_flattens': [None, True, True], 2026-02-21T11:19:32.5153878Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:19:32.5154062Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:19:32.5154232Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:19:32.5154416Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:19:32.5204370Z [360s] Fitting surrogate: 1040 points, 1040 targets 2026-02-21T11:19:33.1307277Z [361s] Generation 16 starting: 31 neighbors, 2 active search path(s) 2026-02-21T11:19:35.8458144Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 18.9 configs/s 2026-02-21T11:19:36.4519110Z [364s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, False], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:19:36.4520329Z Tensor-likes are not close! 2026-02-21T11:19:36.4524488Z 2026-02-21T11:19:36.4528502Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:19:36.4532603Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:36.4534001Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:36.4534353Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:36.4534517Z 2026-02-21T11:19:38.1739525Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 13.9 configs/s 2026-02-21T11:19:41.9670320Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 71.7 configs/s 2026-02-21T11:19:42.2385340Z [370s] Generation 16 complete: 2026-02-21T11:19:42.2389286Z error=1 2026-02-21T11:19:42.2389479Z ok=32 2026-02-21T11:19:42.2393695Z min=0.8672 2026-02-21T11:19:42.2397596Z mid=1.1140 2026-02-21T11:19:42.2401473Z max=3.6987 2026-02-21T11:19:42.2402691Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:19:42.2403026Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:19:42.2403340Z 'tensor_descriptor'], 2026-02-21T11:19:42.2403539Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:19:42.2403739Z 'num_stages': 7, 2026-02-21T11:19:42.2403876Z 'num_warps': 1, 2026-02-21T11:19:42.2404020Z 'pid_type': 'flat', 2026-02-21T11:19:42.2404173Z 'range_flattens': [None, True, True], 2026-02-21T11:19:42.2404434Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:19:42.2404648Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:19:42.2404866Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:19:42.2405088Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:19:42.2434304Z [370s] Fitting surrogate: 1073 points, 1073 targets 2026-02-21T11:19:42.8826490Z [370s] Generation 17 starting: 33 neighbors, 2 active search path(s) 2026-02-21T11:19:45.7589757Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 13.9 configs/s 2026-02-21T11:19:46.0052304Z [373s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, True], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:19:46.0053546Z Tensor-likes are not close! 2026-02-21T11:19:46.0056309Z 2026-02-21T11:19:46.0056540Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:19:46.0056892Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:19:46.0058941Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:19:46.0059244Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:46.0059412Z 2026-02-21T11:19:48.2103408Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 14.0 configs/s 2026-02-21T11:19:52.3168307Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 66.9 configs/s 2026-02-21T11:19:52.5902443Z [380s] Generation 17 complete: 2026-02-21T11:19:52.5903956Z error=1 2026-02-21T11:19:52.5904117Z ok=34 2026-02-21T11:19:52.5904241Z min=0.8693 2026-02-21T11:19:52.5904374Z mid=1.1377 2026-02-21T11:19:52.5904494Z max=2.0788 2026-02-21T11:19:52.5904639Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:19:52.5905005Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:19:52.5905663Z 'tensor_descriptor'], 2026-02-21T11:19:52.5906766Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:19:52.5907007Z 'num_stages': 7, 2026-02-21T11:19:52.5911419Z 'num_warps': 1, 2026-02-21T11:19:52.5915864Z 'pid_type': 'flat', 2026-02-21T11:19:52.5917200Z 'range_flattens': [None, True, True], 2026-02-21T11:19:52.5917437Z 'range_multi_buffers': [None, False, None], 2026-02-21T11:19:52.5917651Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:19:52.5917837Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:19:52.5918037Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:19:52.5946206Z [380s] Fitting surrogate: 1108 points, 1108 targets 2026-02-21T11:19:53.2301330Z [381s] Generation 18 starting: 31 neighbors, 2 active search path(s) 2026-02-21T11:19:56.0186280Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 14.1 configs/s 2026-02-21T11:19:56.4167141Z [384s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, False, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 1, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:19:56.4168288Z Tensor-likes are not close! 2026-02-21T11:19:56.4172753Z 2026-02-21T11:19:56.4176080Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:19:56.4179483Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:56.4183873Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:19:56.4185850Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:19:56.4186045Z 2026-02-21T11:19:58.3266348Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 14.1 configs/s 2026-02-21T11:20:02.2594594Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 291/291 69.7 configs/s 2026-02-21T11:20:02.5411760Z [390s] Generation 18 complete: 2026-02-21T11:20:02.5416706Z error=1 2026-02-21T11:20:02.5418294Z ok=32 2026-02-21T11:20:02.5418491Z min=0.8100 2026-02-21T11:20:02.5423438Z mid=1.1438 2026-02-21T11:20:02.5427078Z max=3.9905 2026-02-21T11:20:02.5430984Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:20:02.5434571Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:20:02.5435686Z 'tensor_descriptor'], 2026-02-21T11:20:02.5435905Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:20:02.5436100Z 'num_stages': 7, 2026-02-21T11:20:02.5436247Z 'num_warps': 1, 2026-02-21T11:20:02.5436384Z 'pid_type': 'flat', 2026-02-21T11:20:02.5436544Z 'range_flattens': [None, True, True], 2026-02-21T11:20:02.5436741Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:20:02.5436931Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:20:02.5437115Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:20:02.5437590Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:20:02.5452167Z [390s] Fitting surrogate: 1141 points, 1141 targets 2026-02-21T11:20:03.1918159Z [391s] Generation 19 starting: 32 neighbors, 2 active search path(s) 2026-02-21T11:20:06.6225746Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 36.2 configs/s 2026-02-21T11:20:06.7186543Z [394s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 1024, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:20:06.7187791Z Tensor-likes are not close! 2026-02-21T11:20:06.7191462Z 2026-02-21T11:20:06.7196507Z Mismatched elements: 1 / 1073741824 (0.0%) 2026-02-21T11:20:06.7200495Z Greatest absolute difference: 0.010498046875 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:20:06.7204833Z Greatest relative difference: 0.2890625 at index (29446, 3489) (up to 0.01 allowed) 2026-02-21T11:20:06.7206175Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:20:06.7206362Z 2026-02-21T11:20:07.4467750Z [395s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:20:07.4468888Z Tensor-likes are not close! 2026-02-21T11:20:07.4473487Z 2026-02-21T11:20:07.4475562Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:20:07.4475888Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:20:07.4476227Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:20:07.4476534Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:20:07.4476696Z 2026-02-21T11:20:08.9531335Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 13.9 configs/s 2026-02-21T11:20:12.8147631Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 295/295 72.0 configs/s 2026-02-21T11:20:13.0863856Z [400s] Generation 19 complete: 2026-02-21T11:20:13.0866922Z error=2 2026-02-21T11:20:13.0871283Z ok=32 2026-02-21T11:20:13.0873245Z min=0.8150 2026-02-21T11:20:13.0873445Z mid=1.1243 2026-02-21T11:20:13.0877440Z max=1.7940 2026-02-21T11:20:13.0880670Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:20:13.0885368Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:20:13.0886957Z 'tensor_descriptor'], 2026-02-21T11:20:13.0887231Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:20:13.0887438Z 'num_stages': 7, 2026-02-21T11:20:13.0887578Z 'num_warps': 1, 2026-02-21T11:20:13.0887723Z 'pid_type': 'flat', 2026-02-21T11:20:13.0887878Z 'range_flattens': [None, True, True], 2026-02-21T11:20:13.0888074Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:20:13.0888267Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:20:13.0892907Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:20:13.0896776Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:20:13.0914848Z [400s] Fitting surrogate: 1175 points, 1175 targets 2026-02-21T11:20:13.7529788Z [401s] Generation 20 starting: 31 neighbors, 2 active search path(s) 2026-02-21T11:20:17.7609991Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 3.7 configs/s 2026-02-21T11:20:18.6374385Z [406s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 2048], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, None], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:20:18.6376036Z Tensor-likes are not close! 2026-02-21T11:20:18.6380741Z 2026-02-21T11:20:18.6385289Z Mismatched elements: 10852 / 1073741824 (0.0%) 2026-02-21T11:20:18.6388182Z Greatest absolute difference: 0.0625 at index (4867, 4010) (up to 0.01 allowed) 2026-02-21T11:20:18.6388590Z Greatest relative difference: 964.0 at index (208713, 2361) (up to 0.01 allowed) 2026-02-21T11:20:18.6388912Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:20:18.6389097Z 2026-02-21T11:20:20.0825271Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 14.0 configs/s 2026-02-21T11:20:24.4889551Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 295/295 63.4 configs/s 2026-02-21T11:20:24.7720537Z [412s] Generation 20 complete: 2026-02-21T11:20:24.7724453Z error=1 2026-02-21T11:20:24.7728793Z ok=32 2026-02-21T11:20:24.7730456Z min=0.8429 2026-02-21T11:20:24.7730617Z mid=1.0854 2026-02-21T11:20:24.7730735Z max=4.8507 2026-02-21T11:20:24.7730929Z best={'block_sizes': [1, 512, 2048], 2026-02-21T11:20:24.7733490Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:20:24.7733814Z 'tensor_descriptor'], 2026-02-21T11:20:24.7734021Z 'load_eviction_policies': ['', '', 'last', 'last'], 2026-02-21T11:20:24.7734224Z 'num_stages': 6, 2026-02-21T11:20:24.7734363Z 'num_warps': 1, 2026-02-21T11:20:24.7734513Z 'pid_type': 'flat', 2026-02-21T11:20:24.7734670Z 'range_flattens': [None, True, True], 2026-02-21T11:20:24.7734869Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:20:24.7735065Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:20:24.7735242Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:20:24.7735447Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:20:24.7769845Z [412s] Fitting surrogate: 1208 points, 1208 targets 2026-02-21T11:20:25.0729100Z [412s] Autotuning complete in 413.0s after searching 1170 configs. 2026-02-21T11:20:25.0729416Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:20:25.0730616Z @helion.kernel(config=helion.Config(block_sizes=[1, 512, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', 'last', 'last'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, True, None], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]), static_shapes=True) 2026-02-21T11:20:25.0732083Z 2026-02-21T11:20:25.0732370Z [412s] Code of selected kernel: /tmp/torchinductor_root/u7/cu7dqu3r7lma4lec452ndyeyhnslexa4lahv676vmevp2qhgz7if.py 2026-02-21T11:20:25.1075304Z from __future__ import annotations 2026-02-21T11:20:25.1079581Z 2026-02-21T11:20:25.1081311Z import torch 2026-02-21T11:20:25.1081480Z import triton 2026-02-21T11:20:25.1081638Z import triton.language as tl 2026-02-21T11:20:25.1081928Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:20:25.1082227Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:20:25.1082405Z 2026-02-21T11:20:25.1082473Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T11:20:25.1082655Z _BLOCK_SIZE_1 = tl.constexpr(512) 2026-02-21T11:20:25.1082827Z _BLOCK_SIZE_2 = tl.constexpr(2048) 2026-02-21T11:20:25.1082949Z 2026-02-21T11:20:25.1083006Z @triton.jit 2026-02-21T11:20:25.1083169Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:20:25.1083385Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:20:25.1083591Z pid_0 = tl.program_id(0) 2026-02-21T11:20:25.1083756Z offset_0 = pid_0 2026-02-21T11:20:25.1083932Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T11:20:25.1084453Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:20:25.1084749Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:20:25.1084993Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:20:25.1085242Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:20:25.1085499Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:20:25.1085723Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:20:25.1085934Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:20:25.1086146Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1086363Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:20:25.1086553Z # src[welford.py:50-63]: ... 2026-02-21T11:20:25.1086993Z for offset_1 in tl.range(0, 4096, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False, flatten=True): 2026-02-21T11:20:25.1087414Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T11:20:25.1087642Z acc_mean_copy = acc_mean 2026-02-21T11:20:25.1087813Z acc_cnt_copy = acc_cnt 2026-02-21T11:20:25.1087971Z acc_m2_copy = acc_m2 2026-02-21T11:20:25.1088139Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:20:25.1088319Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:20:25.1088497Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:20:25.1088725Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1088981Z chunk = tl.load(x + (indices_0[:, None] * 4096 + indices_1[None, :] * 1), None) 2026-02-21T11:20:25.1089272Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:20:25.1089508Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:20:25.1089753Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:20:25.1089983Z v_0 = chunk * chunk 2026-02-21T11:20:25.1090159Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:20:25.1090376Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1090573Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:20:25.1090763Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:20:25.1090957Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:20:25.1091144Z v_2 = sum_x / v_1 2026-02-21T11:20:25.1091338Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:20:25.1091545Z v_3 = sum_x * sum_x 2026-02-21T11:20:25.1091719Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1091952Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:20:25.1092166Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:20:25.1092389Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:20:25.1092578Z v_5 = v_3 / v_4 2026-02-21T11:20:25.1092722Z v_6 = sum_x2 - v_5 2026-02-21T11:20:25.1092899Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:20:25.1093099Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:20:25.1093271Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:20:25.1093463Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1093656Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:20:25.1093855Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:20:25.1094057Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:20:25.1094257Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:20:25.1094486Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:20:25.1094715Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:20:25.1094892Z v_12 = v_11 / acc_cnt 2026-02-21T11:20:25.1095064Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1095269Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:20:25.1095565Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:20:25.1095822Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:20:25.1096017Z v_14 = v_12 * v_13 2026-02-21T11:20:25.1096184Z v_15 = v_8 * v_14 2026-02-21T11:20:25.1096360Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:20:25.1096629Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:20:25.1096913Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:20:25.1097094Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:20:25.1097278Z v_19 = v_8 * v_8 2026-02-21T11:20:25.1097452Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:20:25.1097656Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:20:25.1097914Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:20:25.1098324Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:20:25.1098527Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:20:25.1098692Z v_22 = v_21 / acc_cnt 2026-02-21T11:20:25.1098852Z v_23 = v_19 * v_22 2026-02-21T11:20:25.1098999Z acc_m2 = v_18 + v_23 2026-02-21T11:20:25.1099215Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:20:25.1099441Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:20:25.1099597Z v_26 = v_25 + eps 2026-02-21T11:20:25.1099754Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:20:25.1099941Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:20:25.1100145Z mean_col = acc_mean[:, None] 2026-02-21T11:20:25.1100328Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:20:25.1100528Z rstd_col = v_27[:, None] 2026-02-21T11:20:25.1100699Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:20:25.1100923Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:20:25.1101159Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:20:25.1101373Z # src[welford.py:69-77]: ... 2026-02-21T11:20:25.1101647Z for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, loop_unroll_factor=2, num_stages=1, flatten=True): 2026-02-21T11:20:25.1102011Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T11:20:25.1102237Z mean_col_copy = mean_col 2026-02-21T11:20:25.1102395Z rstd_col_copy = rstd_col 2026-02-21T11:20:25.1102564Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:20:25.1102733Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:20:25.1102934Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:20:25.1103213Z xi_chuck = tl.load(x + (indices_0[:, None] * 4096 + indices_2[None, :] * 1), None) 2026-02-21T11:20:25.1103501Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:20:25.1103789Z load_1 = tl.load(weight + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:20:25.1104041Z w_chuck = load_1[None, :] 2026-02-21T11:20:25.1104243Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:20:25.1104502Z load_2 = tl.load(bias + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:20:25.1104753Z b_chuck = load_2[None, :] 2026-02-21T11:20:25.1104963Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:20:25.1105182Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:20:25.1105373Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:20:25.1105542Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:20:25.1105735Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:20:25.1105926Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:20:25.1106109Z v_32 = v_30 * v_31 2026-02-21T11:20:25.1106264Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:20:25.1106449Z v_34 = v_32 + v_33 2026-02-21T11:20:25.1106701Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:20:25.1106916Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:20:25.1107167Z tl.store(out + (indices_0[:, None] * 4096 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:20:25.1107360Z 2026-02-21T11:20:25.1107589Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:20:25.1107924Z """ 2026-02-21T11:20:25.1108103Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:20:25.1108325Z Args: 2026-02-21T11:20:25.1108468Z weight: weight tensor of shape [N] 2026-02-21T11:20:25.1108651Z bias: bias tensor of shape [N] 2026-02-21T11:20:25.1108833Z x: input tensor of shape [M, N] 2026-02-21T11:20:25.1108997Z Returns: 2026-02-21T11:20:25.1109142Z Output tensor of shape [M, N] 2026-02-21T11:20:25.1109302Z """ 2026-02-21T11:20:25.1109489Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:20:25.1109660Z m, n = x.size() 2026-02-21T11:20:25.1109881Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:20:25.1110182Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:20:25.1110411Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:20:25.1110689Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:20:25.1110993Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:20:25.1111217Z # src[welford.py:45-77]: ... 2026-02-21T11:20:25.1111481Z _launcher(_helion_welford, (262144,), x, weight, bias, out, eps, num_warps=1, num_stages=6) 2026-02-21T11:20:25.1111774Z # src[welford.py:78]: return out 2026-02-21T11:20:25.1111968Z return out 2026-02-21T11:20:26.4709122Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T11:20:26.4713282Z x_val 2026-02-21T11:20:26.4714872Z ------- 2026-02-21T11:20:26.4715075Z 4096 2026-02-21T11:20:26.4719699Z 2026-02-21T11:20:26.4742573Z 67%|██████▋ | 4/6 [26:22<13:33, 406.97s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7: 2026-02-21T11:20:26.4744163Z x_val 2026-02-21T11:20:26.4744312Z ------- 2026-02-21T11:20:26.4744443Z 6144 2026-02-21T11:20:26.4814735Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T11:20:27.2102206Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T11:20:28.5304194Z INFO:tritonbench.utils.triton_op:Took 2.12ms to get benchmark function for torch_compile_welford 2026-02-21T11:20:55.7366712Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:20:55.7370258Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:20:55.7374526Z 'dtype': 'torch.bfloat16', 2026-02-21T11:20:55.7379017Z 'shape': (6144,), 2026-02-21T11:20:55.7380360Z 'stride': (1,)}, 2026-02-21T11:20:55.7380574Z { 'device': 'cuda:0', 2026-02-21T11:20:55.7380779Z 'dtype': 'torch.bfloat16', 2026-02-21T11:20:55.7380976Z 'shape': (6144,), 2026-02-21T11:20:55.7381139Z 'stride': (1,)}, 2026-02-21T11:20:55.7381306Z { 'device': 'cuda:0', 2026-02-21T11:20:55.7381475Z 'dtype': 'torch.bfloat16', 2026-02-21T11:20:55.7381666Z 'shape': (262144, 6144), 2026-02-21T11:20:55.7381836Z 'stride': (6144, 1)}), 2026-02-21T11:20:55.7382082Z 'kwargs': {}} 2026-02-21T11:20:55.7386304Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for helion_welford 2026-02-21T11:20:56.0214854Z [0s] Autotune random seed: 2144717750 2026-02-21T11:20:56.0657428Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:21:32.2621062Z [36s] Timeout after 30s compiling Config(block_sizes=[8192, 1, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', '', 'first', 'last'], num_sm_multiplier=128, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True, False], range_multi_buffers=[False, True, False], range_num_stages=[0, 4, 0], range_unroll_factors=[3, 0, 0], range_warp_specializes=[False, None, None]) 2026-02-21T11:21:34.4070925Z [38s] Timeout after 30s compiling Config(block_sizes=[2048, 128, 1], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', 'first', 'last'], num_sm_multiplier=128, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False, True], range_multi_buffers=[None, None, False], range_num_stages=[1, 3, 3], range_unroll_factors=[0, 4, 4], range_warp_specializes=[True, None, None]) 2026-02-21T11:21:34.4090095Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s 2026-02-21T11:22:50.9414916Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.7 configs/s 2026-02-21T11:22:50.9429614Z [114s] Adaptive compile timeout: 30s (90% percentile=5.0s, bounds=[30.0s, 30s]) 2026-02-21T11:22:51.1215963Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 92.9 configs/s 2026-02-21T11:22:51.8971475Z [115s] Initial random population of 100, 5 starting points: 2026-02-21T11:22:51.8973229Z error=7 2026-02-21T11:22:51.8973387Z timeout=2 2026-02-21T11:22:51.8973511Z ok=91 2026-02-21T11:22:51.8973639Z min=2.5651 2026-02-21T11:22:51.8973761Z mid=29.4113 2026-02-21T11:22:51.8973891Z max=817.7162 2026-02-21T11:22:51.8974038Z best={'block_sizes': [16, 16, 16], 2026-02-21T11:22:51.8974293Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:22:51.8974548Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T11:22:51.8974727Z 'num_stages': 1, 2026-02-21T11:22:51.8974869Z 'num_warps': 4, 2026-02-21T11:22:51.8975030Z 'pid_type': 'flat', 2026-02-21T11:22:51.8975198Z 'range_flattens': [None, None, None], 2026-02-21T11:22:51.8975385Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:22:51.8975572Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:22:51.8975733Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:22:51.8975929Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:22:51.8991693Z [115s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:22:53.1004507Z [117s] Generation 1 starting: 97 neighbors, 5 active search path(s) 2026-02-21T11:23:11.0993962Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 1.2 configs/s 2026-02-21T11:23:15.0303572Z [138s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 64], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, False, None], range_num_stages=[0, 3, 1], range_unroll_factors=[0, 2, 0], range_warp_specializes=[None, None, True]) 2026-02-21T11:23:15.0305143Z Tensor-likes are not close! 2026-02-21T11:23:15.0309418Z 2026-02-21T11:23:15.0314035Z Mismatched elements: 4 / 1610612736 (0.0%) 2026-02-21T11:23:15.0318276Z Greatest absolute difference: 0.01141357421875 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.0322617Z Greatest relative difference: 6.625 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.0324176Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:23:15.0324380Z 2026-02-21T11:23:15.7388230Z [139s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', '', 'last'], num_sm_multiplier=8, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, False, False], range_multi_buffers=[False, None, True], range_num_stages=[1, 3, 0], range_unroll_factors=[1, 3, 0], range_warp_specializes=[True, None, None]) 2026-02-21T11:23:15.7394558Z Tensor-likes are not close! 2026-02-21T11:23:15.7398165Z 2026-02-21T11:23:15.7401715Z Mismatched elements: 4 / 1610612736 (0.0%) 2026-02-21T11:23:15.7405599Z Greatest absolute difference: 0.01141357421875 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.7409264Z Greatest relative difference: 6.625 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.7412821Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:23:15.7413931Z 2026-02-21T11:23:15.8655004Z [139s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', '', 'last'], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 3, 0], range_unroll_factors=[0, 0, 0], range_warp_specializes=[None, True, None]) 2026-02-21T11:23:15.8656109Z Tensor-likes are not close! 2026-02-21T11:23:15.8659779Z 2026-02-21T11:23:15.8661349Z Mismatched elements: 4 / 1610612736 (0.0%) 2026-02-21T11:23:15.8661682Z Greatest absolute difference: 0.01141357421875 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.8662208Z Greatest relative difference: 6.625 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:15.8662508Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:23:15.8662674Z 2026-02-21T11:23:16.5682864Z [140s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[16, 2048, 128], indexing=['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', '', '', 'last'], maxnreg=256, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False, False], range_multi_buffers=[False, None, True], range_num_stages=[1, 3, 0], range_unroll_factors=[1, 3, 0], range_warp_specializes=[True, None, None]) 2026-02-21T11:23:16.5684044Z Tensor-likes are not close! 2026-02-21T11:23:16.5689004Z 2026-02-21T11:23:16.5691207Z Mismatched elements: 4 / 1610612736 (0.0%) 2026-02-21T11:23:16.5691600Z Greatest absolute difference: 0.01141357421875 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:16.5692037Z Greatest relative difference: 6.625 at index (56130, 3670) (up to 0.01 allowed) 2026-02-21T11:23:16.5692360Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:23:16.5692526Z 2026-02-21T11:23:23.5113185Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 102/102 8.2 configs/s 2026-02-21T11:23:33.3882214Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 105/105 10.1 configs/s 2026-02-21T11:23:34.0932151Z [158s] Generation 1 complete: 2026-02-21T11:23:34.0935806Z error=4 2026-02-21T11:23:34.0939049Z ok=99 2026-02-21T11:23:34.0942207Z min=1.9058 2026-02-21T11:23:34.0946127Z mid=3.6518 2026-02-21T11:23:34.0950448Z max=67.1662 2026-02-21T11:23:34.0954349Z best={'block_sizes': [16, 16, 32], 2026-02-21T11:23:34.0954705Z 'indexing': ['pointer', 'pointer', 'pointer', 'pointer', 'pointer'], 2026-02-21T11:23:34.0955017Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T11:23:34.0958290Z 'num_stages': 2, 2026-02-21T11:23:34.0962187Z 'num_warps': 2, 2026-02-21T11:23:34.0965258Z 'pid_type': 'flat', 2026-02-21T11:23:34.0970340Z 'range_flattens': [None, None, None], 2026-02-21T11:23:34.0974389Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:23:34.0976067Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:23:34.0976286Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:23:34.0976504Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:23:34.0976802Z [158s] Fitting surrogate: 203 points, 203 targets 2026-02-21T11:23:35.4430258Z [159s] Generation 2 starting: 98 neighbors, 5 active search path(s) 2026-02-21T11:23:54.3683339Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 1.1 configs/s 2026-02-21T11:23:57.3523585Z [181s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[4, 2048, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:23:57.3524668Z Tensor-likes are not close! 2026-02-21T11:23:57.3524786Z 2026-02-21T11:23:57.3524871Z Mismatched elements: 12 / 1610612736 (0.0%) 2026-02-21T11:23:57.3525143Z Greatest absolute difference: 0.013671875 at index (5356, 1725) (up to 0.01 allowed) 2026-02-21T11:23:57.3525490Z Greatest relative difference: 2.75 at index (124616, 1960) (up to 0.01 allowed) 2026-02-21T11:23:57.3525784Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:23:57.3525965Z 2026-02-21T11:24:04.0269014Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 10.5 configs/s 2026-02-21T11:24:13.4912827Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 141/141 14.3 configs/s 2026-02-21T11:24:14.0312485Z [197s] Generation 2 complete: 2026-02-21T11:24:14.0316848Z error=1 2026-02-21T11:24:14.0320838Z ok=103 2026-02-21T11:24:14.0322292Z min=1.4603 2026-02-21T11:24:14.0322528Z mid=2.3490 2026-02-21T11:24:14.0322662Z max=19.5185 2026-02-21T11:24:14.0322837Z best={'block_sizes': [4, 512, 512], 2026-02-21T11:24:14.0323150Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:24:14.0323465Z 'tensor_descriptor'], 2026-02-21T11:24:14.0328074Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:24:14.0331824Z 'num_stages': 7, 2026-02-21T11:24:14.0333348Z 'num_warps': 2, 2026-02-21T11:24:14.0333544Z 'pid_type': 'flat', 2026-02-21T11:24:14.0333742Z 'range_flattens': [None, True, False], 2026-02-21T11:24:14.0333956Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:24:14.0334158Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:24:14.0334336Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:24:14.0334537Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:24:14.0338532Z [197s] Fitting surrogate: 307 points, 307 targets 2026-02-21T11:24:15.3376871Z [199s] Generation 3 starting: 94 neighbors, 5 active search path(s) 2026-02-21T11:24:28.8160622Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 2.4 configs/s 2026-02-21T11:24:37.1329428Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 11.6 configs/s 2026-02-21T11:24:53.4494082Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 142/142 8.5 configs/s 2026-02-21T11:24:53.9852764Z [237s] Generation 3 complete: 2026-02-21T11:24:53.9854728Z ok=99 2026-02-21T11:24:53.9859870Z min=1.3839 2026-02-21T11:24:53.9864268Z mid=2.1065 2026-02-21T11:24:53.9868586Z max=26.2175 2026-02-21T11:24:53.9872218Z best={'block_sizes': [1, 512, 256], 2026-02-21T11:24:53.9872966Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 2026-02-21T11:24:53.9873291Z 'tensor_descriptor'], 2026-02-21T11:24:53.9873510Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:24:53.9873730Z 'num_stages': 7, 2026-02-21T11:24:53.9873883Z 'num_warps': 1, 2026-02-21T11:24:53.9874027Z 'pid_type': 'flat', 2026-02-21T11:24:53.9874194Z 'range_flattens': [None, True, False], 2026-02-21T11:24:53.9874392Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:24:53.9874590Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:24:53.9874758Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:24:53.9874963Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:24:53.9881614Z [237s] Fitting surrogate: 406 points, 406 targets 2026-02-21T11:24:55.1949698Z [239s] Generation 4 starting: 87 neighbors, 5 active search path(s) 2026-02-21T11:25:04.1396780Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 7.4 configs/s 2026-02-21T11:25:05.3739843Z [249s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', '', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:25:05.3740942Z Tensor-likes are not close! 2026-02-21T11:25:05.3741059Z 2026-02-21T11:25:05.3741150Z Mismatched elements: 12 / 1610612736 (0.0%) 2026-02-21T11:25:05.3741429Z Greatest absolute difference: 0.013671875 at index (5356, 1725) (up to 0.01 allowed) 2026-02-21T11:25:05.3741774Z Greatest relative difference: 2.75 at index (124616, 1960) (up to 0.01 allowed) 2026-02-21T11:25:05.3742256Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:25:05.3742431Z 2026-02-21T11:25:06.0695469Z [250s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:25:06.0696580Z Tensor-likes are not close! 2026-02-21T11:25:06.0696729Z 2026-02-21T11:25:06.0696812Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T11:25:06.0697088Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T11:25:06.0697425Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T11:25:06.0697728Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:25:06.0697903Z 2026-02-21T11:25:11.5052869Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 12.0 configs/s 2026-02-21T11:25:24.3388603Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 161/161 12.3 configs/s 2026-02-21T11:25:24.8440952Z [268s] Generation 4 complete: 2026-02-21T11:25:24.8442925Z error=2 2026-02-21T11:25:24.8443069Z ok=91 2026-02-21T11:25:24.8443198Z min=1.3763 2026-02-21T11:25:24.8443323Z mid=1.9881 2026-02-21T11:25:24.8443447Z max=4.2844 2026-02-21T11:25:24.8443578Z best={'block_sizes': [1, 512, 1024], 2026-02-21T11:25:24.8443905Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:25:24.8444247Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:25:24.8444452Z 'num_stages': 7, 2026-02-21T11:25:24.8444593Z 'num_warps': 1, 2026-02-21T11:25:24.8444729Z 'pid_type': 'flat', 2026-02-21T11:25:24.8444890Z 'range_flattens': [None, True, False], 2026-02-21T11:25:24.8445099Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:25:24.8445677Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:25:24.8445841Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:25:24.8446040Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:25:24.8471516Z [268s] Fitting surrogate: 499 points, 499 targets 2026-02-21T11:25:27.0307001Z [270s] Generation 5 starting: 91 neighbors, 5 active search path(s) 2026-02-21T11:25:43.9709737Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 1.2 configs/s 2026-02-21T11:25:52.4616066Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 11.2 configs/s 2026-02-21T11:26:01.7494717Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 171/171 17.7 configs/s 2026-02-21T11:26:02.2303537Z [306s] Generation 5 complete: 2026-02-21T11:26:02.2308562Z ok=97 2026-02-21T11:26:02.2312327Z min=1.3979 2026-02-21T11:26:02.2316161Z mid=1.9518 2026-02-21T11:26:02.2317793Z max=13.0960 2026-02-21T11:26:02.2318013Z best={'block_sizes': [1, 1024, 512], 2026-02-21T11:26:02.2318778Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:26:02.2319202Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:26:02.2319450Z 'num_stages': 7, 2026-02-21T11:26:02.2319592Z 'num_warps': 1, 2026-02-21T11:26:02.2319750Z 'pid_type': 'flat', 2026-02-21T11:26:02.2319920Z 'range_flattens': [None, True, False], 2026-02-21T11:26:02.2320115Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:26:02.2320340Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:26:02.2320522Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:26:02.2320736Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:26:02.2338101Z [306s] Fitting surrogate: 596 points, 596 targets 2026-02-21T11:26:03.5256496Z [307s] Generation 6 starting: 86 neighbors, 5 active search path(s) 2026-02-21T11:26:11.6968454Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 14.5 configs/s 2026-02-21T11:26:14.4697918Z [318s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 4096, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:26:14.4699140Z Tensor-likes are not close! 2026-02-21T11:26:14.4699258Z 2026-02-21T11:26:14.4699348Z Mismatched elements: 1231933597 / 1610612736 (76.5%) 2026-02-21T11:26:14.4699637Z Greatest absolute difference: 2.75 at index (111127, 4705) (up to 0.01 allowed) 2026-02-21T11:26:14.4699971Z Greatest relative difference: inf at index (85082, 4395) (up to 0.01 allowed) 2026-02-21T11:26:14.4700264Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:26:14.4700423Z 2026-02-21T11:26:19.3499879Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.7 configs/s 2026-02-21T11:26:30.6798692Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 172/172 14.7 configs/s 2026-02-21T11:26:31.1612900Z [335s] Generation 6 complete: 2026-02-21T11:26:31.1617286Z error=1 2026-02-21T11:26:31.1621538Z ok=91 2026-02-21T11:26:31.1624773Z min=1.3655 2026-02-21T11:26:31.1629129Z mid=1.9927 2026-02-21T11:26:31.1630805Z max=9.9226 2026-02-21T11:26:31.1630998Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:26:31.1631333Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:26:31.1631704Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:26:31.1631996Z 'num_stages': 8, 2026-02-21T11:26:31.1632161Z 'num_warps': 1, 2026-02-21T11:26:31.1632312Z 'pid_type': 'flat', 2026-02-21T11:26:31.1632495Z 'range_flattens': [None, True, False], 2026-02-21T11:26:31.1632713Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:26:31.1632931Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:26:31.1633114Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:26:31.1633624Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:26:31.1653380Z [335s] Fitting surrogate: 688 points, 688 targets 2026-02-21T11:26:32.4946927Z [336s] Generation 7 starting: 87 neighbors, 5 active search path(s) 2026-02-21T11:26:40.7503119Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s 2026-02-21T11:26:43.5649567Z [347s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True, False], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 0], range_unroll_factors=[0, 2, 2], range_warp_specializes=[None, None, None]) 2026-02-21T11:26:43.5650739Z Tensor-likes are not close! 2026-02-21T11:26:43.5650898Z 2026-02-21T11:26:43.5651474Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T11:26:43.5651781Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T11:26:43.5652221Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T11:26:43.5652948Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:26:43.5653152Z 2026-02-21T11:26:48.6213559Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 11.7 configs/s 2026-02-21T11:26:57.8430075Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 172/172 18.0 configs/s 2026-02-21T11:26:58.3148874Z [362s] Generation 7 complete: 2026-02-21T11:26:58.3153203Z error=1 2026-02-21T11:26:58.3154637Z ok=92 2026-02-21T11:26:58.3154788Z min=1.3583 2026-02-21T11:26:58.3154924Z mid=1.9722 2026-02-21T11:26:58.3155042Z max=14.2348 2026-02-21T11:26:58.3155187Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:26:58.3155526Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:26:58.3155887Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:26:58.3156094Z 'num_stages': 8, 2026-02-21T11:26:58.3156228Z 'num_warps': 1, 2026-02-21T11:26:58.3156379Z 'pid_type': 'flat', 2026-02-21T11:26:58.3156533Z 'range_flattens': [None, None, False], 2026-02-21T11:26:58.3156728Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:26:58.3156912Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:26:58.3157087Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:26:58.3157280Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:26:58.3190020Z [362s] Fitting surrogate: 781 points, 781 targets 2026-02-21T11:26:59.7116396Z [363s] Generation 8 starting: 96 neighbors, 5 active search path(s) 2026-02-21T11:27:13.3889712Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 3.0 configs/s 2026-02-21T11:27:22.9144124Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 10.6 configs/s 2026-02-21T11:27:29.8651009Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 175/175 23.9 configs/s 2026-02-21T11:27:30.3151844Z [394s] Generation 8 complete: 2026-02-21T11:27:30.3153950Z ok=102 2026-02-21T11:27:30.3154126Z min=1.3261 2026-02-21T11:27:30.3154256Z mid=2.0050 2026-02-21T11:27:30.3154390Z max=34.1729 2026-02-21T11:27:30.3154544Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:27:30.3154876Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:27:30.3155234Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:27:30.3155443Z 'num_stages': 7, 2026-02-21T11:27:30.3155589Z 'num_warps': 1, 2026-02-21T11:27:30.3155731Z 'pid_type': 'flat', 2026-02-21T11:27:30.3155899Z 'range_flattens': [None, None, False], 2026-02-21T11:27:30.3156094Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:27:30.3156293Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:27:30.3156471Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:27:30.3156682Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:27:30.3202785Z [394s] Fitting surrogate: 883 points, 883 targets 2026-02-21T11:27:31.6416604Z [395s] Generation 9 starting: 90 neighbors, 5 active search path(s) 2026-02-21T11:27:43.1778944Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 3.4 configs/s 2026-02-21T11:27:52.0894799Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 10.6 configs/s 2026-02-21T11:28:01.0122243Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 175/175 18.8 configs/s 2026-02-21T11:28:01.4790007Z [425s] Generation 9 complete: 2026-02-21T11:28:01.4795088Z ok=95 2026-02-21T11:28:01.4799378Z min=1.3374 2026-02-21T11:28:01.4802476Z mid=2.2261 2026-02-21T11:28:01.4806965Z max=39.8940 2026-02-21T11:28:01.4810082Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:28:01.4813458Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:28:01.4818251Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:28:01.4822392Z 'num_stages': 7, 2026-02-21T11:28:01.4827335Z 'num_warps': 1, 2026-02-21T11:28:01.4828828Z 'pid_type': 'flat', 2026-02-21T11:28:01.4829037Z 'range_flattens': [None, None, None], 2026-02-21T11:28:01.4829238Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:28:01.4829438Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:28:01.4829604Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:28:01.4829804Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:28:01.4847043Z [425s] Fitting surrogate: 978 points, 978 targets 2026-02-21T11:28:02.8343913Z [426s] Generation 10 starting: 91 neighbors, 5 active search path(s) 2026-02-21T11:28:40.8488158Z [464s] Timeout after 30s compiling Config(block_sizes=[128, 64, 256], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last', 'last', 'first'], maxnreg=64, num_sm_multiplier=128, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None, False], range_multi_buffers=[None, True, False], range_num_stages=[2, 1, 1], range_unroll_factors=[1, 3, 4], range_warp_specializes=[False, False, False]) 2026-02-21T11:28:40.8505346Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 0.4 configs/s 2026-02-21T11:28:49.9181102Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 95/95 10.5 configs/s 2026-02-21T11:28:54.9224039Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 34.4 configs/s 2026-02-21T11:28:55.3504165Z [479s] Generation 10 complete: 2026-02-21T11:28:55.3508500Z timeout=1 2026-02-21T11:28:55.3509971Z ok=95 2026-02-21T11:28:55.3510129Z min=1.3579 2026-02-21T11:28:55.3510267Z mid=2.1228 2026-02-21T11:28:55.3510389Z max=54.5300 2026-02-21T11:28:55.3510541Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:28:55.3510873Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:28:55.3511247Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:28:55.3511818Z 'num_stages': 7, 2026-02-21T11:28:55.3512036Z 'num_warps': 1, 2026-02-21T11:28:55.3512193Z 'pid_type': 'flat', 2026-02-21T11:28:55.3512354Z 'range_flattens': [None, None, None], 2026-02-21T11:28:55.3512562Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:28:55.3512746Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:28:55.3512918Z 'range_unroll_factors': [0, 2, 2], 2026-02-21T11:28:55.3513115Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:28:55.3555293Z [479s] Fitting surrogate: 1074 points, 1074 targets 2026-02-21T11:28:56.7517646Z [480s] Generation 11 starting: 97 neighbors, 5 active search path(s) 2026-02-21T11:29:11.3428583Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 3.4 configs/s 2026-02-21T11:29:20.8489583Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━ 101/101 10.6 configs/s 2026-02-21T11:29:27.9436414Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 25.0 configs/s 2026-02-21T11:29:28.3775686Z [512s] Generation 11 complete: 2026-02-21T11:29:28.3779521Z ok=102 2026-02-21T11:29:28.3783545Z min=1.2689 2026-02-21T11:29:28.3787923Z mid=2.1290 2026-02-21T11:29:28.3792284Z max=35.3434 2026-02-21T11:29:28.3796797Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:29:28.3797227Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:29:28.3797599Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:29:28.3801331Z 'num_stages': 7, 2026-02-21T11:29:28.3805661Z 'num_warps': 1, 2026-02-21T11:29:28.3810560Z 'pid_type': 'flat', 2026-02-21T11:29:28.3814896Z 'range_flattens': [None, None, None], 2026-02-21T11:29:28.3816295Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:29:28.3816534Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:29:28.3816710Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:29:28.3816922Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:29:28.3836755Z [512s] Fitting surrogate: 1176 points, 1176 targets 2026-02-21T11:29:29.7081015Z [513s] Generation 12 starting: 95 neighbors, 5 active search path(s) 2026-02-21T11:29:44.7960325Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 2.2 configs/s 2026-02-21T11:29:55.0679858Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 9.7 configs/s 2026-02-21T11:29:59.6179262Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 37.7 configs/s 2026-02-21T11:30:00.0327370Z [543s] Generation 12 complete: 2026-02-21T11:30:00.0329318Z ok=101 2026-02-21T11:30:00.0329480Z min=1.3272 2026-02-21T11:30:00.0329616Z mid=2.1115 2026-02-21T11:30:00.0329734Z max=61.0171 2026-02-21T11:30:00.0329886Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:30:00.0330205Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:30:00.0330548Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:30:00.0330791Z 'num_stages': 7, 2026-02-21T11:30:00.0330940Z 'num_warps': 1, 2026-02-21T11:30:00.0331084Z 'pid_type': 'flat', 2026-02-21T11:30:00.0331238Z 'range_flattens': [None, True, None], 2026-02-21T11:30:00.0331433Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:30:00.0331612Z 'range_num_stages': [0, 4, 2], 2026-02-21T11:30:00.0331779Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:30:00.0332222Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:30:00.0401216Z [543s] Fitting surrogate: 1277 points, 1277 targets 2026-02-21T11:30:01.3481066Z [545s] Generation 13 starting: 91 neighbors, 5 active search path(s) 2026-02-21T11:30:10.3800559Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 17.9 configs/s 2026-02-21T11:30:18.4437536Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 93/93 11.6 configs/s 2026-02-21T11:30:22.9733679Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 37.9 configs/s 2026-02-21T11:30:23.3862622Z [567s] Generation 13 complete: 2026-02-21T11:30:23.3866865Z ok=96 2026-02-21T11:30:23.3870873Z min=1.3122 2026-02-21T11:30:23.3875275Z mid=1.9974 2026-02-21T11:30:23.3876696Z max=6.7620 2026-02-21T11:30:23.3876886Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:30:23.3877212Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:30:23.3877577Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:30:23.3877794Z 'num_stages': 7, 2026-02-21T11:30:23.3877947Z 'num_warps': 1, 2026-02-21T11:30:23.3878083Z 'pid_type': 'flat', 2026-02-21T11:30:23.3878247Z 'range_flattens': [None, True, None], 2026-02-21T11:30:23.3878443Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:30:23.3878632Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:30:23.3878802Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:30:23.3878992Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:30:23.3931003Z [567s] Fitting surrogate: 1373 points, 1373 targets 2026-02-21T11:30:24.5034081Z [568s] Generation 14 starting: 73 neighbors, 4 active search path(s) 2026-02-21T11:30:34.0228841Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 9.3 configs/s 2026-02-21T11:30:40.7170289Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 76/76 11.4 configs/s 2026-02-21T11:30:45.7235316Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 34.4 configs/s 2026-02-21T11:30:46.1512138Z [590s] Generation 14 complete: 2026-02-21T11:30:46.1515927Z ok=77 2026-02-21T11:30:46.1520316Z min=1.3184 2026-02-21T11:30:46.1525255Z mid=1.9989 2026-02-21T11:30:46.1529529Z max=8.5638 2026-02-21T11:30:46.1533574Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:30:46.1536950Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:30:46.1540177Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:30:46.1540476Z 'num_stages': 7, 2026-02-21T11:30:46.1544008Z 'num_warps': 1, 2026-02-21T11:30:46.1548922Z 'pid_type': 'flat', 2026-02-21T11:30:46.1552053Z 'range_flattens': [None, True, None], 2026-02-21T11:30:46.1555816Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:30:46.1556133Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:30:46.1556348Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:30:46.1556563Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:30:46.1571474Z [590s] Fitting surrogate: 1450 points, 1450 targets 2026-02-21T11:30:47.1704271Z [591s] Generation 15 starting: 66 neighbors, 4 active search path(s) 2026-02-21T11:30:54.2519861Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 16.1 configs/s 2026-02-21T11:30:56.7781387Z [600s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 3], range_warp_specializes=[None, None, False]) 2026-02-21T11:30:56.7783666Z Tensor-likes are not close! 2026-02-21T11:30:56.7783825Z 2026-02-21T11:30:56.7787917Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T11:30:56.7792436Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T11:30:56.7794111Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T11:30:56.7794448Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:30:56.7794625Z 2026-02-21T11:31:00.6092840Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 70/70 11.2 configs/s 2026-02-21T11:31:04.3537059Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 45.0 configs/s 2026-02-21T11:31:04.7754049Z [608s] Generation 15 complete: 2026-02-21T11:31:04.7758426Z error=1 2026-02-21T11:31:04.7759709Z ok=70 2026-02-21T11:31:04.7759887Z min=1.3333 2026-02-21T11:31:04.7760027Z mid=1.9487 2026-02-21T11:31:04.7760464Z max=17.1510 2026-02-21T11:31:04.7760615Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:31:04.7760942Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:31:04.7761314Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:31:04.7761534Z 'num_stages': 7, 2026-02-21T11:31:04.7761673Z 'num_warps': 1, 2026-02-21T11:31:04.7761816Z 'pid_type': 'flat', 2026-02-21T11:31:04.7762078Z 'range_flattens': [None, True, None], 2026-02-21T11:31:04.7762282Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:31:04.7762486Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:31:04.7762664Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:31:04.7762866Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:31:04.7835870Z [608s] Fitting surrogate: 1521 points, 1521 targets 2026-02-21T11:31:05.8178089Z [609s] Generation 16 starting: 68 neighbors, 4 active search path(s) 2026-02-21T11:31:13.2033871Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 8.9 configs/s 2026-02-21T11:31:19.4988562Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 72/72 11.5 configs/s 2026-02-21T11:31:24.3311095Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 35.7 configs/s 2026-02-21T11:31:24.7451453Z [628s] Generation 16 complete: 2026-02-21T11:31:24.7455839Z ok=73 2026-02-21T11:31:24.7459710Z min=1.3200 2026-02-21T11:31:24.7464154Z mid=1.9651 2026-02-21T11:31:24.7468518Z max=7.4859 2026-02-21T11:31:24.7472468Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:31:24.7474122Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:31:24.7474504Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:31:24.7474735Z 'num_stages': 7, 2026-02-21T11:31:24.7474882Z 'num_warps': 1, 2026-02-21T11:31:24.7475035Z 'pid_type': 'flat', 2026-02-21T11:31:24.7475218Z 'range_flattens': [None, True, None], 2026-02-21T11:31:24.7475437Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:31:24.7475624Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:31:24.7475798Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:31:24.7475987Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:31:24.7521183Z [628s] Fitting surrogate: 1594 points, 1594 targets 2026-02-21T11:31:25.6145065Z [629s] Generation 17 starting: 51 neighbors, 3 active search path(s) 2026-02-21T11:31:30.7617374Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 27.8 configs/s 2026-02-21T11:31:34.0440635Z [637s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last', 'last', 'last'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, True, None], range_multi_buffers=[None, True, False], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 3], range_warp_specializes=[None, None, None]) 2026-02-21T11:31:34.0441781Z Tensor-likes are not close! 2026-02-21T11:31:34.0442089Z 2026-02-21T11:31:34.0442184Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T11:31:34.0442500Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T11:31:34.0442849Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T11:31:34.0447849Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:31:34.0452152Z 2026-02-21T11:31:35.4688057Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 11.5 configs/s 2026-02-21T11:31:39.9567963Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 38.3 configs/s 2026-02-21T11:31:40.3635120Z [644s] Generation 17 complete: 2026-02-21T11:31:40.3639009Z error=1 2026-02-21T11:31:40.3640966Z ok=54 2026-02-21T11:31:40.3641167Z min=1.3395 2026-02-21T11:31:40.3646500Z mid=1.8812 2026-02-21T11:31:40.3650883Z max=12.9864 2026-02-21T11:31:40.3656154Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:31:40.3660246Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:31:40.3664706Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:31:40.3666438Z 'num_stages': 7, 2026-02-21T11:31:40.3666624Z 'num_warps': 1, 2026-02-21T11:31:40.3666773Z 'pid_type': 'flat', 2026-02-21T11:31:40.3666947Z 'range_flattens': [None, True, None], 2026-02-21T11:31:40.3667145Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:31:40.3667345Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:31:40.3667522Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:31:40.3667715Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:31:40.3708083Z [644s] Fitting surrogate: 1649 points, 1649 targets 2026-02-21T11:31:41.1954186Z [645s] Generation 18 starting: 53 neighbors, 3 active search path(s) 2026-02-21T11:31:46.4118225Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 11.0 configs/s 2026-02-21T11:31:51.1095462Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 11.5 configs/s 2026-02-21T11:31:55.8896467Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 36.1 configs/s 2026-02-21T11:31:56.3081348Z [660s] Generation 18 complete: 2026-02-21T11:31:56.3085589Z ok=57 2026-02-21T11:31:56.3090707Z min=1.3608 2026-02-21T11:31:56.3095133Z mid=1.9118 2026-02-21T11:31:56.3099515Z max=7.9903 2026-02-21T11:31:56.3103448Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:31:56.3108016Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:31:56.3108467Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:31:56.3108705Z 'num_stages': 7, 2026-02-21T11:31:56.3112623Z 'num_warps': 1, 2026-02-21T11:31:56.3116578Z 'pid_type': 'flat', 2026-02-21T11:31:56.3119953Z 'range_flattens': [None, True, None], 2026-02-21T11:31:56.3123754Z 'range_multi_buffers': [None, True, None], 2026-02-21T11:31:56.3128091Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:31:56.3129704Z 'range_unroll_factors': [0, 2, 3], 2026-02-21T11:31:56.3129991Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:31:56.3158586Z [660s] Fitting surrogate: 1706 points, 1706 targets 2026-02-21T11:31:57.1838529Z [661s] Generation 19 starting: 51 neighbors, 3 active search path(s) 2026-02-21T11:32:03.3325494Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 15.0 configs/s 2026-02-21T11:32:07.8277839Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 11.6 configs/s 2026-02-21T11:32:11.4234694Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 46.9 configs/s 2026-02-21T11:32:11.8242147Z [675s] Generation 19 complete: 2026-02-21T11:32:11.8245206Z ok=55 2026-02-21T11:32:11.8249666Z min=1.3741 2026-02-21T11:32:11.8254243Z mid=1.8965 2026-02-21T11:32:11.8258473Z max=16.6840 2026-02-21T11:32:11.8261763Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:32:11.8265094Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:32:11.8269599Z 'load_eviction_policies': ['', 'last', 'last', 'last'], 2026-02-21T11:32:11.8272885Z 'num_stages': 8, 2026-02-21T11:32:11.8276854Z 'num_warps': 1, 2026-02-21T11:32:11.8277097Z 'pid_type': 'flat', 2026-02-21T11:32:11.8282135Z 'range_flattens': [None, True, True], 2026-02-21T11:32:11.8286601Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:32:11.8286887Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:32:11.8290876Z 'range_unroll_factors': [0, 2, 4], 2026-02-21T11:32:11.8294910Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:32:11.8323085Z [675s] Fitting surrogate: 1761 points, 1761 targets 2026-02-21T11:32:12.6876111Z [676s] Generation 20 starting: 50 neighbors, 3 active search path(s) 2026-02-21T11:32:17.9789305Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 13.3 configs/s 2026-02-21T11:32:20.8158228Z [684s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[1, 2048, 1024], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 2], range_unroll_factors=[0, 2, 4], range_warp_specializes=[None, None, None]) 2026-02-21T11:32:20.8159464Z Tensor-likes are not close! 2026-02-21T11:32:20.8163465Z 2026-02-21T11:32:20.8167920Z Mismatched elements: 5353 / 1610612736 (0.0%) 2026-02-21T11:32:20.8173107Z Greatest absolute difference: 0.0625 at index (380, 5967) (up to 0.01 allowed) 2026-02-21T11:32:20.8174978Z Greatest relative difference: 241.0 at index (179901, 4672) (up to 0.01 allowed) 2026-02-21T11:32:20.8175331Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check. 2026-02-21T11:32:20.8175497Z 2026-02-21T11:32:22.4415281Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 11.5 configs/s 2026-02-21T11:32:25.4624431Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 186/186 54.8 configs/s 2026-02-21T11:32:25.8589015Z [689s] Generation 20 complete: 2026-02-21T11:32:25.8592784Z error=1 2026-02-21T11:32:25.8597268Z ok=52 2026-02-21T11:32:25.8601583Z min=1.3144 2026-02-21T11:32:25.8605993Z mid=1.9580 2026-02-21T11:32:25.8610285Z max=11.1094 2026-02-21T11:32:25.8611772Z best={'block_sizes': [1, 1024, 1024], 2026-02-21T11:32:25.8612196Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:32:25.8612539Z 'load_eviction_policies': ['last', 'last', 'last', 'last'], 2026-02-21T11:32:25.8612756Z 'num_stages': 8, 2026-02-21T11:32:25.8612903Z 'num_warps': 1, 2026-02-21T11:32:25.8613041Z 'pid_type': 'flat', 2026-02-21T11:32:25.8613206Z 'range_flattens': [None, True, True], 2026-02-21T11:32:25.8613394Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:32:25.8613603Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:32:25.8613775Z 'range_unroll_factors': [0, 2, 4], 2026-02-21T11:32:25.8613973Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:32:25.8664585Z [689s] Fitting surrogate: 1814 points, 1814 targets 2026-02-21T11:32:26.1972076Z [690s] Autotuning complete in 690.1s after searching 1772 configs. 2026-02-21T11:32:26.1974283Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:32:26.1975393Z @helion.kernel(config=helion.Config(block_sizes=[1, 1024, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last', 'last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True, True], range_multi_buffers=[None, None, None], range_num_stages=[0, 4, 1], range_unroll_factors=[0, 2, 4], range_warp_specializes=[None, None, None]), static_shapes=True) 2026-02-21T11:32:26.1976357Z 2026-02-21T11:32:26.1976625Z [690s] Code of selected kernel: /tmp/torchinductor_root/op/cop45m6rihjtqe4chj56licptm24myjqwwqvvfflvihmmcznzuq4.py 2026-02-21T11:32:26.2318281Z from __future__ import annotations 2026-02-21T11:32:26.2320266Z 2026-02-21T11:32:26.2320444Z import torch 2026-02-21T11:32:26.2320657Z import triton 2026-02-21T11:32:26.2325443Z import triton.language as tl 2026-02-21T11:32:26.2329883Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:32:26.2333801Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:32:26.2335399Z 2026-02-21T11:32:26.2335571Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T11:32:26.2335795Z _BLOCK_SIZE_1 = tl.constexpr(1024) 2026-02-21T11:32:26.2335996Z _BLOCK_SIZE_2 = tl.constexpr(1024) 2026-02-21T11:32:26.2336112Z 2026-02-21T11:32:26.2336172Z @triton.jit 2026-02-21T11:32:26.2336347Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:32:26.2336582Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:32:26.2336803Z pid_0 = tl.program_id(0) 2026-02-21T11:32:26.2337272Z offset_0 = pid_0 2026-02-21T11:32:26.2337464Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T11:32:26.2337759Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:32:26.2338052Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:32:26.2338296Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:32:26.2338530Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:32:26.2338765Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:32:26.2338990Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:32:26.2339204Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:32:26.2339438Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2339647Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:32:26.2339845Z # src[welford.py:50-63]: ... 2026-02-21T11:32:26.2340128Z for offset_1 in tl.range(0, 6144, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, flatten=True): 2026-02-21T11:32:26.2340482Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T11:32:26.2340742Z acc_mean_copy = acc_mean 2026-02-21T11:32:26.2340905Z acc_cnt_copy = acc_cnt 2026-02-21T11:32:26.2341068Z acc_m2_copy = acc_m2 2026-02-21T11:32:26.2341226Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:32:26.2341408Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:32:26.2341586Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:32:26.2341773Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2342196Z chunk = tl.load(x + (indices_0[:, None] * 6144 + indices_1[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T11:32:26.2342529Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:32:26.2342769Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:32:26.2343006Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:32:26.2343239Z v_0 = chunk * chunk 2026-02-21T11:32:26.2343420Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:32:26.2343625Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2343829Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:32:26.2344016Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:32:26.2344226Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:32:26.2344415Z v_2 = sum_x / v_1 2026-02-21T11:32:26.2344624Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:32:26.2344831Z v_3 = sum_x * sum_x 2026-02-21T11:32:26.2345009Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2345211Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:32:26.2345420Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:32:26.2345653Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:32:26.2345832Z v_5 = v_3 / v_4 2026-02-21T11:32:26.2346067Z v_6 = sum_x2 - v_5 2026-02-21T11:32:26.2346240Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:32:26.2346442Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:32:26.2346611Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:32:26.2346802Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2347003Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:32:26.2347188Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:32:26.2347396Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:32:26.2347585Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:32:26.2347812Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:32:26.2348039Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:32:26.2348211Z v_12 = v_11 / acc_cnt 2026-02-21T11:32:26.2348459Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2348651Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:32:26.2348875Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:32:26.2349108Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:32:26.2349291Z v_14 = v_12 * v_13 2026-02-21T11:32:26.2349433Z v_15 = v_8 * v_14 2026-02-21T11:32:26.2349594Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:32:26.2349846Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:32:26.2350105Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:32:26.2350277Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:32:26.2350435Z v_19 = v_8 * v_8 2026-02-21T11:32:26.2350604Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:32:26.2350792Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:32:26.2351042Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:32:26.2351307Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:32:26.2351497Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:32:26.2351669Z v_22 = v_21 / acc_cnt 2026-02-21T11:32:26.2351815Z v_23 = v_19 * v_22 2026-02-21T11:32:26.2352006Z acc_m2 = v_18 + v_23 2026-02-21T11:32:26.2352214Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:32:26.2352446Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:32:26.2352592Z v_26 = v_25 + eps 2026-02-21T11:32:26.2352744Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:32:26.2352932Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:32:26.2353135Z mean_col = acc_mean[:, None] 2026-02-21T11:32:26.2353323Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:32:26.2353521Z rstd_col = v_27[:, None] 2026-02-21T11:32:26.2353706Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:32:26.2353933Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:32:26.2354190Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:32:26.2354408Z # src[welford.py:69-77]: ... 2026-02-21T11:32:26.2354697Z for offset_2 in tl.range(0, 6144, _BLOCK_SIZE_2, loop_unroll_factor=4, num_stages=1, flatten=True): 2026-02-21T11:32:26.2355034Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T11:32:26.2355262Z mean_col_copy = mean_col 2026-02-21T11:32:26.2355428Z rstd_col_copy = rstd_col 2026-02-21T11:32:26.2355592Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:32:26.2355773Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:32:26.2355968Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:32:26.2356302Z xi_chuck = tl.load(x + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T11:32:26.2356653Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:32:26.2356933Z load_1 = tl.load(weight + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:32:26.2357256Z w_chuck = load_1[None, :] 2026-02-21T11:32:26.2357456Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:32:26.2357732Z load_2 = tl.load(bias + indices_2 * 1, None, eviction_policy='evict_last') 2026-02-21T11:32:26.2357972Z b_chuck = load_2[None, :] 2026-02-21T11:32:26.2358181Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:32:26.2358403Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:32:26.2358591Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:32:26.2358770Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:32:26.2358959Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:32:26.2359168Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:32:26.2359340Z v_32 = v_30 * v_31 2026-02-21T11:32:26.2359556Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:32:26.2359729Z v_34 = v_32 + v_33 2026-02-21T11:32:26.2359919Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:32:26.2360129Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:32:26.2360376Z tl.store(out + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:32:26.2360565Z 2026-02-21T11:32:26.2360797Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:32:26.2361114Z """ 2026-02-21T11:32:26.2361296Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:32:26.2361511Z Args: 2026-02-21T11:32:26.2361652Z weight: weight tensor of shape [N] 2026-02-21T11:32:26.2361833Z bias: bias tensor of shape [N] 2026-02-21T11:32:26.2362038Z x: input tensor of shape [M, N] 2026-02-21T11:32:26.2362208Z Returns: 2026-02-21T11:32:26.2362343Z Output tensor of shape [M, N] 2026-02-21T11:32:26.2362509Z """ 2026-02-21T11:32:26.2362640Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:32:26.2362814Z m, n = x.size() 2026-02-21T11:32:26.2363026Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:32:26.2363316Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:32:26.2363542Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:32:26.2363820Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:32:26.2364126Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:32:26.2364345Z # src[welford.py:45-77]: ... 2026-02-21T11:32:26.2364614Z _launcher(_helion_welford, (262144,), x, weight, bias, out, eps, num_warps=1, num_stages=8) 2026-02-21T11:32:26.2364900Z # src[welford.py:78]: return out 2026-02-21T11:32:26.2365076Z return out 2026-02-21T11:32:27.4461331Z WARNING:tritonbench.utils.triton_op:Completed input ID 7: 2026-02-21T11:32:27.4461634Z x_val 2026-02-21T11:32:27.4467020Z ------- 2026-02-21T11:32:27.4468676Z 6144 2026-02-21T11:32:27.4468827Z 2026-02-21T11:32:27.4515595Z 83%|████████▎ | 5/6 [38:23<08:40, 520.20s/it]WARNING:tritonbench.utils.triton_op:Running input ID 9: 2026-02-21T11:32:27.4519664Z x_val 2026-02-21T11:32:27.4521110Z ------- 2026-02-21T11:32:27.4521276Z 8192 2026-02-21T11:32:27.4615358Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for eager_layer_norm 2026-02-21T11:32:28.1785629Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_welford 2026-02-21T11:32:29.4978385Z INFO:tritonbench.utils.triton_op:Took 2.27ms to get benchmark function for torch_compile_welford 2026-02-21T11:33:05.5012765Z WARNING:__main__:Input tensor metadata: 2026-02-21T11:33:05.5017263Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T11:33:05.5022378Z 'dtype': 'torch.bfloat16', 2026-02-21T11:33:05.5026410Z 'shape': (8192,), 2026-02-21T11:33:05.5030652Z 'stride': (1,)}, 2026-02-21T11:33:05.5034539Z { 'device': 'cuda:0', 2026-02-21T11:33:05.5036620Z 'dtype': 'torch.bfloat16', 2026-02-21T11:33:05.5036845Z 'shape': (8192,), 2026-02-21T11:33:05.5037025Z 'stride': (1,)}, 2026-02-21T11:33:05.5037191Z { 'device': 'cuda:0', 2026-02-21T11:33:05.5037379Z 'dtype': 'torch.bfloat16', 2026-02-21T11:33:05.5037569Z 'shape': (262144, 8192), 2026-02-21T11:33:05.5037743Z 'stride': (8192, 1)}), 2026-02-21T11:33:05.5037914Z 'kwargs': {}} 2026-02-21T11:33:05.5043074Z INFO:tritonbench.utils.triton_op:Took 3.33ms to get benchmark function for helion_welford 2026-02-21T11:33:05.7864878Z [0s] Autotune random seed: 2144717750 2026-02-21T11:33:05.8333021Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T11:33:43.2819156Z [37s] Timeout after 30s compiling Config(block_sizes=[8192, 1, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', '', 'first', 'last'], num_sm_multiplier=128, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True, False], range_multi_buffers=[False, True, False], range_num_stages=[0, 4, 0], range_unroll_factors=[3, 0, 0], range_warp_specializes=[False, None, None]) 2026-02-21T11:33:46.3851761Z [40s] Timeout after 30s compiling Config(block_sizes=[2048, 128, 1], indexing=['pointer', 'tensor_descriptor', 'pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', '', 'first', 'last'], num_sm_multiplier=128, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False, True], range_multi_buffers=[None, None, False], range_num_stages=[1, 3, 3], range_unroll_factors=[0, 4, 4], range_warp_specializes=[True, None, None]) 2026-02-21T11:33:46.3869204Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T11:35:17.8342951Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.4 configs/s 2026-02-21T11:35:17.8361566Z [132s] Adaptive compile timeout: 30s (90% percentile=7.3s, bounds=[30.0s, 30s]) 2026-02-21T11:35:18.6247828Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 33.4 configs/s 2026-02-21T11:35:19.7224070Z [133s] Initial random population of 100, 5 starting points: 2026-02-21T11:35:19.7228399Z error=5 2026-02-21T11:35:19.7232775Z timeout=2 2026-02-21T11:35:19.7236676Z ok=93 2026-02-21T11:35:19.7238256Z min=3.3996 2026-02-21T11:35:19.7238469Z mid=35.6332 2026-02-21T11:35:19.7243969Z max=934.9960 2026-02-21T11:35:19.7245989Z best={'block_sizes': [16, 512, 128], 2026-02-21T11:35:19.7250644Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:35:19.7254567Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:35:19.7258163Z 'num_stages': 4, 2026-02-21T11:35:19.7259601Z 'num_warps': 32, 2026-02-21T11:35:19.7259810Z 'pid_type': 'flat', 2026-02-21T11:35:19.7259991Z 'range_flattens': [None, False, False], 2026-02-21T11:35:19.7260197Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:35:19.7260385Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:35:19.7260561Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:35:19.7260754Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:35:19.7261054Z [133s] Fitting surrogate: 100 points, 100 targets 2026-02-21T11:35:20.9430124Z [135s] Generation 1 starting: 91 neighbors, 5 active search path(s) 2026-02-21T11:35:31.4333715Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 16.5 configs/s 2026-02-21T11:35:42.7320564Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 95/95 8.4 configs/s 2026-02-21T11:35:46.1608207Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 94/94 22.8 configs/s 2026-02-21T11:35:46.9202311Z [161s] Generation 1 complete: 2026-02-21T11:35:46.9205423Z error=2 2026-02-21T11:35:46.9209904Z ok=95 2026-02-21T11:35:46.9213914Z min=2.1740 2026-02-21T11:35:46.9218423Z mid=4.1860 2026-02-21T11:35:46.9219955Z max=43.9982 2026-02-21T11:35:46.9220137Z best={'block_sizes': [32, 128, 128], 2026-02-21T11:35:46.9220429Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:35:46.9220731Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:35:46.9220924Z 'num_stages': 4, 2026-02-21T11:35:46.9221058Z 'num_warps': 8, 2026-02-21T11:35:46.9221201Z 'pid_type': 'flat', 2026-02-21T11:35:46.9221358Z 'range_flattens': [None, False, False], 2026-02-21T11:35:46.9221555Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:35:46.9221742Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:35:46.9222062Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:35:46.9222258Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:35:46.9222563Z [161s] Fitting surrogate: 197 points, 197 targets 2026-02-21T11:35:48.1882464Z [162s] Generation 2 starting: 92 neighbors, 5 active search path(s) 2026-02-21T11:36:01.8270658Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 2.5 configs/s 2026-02-21T11:36:11.0922533Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 10.4 configs/s 2026-02-21T11:36:26.5761073Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 94/94 5.9 configs/s 2026-02-21T11:36:27.3314461Z [201s] Generation 2 complete: 2026-02-21T11:36:27.3318343Z error=9 2026-02-21T11:36:27.3322234Z ok=89 2026-02-21T11:36:27.3324241Z min=2.2052 2026-02-21T11:36:27.3324447Z mid=2.9496 2026-02-21T11:36:27.3329171Z max=43.2548 2026-02-21T11:36:27.3333721Z best={'block_sizes': [32, 128, 128], 2026-02-21T11:36:27.3337786Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:36:27.3341833Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:36:27.3346701Z 'num_stages': 4, 2026-02-21T11:36:27.3348181Z 'num_warps': 8, 2026-02-21T11:36:27.3348378Z 'pid_type': 'flat', 2026-02-21T11:36:27.3348574Z 'range_flattens': [None, False, False], 2026-02-21T11:36:27.3348778Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:36:27.3348973Z 'range_num_stages': [0, 3, 0], 2026-02-21T11:36:27.3349142Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:36:27.3349341Z 'range_warp_specializes': [None, True, None]} 2026-02-21T11:36:27.3349635Z [201s] Fitting surrogate: 295 points, 295 targets 2026-02-21T11:36:28.6301547Z [202s] Generation 3 starting: 94 neighbors, 5 active search path(s) 2026-02-21T11:36:38.5415654Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 14.1 configs/s 2026-02-21T11:36:46.7824434Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 11.4 configs/s 2026-02-21T11:37:06.2417372Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 98/98 5.0 configs/s 2026-02-21T11:37:06.9487453Z [241s] Generation 3 complete: 2026-02-21T11:37:06.9489542Z error=9 2026-02-21T11:37:06.9489721Z ok=91 2026-02-21T11:37:06.9489852Z min=2.0376 2026-02-21T11:37:06.9489998Z mid=2.6414 2026-02-21T11:37:06.9490116Z max=23.2028 2026-02-21T11:37:06.9490258Z best={'block_sizes': [8, 32, 256], 2026-02-21T11:37:06.9490520Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:37:06.9490797Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T11:37:06.9490978Z 'num_stages': 1, 2026-02-21T11:37:06.9491122Z 'num_warps': 1, 2026-02-21T11:37:06.9491259Z 'pid_type': 'flat', 2026-02-21T11:37:06.9491423Z 'range_flattens': [None, None, None], 2026-02-21T11:37:06.9491617Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:37:06.9491815Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:37:06.9492141Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:37:06.9492344Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:37:06.9515212Z [241s] Fitting surrogate: 395 points, 395 targets 2026-02-21T11:37:08.2216711Z [242s] Generation 4 starting: 90 neighbors, 5 active search path(s) 2026-02-21T11:37:17.5050960Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 23.2 configs/s 2026-02-21T11:37:26.2698035Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 10.4 configs/s 2026-02-21T11:37:43.6567474Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━ 99/99 5.6 configs/s 2026-02-21T11:37:44.3601689Z [278s] Generation 4 complete: 2026-02-21T11:37:44.3606630Z error=1 2026-02-21T11:37:44.3610323Z ok=94 2026-02-21T11:37:44.3614729Z min=2.0256 2026-02-21T11:37:44.3618648Z mid=2.6800 2026-02-21T11:37:44.3619943Z max=19.5681 2026-02-21T11:37:44.3620129Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:37:44.3620419Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:37:44.3620752Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:37:44.3620941Z 'num_stages': 5, 2026-02-21T11:37:44.3621088Z 'num_warps': 2, 2026-02-21T11:37:44.3621241Z 'pid_type': 'flat', 2026-02-21T11:37:44.3621761Z 'range_flattens': [None, False, False], 2026-02-21T11:37:44.3622076Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:37:44.3622267Z 'range_num_stages': [0, 4, 0], 2026-02-21T11:37:44.3622445Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:37:44.3622632Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:37:44.3635454Z [278s] Fitting surrogate: 490 points, 490 targets 2026-02-21T11:37:45.4414122Z [279s] Generation 5 starting: 79 neighbors, 5 active search path(s) 2026-02-21T11:37:53.6098458Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 31.5 configs/s 2026-02-21T11:38:01.1115182Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 10.7 configs/s 2026-02-21T11:38:16.5307752Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 101/101 6.4 configs/s 2026-02-21T11:38:17.2124507Z [311s] Generation 5 complete: 2026-02-21T11:38:17.2128378Z error=2 2026-02-21T11:38:17.2132192Z ok=82 2026-02-21T11:38:17.2134212Z min=2.0358 2026-02-21T11:38:17.2134392Z mid=2.5856 2026-02-21T11:38:17.2134535Z max=17.9359 2026-02-21T11:38:17.2134674Z best={'block_sizes': [8, 32, 256], 2026-02-21T11:38:17.2134946Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T11:38:17.2135213Z 'load_eviction_policies': ['', '', '', ''], 2026-02-21T11:38:17.2135399Z 'num_stages': 1, 2026-02-21T11:38:17.2135536Z 'num_warps': 1, 2026-02-21T11:38:17.2135678Z 'pid_type': 'flat', 2026-02-21T11:38:17.2135843Z 'range_flattens': [None, None, None], 2026-02-21T11:38:17.2136029Z 'range_multi_buffers': [None, None, None], 2026-02-21T11:38:17.2136219Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:38:17.2136384Z 'range_unroll_factors': [0, 0, 0], 2026-02-21T11:38:17.2136584Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:38:17.2162061Z [311s] Fitting surrogate: 574 points, 574 targets 2026-02-21T11:38:18.2419693Z [312s] Generation 6 starting: 66 neighbors, 4 active search path(s) 2026-02-21T11:38:27.4101044Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 6.2 configs/s 2026-02-21T11:38:33.9466966Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 10.4 configs/s 2026-02-21T11:38:49.2785773Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 102/102 6.5 configs/s 2026-02-21T11:38:49.9594424Z [344s] Generation 6 complete: 2026-02-21T11:38:49.9598954Z ok=71 2026-02-21T11:38:49.9603260Z min=2.0301 2026-02-21T11:38:49.9605370Z mid=2.5840 2026-02-21T11:38:49.9605567Z max=29.7227 2026-02-21T11:38:49.9609632Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:38:49.9614258Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:38:49.9618562Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:38:49.9620183Z 'num_stages': 5, 2026-02-21T11:38:49.9620379Z 'num_warps': 2, 2026-02-21T11:38:49.9620527Z 'pid_type': 'flat', 2026-02-21T11:38:49.9620702Z 'range_flattens': [None, False, False], 2026-02-21T11:38:49.9621230Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:38:49.9621459Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:38:49.9621631Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:38:49.9621823Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:38:49.9626460Z [344s] Fitting surrogate: 645 points, 645 targets 2026-02-21T11:38:50.8370369Z [345s] Generation 7 starting: 51 neighbors, 3 active search path(s) 2026-02-21T11:38:56.5898813Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 12.4 configs/s 2026-02-21T11:39:01.6995491Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 10.2 configs/s 2026-02-21T11:39:12.0055881Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━ 102/102 9.4 configs/s 2026-02-21T11:39:12.6808168Z [366s] Generation 7 complete: 2026-02-21T11:39:12.6812560Z ok=55 2026-02-21T11:39:12.6814561Z min=2.0551 2026-02-21T11:39:12.6814730Z mid=2.6092 2026-02-21T11:39:12.6814868Z max=9.2201 2026-02-21T11:39:12.6815025Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:39:12.6815359Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:39:12.6815699Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:39:12.6815893Z 'num_stages': 5, 2026-02-21T11:39:12.6816046Z 'num_warps': 2, 2026-02-21T11:39:12.6816189Z 'pid_type': 'flat', 2026-02-21T11:39:12.6816359Z 'range_flattens': [None, False, False], 2026-02-21T11:39:12.6816555Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:39:12.6816756Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:39:12.6816937Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:39:12.6817133Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:39:12.6848454Z [366s] Fitting surrogate: 700 points, 700 targets 2026-02-21T11:39:13.4290538Z [367s] Generation 8 starting: 39 neighbors, 2 active search path(s) 2026-02-21T11:39:18.1708719Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 10.4 configs/s 2026-02-21T11:39:21.9549819Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 10.6 configs/s 2026-02-21T11:39:30.8347781Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 102/102 10.8 configs/s 2026-02-21T11:39:31.5105766Z [385s] Generation 8 complete: 2026-02-21T11:39:31.5110165Z ok=42 2026-02-21T11:39:31.5114227Z min=2.0234 2026-02-21T11:39:31.5118078Z mid=2.5304 2026-02-21T11:39:31.5119542Z max=6.2253 2026-02-21T11:39:31.5119762Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:39:31.5120125Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:39:31.5120485Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:39:31.5120711Z 'num_stages': 5, 2026-02-21T11:39:31.5120853Z 'num_warps': 2, 2026-02-21T11:39:31.5121005Z 'pid_type': 'flat', 2026-02-21T11:39:31.5121188Z 'range_flattens': [None, False, False], 2026-02-21T11:39:31.5121400Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:39:31.5121598Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:39:31.5121794Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:39:31.5122549Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:39:31.5141077Z [385s] Fitting surrogate: 742 points, 742 targets 2026-02-21T11:39:32.1712169Z [386s] Generation 9 starting: 33 neighbors, 2 active search path(s) 2026-02-21T11:39:36.2319403Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 7.2 configs/s 2026-02-21T11:39:39.7312270Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 9.7 configs/s 2026-02-21T11:39:45.4026017Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 102/102 16.3 configs/s 2026-02-21T11:39:46.0658407Z [400s] Generation 9 complete: 2026-02-21T11:39:46.0662745Z ok=36 2026-02-21T11:39:46.0666561Z min=2.0490 2026-02-21T11:39:46.0670463Z mid=2.6952 2026-02-21T11:39:46.0675573Z max=17.1500 2026-02-21T11:39:46.0679983Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:39:46.0684549Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:39:46.0685261Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:39:46.0685523Z 'num_stages': 5, 2026-02-21T11:39:46.0690836Z 'num_warps': 2, 2026-02-21T11:39:46.0695355Z 'pid_type': 'flat', 2026-02-21T11:39:46.0699995Z 'range_flattens': [None, False, False], 2026-02-21T11:39:46.0702025Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:39:46.0702312Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:39:46.0706946Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:39:46.0711460Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:39:46.0713496Z [400s] Fitting surrogate: 778 points, 778 targets 2026-02-21T11:39:46.7965402Z [400s] Generation 10 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:39:51.2141355Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 5.1 configs/s 2026-02-21T11:39:54.8927431Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 10.6 configs/s 2026-02-21T11:40:03.1690322Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 11.6 configs/s 2026-02-21T11:40:03.8339019Z [418s] Generation 10 complete: 2026-02-21T11:40:03.8340877Z ok=41 2026-02-21T11:40:03.8341035Z min=2.0419 2026-02-21T11:40:03.8341170Z mid=2.5524 2026-02-21T11:40:03.8341297Z max=4.7339 2026-02-21T11:40:03.8341430Z best={'block_sizes': [8, 128, 512], 2026-02-21T11:40:03.8341725Z 'indexing': ['pointer', 'pointer', 'pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T11:40:03.8342079Z 'load_eviction_policies': ['', '', '', 'last'], 2026-02-21T11:40:03.8342276Z 'num_stages': 5, 2026-02-21T11:40:03.8342414Z 'num_warps': 2, 2026-02-21T11:40:03.8342561Z 'pid_type': 'flat', 2026-02-21T11:40:03.8342718Z 'range_flattens': [None, False, False], 2026-02-21T11:40:03.8342922Z 'range_multi_buffers': [None, None, True], 2026-02-21T11:40:03.8343118Z 'range_num_stages': [0, 4, 1], 2026-02-21T11:40:03.8343287Z 'range_unroll_factors': [0, 3, 0], 2026-02-21T11:40:03.8343512Z 'range_warp_specializes': [None, None, None]} 2026-02-21T11:40:03.8377206Z [418s] Fitting surrogate: 819 points, 819 targets 2026-02-21T11:40:04.5319050Z [418s] Generation 11 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:40:08.9988511Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 18.8 configs/s 2026-02-21T11:40:12.7815260Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 10.6 configs/s 2026-02-21T11:40:20.7267309Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 12.0 configs/s 2026-02-21T11:40:21.4152972Z [435s] Generation 11 complete: 2026-02-21T11:40:21.4157451Z ok=41 2026-02-21T11:40:21.4161733Z min=2.0192 2026-02-21T11:40:21.4163902Z mid=2.7142 2026-02-21T11:40:21.4164060Z max=6.2956 2026-02-21T11:40:21.4164214Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:40:21.4164502Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:40:21.4164823Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:40:21.4165036Z 'num_stages': 8, 2026-02-21T11:40:21.4165213Z 'num_warps': 8, 2026-02-21T11:40:21.4165634Z 'pid_type': 'flat', 2026-02-21T11:40:21.4165805Z 'range_flattens': [None, None, True], 2026-02-21T11:40:21.4166013Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:40:21.4166203Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:40:21.4166378Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:40:21.4166574Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:40:21.4194168Z [435s] Fitting surrogate: 860 points, 860 targets 2026-02-21T11:40:22.1607314Z [436s] Generation 12 starting: 38 neighbors, 2 active search path(s) 2026-02-21T11:40:26.8701010Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 10.1 configs/s 2026-02-21T11:40:30.6196527Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 10.4 configs/s 2026-02-21T11:40:38.4200337Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 12.2 configs/s 2026-02-21T11:40:39.0876301Z [453s] Generation 12 complete: 2026-02-21T11:40:39.0877613Z ok=40 2026-02-21T11:40:39.0878135Z min=2.0295 2026-02-21T11:40:39.0878315Z mid=2.5344 2026-02-21T11:40:39.0878444Z max=6.5505 2026-02-21T11:40:39.0878602Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:40:39.0878897Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:40:39.0879206Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:40:39.0879425Z 'num_stages': 8, 2026-02-21T11:40:39.0879572Z 'num_warps': 8, 2026-02-21T11:40:39.0879723Z 'pid_type': 'flat', 2026-02-21T11:40:39.0879887Z 'range_flattens': [None, None, True], 2026-02-21T11:40:39.0880095Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:40:39.0880290Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:40:39.0880470Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:40:39.0880677Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:40:39.0915275Z [453s] Fitting surrogate: 900 points, 900 targets 2026-02-21T11:40:39.5728253Z [453s] Generation 13 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:40:41.6255745Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 19.6 configs/s 2026-02-21T11:40:43.3232171Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 10.0 configs/s 2026-02-21T11:40:46.3708226Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 27.8 configs/s 2026-02-21T11:40:47.0410262Z [461s] Generation 13 complete: 2026-02-21T11:40:47.0414653Z ok=19 2026-02-21T11:40:47.0417804Z min=2.0849 2026-02-21T11:40:47.0421629Z mid=2.5549 2026-02-21T11:40:47.0424886Z max=12.3618 2026-02-21T11:40:47.0429228Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:40:47.0430610Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:40:47.0430942Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:40:47.0431148Z 'num_stages': 8, 2026-02-21T11:40:47.0431297Z 'num_warps': 8, 2026-02-21T11:40:47.0431433Z 'pid_type': 'flat', 2026-02-21T11:40:47.0431625Z 'range_flattens': [None, None, True], 2026-02-21T11:40:47.0431820Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:40:47.0432357Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:40:47.0432533Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:40:47.0432726Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:40:47.0444614Z [461s] Fitting surrogate: 919 points, 919 targets 2026-02-21T11:40:47.5105631Z [461s] Generation 14 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:40:49.6259821Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 10.6 configs/s 2026-02-21T11:40:51.3787153Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 10.3 configs/s 2026-02-21T11:40:54.1981366Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 29.8 configs/s 2026-02-21T11:40:54.8687806Z [469s] Generation 14 complete: 2026-02-21T11:40:54.8692305Z ok=19 2026-02-21T11:40:54.8696476Z min=2.0328 2026-02-21T11:40:54.8700284Z mid=2.6439 2026-02-21T11:40:54.8704876Z max=9.8273 2026-02-21T11:40:54.8709545Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:40:54.8710701Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:40:54.8711023Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:40:54.8711238Z 'num_stages': 8, 2026-02-21T11:40:54.8711379Z 'num_warps': 8, 2026-02-21T11:40:54.8711527Z 'pid_type': 'flat', 2026-02-21T11:40:54.8711688Z 'range_flattens': [None, None, True], 2026-02-21T11:40:54.8712057Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:40:54.8712257Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:40:54.8712425Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:40:54.8712629Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:40:54.8721951Z [469s] Fitting surrogate: 938 points, 938 targets 2026-02-21T11:40:55.3458748Z [469s] Generation 15 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:40:57.4687309Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 17.7 configs/s 2026-02-21T11:40:59.2578077Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 10.1 configs/s 2026-02-21T11:41:02.0239097Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 30.4 configs/s 2026-02-21T11:41:02.6806906Z [476s] Generation 15 complete: 2026-02-21T11:41:02.6811314Z ok=19 2026-02-21T11:41:02.6813006Z min=2.0490 2026-02-21T11:41:02.6813234Z mid=2.5968 2026-02-21T11:41:02.6817986Z max=8.5279 2026-02-21T11:41:02.6823155Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:02.6827622Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:02.6831794Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:02.6833290Z 'num_stages': 8, 2026-02-21T11:41:02.6833517Z 'num_warps': 8, 2026-02-21T11:41:02.6833720Z 'pid_type': 'flat', 2026-02-21T11:41:02.6833913Z 'range_flattens': [None, None, True], 2026-02-21T11:41:02.6834136Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:02.6834370Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:02.6838697Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:02.6840907Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:02.6845805Z [476s] Fitting surrogate: 957 points, 957 targets 2026-02-21T11:41:03.1167919Z [477s] Generation 16 starting: 15 neighbors, 1 active search path(s) 2026-02-21T11:41:04.9315157Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 20.0 configs/s 2026-02-21T11:41:06.4726309Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 10.4 configs/s 2026-02-21T11:41:09.2821720Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 29.9 configs/s 2026-02-21T11:41:09.9545417Z [484s] Generation 16 complete: 2026-02-21T11:41:09.9549792Z ok=17 2026-02-21T11:41:09.9551441Z min=2.0336 2026-02-21T11:41:09.9551643Z mid=2.5257 2026-02-21T11:41:09.9556249Z max=9.4132 2026-02-21T11:41:09.9560601Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:09.9565189Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:09.9567165Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:09.9567454Z 'num_stages': 8, 2026-02-21T11:41:09.9571494Z 'num_warps': 8, 2026-02-21T11:41:09.9575692Z 'pid_type': 'flat', 2026-02-21T11:41:09.9579469Z 'range_flattens': [None, None, True], 2026-02-21T11:41:09.9583473Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:09.9585393Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:09.9585614Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:09.9585822Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:09.9586119Z [484s] Fitting surrogate: 974 points, 974 targets 2026-02-21T11:41:10.4232175Z [484s] Generation 17 starting: 16 neighbors, 1 active search path(s) 2026-02-21T11:41:14.0510362Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 18.1 configs/s 2026-02-21T11:41:15.7150128Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 10.2 configs/s 2026-02-21T11:41:18.5259465Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 29.9 configs/s 2026-02-21T11:41:19.1932640Z [493s] Generation 17 complete: 2026-02-21T11:41:19.1934792Z ok=18 2026-02-21T11:41:19.1935028Z min=2.0459 2026-02-21T11:41:19.1939057Z mid=2.5620 2026-02-21T11:41:19.1942430Z max=7.7292 2026-02-21T11:41:19.1946754Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:19.1948472Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:19.1948854Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:19.1953408Z 'num_stages': 8, 2026-02-21T11:41:19.1958037Z 'num_warps': 8, 2026-02-21T11:41:19.1961777Z 'pid_type': 'flat', 2026-02-21T11:41:19.1965573Z 'range_flattens': [None, None, True], 2026-02-21T11:41:19.1970272Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:19.1975319Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:19.1979879Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:19.1984516Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:19.1984866Z [493s] Fitting surrogate: 992 points, 992 targets 2026-02-21T11:41:19.6801556Z [493s] Generation 18 starting: 16 neighbors, 1 active search path(s) 2026-02-21T11:41:21.7708743Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 27.3 configs/s 2026-02-21T11:41:23.3877372Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 10.6 configs/s 2026-02-21T11:41:26.4902633Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 27.3 configs/s 2026-02-21T11:41:27.1676773Z [501s] Generation 18 complete: 2026-02-21T11:41:27.1680573Z ok=18 2026-02-21T11:41:27.1684429Z min=2.0290 2026-02-21T11:41:27.1688251Z mid=2.8560 2026-02-21T11:41:27.1690140Z max=4.6414 2026-02-21T11:41:27.1690356Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:27.1690691Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:27.1691056Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:27.1691324Z 'num_stages': 8, 2026-02-21T11:41:27.1691480Z 'num_warps': 8, 2026-02-21T11:41:27.1691657Z 'pid_type': 'flat', 2026-02-21T11:41:27.1691816Z 'range_flattens': [None, None, True], 2026-02-21T11:41:27.1692094Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:27.1692282Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:27.1692459Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:27.1692653Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:27.1718650Z [501s] Fitting surrogate: 1010 points, 1010 targets 2026-02-21T11:41:27.6532615Z [501s] Generation 19 starting: 17 neighbors, 1 active search path(s) 2026-02-21T11:41:29.7250130Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 20.4 configs/s 2026-02-21T11:41:31.3818453Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 10.3 configs/s 2026-02-21T11:41:35.2237762Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 23.0 configs/s 2026-02-21T11:41:35.8890573Z [510s] Generation 19 complete: 2026-02-21T11:41:35.8894979Z ok=19 2026-02-21T11:41:35.8899332Z min=2.0378 2026-02-21T11:41:35.8900914Z mid=2.3102 2026-02-21T11:41:35.8901134Z max=8.2708 2026-02-21T11:41:35.8905196Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:35.8909237Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:35.8913646Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:35.8915116Z 'num_stages': 8, 2026-02-21T11:41:35.8915359Z 'num_warps': 8, 2026-02-21T11:41:35.8919617Z 'pid_type': 'flat', 2026-02-21T11:41:35.8923825Z 'range_flattens': [None, None, True], 2026-02-21T11:41:35.8927490Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:35.8931804Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:35.8933221Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:35.8933474Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:35.8937773Z [510s] Fitting surrogate: 1029 points, 1029 targets 2026-02-21T11:41:36.2592037Z [510s] Generation 20 starting: 13 neighbors, 1 active search path(s) 2026-02-21T11:41:38.1097216Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 11.0 configs/s 2026-02-21T11:41:39.3185552Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 10.8 configs/s 2026-02-21T11:41:42.4050783Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━ 102/102 27.5 configs/s 2026-02-21T11:41:43.0850539Z [517s] Generation 20 complete: 2026-02-21T11:41:43.0854325Z ok=15 2026-02-21T11:41:43.0855835Z min=2.0397 2026-02-21T11:41:43.0856000Z mid=2.1667 2026-02-21T11:41:43.0856121Z max=3.6210 2026-02-21T11:41:43.0856266Z best={'block_sizes': [32, 64, 128], 2026-02-21T11:41:43.0856544Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T11:41:43.0856847Z 'load_eviction_policies': ['last', '', 'first', 'first'], 2026-02-21T11:41:43.0857058Z 'num_stages': 8, 2026-02-21T11:41:43.0857197Z 'num_warps': 8, 2026-02-21T11:41:43.0857362Z 'pid_type': 'flat', 2026-02-21T11:41:43.0857534Z 'range_flattens': [None, None, True], 2026-02-21T11:41:43.0857730Z 'range_multi_buffers': [None, None, False], 2026-02-21T11:41:43.0857913Z 'range_num_stages': [0, 0, 0], 2026-02-21T11:41:43.0858082Z 'range_unroll_factors': [0, 4, 1], 2026-02-21T11:41:43.0858266Z 'range_warp_specializes': [None, None, False]} 2026-02-21T11:41:43.0892413Z [517s] Fitting surrogate: 1044 points, 1044 targets 2026-02-21T11:41:43.3879493Z [517s] Autotuning complete in 517.6s after searching 999 configs. 2026-02-21T11:41:43.3881353Z One can hardcode the best config and skip autotuning with: 2026-02-21T11:41:43.3882483Z @helion.kernel(config=helion.Config(block_sizes=[32, 64, 128], indexing=['pointer', 'pointer', 'tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', '', 'first', 'first'], num_stages=8, num_warps=8, pid_type='flat', range_flattens=[None, None, True], range_multi_buffers=[None, None, False], range_num_stages=[0, 0, 0], range_unroll_factors=[0, 4, 1], range_warp_specializes=[None, None, False]), static_shapes=True) 2026-02-21T11:41:43.3883442Z 2026-02-21T11:41:43.3883706Z [517s] Code of selected kernel: /tmp/torchinductor_root/gr/cgrpvawa7moqtrrsboqpvhjbscmvrsbj6nrx6g77b27w53ki3qpi.py 2026-02-21T11:41:43.4219483Z from __future__ import annotations 2026-02-21T11:41:43.4219740Z 2026-02-21T11:41:43.4219876Z import torch 2026-02-21T11:41:43.4220012Z import triton 2026-02-21T11:41:43.4220216Z import triton.language as tl 2026-02-21T11:41:43.4220465Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T11:41:43.4220804Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T11:41:43.4220984Z 2026-02-21T11:41:43.4221171Z _BLOCK_SIZE_0 = tl.constexpr(32) 2026-02-21T11:41:43.4221380Z _BLOCK_SIZE_1 = tl.constexpr(64) 2026-02-21T11:41:43.4221550Z _BLOCK_SIZE_2 = tl.constexpr(128) 2026-02-21T11:41:43.4221663Z 2026-02-21T11:41:43.4221728Z @triton.jit 2026-02-21T11:41:43.4221957Z def _helion_welford(x, weight, bias, out, eps): 2026-02-21T11:41:43.4222195Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:41:43.4222710Z pid_0 = tl.program_id(0).to(tl.int64) 2026-02-21T11:41:43.4222921Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T11:41:43.4223162Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int64) 2026-02-21T11:41:43.4223500Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:41:43.4223818Z acc_cnt = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:41:43.4224065Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:41:43.4224326Z acc_mean = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:41:43.4224563Z # src[welford.py:48]: acc_m2 = torch.zeros_like(acc_cnt) 2026-02-21T11:41:43.4224804Z acc_m2 = tl.full([_BLOCK_SIZE_0], 0, tl.float32) 2026-02-21T11:41:43.4225019Z # src[welford.py:50]: for tile_n in hl.tile(n): 2026-02-21T11:41:43.4225414Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4225639Z # src[welford.py:52]: Tn = chunk.size(-1) 2026-02-21T11:41:43.4225831Z # src[welford.py:50-63]: ... 2026-02-21T11:41:43.4226067Z for offset_1 in tl.range(0, 8192, _BLOCK_SIZE_1, loop_unroll_factor=4): 2026-02-21T11:41:43.4226351Z indices_1 = offset_1 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int64) 2026-02-21T11:41:43.4226588Z acc_mean_copy = acc_mean 2026-02-21T11:41:43.4226752Z acc_cnt_copy = acc_cnt 2026-02-21T11:41:43.4226916Z acc_m2_copy = acc_m2 2026-02-21T11:41:43.4227077Z acc_mean_copy_0 = acc_mean_copy 2026-02-21T11:41:43.4227260Z acc_cnt_copy_0 = acc_cnt_copy 2026-02-21T11:41:43.4227437Z acc_m2_copy_0 = acc_m2_copy 2026-02-21T11:41:43.4227630Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4227967Z chunk = tl.load(x + (indices_0[:, None] * 8192 + indices_1[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T11:41:43.4228305Z # src[welford.py:53]: sum_x = torch.sum(chunk, dim=-1) 2026-02-21T11:41:43.4228540Z sum_x = tl.cast(tl.sum(chunk, 1), tl.bfloat16) 2026-02-21T11:41:43.4228785Z # src[welford.py:54]: sum_x2 = torch.sum(chunk * chunk, dim=-1) 2026-02-21T11:41:43.4229008Z v_0 = chunk * chunk 2026-02-21T11:41:43.4229184Z sum_x2 = tl.cast(tl.sum(v_0, 1), tl.bfloat16) 2026-02-21T11:41:43.4229387Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4229586Z _BLOCK_SIZE_1_ = _BLOCK_SIZE_1 2026-02-21T11:41:43.4229766Z # src[welford.py:55]: mean_c = sum_x / Tn 2026-02-21T11:41:43.4229971Z v_1 = tl.cast(_BLOCK_SIZE_1_, tl.bfloat16) 2026-02-21T11:41:43.4230154Z v_2 = sum_x / v_1 2026-02-21T11:41:43.4230350Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:41:43.4230564Z v_3 = sum_x * sum_x 2026-02-21T11:41:43.4230735Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4230936Z _BLOCK_SIZE_1__1 = _BLOCK_SIZE_1 2026-02-21T11:41:43.4231149Z # src[welford.py:56]: m2_c = sum_x2 - (sum_x * sum_x) / Tn 2026-02-21T11:41:43.4231381Z v_4 = tl.cast(_BLOCK_SIZE_1__1, tl.bfloat16) 2026-02-21T11:41:43.4231562Z v_5 = v_3 / v_4 2026-02-21T11:41:43.4231714Z v_6 = sum_x2 - v_5 2026-02-21T11:41:43.4231924Z # src[welford.py:58]: delta = mean_c - acc_mean 2026-02-21T11:41:43.4232126Z v_7 = tl.cast(v_2, tl.float32) 2026-02-21T11:41:43.4232307Z v_8 = v_7 - acc_mean_copy_0 2026-02-21T11:41:43.4232494Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4232700Z _BLOCK_SIZE_1__2 = _BLOCK_SIZE_1 2026-02-21T11:41:43.4232894Z # src[welford.py:59]: new_cnt = acc_cnt + Tn 2026-02-21T11:41:43.4233110Z v_9 = tl.cast(_BLOCK_SIZE_1__2, tl.float32) 2026-02-21T11:41:43.4233308Z acc_cnt = acc_cnt_copy_0 + v_9 2026-02-21T11:41:43.4233546Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:41:43.4233785Z v_11 = tl.full([], 1, tl.int32) 2026-02-21T11:41:43.4234041Z v_12 = v_11 / acc_cnt 2026-02-21T11:41:43.4234237Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4234442Z _BLOCK_SIZE_1__3 = _BLOCK_SIZE_1 2026-02-21T11:41:43.4234683Z # src[welford.py:60]: new_mean = acc_mean + delta * (Tn / new_cnt) 2026-02-21T11:41:43.4234941Z v_13 = tl.cast(_BLOCK_SIZE_1__3, tl.float32) 2026-02-21T11:41:43.4235143Z v_14 = v_12 * v_13 2026-02-21T11:41:43.4235302Z v_15 = v_8 * v_14 2026-02-21T11:41:43.4235477Z acc_mean = acc_mean_copy_0 + v_15 2026-02-21T11:41:43.4235759Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:41:43.4236039Z v_17 = tl.cast(v_6, tl.float32) 2026-02-21T11:41:43.4236231Z v_18 = acc_m2_copy_0 + v_17 2026-02-21T11:41:43.4236401Z v_19 = v_8 * v_8 2026-02-21T11:41:43.4236652Z # src[welford.py:51]: chunk = x[tile_m, tile_n] 2026-02-21T11:41:43.4236855Z _BLOCK_SIZE_1__4 = _BLOCK_SIZE_1 2026-02-21T11:41:43.4237116Z # src[welford.py:61]: new_m2 = acc_m2 + m2_c + delta * delta * (acc_cnt * Tn / new_cnt) 2026-02-21T11:41:43.4237393Z v_20 = tl.cast(_BLOCK_SIZE_1__4, tl.float32) 2026-02-21T11:41:43.4237594Z v_21 = acc_cnt_copy_0 * v_20 2026-02-21T11:41:43.4237772Z v_22 = v_21 / acc_cnt 2026-02-21T11:41:43.4237926Z v_23 = v_19 * v_22 2026-02-21T11:41:43.4238087Z acc_m2 = v_18 + v_23 2026-02-21T11:41:43.4238301Z # src[welford.py:65]: rstd_tile = torch.rsqrt(acc_m2 / acc_cnt + eps) 2026-02-21T11:41:43.4238541Z v_25 = acc_m2 / acc_cnt 2026-02-21T11:41:43.4238694Z v_26 = v_25 + eps 2026-02-21T11:41:43.4238857Z v_27 = libdevice.rsqrt(v_26) 2026-02-21T11:41:43.4239052Z # src[welford.py:66]: mean_col = acc_mean[:, None] 2026-02-21T11:41:43.4239264Z mean_col = acc_mean[:, None] 2026-02-21T11:41:43.4239467Z # src[welford.py:67]: rstd_col = rstd_tile[:, None] 2026-02-21T11:41:43.4239666Z rstd_col = v_27[:, None] 2026-02-21T11:41:43.4239856Z # src[welford.py:69]: for tile_n in hl.tile(n): 2026-02-21T11:41:43.4240082Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:41:43.4240337Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:41:43.4240556Z # src[welford.py:69-77]: ... 2026-02-21T11:41:43.4240945Z for offset_2 in tl.range(0, 8192, _BLOCK_SIZE_2, loop_unroll_factor=1, warp_specialize=False, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T11:41:43.4241395Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int64) 2026-02-21T11:41:43.4241625Z mean_col_copy = mean_col 2026-02-21T11:41:43.4241800Z rstd_col_copy = rstd_col 2026-02-21T11:41:43.4242009Z mean_col_copy_0 = mean_col_copy 2026-02-21T11:41:43.4242200Z rstd_col_copy_0 = rstd_col_copy 2026-02-21T11:41:43.4242404Z # src[welford.py:70]: xi_chuck = x[tile_m, tile_n] 2026-02-21T11:41:43.4242716Z xi_chuck = tl.load(x + (indices_0[:, None] * 8192 + indices_2[None, :] * 1), None) 2026-02-21T11:41:43.4243030Z # src[welford.py:71]: w_chuck = weight[tile_n][None, :] 2026-02-21T11:41:43.4243321Z load_1 = tl.load(weight + indices_2 * 1, None, eviction_policy='evict_first') 2026-02-21T11:41:43.4243594Z w_chuck = load_1[None, :] 2026-02-21T11:41:43.4243800Z # src[welford.py:72]: b_chuck = bias[tile_n][None, :] 2026-02-21T11:41:43.4244095Z load_2 = tl.load(bias + indices_2 * 1, None, eviction_policy='evict_first') 2026-02-21T11:41:43.4244375Z b_chuck = load_2[None, :] 2026-02-21T11:41:43.4244577Z # src[welford.py:74]: y = (xi_chuck - mean_col) * rstd_col 2026-02-21T11:41:43.4244802Z v_28 = tl.cast(xi_chuck, tl.float32) 2026-02-21T11:41:43.4244990Z v_29 = v_28 - mean_col_copy_0 2026-02-21T11:41:43.4245174Z v_30 = v_29 * rstd_col_copy_0 2026-02-21T11:41:43.4245367Z # src[welford.py:75]: y = y * w_chuck + b_chuck 2026-02-21T11:41:43.4245628Z v_31 = tl.cast(w_chuck, tl.float32) 2026-02-21T11:41:43.4245798Z v_32 = v_30 * v_31 2026-02-21T11:41:43.4245957Z v_33 = tl.cast(b_chuck, tl.float32) 2026-02-21T11:41:43.4246132Z v_34 = v_32 + v_33 2026-02-21T11:41:43.4246312Z # src[welford.py:77]: out[tile_m, tile_n] = y.to(x.dtype) 2026-02-21T11:41:43.4246528Z v_35 = tl.cast(v_34, tl.bfloat16) 2026-02-21T11:41:43.4246766Z tl.store(out + (indices_0[:, None] * 8192 + indices_2[None, :] * 1), v_35, None) 2026-02-21T11:41:43.4246964Z 2026-02-21T11:41:43.4247190Z def welford(weight: torch.Tensor, bias: torch.Tensor, x: torch.Tensor, eps: float=1e-05, *, _launcher=_default_launcher): 2026-02-21T11:41:43.4247527Z """ 2026-02-21T11:41:43.4247705Z Applies LayerNorm using Welford's algorithm for mean/variance. 2026-02-21T11:41:43.4247932Z Args: 2026-02-21T11:41:43.4248069Z weight: weight tensor of shape [N] 2026-02-21T11:41:43.4248321Z bias: bias tensor of shape [N] 2026-02-21T11:41:43.4248500Z x: input tensor of shape [M, N] 2026-02-21T11:41:43.4248672Z Returns: 2026-02-21T11:41:43.4248804Z Output tensor of shape [M, N] 2026-02-21T11:41:43.4248969Z """ 2026-02-21T11:41:43.4249096Z # src[welford.py:41]: m, n = x.size() 2026-02-21T11:41:43.4249273Z m, n = x.size() 2026-02-21T11:41:43.4249489Z # src[welford.py:43]: out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:41:43.4249775Z out = torch.empty([m, n], dtype=x.dtype, device=x.device) 2026-02-21T11:41:43.4250006Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:41:43.4250189Z _BLOCK_SIZE_0 = 32 2026-02-21T11:41:43.4250360Z # src[welford.py:45]: for tile_m in hl.tile(m): 2026-02-21T11:41:43.4250629Z # src[welford.py:46]: acc_cnt = torch.zeros_like(x[tile_m, 0], dtype=torch.float32) 2026-02-21T11:41:43.4250936Z # src[welford.py:47]: acc_mean = torch.zeros_like(acc_cnt) 2026-02-21T11:41:43.4251163Z # src[welford.py:45-77]: ... 2026-02-21T11:41:43.4251485Z _launcher(_helion_welford, (triton.cdiv(262144, _BLOCK_SIZE_0),), x, weight, bias, out, eps, num_warps=8, num_stages=8) 2026-02-21T11:41:43.4251828Z # src[welford.py:78]: return out 2026-02-21T11:41:43.4252029Z return out 2026-02-21T11:41:44.6981636Z WARNING:tritonbench.utils.triton_op:Completed input ID 9: 2026-02-21T11:41:44.6983657Z x_val 2026-02-21T11:41:44.6983811Z ------- 2026-02-21T11:41:44.6983930Z 8192 2026-02-21T11:41:44.6983999Z 2026-02-21T11:41:44.6984232Z 100%|██████████| 6/6 [47:40<00:00, 532.80s/it] 2026-02-21T11:41:44.6984476Z 100%|██████████| 6/6 [47:40<00:00, 476.80s/it] 2026-02-21T11:41:44.6996722Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpwdbicg0y.csv 2026-02-21T11:41:47.7478193Z x_val triton_welford-speedup triton_welford-accuracy torch_compile_welford-speedup torch_compile_welford-accuracy helion_welford-speedup helion_welford-accuracy 2026-02-21T11:41:47.7478969Z ------- ------------------------ ------------------------- ------------------------------- -------------------------------- ------------------------ ------------------------- 2026-02-21T11:41:47.7479485Z 1024 0.728001 1 0.566498 1 3.39666 1 2026-02-21T11:41:47.7479918Z 2048 0.750958 1 0.412089 1 2.4319 1 2026-02-21T11:41:47.7480323Z 3072 0.808927 1 0.378932 1 2.15576 1 2026-02-21T11:41:47.7480737Z 4096 0.825604 1 0.357404 1 1.93076 1 2026-02-21T11:41:47.7481140Z 6144 0.864024 1 0.331297 1 1.72296 1 2026-02-21T11:41:47.7482188Z 8192 0.876144 1 0.320894 1 1.31407 1 2026-02-21T11:41:47.7482634Z average 0.808943 1 0.394519 1 2.15868 1 2026-02-21T11:41:52.9360456Z ✅ Completed benchmark for kernel: welford 2026-02-21T11:41:52.9373039Z [ 2026-02-21T11:41:52.9377016Z { 2026-02-21T11:41:52.9380413Z "benchmark": { 2026-02-21T11:41:52.9384533Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9388510Z "extra_info": { 2026-02-21T11:41:52.9392612Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9396037Z } 2026-02-21T11:41:52.9396194Z }, 2026-02-21T11:41:52.9396344Z "model": { 2026-02-21T11:41:52.9396491Z "name": "welford" 2026-02-21T11:41:52.9396646Z }, 2026-02-21T11:41:52.9396766Z "metric": { 2026-02-21T11:41:52.9396919Z "name": "triton_speedup", 2026-02-21T11:41:52.9397088Z "benchmark_values": [ 2026-02-21T11:41:52.9397252Z 0.7280009070676684, 2026-02-21T11:41:52.9397401Z 0.7509582228521058, 2026-02-21T11:41:52.9397537Z 0.8089269964565614, 2026-02-21T11:41:52.9397682Z 0.8256041742746476, 2026-02-21T11:41:52.9397819Z 0.8640243836633036, 2026-02-21T11:41:52.9397965Z 0.876144041903765 2026-02-21T11:41:52.9398105Z ] 2026-02-21T11:41:52.9398229Z }, 2026-02-21T11:41:52.9398349Z "shape": [ 2026-02-21T11:41:52.9398483Z "1024", 2026-02-21T11:41:52.9398599Z "2048", 2026-02-21T11:41:52.9398722Z "3072", 2026-02-21T11:41:52.9398842Z "4096", 2026-02-21T11:41:52.9398954Z "6144", 2026-02-21T11:41:52.9399078Z "8192" 2026-02-21T11:41:52.9399199Z ] 2026-02-21T11:41:52.9399319Z }, 2026-02-21T11:41:52.9399430Z { 2026-02-21T11:41:52.9399552Z "benchmark": { 2026-02-21T11:41:52.9399693Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9399864Z "extra_info": { 2026-02-21T11:41:52.9400008Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9400165Z } 2026-02-21T11:41:52.9400276Z }, 2026-02-21T11:41:52.9400440Z "model": { 2026-02-21T11:41:52.9400574Z "name": "welford" 2026-02-21T11:41:52.9400717Z }, 2026-02-21T11:41:52.9400829Z "metric": { 2026-02-21T11:41:52.9400973Z "name": "triton_accuracy", 2026-02-21T11:41:52.9401133Z "benchmark_values": [ 2026-02-21T11:41:52.9401284Z 1.0, 2026-02-21T11:41:52.9401400Z 1.0, 2026-02-21T11:41:52.9401519Z 1.0, 2026-02-21T11:41:52.9401631Z 1.0, 2026-02-21T11:41:52.9401755Z 1.0, 2026-02-21T11:41:52.9402083Z 1.0 2026-02-21T11:41:52.9402215Z ] 2026-02-21T11:41:52.9402335Z }, 2026-02-21T11:41:52.9402464Z "shape": [ 2026-02-21T11:41:52.9402595Z "1024", 2026-02-21T11:41:52.9402715Z "2048", 2026-02-21T11:41:52.9402840Z "3072", 2026-02-21T11:41:52.9402954Z "4096", 2026-02-21T11:41:52.9403079Z "6144", 2026-02-21T11:41:52.9403197Z "8192" 2026-02-21T11:41:52.9403318Z ] 2026-02-21T11:41:52.9403431Z }, 2026-02-21T11:41:52.9403547Z { 2026-02-21T11:41:52.9403662Z "benchmark": { 2026-02-21T11:41:52.9403809Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9403970Z "extra_info": { 2026-02-21T11:41:52.9404121Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9404272Z } 2026-02-21T11:41:52.9404382Z }, 2026-02-21T11:41:52.9404500Z "model": { 2026-02-21T11:41:52.9404624Z "name": "welford" 2026-02-21T11:41:52.9404770Z }, 2026-02-21T11:41:52.9404886Z "metric": { 2026-02-21T11:41:52.9405040Z "name": "torch_compile_speedup", 2026-02-21T11:41:52.9405220Z "benchmark_values": [ 2026-02-21T11:41:52.9405373Z 0.566498472330231, 2026-02-21T11:41:52.9405700Z 0.4120890595248057, 2026-02-21T11:41:52.9405860Z 0.3789321492326726, 2026-02-21T11:41:52.9406009Z 0.35740374141045356, 2026-02-21T11:41:52.9406169Z 0.33129689632654763, 2026-02-21T11:41:52.9406327Z 0.3208938493278108 2026-02-21T11:41:52.9406463Z ] 2026-02-21T11:41:52.9406583Z }, 2026-02-21T11:41:52.9406695Z "shape": [ 2026-02-21T11:41:52.9406823Z "1024", 2026-02-21T11:41:52.9406940Z "2048", 2026-02-21T11:41:52.9407063Z "3072", 2026-02-21T11:41:52.9407178Z "4096", 2026-02-21T11:41:52.9407297Z "6144", 2026-02-21T11:41:52.9407411Z "8192" 2026-02-21T11:41:52.9407532Z ] 2026-02-21T11:41:52.9407642Z }, 2026-02-21T11:41:52.9407759Z { 2026-02-21T11:41:52.9407881Z "benchmark": { 2026-02-21T11:41:52.9408019Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9408183Z "extra_info": { 2026-02-21T11:41:52.9408413Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9408571Z } 2026-02-21T11:41:52.9408680Z }, 2026-02-21T11:41:52.9408801Z "model": { 2026-02-21T11:41:52.9408925Z "name": "welford" 2026-02-21T11:41:52.9409065Z }, 2026-02-21T11:41:52.9409177Z "metric": { 2026-02-21T11:41:52.9409323Z "name": "torch_compile_accuracy", 2026-02-21T11:41:52.9409500Z "benchmark_values": [ 2026-02-21T11:41:52.9409648Z 1.0, 2026-02-21T11:41:52.9409771Z 1.0, 2026-02-21T11:41:52.9409885Z 1.0, 2026-02-21T11:41:52.9410007Z 1.0, 2026-02-21T11:41:52.9410121Z 1.0, 2026-02-21T11:41:52.9410242Z 1.0 2026-02-21T11:41:52.9410357Z ] 2026-02-21T11:41:52.9410472Z }, 2026-02-21T11:41:52.9410582Z "shape": [ 2026-02-21T11:41:52.9410709Z "1024", 2026-02-21T11:41:52.9410823Z "2048", 2026-02-21T11:41:52.9410945Z "3072", 2026-02-21T11:41:52.9411058Z "4096", 2026-02-21T11:41:52.9411179Z "6144", 2026-02-21T11:41:52.9411452Z "8192" 2026-02-21T11:41:52.9411590Z ] 2026-02-21T11:41:52.9411719Z }, 2026-02-21T11:41:52.9411832Z { 2026-02-21T11:41:52.9411998Z "benchmark": { 2026-02-21T11:41:52.9412138Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9412307Z "extra_info": { 2026-02-21T11:41:52.9412447Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9412601Z } 2026-02-21T11:41:52.9412755Z }, 2026-02-21T11:41:52.9412879Z "model": { 2026-02-21T11:41:52.9413027Z "name": "welford" 2026-02-21T11:41:52.9413168Z }, 2026-02-21T11:41:52.9413305Z "metric": { 2026-02-21T11:41:52.9413446Z "name": "helion_speedup", 2026-02-21T11:41:52.9413620Z "benchmark_values": [ 2026-02-21T11:41:52.9413775Z 3.3966559032920456, 2026-02-21T11:41:52.9413968Z 2.431898631410386, 2026-02-21T11:41:52.9414153Z 2.1557582017275916, 2026-02-21T11:41:52.9414301Z 1.9307607531858557, 2026-02-21T11:41:52.9414452Z 1.7229645946860972, 2026-02-21T11:41:52.9414607Z 1.3140709758687872 2026-02-21T11:41:52.9414750Z ] 2026-02-21T11:41:52.9414879Z }, 2026-02-21T11:41:52.9415000Z "shape": [ 2026-02-21T11:41:52.9415136Z "1024", 2026-02-21T11:41:52.9415257Z "2048", 2026-02-21T11:41:52.9415390Z "3072", 2026-02-21T11:41:52.9415508Z "4096", 2026-02-21T11:41:52.9415635Z "6144", 2026-02-21T11:41:52.9415753Z "8192" 2026-02-21T11:41:52.9415882Z ] 2026-02-21T11:41:52.9415999Z }, 2026-02-21T11:41:52.9416120Z { 2026-02-21T11:41:52.9416238Z "benchmark": { 2026-02-21T11:41:52.9416391Z "name": "Helion Benchmark", 2026-02-21T11:41:52.9416566Z "extra_info": { 2026-02-21T11:41:52.9416711Z "device": "NVIDIA B200" 2026-02-21T11:41:52.9416869Z } 2026-02-21T11:41:52.9416983Z }, 2026-02-21T11:41:52.9417108Z "model": { 2026-02-21T11:41:52.9417234Z "name": "welford" 2026-02-21T11:41:52.9417379Z }, 2026-02-21T11:41:52.9417495Z "metric": { 2026-02-21T11:41:52.9417643Z "name": "helion_accuracy", 2026-02-21T11:41:52.9417809Z "benchmark_values": [ 2026-02-21T11:41:52.9418055Z 1.0, 2026-02-21T11:41:52.9418175Z 1.0, 2026-02-21T11:41:52.9418303Z 1.0, 2026-02-21T11:41:52.9418431Z 1.0, 2026-02-21T11:41:52.9418551Z 1.0, 2026-02-21T11:41:52.9418679Z 1.0 2026-02-21T11:41:52.9418799Z ] 2026-02-21T11:41:52.9418924Z }, 2026-02-21T11:41:52.9419045Z "shape": [ 2026-02-21T11:41:52.9419181Z "1024", 2026-02-21T11:41:52.9419302Z "2048", 2026-02-21T11:41:52.9419431Z "3072", 2026-02-21T11:41:52.9419548Z "4096", 2026-02-21T11:41:52.9419674Z "6144", 2026-02-21T11:41:52.9419791Z "8192" 2026-02-21T11:41:52.9419917Z ] 2026-02-21T11:41:52.9420035Z } 2026-02-21T11:41:52.9420173Z ] 2026-02-21T11:41:52.9483555Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T11:41:52.9483835Z with: 2026-02-21T11:41:52.9484331Z github-token: *** 2026-02-21T11:41:52.9484490Z venv: .venv/bin/activate 2026-02-21T11:41:52.9484648Z schema-version: v3 2026-02-21T11:41:52.9484815Z env: 2026-02-21T11:41:52.9484949Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:52.9485157Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9485399Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:52.9485645Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9485859Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9486079Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9486445Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:52.9486851Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:52.9487062Z ##[endgroup] 2026-02-21T11:41:52.9544143Z ##[group]Run set -eux 2026-02-21T11:41:52.9544320Z set -eux 2026-02-21T11:41:52.9544461Z  2026-02-21T11:41:52.9544612Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T11:41:52.9544836Z  echo "Missing github-token input" 2026-02-21T11:41:52.9545024Z  exit 1 2026-02-21T11:41:52.9545151Z fi 2026-02-21T11:41:52.9546096Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:52.9546293Z env: 2026-02-21T11:41:52.9546442Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:52.9546646Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9546901Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:52.9547154Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9547364Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9547584Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:52.9547939Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:52.9548319Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:52.9548669Z GITHUB_TOKEN: *** 2026-02-21T11:41:52.9548812Z ##[endgroup] 2026-02-21T11:41:53.0034600Z + [[ -z *** ]] 2026-02-21T11:41:53.0094135Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T11:41:53.0094397Z with: 2026-02-21T11:41:53.0094663Z github-token: *** 2026-02-21T11:41:53.0094813Z env: 2026-02-21T11:41:53.0094953Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:53.0095168Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0095422Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:53.0095668Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0095895Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0096131Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0096495Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:53.0096885Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:53.0097234Z ##[endgroup] 2026-02-21T11:41:53.0106274Z ##[group]Run set -eux 2026-02-21T11:41:53.0106439Z set -eux 2026-02-21T11:41:53.0106580Z  2026-02-21T11:41:53.0106867Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T11:41:53.0107277Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:53.0107472Z env: 2026-02-21T11:41:53.0107605Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:53.0107804Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0108043Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:53.0108417Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0108638Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0108845Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:53.0109204Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:53.0109611Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:53.0109939Z GITHUB_TOKEN: *** 2026-02-21T11:41:53.0110088Z ##[endgroup] 2026-02-21T11:41:53.0672610Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-04-1004 2026-02-21T11:41:54.9791184Z setting job-id=64380329783 2026-02-21T11:41:54.9795725Z setting job-name=run-b200 (welford) / benchmark-cu130-welford-py3.12-b200 2026-02-21T11:41:54.9928844Z ##[group]Run set -eux 2026-02-21T11:41:54.9929023Z set -eux 2026-02-21T11:41:54.9929151Z  2026-02-21T11:41:54.9929320Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T11:41:54.9929529Z  source ".venv/bin/activate" 2026-02-21T11:41:54.9929697Z fi 2026-02-21T11:41:54.9929825Z  2026-02-21T11:41:54.9930043Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T11:41:54.9930355Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T11:41:54.9930552Z  --repo "${REPO}" \ 2026-02-21T11:41:54.9930732Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T11:41:54.9930916Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T11:41:54.9931112Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T11:41:54.9931318Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T11:41:54.9931497Z  --job-id "${JOB_ID}" \ 2026-02-21T11:41:54.9931674Z  --job-name "${JOB_NAME}" 2026-02-21T11:41:54.9931969Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:54.9932163Z env: 2026-02-21T11:41:54.9932311Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:54.9932518Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:54.9932752Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:54.9932991Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:54.9933198Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:54.9933407Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:54.9933755Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:54.9934126Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:54.9934343Z SCHEMA_VERSION: v3 2026-02-21T11:41:54.9934484Z REPO: pytorch/helion 2026-02-21T11:41:54.9934641Z HEAD_BRANCH: refs/heads/main 2026-02-21T11:41:54.9934823Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T11:41:54.9935026Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T11:41:54.9935183Z RUN_ATTEMPT: 1 2026-02-21T11:41:54.9935316Z JOB_ID: 64380329783 2026-02-21T11:41:54.9935526Z JOB_NAME: run-b200 (welford) / benchmark-cu130-welford-py3.12-b200 2026-02-21T11:41:54.9935757Z ##[endgroup] 2026-02-21T11:41:55.0453894Z + [[ -n .venv/bin/activate ]] 2026-02-21T11:41:55.0454132Z + source .venv/bin/activate 2026-02-21T11:41:55.0454552Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0454694Z ++ '[' -n x ']' 2026-02-21T11:41:55.0454843Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T11:41:55.0455106Z ++ '[' .venv/bin/activate = /__w/_temp/017738e5-4709-46b9-b48f-3b17feb2d245.sh ']' 2026-02-21T11:41:55.0455376Z ++ deactivate nondestructive 2026-02-21T11:41:55.0455536Z ++ unset -f pydoc 2026-02-21T11:41:55.0455677Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0455802Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0455938Z ++ hash -r 2026-02-21T11:41:55.0456057Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0456195Z ++ unset VIRTUAL_ENV 2026-02-21T11:41:55.0456342Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T11:41:55.0456637Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T11:41:55.0456842Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T11:41:55.0457046Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T11:41:55.0457215Z ++ '[' linux-gnu = msys ']' 2026-02-21T11:41:55.0457368Z ++ export VIRTUAL_ENV 2026-02-21T11:41:55.0457510Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0457642Z ++ unset SCRIPT_PATH 2026-02-21T11:41:55.0458268Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T11:41:55.0459402Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T11:41:55.0460036Z ++ export PATH 2026-02-21T11:41:55.0460181Z ++ '[' xhelion '!=' x ']' 2026-02-21T11:41:55.0460345Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T11:41:55.0460509Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T11:41:55.0460664Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0460787Z ++ '[' -z '' ']' 2026-02-21T11:41:55.0460923Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T11:41:55.0461061Z ++ PS1='(helion) ' 2026-02-21T11:41:55.0461199Z ++ export PS1 2026-02-21T11:41:55.0461327Z ++ alias pydoc 2026-02-21T11:41:55.0461462Z ++ true 2026-02-21T11:41:55.0461580Z ++ hash -r 2026-02-21T11:41:55.0462585Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329783 --job-name 'run-b200 (welford) / benchmark-cu130-welford-py3.12-b200' 2026-02-21T11:41:55.0815770Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T11:41:55.0816040Z with: 2026-02-21T11:41:55.0816195Z venv: .venv/bin/activate 2026-02-21T11:41:55.0816349Z env: 2026-02-21T11:41:55.0816495Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:55.0816695Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0816953Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:55.0817189Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0817409Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0817620Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0817979Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:55.0818366Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:55.0818577Z ##[endgroup] 2026-02-21T11:41:55.0827618Z ##[group]Run set -eux 2026-02-21T11:41:55.0827801Z set -eux 2026-02-21T11:41:55.0827946Z  2026-02-21T11:41:55.0828088Z if command -v nvidia-smi; then 2026-02-21T11:41:55.0828278Z  DEVICE_NAME=cuda 2026-02-21T11:41:55.0828437Z  nvidia-smi 2026-02-21T11:41:55.0828592Z elif command -v rocm-smi; then 2026-02-21T11:41:55.0828774Z  DEVICE_NAME=rocm 2026-02-21T11:41:55.0829015Z  rocm-smi 2026-02-21T11:41:55.0829170Z elif command -v hl-smi; then 2026-02-21T11:41:55.0829343Z  DEVICE_NAME=hpu 2026-02-21T11:41:55.0829492Z  hl-smi 2026-02-21T11:41:55.0829617Z else 2026-02-21T11:41:55.0829747Z  arch=$(uname -m) 2026-02-21T11:41:55.0829890Z  2026-02-21T11:41:55.0830010Z  case "$arch" in 2026-02-21T11:41:55.0830164Z  aarch64|arm64) 2026-02-21T11:41:55.0830316Z  DEVICE_NAME=arm64-cpu 2026-02-21T11:41:55.0830477Z  ;; 2026-02-21T11:41:55.0830597Z  *) 2026-02-21T11:41:55.0830726Z  DEVICE_NAME=cpu 2026-02-21T11:41:55.0830869Z  ;; 2026-02-21T11:41:55.0830997Z  esac 2026-02-21T11:41:55.0831127Z  lscpu 2026-02-21T11:41:55.0831253Z fi 2026-02-21T11:41:55.0831421Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T11:41:55.0831709Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:55.0831956Z env: 2026-02-21T11:41:55.0832089Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:55.0832292Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0832535Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:55.0832774Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0832987Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0833192Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.0833542Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:55.0833914Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:55.0834130Z ##[endgroup] 2026-02-21T11:41:55.1362666Z /usr/bin/nvidia-smi 2026-02-21T11:41:55.1363011Z + command -v nvidia-smi 2026-02-21T11:41:55.1363217Z + DEVICE_NAME=cuda 2026-02-21T11:41:55.1363375Z + nvidia-smi 2026-02-21T11:41:55.1514852Z Sat Feb 21 11:41:55 2026 2026-02-21T11:41:55.1515225Z +-----------------------------------------------------------------------------------------+ 2026-02-21T11:41:55.1515725Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T11:41:55.1516117Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T11:41:55.1516514Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T11:41:55.1517201Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T11:41:55.1517510Z | | | MIG M. | 2026-02-21T11:41:55.1517793Z |=========================================+========================+======================| 2026-02-21T11:41:55.1591200Z | 0 NVIDIA B200 Off | 00000000:43:00.0 Off | 0 | 2026-02-21T11:41:55.1591612Z | N/A 38C P0 160W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T11:41:55.1592121Z | | | Disabled | 2026-02-21T11:41:55.1592435Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T11:41:55.1592677Z 2026-02-21T11:41:55.1592888Z +-----------------------------------------------------------------------------------------+ 2026-02-21T11:41:55.1593182Z | Processes: | 2026-02-21T11:41:55.1593483Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T11:41:55.1593752Z | ID ID Usage | 2026-02-21T11:41:55.1593991Z |=========================================================================================| 2026-02-21T11:41:55.1594465Z | No running processes found | 2026-02-21T11:41:55.1594768Z +-----------------------------------------------------------------------------------------+ 2026-02-21T11:41:55.1891838Z + echo DEVICE_NAME=cuda 2026-02-21T11:41:55.1924602Z ##[group]Run set -eux 2026-02-21T11:41:55.1924801Z set -eux 2026-02-21T11:41:55.1924935Z  2026-02-21T11:41:55.1925095Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T11:41:55.1925318Z  # Return the same device name as PyTorch 2026-02-21T11:41:55.1925617Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T11:41:55.1925918Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T11:41:55.1926222Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T11:41:55.1926535Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T11:41:55.1926877Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T11:41:55.1927219Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T11:41:55.1927893Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T11:41:55.1928555Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T11:41:55.1928867Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T11:41:55.1929141Z fi 2026-02-21T11:41:55.1929308Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T11:41:55.1929582Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:55.1929771Z env: 2026-02-21T11:41:55.1929912Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:55.1930101Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.1930339Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:55.1930568Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.1930780Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.1930988Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.1931405Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:55.1931791Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:55.1932033Z DEVICE_NAME: cuda 2026-02-21T11:41:55.1932182Z ##[endgroup] 2026-02-21T11:41:55.2421223Z + [[ cuda == \c\u\d\a ]] 2026-02-21T11:41:55.2424506Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader 2026-02-21T11:41:55.2617477Z + DEVICE_TYPE='NVIDIA B200' 2026-02-21T11:41:55.2619452Z + echo 'DEVICE_TYPE=NVIDIA B200' 2026-02-21T11:41:55.2658013Z ##[group]Run set -eux 2026-02-21T11:41:55.2658180Z set -eux 2026-02-21T11:41:55.2658307Z  2026-02-21T11:41:55.2658457Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T11:41:55.2658654Z  source ".venv/bin/activate" 2026-02-21T11:41:55.2658826Z fi 2026-02-21T11:41:55.2658943Z  2026-02-21T11:41:55.2659133Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T11:41:55.2659470Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T11:41:55.2659842Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:55.2660039Z env: 2026-02-21T11:41:55.2660173Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:55.2660370Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.2660607Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:55.2660913Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.2661125Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.2661328Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:55.2661674Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:55.2662100Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:55.2662317Z DEVICE_NAME: cuda 2026-02-21T11:41:55.2662463Z DEVICE_TYPE: NVIDIA B200 2026-02-21T11:41:55.2662630Z ##[endgroup] 2026-02-21T11:41:55.3181376Z + [[ -n .venv/bin/activate ]] 2026-02-21T11:41:55.3181608Z + source .venv/bin/activate 2026-02-21T11:41:55.3181772Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3181946Z ++ '[' -n x ']' 2026-02-21T11:41:55.3182094Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T11:41:55.3182355Z ++ '[' .venv/bin/activate = /__w/_temp/80f14291-cd93-40c2-ba29-7aa5132fefe9.sh ']' 2026-02-21T11:41:55.3182629Z ++ deactivate nondestructive 2026-02-21T11:41:55.3182790Z ++ unset -f pydoc 2026-02-21T11:41:55.3182931Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3183058Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3183194Z ++ hash -r 2026-02-21T11:41:55.3183316Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3183455Z ++ unset VIRTUAL_ENV 2026-02-21T11:41:55.3183602Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T11:41:55.3183784Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T11:41:55.3183985Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T11:41:55.3184165Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T11:41:55.3191473Z ++ '[' linux-gnu = msys ']' 2026-02-21T11:41:55.3191646Z ++ export VIRTUAL_ENV 2026-02-21T11:41:55.3191798Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3191965Z ++ unset SCRIPT_PATH 2026-02-21T11:41:55.3192615Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T11:41:55.3193801Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T11:41:55.3194461Z ++ export PATH 2026-02-21T11:41:55.3194608Z ++ '[' xhelion '!=' x ']' 2026-02-21T11:41:55.3194777Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T11:41:55.3194950Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T11:41:55.3195114Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3195447Z ++ '[' -z '' ']' 2026-02-21T11:41:55.3195592Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T11:41:55.3195733Z ++ PS1='(helion) ' 2026-02-21T11:41:55.3195879Z ++ export PS1 2026-02-21T11:41:55.3196011Z ++ alias pydoc 2026-02-21T11:41:55.3196153Z ++ true 2026-02-21T11:41:55.3196276Z ++ hash -r 2026-02-21T11:41:55.3196471Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T11:41:55.9740207Z Collecting psutil==7.0.0 2026-02-21T11:41:56.1007242Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T11:41:56.1811795Z Collecting nvidia-ml-py==13.580.82 2026-02-21T11:41:56.1847099Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T11:41:56.1922150Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T11:41:56.2113206Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T11:41:56.2952468Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T11:41:56.2959114Z Attempting uninstall: nvidia-ml-py 2026-02-21T11:41:56.2977767Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T11:41:56.2989419Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T11:41:56.3651405Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T11:41:56.4094301Z Attempting uninstall: psutil 2026-02-21T11:41:56.4125478Z Found existing installation: psutil 7.2.2 2026-02-21T11:41:56.4139563Z Uninstalling psutil-7.2.2: 2026-02-21T11:41:56.4146259Z Successfully uninstalled psutil-7.2.2 2026-02-21T11:41:56.5263165Z 2026-02-21T11:41:56.5297181Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T11:41:56.6490216Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T11:41:58.3021672Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T11:41:58.3021996Z with: 2026-02-21T11:41:58.3022146Z venv: .venv/bin/activate 2026-02-21T11:41:58.3022298Z env: 2026-02-21T11:41:58.3022441Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:58.3022652Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3022910Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:58.3023173Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3023402Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3023620Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3023989Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:58.3024373Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:58.3024584Z DEVICE_NAME: cuda 2026-02-21T11:41:58.3024755Z DEVICE_TYPE: NVIDIA B200 2026-02-21T11:41:58.3024906Z ##[endgroup] 2026-02-21T11:41:58.3033125Z ##[group]Run set -eux 2026-02-21T11:41:58.3033290Z set -eux 2026-02-21T11:41:58.3033439Z  2026-02-21T11:41:58.3033585Z # TODO (huydhn): Implement this part 2026-02-21T11:41:58.3033825Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T11:41:58.3034113Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T11:41:58.3034311Z env: 2026-02-21T11:41:58.3034450Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:58.3034648Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3034879Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:58.3035118Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3035324Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3035535Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3035879Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:58.3036257Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:58.3036473Z DEVICE_NAME: cuda 2026-02-21T11:41:58.3036612Z DEVICE_TYPE: NVIDIA B200 2026-02-21T11:41:58.3036769Z ##[endgroup] 2026-02-21T11:41:58.3572748Z + echo 'dependencies={}' 2026-02-21T11:41:58.3629853Z ##[group]Run actions/upload-artifact@v6 2026-02-21T11:41:58.3630075Z with: 2026-02-21T11:41:58.3630240Z name: benchmark-results-b200-welford 2026-02-21T11:41:58.3630435Z path: test/test-reports 2026-02-21T11:41:58.3630606Z if-no-files-found: warn 2026-02-21T11:41:58.3630762Z compression-level: 6 2026-02-21T11:41:58.3630914Z overwrite: false 2026-02-21T11:41:58.3631060Z include-hidden-files: false 2026-02-21T11:41:58.3631220Z env: 2026-02-21T11:41:58.3631348Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T11:41:58.3631554Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3631796Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T11:41:58.3632090Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3632306Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3632515Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T11:41:58.3632894Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T11:41:58.3633403Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T11:41:58.3633628Z DEVICE_NAME: cuda 2026-02-21T11:41:58.3633773Z DEVICE_TYPE: NVIDIA B200 2026-02-21T11:41:58.3633933Z ##[endgroup] 2026-02-21T11:41:58.3636071Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T11:41:58.5747212Z With the provided path, there will be 1 file uploaded 2026-02-21T11:41:58.5757279Z Artifact name is valid! 2026-02-21T11:41:58.5761956Z Root directory input is valid! 2026-02-21T11:41:58.8217449Z Beginning upload of artifact content to blob storage 2026-02-21T11:41:59.1616227Z Uploaded bytes 605 2026-02-21T11:41:59.2474613Z Finished uploading artifact content to blob storage! 2026-02-21T11:41:59.2475110Z SHA256 digest of uploaded artifact zip is 739ac29c40f790e38086847a6de2ab600f6506880b2d052f95ea74140d7f917a 2026-02-21T11:41:59.2479652Z Finalizing artifact upload 2026-02-21T11:41:59.5213312Z Artifact benchmark-results-b200-welford.zip successfully finalized. Artifact ID 5601102523 2026-02-21T11:41:59.5213852Z Artifact benchmark-results-b200-welford has been successfully uploaded! Final size is 605 bytes. Artifact ID is 5601102523 2026-02-21T11:41:59.5214448Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5601102523 2026-02-21T11:41:59.5327994Z Post job cleanup. 2026-02-21T11:41:59.5331725Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T11:41:59.7145597Z (node:205095) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T11:41:59.7146091Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T11:41:59.7156657Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T11:41:59.7256390Z Post job cleanup. 2026-02-21T11:41:59.7258881Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T11:41:59.9118581Z Post job cleanup. 2026-02-21T11:41:59.9121669Z ##[command]/usr/bin/docker exec 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T11:42:00.0736976Z [command]/usr/bin/git version 2026-02-21T11:42:00.0764478Z git version 2.43.0 2026-02-21T11:42:00.0792924Z Temporarily overriding HOME='/__w/_temp/d7ac9304-1b15-4e50-bb72-3054cf24034b' before making global git config changes 2026-02-21T11:42:00.0795635Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T11:42:00.0799462Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T11:42:00.0831544Z Removing SSH command configuration 2026-02-21T11:42:00.0833563Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T11:42:00.0859954Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T11:42:00.1085281Z Removing HTTP extra header 2026-02-21T11:42:00.1085700Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T11:42:00.1109246Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T11:42:00.1335742Z Removing includeIf entries pointing to credentials config files 2026-02-21T11:42:00.1337515Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T11:42:00.1358414Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T11:42:00.1358743Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T11:42:00.1359018Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T11:42:00.1359552Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T11:42:00.1362454Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T11:42:00.1379916Z /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1387641Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1418426Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T11:42:00.1434113Z /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1442344Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1467662Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T11:42:00.1485902Z /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1531178Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1559587Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T11:42:00.1573767Z /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1581381Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config 2026-02-21T11:42:00.1616449Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T11:42:00.1830704Z Removing credentials config '/__w/_temp/git-credentials-131e5f73-3f26-44a5-8995-5ab231ffa76d.config' 2026-02-21T11:42:00.1916847Z Stop and remove container: 1ba5cc7795af4ec8a97beebf24e9b59a_nvidiacuda1301develubuntu2404_bf8b79 2026-02-21T11:42:00.1920379Z ##[command]/usr/bin/docker rm --force 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T11:42:04.3373941Z 748244fffba051af5f5cff673877559f5b5ba31ac47e4c49ff2680b83d7c9486 2026-02-21T11:42:04.3407829Z Remove container network: github_network_bb902c0f1908469581a19a15aa9ed8d1 2026-02-21T11:42:04.3410647Z ##[command]/usr/bin/docker network rm github_network_bb902c0f1908469581a19a15aa9ed8d1 2026-02-21T11:42:04.7638738Z github_network_bb902c0f1908469581a19a15aa9ed8d1 2026-02-21T11:42:04.7690335Z Evaluate and set job outputs 2026-02-21T11:42:04.7695476Z Set output 'benchmark-metadata' 2026-02-21T11:42:04.7696859Z Set output 'runners-info' 2026-02-21T11:42:04.7697358Z Set output 'dependencies' 2026-02-21T11:42:04.7697736Z Cleaning up orphan processes