2026-02-21T08:04:43.4330387Z Current runner version: '2.331.0' 2026-02-21T08:04:43.4334570Z Runner name: 'dgxb200-03-1005' 2026-02-21T08:04:43.4335215Z Runner group name: 'default' 2026-02-21T08:04:43.4335863Z Machine name: '3565fcf04df8' 2026-02-21T08:04:43.4337861Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T08:04:43.4339429Z Contents: read 2026-02-21T08:04:43.4339873Z Metadata: read 2026-02-21T08:04:43.4340331Z ##[endgroup] 2026-02-21T08:04:43.4341936Z Secret source: Actions 2026-02-21T08:04:43.4342519Z Prepare workflow directory 2026-02-21T08:04:43.4704674Z Prepare all required actions 2026-02-21T08:04:43.4733445Z Getting action download info 2026-02-21T08:04:43.9422916Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T08:04:44.2973999Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T08:04:44.6762885Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T08:04:45.0512751Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T08:04:45.7173150Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T08:04:46.2983497Z Getting action download info 2026-02-21T08:04:46.5280899Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T08:04:46.5283613Z ##[group] Inputs 2026-02-21T08:04:46.5283916Z runner: linux.dgx.b200 2026-02-21T08:04:46.5284221Z python-version: 3.12 2026-02-21T08:04:46.5284532Z image: nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:46.5284823Z runtime-version: cu130 2026-02-21T08:04:46.5285141Z container-options: --gpus all 2026-02-21T08:04:46.5285440Z alias: b200 2026-02-21T08:04:46.5285668Z kernels: softmax 2026-02-21T08:04:46.5285931Z env-vars: 2026-02-21T08:04:46.5286159Z custom-args: 2026-02-21T08:04:46.5286661Z run_h100: true 2026-02-21T08:04:46.5286885Z run_b200: true 2026-02-21T08:04:46.5287169Z run_mi325x: true 2026-02-21T08:04:46.5287392Z ##[endgroup] 2026-02-21T08:04:46.5287731Z Complete job name: run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200 2026-02-21T08:04:46.5523868Z ##[group]Checking docker version 2026-02-21T08:04:46.5533540Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T08:04:46.5708640Z '1.53' 2026-02-21T08:04:46.5726923Z Docker daemon API version: '1.53' 2026-02-21T08:04:46.5727378Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T08:04:46.5878967Z '1.52' 2026-02-21T08:04:46.5898365Z Docker client API version: '1.52' 2026-02-21T08:04:46.5902811Z ##[endgroup] 2026-02-21T08:04:46.5904712Z ##[group]Clean up resources from previous jobs 2026-02-21T08:04:46.5908321Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=0581a9" 2026-02-21T08:04:46.6020450Z ##[command]/usr/bin/docker network prune --force --filter "label=0581a9" 2026-02-21T08:04:46.6124176Z ##[endgroup] 2026-02-21T08:04:46.6124476Z ##[group]Create local container network 2026-02-21T08:04:46.6130773Z ##[command]/usr/bin/docker network create --label 0581a9 github_network_1dabb68bcff447bd84adae5308b06429 2026-02-21T08:04:46.6461789Z 09773e3a4e0ced1bc0281e806d312428394d5dada89c5653e3d75c24718b90b7 2026-02-21T08:04:46.6485178Z ##[endgroup] 2026-02-21T08:04:46.6503235Z ##[group]Starting job container 2026-02-21T08:04:46.6518915Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:47.4473345Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda 2026-02-21T08:04:47.7314469Z 1cd98a0b9132: Pulling fs layer 2026-02-21T08:04:47.7314863Z 76249c7cd503: Pulling fs layer 2026-02-21T08:04:47.7315297Z 8fb7ecb711ef: Pulling fs layer 2026-02-21T08:04:47.7315703Z afcf80b42416: Pulling fs layer 2026-02-21T08:04:47.7316079Z ab7341a40ee7: Pulling fs layer 2026-02-21T08:04:47.7316552Z e93dd1223ff5: Pulling fs layer 2026-02-21T08:04:47.7316917Z 401d11fb2a09: Pulling fs layer 2026-02-21T08:04:47.7317558Z d7913b78456a: Pulling fs layer 2026-02-21T08:04:47.7317814Z eea924c2c8fb: Pulling fs layer 2026-02-21T08:04:47.7321300Z c03b8ec8dd33: Pulling fs layer 2026-02-21T08:04:47.7321624Z c20926c42231: Pulling fs layer 2026-02-21T08:04:47.8627992Z 1cd98a0b9132: Download complete 2026-02-21T08:04:47.9636776Z d7913b78456a: Download complete 2026-02-21T08:04:48.0629567Z afcf80b42416: Download complete 2026-02-21T08:04:48.0631477Z c20926c42231: Download complete 2026-02-21T08:04:48.1632844Z 8fb7ecb711ef: Download complete 2026-02-21T08:04:48.1639294Z c03b8ec8dd33: Download complete 2026-02-21T08:04:48.6630245Z 401d11fb2a09: Download complete 2026-02-21T08:04:50.2643586Z 76249c7cd503: Download complete 2026-02-21T08:04:51.7666598Z 76249c7cd503: Pull complete 2026-02-21T08:04:51.8629416Z ab7341a40ee7: Download complete 2026-02-21T08:04:53.2671300Z 401d11fb2a09: Pull complete 2026-02-21T08:04:58.3680934Z ab7341a40ee7: Pull complete 2026-02-21T08:04:58.4641215Z d7913b78456a: Pull complete 2026-02-21T08:04:58.4678875Z c03b8ec8dd33: Pull complete 2026-02-21T08:05:13.6637444Z eea924c2c8fb: Download complete 2026-02-21T08:05:26.8631342Z e93dd1223ff5: Download complete 2026-02-21T08:05:32.1632859Z afcf80b42416: Pull complete 2026-02-21T08:05:32.1633544Z c20926c42231: Pull complete 2026-02-21T08:05:32.1642947Z eea924c2c8fb: Pull complete 2026-02-21T08:05:32.2641263Z 8fb7ecb711ef: Pull complete 2026-02-21T08:06:21.2640991Z e93dd1223ff5: Pull complete 2026-02-21T08:06:21.6873083Z 1cd98a0b9132: Pull complete 2026-02-21T08:06:21.6875055Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92 2026-02-21T08:06:21.6875522Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:21.6886807Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:21.6963236Z ##[command]/usr/bin/docker create --name 6d984ead33f845ac9a028d8d082e23df_nvidiacuda1301develubuntu2404_d3efdf --label 0581a9 --workdir /__w/helion/helion --network github_network_1dabb68bcff447bd84adae5308b06429 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/charlie/_work":"/__w" -v "/home/charlie/externals":"/__e":ro -v "/home/charlie/_work/_temp":"/__w/_temp" -v "/home/charlie/_work/_actions":"/__w/_actions" -v "/home/charlie/_work/_tool":"/__w/_tool" -v "/home/charlie/_work/_temp/_github_home":"/github/home" -v "/home/charlie/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null" 2026-02-21T08:06:21.9752053Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T08:06:21.9775801Z ##[command]/usr/bin/docker start 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T08:06:22.2510786Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T08:06:22.2523678Z ##[command]/usr/bin/docker ps --all --filter id=2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T08:06:22.2682073Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 Up Less than a second 2026-02-21T08:06:22.2698152Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T08:06:22.2785856Z HOME=/github/home 2026-02-21T08:06:22.2787968Z GITHUB_ACTIONS=true 2026-02-21T08:06:22.2788345Z CI=true 2026-02-21T08:06:22.2788783Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:22.2789218Z NVARCH=x86_64 2026-02-21T08:06:22.2793746Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 2026-02-21T08:06:22.2799048Z NV_CUDA_CUDART_VERSION=13.0.88-1 2026-02-21T08:06:22.2799270Z CUDA_VERSION=13.0.1 2026-02-21T08:06:22.2799696Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:22.2800064Z NVIDIA_VISIBLE_DEVICES=all 2026-02-21T08:06:22.2800345Z NVIDIA_DRIVER_CAPABILITIES=compute,utility 2026-02-21T08:06:22.2800665Z NV_CUDA_LIB_VERSION=13.0.1-1 2026-02-21T08:06:22.2801013Z NV_NVTX_VERSION=13.0.85-1 2026-02-21T08:06:22.2801266Z NV_LIBNPP_VERSION=13.0.1.2-1 2026-02-21T08:06:22.2801531Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1 2026-02-21T08:06:22.2801834Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1 2026-02-21T08:06:22.2802096Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0 2026-02-21T08:06:22.2802395Z NV_LIBCUBLAS_VERSION=13.0.2.14-1 2026-02-21T08:06:22.2802635Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1 2026-02-21T08:06:22.2802933Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2026-02-21T08:06:22.2803230Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:22.2803456Z NCCL_VERSION=2.28.3-1 2026-02-21T08:06:22.2803718Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0 2026-02-21T08:06:22.2803970Z NVIDIA_PRODUCT_NAME=CUDA 2026-02-21T08:06:22.2804235Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1 2026-02-21T08:06:22.2804480Z NV_NVML_DEV_VERSION=13.0.87-1 2026-02-21T08:06:22.2804736Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1 2026-02-21T08:06:22.2804984Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1 2026-02-21T08:06:22.2805271Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1 2026-02-21T08:06:22.2805562Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1 2026-02-21T08:06:22.2805822Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0 2026-02-21T08:06:22.2806145Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1 2026-02-21T08:06:22.2806403Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1 2026-02-21T08:06:22.2806767Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1 2026-02-21T08:06:22.2807097Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2026-02-21T08:06:22.2807348Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:22.2807679Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0 2026-02-21T08:06:22.2807944Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2026-02-21T08:06:22.2813616Z ##[endgroup] 2026-02-21T08:06:22.2820614Z ##[group]Waiting for all services to be ready 2026-02-21T08:06:22.2821807Z ##[endgroup] 2026-02-21T08:06:22.2952669Z ##[group]Run echo "Detected NVIDIA image" 2026-02-21T08:06:22.2953022Z echo "Detected NVIDIA image" 2026-02-21T08:06:22.2953392Z nvidia-smi || echo "nvidia-smi not found" 2026-02-21T08:06:22.2955589Z shell: bash -l {0} 2026-02-21T08:06:22.2955895Z env: 2026-02-21T08:06:22.2956336Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:22.2956557Z ##[endgroup] 2026-02-21T08:06:22.3582319Z Detected NVIDIA image 2026-02-21T08:06:22.3857090Z Sat Feb 21 08:06:22 2026 2026-02-21T08:06:22.3857634Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:22.3863223Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T08:06:22.3864253Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:22.3864743Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T08:06:22.3865514Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T08:06:22.3865958Z | | | MIG M. | 2026-02-21T08:06:22.3866403Z |=========================================+========================+======================| 2026-02-21T08:06:22.3945836Z | 0 NVIDIA B200 Off | 00000000:52:00.0 Off | 0 | 2026-02-21T08:06:22.3947614Z | N/A 30C P0 141W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T08:06:22.3947969Z | | | Disabled | 2026-02-21T08:06:22.3948382Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:22.3948683Z 2026-02-21T08:06:22.3953168Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:22.3957642Z | Processes: | 2026-02-21T08:06:22.3962431Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T08:06:22.3962886Z | ID ID Usage | 2026-02-21T08:06:22.3963274Z |=========================================================================================| 2026-02-21T08:06:22.3963690Z | No running processes found | 2026-02-21T08:06:22.3964096Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:22.4328127Z ##[group]Run set -x 2026-02-21T08:06:22.4328352Z set -x 2026-02-21T08:06:22.4328586Z apt-get update 2026-02-21T08:06:22.4328797Z apt-get install -y git 2026-02-21T08:06:22.4329225Z shell: bash -l {0} 2026-02-21T08:06:22.4329409Z env: 2026-02-21T08:06:22.4329638Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:22.4329849Z ##[endgroup] 2026-02-21T08:06:22.4860916Z + apt-get update 2026-02-21T08:06:22.5413084Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease [1581 B] 2026-02-21T08:06:22.6520420Z Get:2 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T08:06:22.6566269Z Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 Packages [1218 kB] 2026-02-21T08:06:22.8514636Z Get:4 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T08:06:23.0037773Z Get:5 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T08:06:23.0874458Z Get:6 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T08:06:23.3744709Z Get:7 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T08:06:23.5224596Z Get:8 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T08:06:23.7832545Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T08:06:24.0146785Z Get:10 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T08:06:24.2510865Z Get:11 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T08:06:24.4162123Z Get:12 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T08:06:25.6637567Z Get:13 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T08:06:25.7251103Z Get:14 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T08:06:25.7284419Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T08:06:25.8719293Z Get:16 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T08:06:25.8724485Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T08:06:25.9512578Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T08:06:26.0547296Z Get:19 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T08:06:26.0564446Z Get:20 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T08:06:26.5851216Z Fetched 37.5 MB in 4s (9503 kB/s) 2026-02-21T08:06:27.8374442Z Reading package lists... 2026-02-21T08:06:27.8477353Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:27.8484115Z + apt-get install -y git 2026-02-21T08:06:30.4025279Z Reading package lists... 2026-02-21T08:06:30.5209354Z Building dependency tree... 2026-02-21T08:06:30.5213906Z Reading state information... 2026-02-21T08:06:30.6599743Z The following additional packages will be installed: 2026-02-21T08:06:30.6604574Z git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls 2026-02-21T08:06:30.6605925Z libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3 2026-02-21T08:06:30.6606552Z libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1 2026-02-21T08:06:30.6607095Z libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 2026-02-21T08:06:30.6609386Z openssh-client publicsuffix xauth 2026-02-21T08:06:30.6614113Z Suggested packages: 2026-02-21T08:06:30.6614506Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T08:06:30.6615307Z gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain 2026-02-21T08:06:30.6615647Z libpam-ssh monkeysphere ssh-askpass 2026-02-21T08:06:31.2370091Z The following NEW packages will be installed: 2026-02-21T08:06:31.2371769Z git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 2026-02-21T08:06:31.2372219Z libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1 2026-02-21T08:06:31.2372608Z libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0 2026-02-21T08:06:31.2373120Z libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6 2026-02-21T08:06:31.2373821Z libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth 2026-02-21T08:06:31.5313239Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded. 2026-02-21T08:06:31.5318395Z Need to get 8886 kB of archives. 2026-02-21T08:06:31.5322615Z After this operation, 38.0 MB of additional disk space will be used. 2026-02-21T08:06:31.5323337Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB] 2026-02-21T08:06:31.8108252Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T08:06:32.1904785Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB] 2026-02-21T08:06:32.2454831Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB] 2026-02-21T08:06:32.3204953Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB] 2026-02-21T08:06:32.3438031Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB] 2026-02-21T08:06:32.3892113Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B] 2026-02-21T08:06:32.3932339Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB] 2026-02-21T08:06:32.5175967Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB] 2026-02-21T08:06:32.5673852Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T08:06:32.5707357Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB] 2026-02-21T08:06:32.5989507Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T08:06:32.6102267Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB] 2026-02-21T08:06:32.6216675Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB] 2026-02-21T08:06:32.6310502Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B] 2026-02-21T08:06:32.6326878Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB] 2026-02-21T08:06:32.6345134Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB] 2026-02-21T08:06:32.6561789Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB] 2026-02-21T08:06:32.7230046Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB] 2026-02-21T08:06:32.7933455Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB] 2026-02-21T08:06:32.7997962Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T08:06:32.8010852Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T08:06:32.8721829Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB] 2026-02-21T08:06:32.8815329Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T08:06:32.8842063Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB] 2026-02-21T08:06:32.9117431Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB] 2026-02-21T08:06:32.9147126Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB] 2026-02-21T08:06:32.9301014Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T08:06:32.9590769Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T08:06:32.9609320Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T08:06:33.0091168Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T08:06:33.2234064Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T08:06:33.2455808Z Fetched 8886 kB in 2s (4721 kB/s) 2026-02-21T08:06:33.2950144Z Selecting previously unselected package krb5-locales. 2026-02-21T08:06:33.2965368Z (Reading database ... 2026-02-21T08:06:33.2967037Z (Reading database ... 5% 2026-02-21T08:06:33.2967696Z (Reading database ... 10% 2026-02-21T08:06:33.2972717Z (Reading database ... 15% 2026-02-21T08:06:33.2976267Z (Reading database ... 20% 2026-02-21T08:06:33.2978542Z (Reading database ... 25% 2026-02-21T08:06:33.2978816Z (Reading database ... 30% 2026-02-21T08:06:33.2979149Z (Reading database ... 35% 2026-02-21T08:06:33.2979396Z (Reading database ... 40% 2026-02-21T08:06:33.2979629Z (Reading database ... 45% 2026-02-21T08:06:33.2979816Z (Reading database ... 50% 2026-02-21T08:06:33.2980101Z (Reading database ... 55% 2026-02-21T08:06:33.2980300Z (Reading database ... 60% 2026-02-21T08:06:33.2980517Z (Reading database ... 65% 2026-02-21T08:06:33.2980800Z (Reading database ... 70% 2026-02-21T08:06:33.2984504Z (Reading database ... 75% 2026-02-21T08:06:33.2993421Z (Reading database ... 80% 2026-02-21T08:06:33.2994718Z (Reading database ... 85% 2026-02-21T08:06:33.3002685Z (Reading database ... 90% 2026-02-21T08:06:33.3006198Z (Reading database ... 95% 2026-02-21T08:06:33.3006508Z (Reading database ... 100% 2026-02-21T08:06:33.3006843Z (Reading database ... 15507 files and directories currently installed.) 2026-02-21T08:06:33.3015251Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ... 2026-02-21T08:06:33.3042745Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:33.3346901Z Selecting previously unselected package less. 2026-02-21T08:06:33.3354978Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T08:06:33.3407821Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T08:06:33.3734759Z Selecting previously unselected package libbsd0:amd64. 2026-02-21T08:06:33.3743101Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ... 2026-02-21T08:06:33.3798253Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:33.4101278Z Selecting previously unselected package libexpat1:amd64. 2026-02-21T08:06:33.4109370Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ... 2026-02-21T08:06:33.4137113Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:33.4454760Z Selecting previously unselected package libkrb5support0:amd64. 2026-02-21T08:06:33.4461515Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:33.4490124Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:33.4776235Z Selecting previously unselected package libk5crypto3:amd64. 2026-02-21T08:06:33.4787864Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:33.4801977Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:33.5066755Z Selecting previously unselected package libkeyutils1:amd64. 2026-02-21T08:06:33.5074483Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ... 2026-02-21T08:06:33.5107706Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:33.5362953Z Selecting previously unselected package libkrb5-3:amd64. 2026-02-21T08:06:33.5372833Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:33.5399591Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:33.5690021Z Selecting previously unselected package libgssapi-krb5-2:amd64. 2026-02-21T08:06:33.5697394Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:33.5723946Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:33.5992363Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T08:06:33.5994395Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T08:06:33.6257631Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:33.6556233Z Selecting previously unselected package libedit2:amd64. 2026-02-21T08:06:33.6565019Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ... 2026-02-21T08:06:33.6590375Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:33.7195695Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T08:06:33.7201436Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T08:06:33.7225794Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:33.7528116Z Selecting previously unselected package libnghttp2-14:amd64. 2026-02-21T08:06:33.7530117Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ... 2026-02-21T08:06:33.7566505Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:33.7786740Z Selecting previously unselected package libpsl5t64:amd64. 2026-02-21T08:06:33.7796509Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ... 2026-02-21T08:06:33.7854942Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:33.8087262Z Selecting previously unselected package libxau6:amd64. 2026-02-21T08:06:33.8089973Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ... 2026-02-21T08:06:33.8130501Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:33.8342206Z Selecting previously unselected package libxdmcp6:amd64. 2026-02-21T08:06:33.8348559Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ... 2026-02-21T08:06:33.8376657Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:33.8615219Z Selecting previously unselected package libxcb1:amd64. 2026-02-21T08:06:33.8621751Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ... 2026-02-21T08:06:33.8644804Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:33.8845020Z Selecting previously unselected package libx11-data. 2026-02-21T08:06:33.8851951Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ... 2026-02-21T08:06:33.8878280Z Unpacking libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:33.9235910Z Selecting previously unselected package libx11-6:amd64. 2026-02-21T08:06:33.9245712Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ... 2026-02-21T08:06:33.9270274Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:33.9554460Z Selecting previously unselected package libxext6:amd64. 2026-02-21T08:06:33.9560865Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ... 2026-02-21T08:06:33.9584161Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:33.9803065Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T08:06:33.9810794Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T08:06:33.9835825Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:34.0064481Z Selecting previously unselected package openssh-client. 2026-02-21T08:06:34.0070141Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T08:06:34.0145884Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:34.0493834Z Selecting previously unselected package publicsuffix. 2026-02-21T08:06:34.0498393Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ... 2026-02-21T08:06:34.0525429Z Unpacking publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:34.0733548Z Selecting previously unselected package xauth. 2026-02-21T08:06:34.0744304Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T08:06:34.0766781Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:34.0984433Z Selecting previously unselected package libbrotli1:amd64. 2026-02-21T08:06:34.0990920Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ... 2026-02-21T08:06:34.1013396Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:34.1261942Z Selecting previously unselected package librtmp1:amd64. 2026-02-21T08:06:34.1269567Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ... 2026-02-21T08:06:34.1298172Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:34.1511598Z Selecting previously unselected package libssh-4:amd64. 2026-02-21T08:06:34.1516360Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ... 2026-02-21T08:06:34.1540738Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:34.1785939Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T08:06:34.1794583Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T08:06:34.1821885Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:34.2033663Z Selecting previously unselected package liberror-perl. 2026-02-21T08:06:34.2040933Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ... 2026-02-21T08:06:34.2066117Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T08:06:34.2253247Z Selecting previously unselected package git-man. 2026-02-21T08:06:34.2259265Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T08:06:34.2283957Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:34.2561865Z Selecting previously unselected package git. 2026-02-21T08:06:34.2567173Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T08:06:34.2631609Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:34.3648536Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:34.3719118Z Setting up libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:34.3793785Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:34.3858996Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:34.3925276Z Setting up libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:34.3979537Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:34.4040666Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:34.4105087Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T08:06:34.4239954Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:34.4302123Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:34.4365314Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T08:06:34.4421481Z Setting up libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:34.4491497Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:34.4563613Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:34.4617067Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:34.4685074Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:34.4755895Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:34.4816637Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:34.4865614Z Setting up publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:34.4937418Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:34.5000357Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:34.5075954Z Setting up libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:34.5139528Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:34.5224105Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:34.5304115Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:34.5410460Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:34.5472556Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:34.6042327Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:34.6109673Z Setting up libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:34.6172824Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:34.6282543Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:34.6348759Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ... 2026-02-21T08:06:34.6732841Z ##[group]Run actions/checkout@v6 2026-02-21T08:06:34.6733124Z with: 2026-02-21T08:06:34.6733450Z repository: pytorch/helion 2026-02-21T08:06:34.6733860Z token: *** 2026-02-21T08:06:34.6734051Z ssh-strict: true 2026-02-21T08:06:34.6734298Z ssh-user: git 2026-02-21T08:06:34.6734499Z persist-credentials: true 2026-02-21T08:06:34.6734748Z clean: true 2026-02-21T08:06:34.6734964Z sparse-checkout-cone-mode: true 2026-02-21T08:06:34.6735235Z fetch-depth: 1 2026-02-21T08:06:34.6735464Z fetch-tags: false 2026-02-21T08:06:34.6735660Z show-progress: true 2026-02-21T08:06:34.6736055Z lfs: false 2026-02-21T08:06:34.6736248Z submodules: false 2026-02-21T08:06:34.6736481Z set-safe-directory: true 2026-02-21T08:06:34.6736699Z env: 2026-02-21T08:06:34.6736922Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:34.6737126Z ##[endgroup] 2026-02-21T08:06:34.6769892Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:34.8403050Z Syncing repository: pytorch/helion 2026-02-21T08:06:34.8403994Z ##[group]Getting Git version info 2026-02-21T08:06:34.8404366Z Working directory is '/__w/helion/helion' 2026-02-21T08:06:34.8404787Z [command]/usr/bin/git version 2026-02-21T08:06:34.8410773Z git version 2.43.0 2026-02-21T08:06:34.8423985Z ##[endgroup] 2026-02-21T08:06:34.8440481Z Temporarily overriding HOME='/__w/_temp/4c216d0f-717e-4c76-a6ea-eec8e78f2ef5' before making global git config changes 2026-02-21T08:06:34.8441161Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:06:34.8441684Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:06:34.8471455Z Deleting the contents of '/__w/helion/helion' 2026-02-21T08:06:34.8476914Z ##[group]Initializing the repository 2026-02-21T08:06:34.8482516Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T08:06:34.8506670Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T08:06:34.8508261Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T08:06:34.8508694Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T08:06:34.8508991Z hint: 2026-02-21T08:06:34.8509267Z hint: git config --global init.defaultBranch 2026-02-21T08:06:34.8509525Z hint: 2026-02-21T08:06:34.8509818Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T08:06:34.8510436Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T08:06:34.8510790Z hint: 2026-02-21T08:06:34.8511012Z hint: git branch -m 2026-02-21T08:06:34.8511299Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T08:06:34.8516754Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T08:06:34.8541332Z ##[endgroup] 2026-02-21T08:06:34.8541763Z ##[group]Disabling automatic garbage collection 2026-02-21T08:06:34.8544045Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T08:06:34.8568188Z ##[endgroup] 2026-02-21T08:06:34.8568540Z ##[group]Setting up auth 2026-02-21T08:06:34.8568817Z Removing SSH command configuration 2026-02-21T08:06:34.8576593Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:06:34.8600143Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:06:34.8833979Z Removing HTTP extra header 2026-02-21T08:06:34.8835436Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:06:34.8854841Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:06:34.9074576Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:06:34.9077230Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:06:34.9101066Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:06:34.9328967Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T08:06:34.9369445Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T08:06:34.9397004Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T08:06:34.9417609Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T08:06:34.9442382Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T08:06:34.9463874Z ##[endgroup] 2026-02-21T08:06:34.9464233Z ##[group]Fetching the repository 2026-02-21T08:06:34.9470358Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T08:06:35.4130214Z From https://github.com/pytorch/helion 2026-02-21T08:06:35.4130890Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T08:06:35.4156029Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T08:06:35.4178929Z origin/main 2026-02-21T08:06:35.4181012Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T08:06:35.4199272Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:35.4206368Z ##[endgroup] 2026-02-21T08:06:35.4206751Z ##[group]Determining the checkout info 2026-02-21T08:06:35.4207156Z ##[endgroup] 2026-02-21T08:06:35.4207474Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T08:06:35.4241955Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T08:06:35.4260865Z ##[group]Checking out the ref 2026-02-21T08:06:35.4261395Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T08:06:35.4460235Z Switched to a new branch 'main' 2026-02-21T08:06:35.4463852Z branch 'main' set up to track 'origin/main'. 2026-02-21T08:06:35.4466429Z ##[endgroup] 2026-02-21T08:06:35.4493215Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T08:06:35.4515240Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:35.4653425Z ##[group]Run actions/setup-python@v6 2026-02-21T08:06:35.4653645Z with: 2026-02-21T08:06:35.4653868Z python-version: 3.12 2026-02-21T08:06:35.4654126Z check-latest: false 2026-02-21T08:06:35.4654395Z token: *** 2026-02-21T08:06:35.4654625Z update-environment: true 2026-02-21T08:06:35.4654832Z allow-prereleases: false 2026-02-21T08:06:35.4655040Z freethreaded: false 2026-02-21T08:06:35.4655231Z env: 2026-02-21T08:06:35.4655436Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:35.4655653Z ##[endgroup] 2026-02-21T08:06:35.4659048Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:35.6652821Z ##[group]Installed versions 2026-02-21T08:06:35.6664839Z Version 3.12 was not found in the local cache 2026-02-21T08:06:36.4925738Z Version 3.12 is available for downloading 2026-02-21T08:06:36.4926399Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T08:06:37.3676211Z Extract downloaded archive 2026-02-21T08:06:37.3801877Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/76928f81-147e-4c44-8c08-6641cfb954e6 -f /__w/_temp/f4fc44dd-7761-48f1-9ebb-70b591183c16 2026-02-21T08:06:39.3035322Z Execute installation script 2026-02-21T08:06:39.3146433Z Check if Python hostedtoolcache folder exist... 2026-02-21T08:06:39.3146850Z Creating Python hostedtoolcache folder... 2026-02-21T08:06:39.3154160Z Create Python 3.12.12 folder 2026-02-21T08:06:39.3168375Z Copy Python binaries to hostedtoolcache folder 2026-02-21T08:06:39.6927270Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T08:06:39.6965996Z Upgrading pip... 2026-02-21T08:06:41.0438233Z Looking in links: /tmp/tmpexo42m5v 2026-02-21T08:06:41.0440373Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T08:06:41.0476502Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T08:06:41.6149159Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T08:06:41.7707157Z Collecting pip 2026-02-21T08:06:41.8095101Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T08:06:41.8173989Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T08:06:41.8461167Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 122.2 MB/s eta 0:00:00 2026-02-21T08:06:41.8553274Z Installing collected packages: pip 2026-02-21T08:06:41.8554829Z Attempting uninstall: pip 2026-02-21T08:06:41.8565713Z Found existing installation: pip 25.0.1 2026-02-21T08:06:41.8738364Z Uninstalling pip-25.0.1: 2026-02-21T08:06:41.8770138Z Successfully uninstalled pip-25.0.1 2026-02-21T08:06:42.4390583Z Successfully installed pip-26.0.1 2026-02-21T08:06:42.4869366Z Create complete file 2026-02-21T08:06:42.4906980Z Successfully set up CPython (3.12.12) 2026-02-21T08:06:42.4907506Z ##[endgroup] 2026-02-21T08:06:42.5115837Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T08:06:42.5116074Z with: 2026-02-21T08:06:42.5116312Z activate-environment: false 2026-02-21T08:06:42.5116584Z working-directory: /home/charlie/_work/helion/helion 2026-02-21T08:06:42.5117095Z github-token: *** 2026-02-21T08:06:42.5117345Z enable-cache: auto 2026-02-21T08:06:42.5117853Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T08:06:42.5118357Z restore-cache: true 2026-02-21T08:06:42.5118566Z save-cache: true 2026-02-21T08:06:42.5118794Z prune-cache: true 2026-02-21T08:06:42.5119015Z cache-python: false 2026-02-21T08:06:42.5119243Z ignore-nothing-to-cache: false 2026-02-21T08:06:42.5119500Z ignore-empty-workdir: false 2026-02-21T08:06:42.5119709Z add-problem-matchers: true 2026-02-21T08:06:42.5119989Z resolution-strategy: highest 2026-02-21T08:06:42.5120197Z env: 2026-02-21T08:06:42.5120514Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:42.5120762Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:42.5121093Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:42.5121388Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:42.5121740Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:42.5122032Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:42.5122522Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:42.5122961Z ##[endgroup] 2026-02-21T08:06:42.5129239Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:42.7393744Z (node:802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:06:42.7395801Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:06:42.7464934Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T08:06:42.7465378Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T08:06:42.7465969Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T08:06:42.7471631Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T08:06:42.7472310Z Getting latest version from GitHub API... 2026-02-21T08:06:43.0271787Z manifest-file not provided, reading from local file. 2026-02-21T08:06:43.0306776Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T08:06:43.0310021Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T08:06:43.3415128Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/52f4ea79-f6cf-4673-b3a4-42e3b22b5a32 -f /__w/_temp/9043aff8-f81b-4d13-ad8e-c41d78bd8e10 2026-02-21T08:06:43.7246483Z Added /github/home/.local/bin to the path 2026-02-21T08:06:43.7248715Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T08:06:43.7249136Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T08:06:43.7249525Z Added /github/home/.local/share/uv/python to the path 2026-02-21T08:06:43.7255736Z Successfully installed uv version 0.10.4 2026-02-21T08:06:43.8671140Z ##[group]Run uv venv --python 3.12 2026-02-21T08:06:43.8671476Z uv venv --python 3.12 2026-02-21T08:06:43.8671887Z shell: bash -l {0} 2026-02-21T08:06:43.8672110Z env: 2026-02-21T08:06:43.8672357Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:43.8672622Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.8672959Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:43.8673246Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.8673574Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.8673831Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.8674263Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:43.8674779Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:43.8675041Z ##[endgroup] 2026-02-21T08:06:43.9859704Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T08:06:43.9860558Z Creating virtual environment at: .venv 2026-02-21T08:06:43.9860937Z Activate with: source .venv/bin/activate 2026-02-21T08:06:43.9933966Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:43.9934291Z source .venv/bin/activate 2026-02-21T08:06:43.9934656Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130 2026-02-21T08:06:43.9935108Z shell: bash -l {0} 2026-02-21T08:06:43.9935339Z env: 2026-02-21T08:06:43.9935519Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:43.9935802Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.9936102Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:43.9936410Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.9936666Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.9936986Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:43.9937417Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:43.9937946Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:43.9938205Z ##[endgroup] 2026-02-21T08:06:44.6154029Z Resolved 26 packages in 528ms 2026-02-21T08:06:44.6226361Z Downloading networkx (2.0MiB) 2026-02-21T08:06:44.6267518Z Downloading sympy (6.0MiB) 2026-02-21T08:06:44.6270212Z Downloading nvidia-cufft (204.2MiB) 2026-02-21T08:06:44.6307558Z Downloading nvidia-cuda-cupti (10.2MiB) 2026-02-21T08:06:44.6421845Z Downloading nvidia-cuda-runtime (2.1MiB) 2026-02-21T08:06:44.6500628Z Downloading nvidia-cusolver (184.5MiB) 2026-02-21T08:06:44.6505596Z Downloading nvidia-curand (56.8MiB) 2026-02-21T08:06:44.6507252Z Downloading nvidia-cufile (1.2MiB) 2026-02-21T08:06:44.6512194Z Downloading nvidia-nvjitlink (38.8MiB) 2026-02-21T08:06:44.6516450Z Downloading nvidia-cudnn-cu13 (332.4MiB) 2026-02-21T08:06:44.6518018Z Downloading nvidia-cusparse (133.8MiB) 2026-02-21T08:06:44.6518328Z Downloading triton (162.6MiB) 2026-02-21T08:06:44.6518587Z Downloading torch (584.2MiB) 2026-02-21T08:06:44.6600300Z Downloading nvidia-nvshmem-cu13 (57.6MiB) 2026-02-21T08:06:44.6670168Z Downloading nvidia-cusparselt-cu13 (162.0MiB) 2026-02-21T08:06:44.6794128Z Downloading nvidia-cuda-nvrtc (86.0MiB) 2026-02-21T08:06:44.6818942Z Downloading nvidia-nccl-cu13 (184.9MiB) 2026-02-21T08:06:44.6938861Z Downloading nvidia-cublas (400.0MiB) 2026-02-21T08:06:45.0604659Z Downloaded nvidia-cufile 2026-02-21T08:06:45.1960745Z Downloaded nvidia-cuda-runtime 2026-02-21T08:06:45.7926215Z Downloaded networkx 2026-02-21T08:06:46.2374903Z Downloaded nvidia-cuda-cupti 2026-02-21T08:06:47.5020805Z Downloaded sympy 2026-02-21T08:06:47.6670899Z Downloaded triton 2026-02-21T08:06:48.6856059Z Downloaded nvidia-nvjitlink 2026-02-21T08:06:49.6266982Z Downloaded nvidia-curand 2026-02-21T08:06:49.8723445Z Downloaded nvidia-nvshmem-cu13 2026-02-21T08:06:50.9419276Z Downloaded nvidia-cuda-nvrtc 2026-02-21T08:06:52.1602073Z Downloaded nvidia-cusparse 2026-02-21T08:06:52.2649528Z Downloaded nvidia-cufft 2026-02-21T08:06:52.6007897Z Downloaded nvidia-cusparselt-cu13 2026-02-21T08:06:52.7733060Z Downloaded nvidia-cusolver 2026-02-21T08:06:53.0037862Z Downloaded nvidia-nccl-cu13 2026-02-21T08:06:54.0746405Z Downloaded nvidia-cudnn-cu13 2026-02-21T08:06:54.5562613Z Downloaded nvidia-cublas 2026-02-21T08:07:00.3320083Z Downloaded torch 2026-02-21T08:07:00.3328180Z Prepared 26 packages in 15.71s 2026-02-21T08:07:00.3358578Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:00.3359259Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:00.3359855Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:01.9071449Z Installed 26 packages in 1.57s 2026-02-21T08:07:01.9073288Z + filelock==3.20.0 2026-02-21T08:07:01.9073766Z + fsspec==2025.12.0 2026-02-21T08:07:01.9073990Z + jinja2==3.1.6 2026-02-21T08:07:01.9074271Z + markupsafe==3.0.2 2026-02-21T08:07:01.9075081Z + mpmath==1.3.0 2026-02-21T08:07:01.9075309Z + networkx==3.6.1 2026-02-21T08:07:01.9080699Z + nvidia-cublas==13.0.0.19 2026-02-21T08:07:01.9084778Z + nvidia-cuda-cupti==13.0.48 2026-02-21T08:07:01.9086567Z + nvidia-cuda-nvrtc==13.0.48 2026-02-21T08:07:01.9086916Z + nvidia-cuda-runtime==13.0.48 2026-02-21T08:07:01.9087415Z + nvidia-cudnn-cu13==9.13.0.50 2026-02-21T08:07:01.9087713Z + nvidia-cufft==12.0.0.15 2026-02-21T08:07:01.9087951Z + nvidia-cufile==1.15.0.42 2026-02-21T08:07:01.9088321Z + nvidia-curand==10.4.0.35 2026-02-21T08:07:01.9088572Z + nvidia-cusolver==12.0.3.29 2026-02-21T08:07:01.9088863Z + nvidia-cusparse==12.6.2.49 2026-02-21T08:07:01.9089197Z + nvidia-cusparselt-cu13==0.8.0 2026-02-21T08:07:01.9089497Z + nvidia-nccl-cu13==2.27.7 2026-02-21T08:07:01.9089730Z + nvidia-nvjitlink==13.0.39 2026-02-21T08:07:01.9090065Z + nvidia-nvshmem-cu13==3.3.24 2026-02-21T08:07:01.9090363Z + nvidia-nvtx==13.0.39 2026-02-21T08:07:01.9090602Z + setuptools==70.2.0 2026-02-21T08:07:01.9090919Z + sympy==1.14.0 2026-02-21T08:07:01.9091141Z + torch==2.9.1+cu130 2026-02-21T08:07:01.9091404Z + triton==3.5.1 2026-02-21T08:07:01.9091736Z + typing-extensions==4.15.0 2026-02-21T08:07:01.9190058Z ##[group]Run source .venv/bin/activate 2026-02-21T08:07:01.9190365Z source .venv/bin/activate 2026-02-21T08:07:01.9190730Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T08:07:01.9191120Z python -c "import helion; print(helion.__name__)" 2026-02-21T08:07:01.9191686Z shell: bash -l {0} 2026-02-21T08:07:01.9191924Z env: 2026-02-21T08:07:01.9192131Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:01.9192651Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:01.9193089Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:01.9193542Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:01.9193816Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:01.9194113Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:01.9194671Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:01.9195116Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:01.9195459Z ##[endgroup] 2026-02-21T08:07:02.9027396Z Resolved 30 packages in 882ms 2026-02-21T08:07:02.9041978Z Building helion @ file:///__w/helion/helion 2026-02-21T08:07:02.9226899Z Downloading pygments (1.2MiB) 2026-02-21T08:07:02.9231896Z Downloading virtualenv (5.6MiB) 2026-02-21T08:07:02.9232229Z Downloading numpy (15.8MiB) 2026-02-21T08:07:02.9427522Z Downloading scikit-learn (8.5MiB) 2026-02-21T08:07:02.9449936Z Downloading scipy (33.4MiB) 2026-02-21T08:07:03.0455791Z Built helion @ file:///__w/helion/helion 2026-02-21T08:07:03.2235212Z Downloaded virtualenv 2026-02-21T08:07:03.2363005Z Downloaded pygments 2026-02-21T08:07:03.9617187Z Downloaded scikit-learn 2026-02-21T08:07:03.9621527Z Downloaded numpy 2026-02-21T08:07:04.4162150Z Downloaded scipy 2026-02-21T08:07:04.4170531Z Prepared 27 packages in 1.51s 2026-02-21T08:07:04.4174763Z Uninstalled 1 package in 0.37ms 2026-02-21T08:07:04.4184545Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:04.4186308Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:04.4186929Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:05.0852052Z Installed 29 packages in 667ms 2026-02-21T08:07:05.0852446Z + cfgv==3.5.0 2026-02-21T08:07:05.0857108Z + distlib==0.4.0 2026-02-21T08:07:05.0859015Z + expecttest==0.3.0 2026-02-21T08:07:05.0859369Z + filecheck==1.0.3 2026-02-21T08:07:05.0859753Z - filelock==3.20.0 2026-02-21T08:07:05.0859989Z + filelock==3.24.3 2026-02-21T08:07:05.0860266Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T08:07:05.0860517Z + hypothesis==6.151.9 2026-02-21T08:07:05.0860810Z + identify==2.6.16 2026-02-21T08:07:05.0861000Z + iniconfig==2.3.0 2026-02-21T08:07:05.0861229Z + joblib==1.5.3 2026-02-21T08:07:05.0861681Z + markdown-it-py==4.0.0 2026-02-21T08:07:05.0861925Z + mdurl==0.1.2 2026-02-21T08:07:05.0862132Z + nodeenv==1.10.0 2026-02-21T08:07:05.0862378Z + numpy==2.4.2 2026-02-21T08:07:05.0862595Z + packaging==26.0 2026-02-21T08:07:05.0862795Z + platformdirs==4.9.2 2026-02-21T08:07:05.0863040Z + pluggy==1.6.0 2026-02-21T08:07:05.0863219Z + pre-commit==4.5.1 2026-02-21T08:07:05.0863438Z + psutil==7.2.2 2026-02-21T08:07:05.0863625Z + pygments==2.19.2 2026-02-21T08:07:05.0863852Z + pytest==9.0.2 2026-02-21T08:07:05.0864033Z + pytest-timeout==2.4.0 2026-02-21T08:07:05.0864274Z + pyyaml==6.0.3 2026-02-21T08:07:05.0864487Z + rich==14.3.3 2026-02-21T08:07:05.0864673Z + scikit-learn==1.8.0 2026-02-21T08:07:05.0864896Z + scipy==1.17.0 2026-02-21T08:07:05.0865101Z + sortedcontainers==2.4.0 2026-02-21T08:07:05.0865348Z + threadpoolctl==3.6.0 2026-02-21T08:07:05.0865537Z + virtualenv==20.38.0 2026-02-21T08:07:17.1532627Z helion 2026-02-21T08:07:18.0976091Z ##[group]Run set -x 2026-02-21T08:07:18.0976321Z set -x 2026-02-21T08:07:18.0976584Z source .venv/bin/activate 2026-02-21T08:07:18.0976859Z uv pip install pip 2026-02-21T08:07:18.0977098Z uv pip install quack-kernels --no-deps 2026-02-21T08:07:18.0977439Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T08:07:18.0977766Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:07:18.0978080Z pushd tritonbench/ 2026-02-21T08:07:18.0978486Z git submodule update --init --recursive 2026-02-21T08:07:18.0978799Z uv pip install -r requirements.txt 2026-02-21T08:07:18.0979068Z python install.py --liger 2026-02-21T08:07:18.0979336Z uv pip install -e . --no-deps 2026-02-21T08:07:18.0979696Z popd 2026-02-21T08:07:18.0979882Z popd 2026-02-21T08:07:18.0980228Z shell: bash -l {0} 2026-02-21T08:07:18.0980423Z env: 2026-02-21T08:07:18.0980732Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:18.0981001Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:18.0981329Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:18.0981698Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:18.0981985Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:18.0982281Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:18.0982691Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:18.0983209Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:18.0983475Z ##[endgroup] 2026-02-21T08:07:21.8834694Z + source .venv/bin/activate 2026-02-21T08:07:21.8838986Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8843061Z ++ '[' -n x ']' 2026-02-21T08:07:21.8846248Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:07:21.8847623Z ++ '[' .venv/bin/activate = /__w/_temp/043c16d0-a9f6-432c-a45e-a8ca6f064229.sh ']' 2026-02-21T08:07:21.8848041Z ++ deactivate nondestructive 2026-02-21T08:07:21.8848333Z ++ unset -f pydoc 2026-02-21T08:07:21.8848602Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8848869Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8849090Z ++ hash -r 2026-02-21T08:07:21.8849352Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8849600Z ++ unset VIRTUAL_ENV 2026-02-21T08:07:21.8849882Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:07:21.8850204Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:07:21.8850545Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:07:21.8850883Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:07:21.8851141Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:07:21.8851441Z ++ export VIRTUAL_ENV 2026-02-21T08:07:21.8851935Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8852158Z ++ unset SCRIPT_PATH 2026-02-21T08:07:21.8852896Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:07:21.8854060Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:07:21.8854792Z ++ export PATH 2026-02-21T08:07:21.8855017Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:07:21.8855236Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:07:21.8855495Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:07:21.8855705Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8855908Z ++ '[' -z '' ']' 2026-02-21T08:07:21.8856077Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:07:21.8856333Z ++ PS1='(helion) ' 2026-02-21T08:07:21.8856513Z ++ export PS1 2026-02-21T08:07:21.8857418Z ++ alias pydoc 2026-02-21T08:07:21.8857658Z ++ true 2026-02-21T08:07:21.8857826Z ++ hash -r 2026-02-21T08:07:21.8858340Z + uv pip install pip 2026-02-21T08:07:21.9416721Z Resolved 1 package in 49ms 2026-02-21T08:07:21.9477670Z Downloading pip (1.7MiB) 2026-02-21T08:07:22.0072209Z Downloaded pip 2026-02-21T08:07:22.0074227Z Prepared 1 package in 65ms 2026-02-21T08:07:22.0101765Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:22.0102337Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:22.0103356Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:22.3975199Z Installed 1 package in 389ms 2026-02-21T08:07:22.3977674Z + pip==26.0.1 2026-02-21T08:07:22.4006096Z + uv pip install quack-kernels --no-deps 2026-02-21T08:07:22.4979596Z Resolved 1 package in 91ms 2026-02-21T08:07:22.5476188Z Prepared 1 package in 49ms 2026-02-21T08:07:22.5514613Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:22.5515287Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:22.5515877Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:22.5527082Z Installed 1 package in 5ms 2026-02-21T08:07:22.5527575Z + quack-kernels==0.2.10 2026-02-21T08:07:22.5546929Z + mkdir -p benchmarks/ 2026-02-21T08:07:22.5556600Z + pushd benchmarks/ 2026-02-21T08:07:22.5556984Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:07:22.5557437Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:22.5568638Z Cloning into 'tritonbench'... 2026-02-21T08:07:25.4910170Z + pushd tritonbench/ 2026-02-21T08:07:25.4910668Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:25.4911726Z + git submodule update --init --recursive 2026-02-21T08:07:25.8009916Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T08:07:25.8527979Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T08:07:26.1946880Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T08:07:26.3212008Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T08:07:26.3348336Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T08:07:26.3373116Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T08:07:26.3395269Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T08:07:31.1777253Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T08:07:48.5620461Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T08:07:54.5535122Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T08:07:58.2049748Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T08:08:01.2146603Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T08:08:05.9258402Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T08:08:06.2188629Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T08:08:06.4387316Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T08:08:06.4413869Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T08:08:12.7430126Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T08:08:13.2052710Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T08:08:13.2792919Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T08:08:13.2833520Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:13.2865870Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T08:08:13.2892847Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:17.7259319Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T08:08:21.4016728Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T08:08:21.8907552Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T08:08:21.9162486Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T08:08:21.9180384Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T08:08:21.9204201Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T08:08:25.9087037Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T08:08:25.9682457Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T08:08:25.9693055Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T08:08:25.9701976Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T08:08:25.9702788Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T08:08:25.9725483Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T08:08:29.8770373Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T08:08:33.5053746Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T08:08:34.4995515Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T08:08:34.9312671Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T08:08:34.9811196Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T08:08:34.9824880Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T08:08:34.9829936Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T08:08:34.9851749Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T08:08:38.8817439Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T08:08:43.0156774Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T08:08:43.4349597Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T08:08:43.4389697Z + uv pip install -r requirements.txt 2026-02-21T08:08:43.4461056Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:43.6143613Z Resolved 30 packages in 167ms 2026-02-21T08:08:43.6251161Z Downloading pillow (6.7MiB) 2026-02-21T08:08:43.6251404Z Downloading hf-xet (3.2MiB) 2026-02-21T08:08:43.6251795Z Downloading kiwisolver (1.4MiB) 2026-02-21T08:08:43.6301777Z Downloading tokenizers (3.0MiB) 2026-02-21T08:08:43.6307472Z Downloading fonttools (4.7MiB) 2026-02-21T08:08:43.6311237Z Downloading matplotlib (8.3MiB) 2026-02-21T08:08:43.6315828Z Downloading transformers (10.3MiB) 2026-02-21T08:08:43.7889914Z Downloaded kiwisolver 2026-02-21T08:08:43.8721030Z Downloaded tokenizers 2026-02-21T08:08:43.8762875Z Downloaded hf-xet 2026-02-21T08:08:44.0743288Z Downloaded pillow 2026-02-21T08:08:44.1199624Z Downloaded fonttools 2026-02-21T08:08:44.1812663Z Downloaded matplotlib 2026-02-21T08:08:45.2564729Z Downloaded transformers 2026-02-21T08:08:45.2576389Z Prepared 23 packages in 1.64s 2026-02-21T08:08:45.2612852Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:45.2613428Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:45.2614010Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:45.3284721Z Installed 23 packages in 70ms 2026-02-21T08:08:45.3289946Z + certifi==2026.1.4 2026-02-21T08:08:45.3293699Z + charset-normalizer==3.4.4 2026-02-21T08:08:45.3295879Z + contourpy==1.3.3 2026-02-21T08:08:45.3296066Z + cycler==0.12.1 2026-02-21T08:08:45.3296280Z + fonttools==4.61.1 2026-02-21T08:08:45.3296434Z + hf-xet==1.2.0 2026-02-21T08:08:45.3299770Z + huggingface-hub==0.36.2 2026-02-21T08:08:45.3303734Z + idna==3.11 2026-02-21T08:08:45.3305087Z + kiwisolver==1.4.9 2026-02-21T08:08:45.3305287Z + matplotlib==3.10.8 2026-02-21T08:08:45.3305449Z + nvidia-ml-py==13.590.48 2026-02-21T08:08:45.3305622Z + pillow==12.1.1 2026-02-21T08:08:45.3305780Z + pyparsing==3.3.2 2026-02-21T08:08:45.3305944Z + python-dateutil==2.9.0.post0 2026-02-21T08:08:45.3306117Z + regex==2026.2.19 2026-02-21T08:08:45.3306255Z + requests==2.32.5 2026-02-21T08:08:45.3306403Z + safetensors==0.7.0 2026-02-21T08:08:45.3306545Z + six==1.17.0 2026-02-21T08:08:45.3306688Z + tabulate==0.9.0 2026-02-21T08:08:45.3306841Z + tokenizers==0.21.4 2026-02-21T08:08:45.3306995Z + tqdm==4.67.3 2026-02-21T08:08:45.3307136Z + transformers==4.53.0 2026-02-21T08:08:45.3307293Z + urllib3==2.6.3 2026-02-21T08:08:45.3380694Z + python install.py --liger 2026-02-21T08:08:49.1621525Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:49.1648762Z Audited 6 packages in 3ms 2026-02-21T08:08:49.6447306Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T08:08:49.6515562Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:49.7473039Z Resolved 1 package in 94ms 2026-02-21T08:08:49.8123299Z Prepared 1 package in 65ms 2026-02-21T08:08:49.8169996Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:49.8170570Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:49.8171469Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:49.9358516Z Installed 1 package in 123ms 2026-02-21T08:08:49.9358939Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T08:08:49.9396947Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T08:08:50.4101482Z + uv pip install -e . --no-deps 2026-02-21T08:08:50.5533574Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:08:50.5570567Z Resolved 1 package in 2ms 2026-02-21T08:08:50.6112719Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:51.8192524Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:08:51.8207269Z Prepared 1 package in 1.26s 2026-02-21T08:08:51.8214554Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:08:51.8215097Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:08:51.8215557Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:08:51.8216227Z Installed 1 package in 0.54ms 2026-02-21T08:08:51.8216513Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T08:08:52.0803694Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:08:52.0803962Z /__w/helion/helion 2026-02-21T08:08:52.0805158Z + popd 2026-02-21T08:08:52.0805293Z + popd 2026-02-21T08:08:52.0858715Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:52.0859040Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:08:52.0859235Z  2026-02-21T08:08:52.0859388Z source .venv/bin/activate 2026-02-21T08:08:52.0859559Z  2026-02-21T08:08:52.0859731Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T08:08:52.0859948Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T08:08:52.0860142Z echo "$TEST_REPORTS_DIR" 2026-02-21T08:08:52.0860307Z  2026-02-21T08:08:52.0860450Z KERNEL_LIST="softmax" 2026-02-21T08:08:52.0860646Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T08:08:52.0860871Z  echo "==========================================" 2026-02-21T08:08:52.0861121Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T08:08:52.0861354Z  echo "==========================================" 2026-02-21T08:08:52.0861608Z  2026-02-21T08:08:52.0861860Z  # Get available implementations and baseline for this kernel 2026-02-21T08:08:52.0862260Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T08:08:52.0862659Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T08:08:52.0862980Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T08:08:52.0863220Z  2026-02-21T08:08:52.0863360Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T08:08:52.0863620Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T08:08:52.0863889Z  continue 2026-02-21T08:08:52.0864029Z  fi 2026-02-21T08:08:52.0864179Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T08:08:52.0864427Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T08:08:52.0864664Z  continue 2026-02-21T08:08:52.0864819Z  fi 2026-02-21T08:08:52.0864971Z  echo "Using baseline: $BASELINE" 2026-02-21T08:08:52.0865213Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T08:08:52.0865423Z  2026-02-21T08:08:52.0865583Z  # Do autotuning but do not record the results 2026-02-21T08:08:52.0865791Z  python benchmarks/run.py \ 2026-02-21T08:08:52.0865978Z  --op $kernel \ 2026-02-21T08:08:52.0866161Z  --metrics speedup,accuracy \ 2026-02-21T08:08:52.0866375Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:52.0866586Z  --cudagraph \ 2026-02-21T08:08:52.0866747Z  --only $IMPLS \ 2026-02-21T08:08:52.0866946Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:52.0867150Z  --baseline $BASELINE \ 2026-02-21T08:08:52.0867327Z  --atol 1e-2 \ 2026-02-21T08:08:52.0867501Z  --rtol 1e-2 \ 2026-02-21T08:08:52.0867835Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:52.0868041Z  --keep-going \ 2026-02-21T08:08:52.0868192Z   2026-02-21T08:08:52.0868326Z  2026-02-21T08:08:52.0868451Z  # Relax the GPU 2026-02-21T08:08:52.0868612Z  sleep 2m 2026-02-21T08:08:52.0868746Z  2026-02-21T08:08:52.0868906Z  # Run again with cache and record results 2026-02-21T08:08:52.0869208Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T08:08:52.0869487Z  --op $kernel \ 2026-02-21T08:08:52.0869672Z  --metrics speedup,accuracy \ 2026-02-21T08:08:52.0869890Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:08:52.0870092Z  --cudagraph \ 2026-02-21T08:08:52.0870243Z  --only $IMPLS \ 2026-02-21T08:08:52.0870552Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:08:52.0870760Z  --baseline $BASELINE \ 2026-02-21T08:08:52.0870935Z  --atol 1e-2 \ 2026-02-21T08:08:52.0871091Z  --rtol 1e-2 \ 2026-02-21T08:08:52.0871272Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:08:52.0871513Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T08:08:52.0871766Z  --append-to-output \ 2026-02-21T08:08:52.0871950Z  --keep-going \ 2026-02-21T08:08:52.0872102Z   2026-02-21T08:08:52.0872233Z  2026-02-21T08:08:52.0872409Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T08:08:52.0872614Z done 2026-02-21T08:08:52.0872748Z  2026-02-21T08:08:52.0872918Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T08:08:52.0873173Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T08:08:52.0873367Z  exit 1 2026-02-21T08:08:52.0873512Z fi 2026-02-21T08:08:52.0873670Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T08:08:52.0873988Z shell: bash -l {0} 2026-02-21T08:08:52.0874125Z env: 2026-02-21T08:08:52.0874264Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:08:52.0874470Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:52.0874708Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:08:52.0874954Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:52.0875169Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:52.0875399Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:08:52.0875762Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:08:52.0876141Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:08:52.0876368Z ##[endgroup] 2026-02-21T08:08:52.1485435Z /__w/helion/helion/test/test-reports 2026-02-21T08:08:52.1485779Z ========================================== 2026-02-21T08:08:52.1486094Z Running benchmark for kernel: softmax 2026-02-21T08:08:52.1486374Z ========================================== 2026-02-21T08:08:56.9200510Z Using baseline: naive_softmax 2026-02-21T08:08:56.9205871Z Available implementations for softmax: helion_softmax_tritonbench,torch_compile_softmax,triton_softmax 2026-02-21T08:09:02.1568299Z Using num_inputs=20 for softmax 2026-02-21T08:09:02.2028533Z Running softmax benchmark with Helion implementation... 2026-02-21T08:09:02.2029231Z 2026-02-21T08:09:02.4426258Z Equally-spaced-k mode: Selected 20 equally spaced inputs (total available: 98) 2026-02-21T08:09:02.4430959Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 5, 10, 15, 20, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 77, 82, 87, 92, 97] 2026-02-21T08:09:02.4435171Z 2026-02-21T08:09:02.4442277Z 0%| | 0/20 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:09:46.0217630Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:09:46.0217898Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:09:46.0218149Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:46.0218394Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:46.0218678Z %cst = arith.constant dense<256> : tensor<512x1xi32> 2026-02-21T08:09:46.0219025Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<512xf32> 2026-02-21T08:09:46.0219443Z %cst_1 = arith.constant dense<0xFF800000> : tensor<512xf32> 2026-02-21T08:09:46.0219774Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:09:46.0220023Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:46.0220266Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:09:46.0220518Z %c256_i64 = arith.constant 256 : i64 2026-02-21T08:09:46.0220763Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:46.0221205Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : , > 2026-02-21T08:09:46.0221903Z %1 = tt.get_program_id x : i32 2026-02-21T08:09:46.0222143Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:09:46.0222389Z %3 = arith.minsi %2, %c8_i32 : i32 2026-02-21T08:09:46.0222648Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:09:46.0222936Z %4 = arith.muli %arg2, %c512_i32 : i32 2026-02-21T08:09:46.0223262Z %5 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:09:46.0223638Z %6 = tt.splat %4 : i32 -> tensor<512xi32> 2026-02-21T08:09:46.0223909Z %7 = arith.addi %6, %5 : tensor<512xi32> 2026-02-21T08:09:46.0224175Z %c240_i32 = arith.constant 240 : i32 2026-02-21T08:09:46.0224423Z %c48_i32 = arith.constant 48 : i32 2026-02-21T08:09:46.0224931Z %8:2 = scf.for %arg3 = %c0_i32 to %c240_i32 step %c48_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<512xf32>, tensor<512xf32>) : i32 { 2026-02-21T08:09:46.0225485Z %59 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0225846Z %60 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0226121Z %61 = arith.addi %60, %59 : tensor<16xi32> 2026-02-21T08:09:46.0226475Z %62 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0226850Z %63 = arith.muli %62, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0227207Z %64 = tt.expand_dims %61 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0228110Z %65 = tt.broadcast %63 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0228479Z %66 = tt.broadcast %64 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0228809Z %67 = arith.addi %65, %66 : tensor<512x16xi32> 2026-02-21T08:09:46.0229132Z %68 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0229525Z %69 = tt.addptr %68, %67 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0229882Z %70 = tt.load %69 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0230202Z %71 = arith.extf %70 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0230517Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0230775Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0231031Z %150 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:09:46.0231287Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0231777Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0232084Z %73 = arith.truncf %72 : tensor<512xf32> to tensor<512xf16> 2026-02-21T08:09:46.0232416Z %74 = arith.extf %73 : tensor<512xf16> to tensor<512xf32> 2026-02-21T08:09:46.0232725Z %75 = arith.cmpf ogt, %arg4, %74 : tensor<512xf32> 2026-02-21T08:09:46.0233037Z %76 = arith.cmpf une, %arg4, %arg4 : tensor<512xf32> 2026-02-21T08:09:46.0233330Z %77 = arith.ori %75, %76 : tensor<512xi1> 2026-02-21T08:09:46.0233641Z %78 = arith.select %77, %arg4, %74 : tensor<512xi1>, tensor<512xf32> 2026-02-21T08:09:46.0233989Z %79 = arith.subf %arg4, %78 : tensor<512xf32> 2026-02-21T08:09:46.0234474Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0234970Z %81 = arith.mulf %arg5, %80 : tensor<512xf32> 2026-02-21T08:09:46.0235307Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0235706Z %83 = tt.broadcast %82 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0236028Z %84 = arith.subf %71, %83 : tensor<512x16xf32> 2026-02-21T08:09:46.0236515Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0237005Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0237252Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0237497Z %150 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:46.0237737Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0237990Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0238249Z %87 = arith.addf %81, %86 : tensor<512xf32> 2026-02-21T08:09:46.0238493Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:09:46.0238741Z %88 = arith.muli %c16_i32, %c1_i32_4 : i32 2026-02-21T08:09:46.0238986Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T08:09:46.0239278Z %90 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0239587Z %91 = tt.splat %89 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0239838Z %92 = arith.addi %91, %90 : tensor<16xi32> 2026-02-21T08:09:46.0240156Z %93 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0240490Z %94 = arith.muli %93, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0240809Z %95 = tt.expand_dims %92 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0241172Z %96 = tt.broadcast %94 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0241507Z %97 = tt.broadcast %95 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0241845Z %98 = arith.addi %96, %97 : tensor<512x16xi32> 2026-02-21T08:09:46.0242158Z %99 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0242621Z %100 = tt.addptr %99, %98 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0242960Z %101 = tt.load %100 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0243281Z %102 = arith.extf %101 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0243586Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0243842Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0244081Z %150 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:09:46.0244333Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0244589Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0244882Z %104 = arith.truncf %103 : tensor<512xf32> to tensor<512xf16> 2026-02-21T08:09:46.0245221Z %105 = arith.extf %104 : tensor<512xf16> to tensor<512xf32> 2026-02-21T08:09:46.0245529Z %106 = arith.cmpf ogt, %78, %105 : tensor<512xf32> 2026-02-21T08:09:46.0245901Z %107 = arith.cmpf une, %78, %78 : tensor<512xf32> 2026-02-21T08:09:46.0246169Z %108 = arith.ori %106, %107 : tensor<512xi1> 2026-02-21T08:09:46.0246474Z %109 = arith.select %108, %78, %105 : tensor<512xi1>, tensor<512xf32> 2026-02-21T08:09:46.0246788Z %110 = arith.subf %78, %109 : tensor<512xf32> 2026-02-21T08:09:46.0247253Z %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0247718Z %112 = arith.mulf %87, %111 : tensor<512xf32> 2026-02-21T08:09:46.0248043Z %113 = tt.expand_dims %109 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0248431Z %114 = tt.broadcast %113 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0248751Z %115 = arith.subf %102, %114 : tensor<512x16xf32> 2026-02-21T08:09:46.0249228Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0249708Z %117 = "tt.reduce"(%116) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0249947Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0250179Z %150 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:46.0250414Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0250656Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0250954Z %118 = arith.addf %112, %117 : tensor<512xf32> 2026-02-21T08:09:46.0251203Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:46.0251450Z %119 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T08:09:46.0251743Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T08:09:46.0252035Z %121 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0252358Z %122 = tt.splat %120 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0252621Z %123 = arith.addi %122, %121 : tensor<16xi32> 2026-02-21T08:09:46.0252956Z %124 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0253303Z %125 = arith.muli %124, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0253636Z %126 = tt.expand_dims %123 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0254045Z %127 = tt.broadcast %125 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0254400Z %128 = tt.broadcast %126 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0254729Z %129 = arith.addi %127, %128 : tensor<512x16xi32> 2026-02-21T08:09:46.0255070Z %130 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0255436Z %131 = tt.addptr %130, %129 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0255751Z %132 = tt.load %131 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0256042Z %133 = arith.extf %132 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0256334Z %134 = "tt.reduce"(%133) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0256630Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0256856Z %150 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:09:46.0257085Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0257334Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0257667Z %135 = arith.truncf %134 : tensor<512xf32> to tensor<512xf16> 2026-02-21T08:09:46.0258010Z %136 = arith.extf %135 : tensor<512xf16> to tensor<512xf32> 2026-02-21T08:09:46.0258347Z %137 = arith.cmpf ogt, %109, %136 : tensor<512xf32> 2026-02-21T08:09:46.0258644Z %138 = arith.cmpf une, %109, %109 : tensor<512xf32> 2026-02-21T08:09:46.0258950Z %139 = arith.ori %137, %138 : tensor<512xi1> 2026-02-21T08:09:46.0259302Z %140 = arith.select %139, %109, %136 : tensor<512xi1>, tensor<512xf32> 2026-02-21T08:09:46.0259645Z %141 = arith.subf %109, %140 : tensor<512xf32> 2026-02-21T08:09:46.0260228Z %142 = tt.extern_elementwise %141 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0260728Z %143 = arith.mulf %118, %142 : tensor<512xf32> 2026-02-21T08:09:46.0261040Z %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0261401Z %145 = tt.broadcast %144 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0261729Z %146 = arith.subf %133, %145 : tensor<512x16xf32> 2026-02-21T08:09:46.0262183Z %147 = tt.extern_elementwise %146 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0262636Z %148 = "tt.reduce"(%147) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0262864Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:46.0267841Z %150 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:46.0268095Z tt.reduce.return %150 : f32 2026-02-21T08:09:46.0268370Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0268632Z %149 = arith.addf %143, %148 : tensor<512xf32> 2026-02-21T08:09:46.0268916Z scf.yield %140, %149 : tensor<512xf32>, tensor<512xf32> 2026-02-21T08:09:46.0269159Z } 2026-02-21T08:09:46.0269392Z %9 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0269705Z %10 = tt.splat %c240_i32 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0269952Z %11 = arith.addi %10, %9 : tensor<16xi32> 2026-02-21T08:09:46.0270265Z %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0270589Z %13 = arith.muli %12, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0270899Z %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0271245Z %15 = tt.broadcast %13 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0271642Z %16 = tt.broadcast %14 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0271939Z %17 = arith.addi %15, %16 : tensor<512x16xi32> 2026-02-21T08:09:46.0272225Z %18 = tt.splat %arg0 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0272569Z %19 = tt.addptr %18, %17 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0272877Z %20 = tt.load %19 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0273162Z %21 = arith.extf %20 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0273434Z %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0273748Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:09:46.0273977Z %59 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:09:46.0274205Z tt.reduce.return %59 : f32 2026-02-21T08:09:46.0274436Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0274714Z %23 = arith.truncf %22 : tensor<512xf32> to tensor<512xf16> 2026-02-21T08:09:46.0275022Z %24 = arith.extf %23 : tensor<512xf16> to tensor<512xf32> 2026-02-21T08:09:46.0275545Z %25 = arith.cmpf ogt, %8#0, %24 : tensor<512xf32> 2026-02-21T08:09:46.0275813Z %26 = arith.cmpf une, %8#0, %8#0 : tensor<512xf32> 2026-02-21T08:09:46.0276069Z %27 = arith.ori %25, %26 : tensor<512xi1> 2026-02-21T08:09:46.0276343Z %28 = arith.select %27, %8#0, %24 : tensor<512xi1>, tensor<512xf32> 2026-02-21T08:09:46.0276633Z %29 = arith.subf %8#0, %28 : tensor<512xf32> 2026-02-21T08:09:46.0277067Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0277505Z %31 = arith.mulf %8#1, %30 : tensor<512xf32> 2026-02-21T08:09:46.0277800Z %32 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0278153Z %33 = tt.broadcast %32 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0278443Z %34 = arith.subf %21, %33 : tensor<512x16xf32> 2026-02-21T08:09:46.0278956Z %35 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0279401Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T08:09:46.0279627Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:09:46.0279855Z %59 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:09:46.0280085Z tt.reduce.return %59 : f32 2026-02-21T08:09:46.0280316Z }) : (tensor<512x16xf32>) -> tensor<512xf32> 2026-02-21T08:09:46.0280570Z %37 = arith.addf %31, %36 : tensor<512xf32> 2026-02-21T08:09:46.0280809Z %c240_i32_2 = arith.constant 240 : i32 2026-02-21T08:09:46.0281056Z %c48_i32_3 = arith.constant 48 : i32 2026-02-21T08:09:46.0281334Z scf.for %arg3 = %c0_i32 to %c240_i32_2 step %c48_i32_3 : i32 { 2026-02-21T08:09:46.0281712Z %59 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0282011Z %60 = tt.splat %arg3 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0282299Z %61 = arith.addi %60, %59 : tensor<16xi32> 2026-02-21T08:09:46.0282752Z %62 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<512x16xf16> 2026-02-21T08:09:46.0283287Z %63 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0283690Z %64 = arith.extf %62 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0284012Z %65 = tt.broadcast %63 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0284314Z %66 = arith.subf %64, %65 : tensor<512x16xf32> 2026-02-21T08:09:46.0284767Z %67 = tt.extern_elementwise %66 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0285277Z %68 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0285651Z %69 = tt.broadcast %68 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0285948Z %70 = arith.divf %67, %69 : tensor<512x16xf32> 2026-02-21T08:09:46.0286247Z %71 = arith.truncf %70 : tensor<512x16xf32> to tensor<512x16xf16> 2026-02-21T08:09:46.0286603Z %72 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0286939Z %73 = arith.muli %72, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0287255Z %74 = tt.expand_dims %61 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0287594Z %75 = tt.broadcast %73 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0287903Z %76 = tt.broadcast %74 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0288174Z %77 = arith.addi %75, %76 : tensor<512x16xi32> 2026-02-21T08:09:46.0288454Z %78 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0288782Z %79 = tt.addptr %78, %77 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0289154Z tt.store %79, %71 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0289405Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:09:46.0289630Z %80 = arith.muli %c16_i32, %c1_i32_4 : i32 2026-02-21T08:09:46.0289866Z %81 = arith.addi %arg3, %80 : i32 2026-02-21T08:09:46.0290134Z %82 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0290427Z %83 = tt.splat %81 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0290660Z %84 = arith.addi %83, %82 : tensor<16xi32> 2026-02-21T08:09:46.0291002Z %85 = tt.descriptor_load %0[%4, %81] : !tt.tensordesc> -> tensor<512x16xf16> 2026-02-21T08:09:46.0291404Z %86 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0291784Z %87 = arith.extf %85 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0292143Z %88 = tt.broadcast %86 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0292416Z %89 = arith.subf %87, %88 : tensor<512x16xf32> 2026-02-21T08:09:46.0292849Z %90 = tt.extern_elementwise %89 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0293336Z %91 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0293669Z %92 = tt.broadcast %91 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0293946Z %93 = arith.divf %90, %92 : tensor<512x16xf32> 2026-02-21T08:09:46.0294214Z %94 = arith.truncf %93 : tensor<512x16xf32> to tensor<512x16xf16> 2026-02-21T08:09:46.0294549Z %95 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0294851Z %96 = arith.muli %95, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0295149Z %97 = tt.expand_dims %84 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0295494Z %98 = tt.broadcast %96 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0295802Z %99 = tt.broadcast %97 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0296096Z %100 = arith.addi %98, %99 : tensor<512x16xi32> 2026-02-21T08:09:46.0296503Z %101 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0296910Z %102 = tt.addptr %101, %100 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0297228Z tt.store %102, %94 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0297483Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:46.0297711Z %103 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T08:09:46.0297953Z %104 = arith.addi %arg3, %103 : i32 2026-02-21T08:09:46.0298230Z %105 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0298521Z %106 = tt.splat %104 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0298770Z %107 = arith.addi %106, %105 : tensor<16xi32> 2026-02-21T08:09:46.0299107Z %108 = tt.descriptor_load %0[%4, %104] : !tt.tensordesc> -> tensor<512x16xf16> 2026-02-21T08:09:46.0299519Z %109 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0299865Z %110 = arith.extf %108 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0300186Z %111 = tt.broadcast %109 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0300484Z %112 = arith.subf %110, %111 : tensor<512x16xf32> 2026-02-21T08:09:46.0300921Z %113 = tt.extern_elementwise %112 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0301420Z %114 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0301827Z %115 = tt.broadcast %114 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0302128Z %116 = arith.divf %113, %115 : tensor<512x16xf32> 2026-02-21T08:09:46.0302498Z %117 = arith.truncf %116 : tensor<512x16xf32> to tensor<512x16xf16> 2026-02-21T08:09:46.0302832Z %118 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0303148Z %119 = arith.muli %118, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0303444Z %120 = tt.expand_dims %107 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0303792Z %121 = tt.broadcast %119 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0304110Z %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0304397Z %123 = arith.addi %121, %122 : tensor<512x16xi32> 2026-02-21T08:09:46.0304691Z %124 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0305033Z %125 = tt.addptr %124, %123 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0305409Z tt.store %125, %117 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0305637Z } 2026-02-21T08:09:46.0305856Z %38 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:46.0306160Z %39 = tt.splat %c240_i32_2 : i32 -> tensor<16xi32> 2026-02-21T08:09:46.0306404Z %40 = arith.addi %39, %38 : tensor<16xi32> 2026-02-21T08:09:46.0306764Z %41 = tt.descriptor_load %0[%4, %c240_i32_2] : !tt.tensordesc> -> tensor<512x16xf16> 2026-02-21T08:09:46.0307176Z %42 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0307522Z %43 = arith.extf %41 : tensor<512x16xf16> to tensor<512x16xf32> 2026-02-21T08:09:46.0307827Z %44 = tt.broadcast %42 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0308112Z %45 = arith.subf %43, %44 : tensor<512x16xf32> 2026-02-21T08:09:46.0308549Z %46 = tt.extern_elementwise %45 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32> 2026-02-21T08:09:46.0309030Z %47 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32> 2026-02-21T08:09:46.0309373Z %48 = tt.broadcast %47 : tensor<512x1xf32> -> tensor<512x16xf32> 2026-02-21T08:09:46.0309646Z %49 = arith.divf %46, %48 : tensor<512x16xf32> 2026-02-21T08:09:46.0309932Z %50 = arith.truncf %49 : tensor<512x16xf32> to tensor<512x16xf16> 2026-02-21T08:09:46.0310268Z %51 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:46.0310572Z %52 = arith.muli %51, %cst : tensor<512x1xi32> 2026-02-21T08:09:46.0310873Z %53 = tt.expand_dims %40 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:09:46.0311200Z %54 = tt.broadcast %52 : tensor<512x1xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0311516Z %55 = tt.broadcast %53 : tensor<1x16xi32> -> tensor<512x16xi32> 2026-02-21T08:09:46.0311825Z %56 = arith.addi %54, %55 : tensor<512x16xi32> 2026-02-21T08:09:46.0312118Z %57 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0312447Z %58 = tt.addptr %57, %56 : tensor<512x16x!tt.ptr>, tensor<512x16xi32> 2026-02-21T08:09:46.0312751Z tt.store %58, %50 : tensor<512x16x!tt.ptr> 2026-02-21T08:09:46.0312995Z } {tt.warp_specialize} 2026-02-21T08:09:46.0313182Z tt.return 2026-02-21T08:09:46.0313338Z } 2026-02-21T08:09:46.0313480Z } 2026-02-21T08:09:46.0313569Z 2026-02-21T08:09:46.0313628Z {-# 2026-02-21T08:09:46.0313795Z external_resources: { 2026-02-21T08:09:46.0313990Z mlir_reproducer: { 2026-02-21T08:09:46.0319192Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:46.0324459Z disable_threading: false, 2026-02-21T08:09:46.0324670Z verify_each: true 2026-02-21T08:09:46.0324842Z } 2026-02-21T08:09:46.0324994Z } 2026-02-21T08:09:46.0325133Z #-} 2026-02-21T08:09:46.0325650Z /tmp/torchinductor_root/iq/ciqafosyvytmppnhx4u2hjiekjtwu3zcm2aprmnhs7mquxpoo5mi.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:46.0327101Z /tmp/torchinductor_root/iq/ciqafosyvytmppnhx4u2hjiekjtwu3zcm2aprmnhs7mquxpoo5mi.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:46.0328264Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:46.0329511Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], maxnreg=64, num_sm_multiplier=2, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:09:46.0330701Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:46.0331005Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:47.1901386Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.0 configs/s 2026-02-21T08:09:47.1912749Z [38s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 60s]) 2026-02-21T08:09:47.5438402Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2863.1 configs/s 2026-02-21T08:09:47.5917655Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T08:09:47.5922653Z error=8 2026-02-21T08:09:47.5927244Z ok=92 2026-02-21T08:09:47.5930615Z min=0.0082 2026-02-21T08:09:47.5935459Z mid=0.0533 2026-02-21T08:09:47.5937001Z max=2.5487 2026-02-21T08:09:47.5937225Z best={'block_sizes': [16, 256], 2026-02-21T08:09:47.5937506Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:09:47.5937773Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:09:47.5938074Z 'num_stages': 8, 2026-02-21T08:09:47.5938249Z 'num_warps': 4, 2026-02-21T08:09:47.5938430Z 'pid_type': 'flat', 2026-02-21T08:09:47.5938626Z 'range_flattens': [None, True], 2026-02-21T08:09:47.5942924Z 'range_multi_buffers': [None, True], 2026-02-21T08:09:47.5944393Z 'range_num_stages': [0, 1], 2026-02-21T08:09:47.5945206Z 'range_unroll_factors': [0, 0], 2026-02-21T08:09:47.5945443Z 'range_warp_specializes': [None, False]} 2026-02-21T08:09:47.5945830Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:09:48.9668151Z [40s] Generation 1 starting: 83 neighbors, 5 active search path(s) 2026-02-21T08:09:54.5760712Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 5.1 configs/s 2026-02-21T08:10:00.4278976Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 14.8 configs/s 2026-02-21T08:10:03.2060326Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.6 2026-02-21T08:10:03.2060864Z configs/s 2026-02-21T08:10:03.5255754Z [55s] Generation 1 complete: 2026-02-21T08:10:03.5256107Z ok=88 2026-02-21T08:10:03.5256278Z min=0.0061 2026-02-21T08:10:03.5256459Z mid=0.0082 2026-02-21T08:10:03.5256639Z max=0.0881 2026-02-21T08:10:03.5257388Z best={'block_sizes': [4, 256], 2026-02-21T08:10:03.5257681Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:03.5257954Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:10:03.5258214Z 'num_stages': 5, 2026-02-21T08:10:03.5258395Z 'num_warps': 4, 2026-02-21T08:10:03.5258616Z 'pid_type': 'flat', 2026-02-21T08:10:03.5258823Z 'range_flattens': [None, None], 2026-02-21T08:10:03.5259065Z 'range_multi_buffers': [None, None], 2026-02-21T08:10:03.5259286Z 'range_num_stages': [0, 1], 2026-02-21T08:10:03.5259480Z 'range_unroll_factors': [0, 3], 2026-02-21T08:10:03.5259688Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:03.5269917Z [55s] Fitting surrogate: 188 points, 188 targets 2026-02-21T08:10:04.7086052Z [56s] Generation 2 starting: 75 neighbors, 5 active search path(s) 2026-02-21T08:10:08.2766241Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 58.1 configs/s 2026-02-21T08:10:13.2355514Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 15.7 configs/s 2026-02-21T08:10:16.6636830Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 325.4 2026-02-21T08:10:16.6637415Z configs/s 2026-02-21T08:10:17.0079338Z [68s] Generation 2 complete: 2026-02-21T08:10:17.0079653Z ok=81 2026-02-21T08:10:17.0079867Z min=0.0061 2026-02-21T08:10:17.0080069Z mid=0.0082 2026-02-21T08:10:17.0080270Z max=0.0758 2026-02-21T08:10:17.0080484Z best={'block_sizes': [16, 256], 2026-02-21T08:10:17.0080823Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:17.0081173Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:17.0081484Z 'num_stages': 7, 2026-02-21T08:10:17.0082089Z 'num_warps': 16, 2026-02-21T08:10:17.0082318Z 'pid_type': 'flat', 2026-02-21T08:10:17.0082572Z 'range_flattens': [None, False], 2026-02-21T08:10:17.0082887Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:17.0083205Z 'range_num_stages': [0, 4], 2026-02-21T08:10:17.0083488Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:17.0083864Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:17.0095423Z [68s] Fitting surrogate: 269 points, 269 targets 2026-02-21T08:10:17.9834503Z [69s] Generation 3 starting: 65 neighbors, 5 active search path(s) 2026-02-21T08:10:20.9207437Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 33.0 configs/s 2026-02-21T08:10:25.2986197Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 15.7 configs/s 2026-02-21T08:10:28.6212272Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 338.1 2026-02-21T08:10:28.6212735Z configs/s 2026-02-21T08:10:28.9493041Z [80s] Generation 3 complete: 2026-02-21T08:10:28.9497171Z ok=70 2026-02-21T08:10:28.9498578Z min=0.0061 2026-02-21T08:10:28.9498747Z mid=0.0082 2026-02-21T08:10:28.9498873Z max=0.0348 2026-02-21T08:10:28.9499022Z best={'block_sizes': [16, 256], 2026-02-21T08:10:28.9499283Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:28.9500002Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:28.9500186Z 'num_stages': 7, 2026-02-21T08:10:28.9500328Z 'num_warps': 16, 2026-02-21T08:10:28.9500478Z 'pid_type': 'flat', 2026-02-21T08:10:28.9500638Z 'range_flattens': [None, False], 2026-02-21T08:10:28.9500828Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:28.9501012Z 'range_num_stages': [0, 4], 2026-02-21T08:10:28.9501186Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:28.9501363Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:28.9507057Z [80s] Fitting surrogate: 339 points, 339 targets 2026-02-21T08:10:29.8300312Z [81s] Generation 4 starting: 63 neighbors, 5 active search path(s) 2026-02-21T08:10:31.9267621Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 74.8 configs/s 2026-02-21T08:10:35.8818445Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.1 configs/s 2026-02-21T08:10:39.3138749Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.3 2026-02-21T08:10:39.3139199Z configs/s 2026-02-21T08:10:39.6331138Z [91s] Generation 4 complete: 2026-02-21T08:10:39.6335387Z ok=68 2026-02-21T08:10:39.6337016Z min=0.0061 2026-02-21T08:10:39.6337220Z mid=0.0063 2026-02-21T08:10:39.6337355Z max=0.0102 2026-02-21T08:10:39.6342665Z best={'block_sizes': [16, 256], 2026-02-21T08:10:39.6346613Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:39.6351067Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:39.6355585Z 'num_stages': 7, 2026-02-21T08:10:39.6357511Z 'num_warps': 16, 2026-02-21T08:10:39.6357698Z 'pid_type': 'flat', 2026-02-21T08:10:39.6357874Z 'range_flattens': [None, False], 2026-02-21T08:10:39.6358062Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:39.6358255Z 'range_num_stages': [0, 4], 2026-02-21T08:10:39.6358421Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:39.6358608Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:39.6358945Z [91s] Fitting surrogate: 407 points, 407 targets 2026-02-21T08:10:40.3712307Z [91s] Generation 5 starting: 50 neighbors, 4 active search path(s) 2026-02-21T08:10:42.2116544Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 41.6 configs/s 2026-02-21T08:10:45.4145060Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 16.1 configs/s 2026-02-21T08:10:48.0286974Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 392.4 2026-02-21T08:10:48.0287791Z configs/s 2026-02-21T08:10:48.3000593Z [99s] Generation 5 complete: 2026-02-21T08:10:48.3005537Z ok=55 2026-02-21T08:10:48.3009909Z min=0.0061 2026-02-21T08:10:48.3011337Z mid=0.0063 2026-02-21T08:10:48.3011505Z max=0.0164 2026-02-21T08:10:48.3011724Z best={'block_sizes': [16, 256], 2026-02-21T08:10:48.3011949Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:48.3012160Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:48.3012397Z 'num_stages': 7, 2026-02-21T08:10:48.3012538Z 'num_warps': 16, 2026-02-21T08:10:48.3012683Z 'pid_type': 'flat', 2026-02-21T08:10:48.3012847Z 'range_flattens': [None, False], 2026-02-21T08:10:48.3013027Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:48.3013219Z 'range_num_stages': [0, 4], 2026-02-21T08:10:48.3013386Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:48.3013577Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:48.3016106Z [99s] Fitting surrogate: 462 points, 462 targets 2026-02-21T08:10:48.9941230Z [100s] Generation 6 starting: 34 neighbors, 3 active search path(s) 2026-02-21T08:10:50.2529555Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 40.1 configs/s 2026-02-21T08:10:52.4546710Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 16.2 configs/s 2026-02-21T08:10:54.2748913Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 561.7 2026-02-21T08:10:54.2750306Z configs/s 2026-02-21T08:10:54.4669893Z [106s] Generation 6 complete: 2026-02-21T08:10:54.4674240Z ok=38 2026-02-21T08:10:54.4678677Z min=0.0061 2026-02-21T08:10:54.4680649Z mid=0.0063 2026-02-21T08:10:54.4680808Z max=0.0102 2026-02-21T08:10:54.4680962Z best={'block_sizes': [16, 256], 2026-02-21T08:10:54.4681174Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:54.4681397Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:54.4681649Z 'num_stages': 7, 2026-02-21T08:10:54.4681804Z 'num_warps': 16, 2026-02-21T08:10:54.4681951Z 'pid_type': 'flat', 2026-02-21T08:10:54.4682107Z 'range_flattens': [None, False], 2026-02-21T08:10:54.4682300Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:54.4682482Z 'range_num_stages': [0, 4], 2026-02-21T08:10:54.4682655Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:54.4682834Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:54.4687900Z [106s] Fitting surrogate: 500 points, 500 targets 2026-02-21T08:10:54.7874654Z [106s] Generation 7 starting: 14 neighbors, 1 active search path(s) 2026-02-21T08:10:55.4748157Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 60.2 configs/s 2026-02-21T08:10:56.3649956Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 14/14 16.5 configs/s 2026-02-21T08:10:57.0908320Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1398.9 2026-02-21T08:10:57.0911932Z configs/s 2026-02-21T08:10:57.1738579Z [108s] Generation 7 complete: 2026-02-21T08:10:57.1741749Z ok=16 2026-02-21T08:10:57.1746173Z min=0.0061 2026-02-21T08:10:57.1748292Z mid=0.0061 2026-02-21T08:10:57.1748516Z max=0.0143 2026-02-21T08:10:57.1753752Z best={'block_sizes': [16, 256], 2026-02-21T08:10:57.1755754Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:10:57.1756092Z 'load_eviction_policies': ['', ''], 2026-02-21T08:10:57.1756329Z 'num_stages': 7, 2026-02-21T08:10:57.1756556Z 'num_warps': 16, 2026-02-21T08:10:57.1756756Z 'pid_type': 'flat', 2026-02-21T08:10:57.1761341Z 'range_flattens': [None, False], 2026-02-21T08:10:57.1765305Z 'range_multi_buffers': [None, True], 2026-02-21T08:10:57.1768772Z 'range_num_stages': [0, 4], 2026-02-21T08:10:57.1772212Z 'range_unroll_factors': [0, 4], 2026-02-21T08:10:57.1776106Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:57.1780172Z [108s] Fitting surrogate: 516 points, 516 targets 2026-02-21T08:10:57.4729528Z [109s] Autotuning complete in 109.1s after searching 494 configs. 2026-02-21T08:10:57.4732285Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:10:57.4733413Z @helion.kernel(config=helion.Config(block_sizes=[16, 256], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=7, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:10:57.4734674Z 2026-02-21T08:10:57.4739502Z [109s] Code of selected kernel: /tmp/torchinductor_root/fe/cfe46z52drtjal4sey6zsroulfver3cr6rioiwvny46ikfnh4jrf.py 2026-02-21T08:10:57.9772526Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:10:57.9776567Z (M, N) 2026-02-21T08:10:57.9778245Z ----------- 2026-02-21T08:10:57.9778436Z (4096, 256) 2026-02-21T08:10:57.9778526Z 2026-02-21T08:10:57.9779167Z 5%|▌ | 1/20 [01:55<36:35, 115.53s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T08:10:57.9781798Z (M, N) 2026-02-21T08:10:57.9781983Z ----------- 2026-02-21T08:10:57.9782146Z (4096, 896) 2026-02-21T08:10:57.9782541Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax 2026-02-21T08:10:59.5973158Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:11:01.0600344Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for torch_compile_softmax 2026-02-21T08:11:02.8720238Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:11:02.8723834Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:11:02.8727796Z 'dtype': 'torch.float16', 2026-02-21T08:11:02.8731959Z 'shape': (4096, 896), 2026-02-21T08:11:02.8736511Z 'stride': (896, 1)},), 2026-02-21T08:11:02.8741659Z 'kwargs': {}} 2026-02-21T08:11:02.8746194Z INFO:tritonbench.utils.triton_op:Took 1.67ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:11:03.0470822Z [0s] Autotune random seed: 2134816249 2026-02-21T08:11:03.1823892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:11:36.3408960Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:11:36.3427099Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T08:11:38.8565645Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:11:38.8566282Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:11:38.8570162Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:11:38.8574903Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:11:38.8576293Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:38.8576542Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:38.8576760Z %cst = arith.constant dense<896> : tensor<16x1xi32> 2026-02-21T08:11:38.8577105Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T08:11:38.8582235Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T08:11:38.8583661Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:11:38.8583898Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:38.8584103Z %c896_i32 = arith.constant 896 : i32 2026-02-21T08:11:38.8584285Z %c896_i64 = arith.constant 896 : i64 2026-02-21T08:11:38.8584471Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:38.8584794Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T08:11:38.8585126Z %1 = tt.get_program_id x : i32 2026-02-21T08:11:38.8585303Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:11:38.8585490Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T08:11:38.8585691Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:11:38.8585893Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T08:11:38.8586131Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:38.8586394Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T08:11:38.8586941Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T08:11:38.8587131Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:11:38.8587318Z %c384_i32 = arith.constant 384 : i32 2026-02-21T08:11:38.8587688Z %8:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c384_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T08:11:38.8588157Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T08:11:38.8588498Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8588734Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8588937Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8589126Z %108 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:38.8589327Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8589526Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8589753Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:11:38.8590000Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:11:38.8590227Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T08:11:38.8590461Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T08:11:38.8590676Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T08:11:38.8590916Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:11:38.8591163Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T08:11:38.8591522Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8592242Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T08:11:38.8592507Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8592973Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8593309Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T08:11:38.8593689Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8594055Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8594263Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8594452Z %108 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:38.8594653Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8594843Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8595054Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T08:11:38.8595251Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:11:38.8595441Z %68 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T08:11:38.8595637Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T08:11:38.8595917Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T08:11:38.8596243Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8596472Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8596668Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8596857Z %108 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:38.8597046Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8597236Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8597488Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:11:38.8597745Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:11:38.8597983Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T08:11:38.8598211Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T08:11:38.8598430Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T08:11:38.8598743Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:11:38.8598993Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T08:11:38.8599362Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8599737Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T08:11:38.8599996Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8600308Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8600559Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T08:11:38.8600942Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8601333Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8601572Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8601767Z %108 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:38.8601956Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8602153Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8602363Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T08:11:38.8602566Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:38.8602769Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:11:38.8602962Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T08:11:38.8603250Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T08:11:38.8603579Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8603825Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8604025Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8604274Z %108 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:38.8604484Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8604672Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8604911Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:11:38.8605165Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:11:38.8605410Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T08:11:38.8605634Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T08:11:38.8605843Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T08:11:38.8606087Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:11:38.8606316Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T08:11:38.8606669Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8607027Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T08:11:38.8607286Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8607588Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8607831Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T08:11:38.8608205Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8608575Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8608771Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:38.8608955Z %108 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:38.8609139Z tt.reduce.return %108 : f32 2026-02-21T08:11:38.8609328Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8609526Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T08:11:38.8609751Z scf.yield %98, %107 : tensor<16xf32>, tensor<16xf32> 2026-02-21T08:11:38.8610021Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:11:38.8610320Z %9 = tt.descriptor_load %0[%4, %c768_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T08:11:38.8610642Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8610877Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8611070Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:11:38.8611248Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:11:38.8611441Z tt.reduce.return %50 : f32 2026-02-21T08:11:38.8611660Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8611886Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:11:38.8612123Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:11:38.8612347Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T08:11:38.8612567Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T08:11:38.8612771Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T08:11:38.8613002Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:11:38.8613228Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T08:11:38.8613577Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8613922Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T08:11:38.8614171Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8614464Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8614693Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T08:11:38.8615062Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8615483Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T08:11:38.8615684Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:11:38.8615867Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:11:38.8616051Z tt.reduce.return %50 : f32 2026-02-21T08:11:38.8616237Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T08:11:38.8616428Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T08:11:38.8616626Z %c768_i32_2 = arith.constant 768 : i32 2026-02-21T08:11:38.8616811Z %c384_i32_3 = arith.constant 384 : i32 2026-02-21T08:11:38.8617043Z scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c384_i32_3 : i32 { 2026-02-21T08:11:38.8617316Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:11:38.8617583Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:11:38.8617796Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T08:11:38.8618044Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:11:38.8618309Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T08:11:38.8618559Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:11:38.8618852Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8619111Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8619353Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T08:11:38.8619595Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8619873Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8620175Z %61 = tt.load %60 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8620475Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8620765Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8621122Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8621355Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T08:11:38.8621780Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8622192Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8622486Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8622714Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T08:11:38.8622958Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T08:11:38.8623235Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8623512Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8623778Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8623981Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:11:38.8624179Z %73 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T08:11:38.8624368Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T08:11:38.8624605Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:11:38.8624860Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T08:11:38.8625055Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T08:11:38.8625305Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:11:38.8625560Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T08:11:38.8625817Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:11:38.8626109Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8626416Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8626660Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T08:11:38.8626891Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8627175Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8627466Z %86 = tt.load %85 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8627772Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8628056Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8628309Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8628543Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T08:11:38.8628904Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8629313Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8629597Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8629828Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T08:11:38.8630066Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T08:11:38.8630332Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8630616Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8630869Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8631082Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:38.8631279Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:11:38.8631467Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T08:11:38.8631742Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:11:38.8632050Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T08:11:38.8632259Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T08:11:38.8632503Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:11:38.8632772Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T08:11:38.8633035Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:11:38.8633328Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8633601Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8633843Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T08:11:38.8634090Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8634374Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8634689Z %111 = tt.load %110 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8635000Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8635286Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8635557Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8635797Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T08:11:38.8636176Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8636595Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8636881Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8637172Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T08:11:38.8637418Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T08:11:38.8637696Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8637975Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8638246Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8638461Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:11:38.8638697Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:11:38.8638957Z %28 = tt.splat %c768_i32_2 : i32 -> tensor<128xi32> 2026-02-21T08:11:38.8639160Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T08:11:38.8639413Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:11:38.8639673Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T08:11:38.8639926Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:11:38.8640221Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8640479Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T08:11:38.8640714Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T08:11:38.8640940Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8641250Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8641587Z %38 = tt.load %37 evictionPolicy = evict_last : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8641904Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8642203Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T08:11:38.8642466Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8642716Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T08:11:38.8643149Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T08:11:38.8643580Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:11:38.8643882Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T08:11:38.8644122Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T08:11:38.8644374Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T08:11:38.8644652Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8644958Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T08:11:38.8645227Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T08:11:38.8645579Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:11:38.8645912Z tt.return 2026-02-21T08:11:38.8646042Z } 2026-02-21T08:11:38.8646175Z } 2026-02-21T08:11:38.8646245Z 2026-02-21T08:11:38.8646297Z {-# 2026-02-21T08:11:38.8646439Z external_resources: { 2026-02-21T08:11:38.8646600Z mlir_reproducer: { 2026-02-21T08:11:38.8651116Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:38.8655585Z disable_threading: false, 2026-02-21T08:11:38.8655758Z verify_each: true 2026-02-21T08:11:38.8655902Z } 2026-02-21T08:11:38.8656028Z } 2026-02-21T08:11:38.8656142Z #-} 2026-02-21T08:11:38.8656604Z /tmp/torchinductor_root/57/c57bheky2o2t4z7zwekod4ad533kma5u7rgiwiyx473bitht7jc2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:38.8657800Z /tmp/torchinductor_root/57/c57bheky2o2t4z7zwekod4ad533kma5u7rgiwiyx473bitht7jc2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:38.8658768Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:38.8659837Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=64, num_sm_multiplier=4, num_stages=5, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:11:38.8660852Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:38.8661106Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:40.5010852Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:11:40.5016399Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:11:40.5020437Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:11:40.5025047Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:40.5028974Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:11:40.5033017Z %cst = arith.constant dense<896> : tensor<32x1xi32> 2026-02-21T08:11:40.5033376Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:11:40.5033648Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:11:40.5033867Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:11:40.5034061Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:40.5034253Z %c896_i32 = arith.constant 896 : i32 2026-02-21T08:11:40.5034433Z %c896_i64 = arith.constant 896 : i64 2026-02-21T08:11:40.5034619Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:40.5034927Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T08:11:40.5035252Z %1 = tt.get_program_id x : i32 2026-02-21T08:11:40.5035459Z scf.for %arg2 = %1 to %c128_i32 step %c9472_i32 : i32 { 2026-02-21T08:11:40.5035682Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:11:40.5036214Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:11:40.5036480Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:11:40.5036681Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:11:40.5036868Z %c864_i32 = arith.constant 864 : i32 2026-02-21T08:11:40.5037062Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:11:40.5037414Z %6:2 = scf.for %arg3 = %c0_i32 to %c864_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:11:40.5037871Z %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:11:40.5038205Z %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5038437Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5038639Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5038836Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:40.5039049Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5039237Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5039469Z %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:11:40.5039720Z %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:11:40.5039953Z %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32> 2026-02-21T08:11:40.5040182Z %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:11:40.5040392Z %54 = arith.ori %52, %53 : tensor<32xi1> 2026-02-21T08:11:40.5040627Z %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:11:40.5040863Z %56 = arith.subf %arg4, %55 : tensor<32xf32> 2026-02-21T08:11:40.5041233Z %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5041681Z %58 = arith.mulf %arg5, %57 : tensor<32xf32> 2026-02-21T08:11:40.5042082Z %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5042376Z %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5042608Z %61 = arith.subf %48, %60 : tensor<32x32xf32> 2026-02-21T08:11:40.5042977Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5043347Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5043535Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5043724Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:40.5043910Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5044103Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5044300Z %64 = arith.addf %58, %63 : tensor<32xf32> 2026-02-21T08:11:40.5044498Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:40.5044690Z %65 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:11:40.5044888Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T08:11:40.5045161Z %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:11:40.5045465Z %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5045695Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5045879Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5046063Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:40.5046248Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5046433Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5046657Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:11:40.5046890Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:11:40.5047120Z %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32> 2026-02-21T08:11:40.5047488Z %73 = arith.cmpf une, %55, %55 : tensor<32xf32> 2026-02-21T08:11:40.5047719Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:11:40.5047945Z %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:11:40.5048184Z %76 = arith.subf %55, %75 : tensor<32xf32> 2026-02-21T08:11:40.5048539Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5048895Z %78 = arith.mulf %64, %77 : tensor<32xf32> 2026-02-21T08:11:40.5049152Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5049440Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5049680Z %81 = arith.subf %68, %80 : tensor<32x32xf32> 2026-02-21T08:11:40.5050103Z %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5050525Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5050728Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5050938Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:40.5051136Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5051323Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5051564Z %84 = arith.addf %78, %83 : tensor<32xf32> 2026-02-21T08:11:40.5051761Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:40.5051967Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:11:40.5052164Z %86 = arith.addi %arg3, %85 : i32 2026-02-21T08:11:40.5052433Z %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:11:40.5052753Z %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5052984Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5053259Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5053447Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:11:40.5053645Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5053830Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5054052Z %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:11:40.5054308Z %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:11:40.5054530Z %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32> 2026-02-21T08:11:40.5054746Z %93 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:11:40.5054943Z %94 = arith.ori %92, %93 : tensor<32xi1> 2026-02-21T08:11:40.5055172Z %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:11:40.5055403Z %96 = arith.subf %75, %95 : tensor<32xf32> 2026-02-21T08:11:40.5055751Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5056107Z %98 = arith.mulf %84, %97 : tensor<32xf32> 2026-02-21T08:11:40.5056352Z %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5056649Z %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5056883Z %101 = arith.subf %88, %100 : tensor<32x32xf32> 2026-02-21T08:11:40.5057251Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5057626Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5057815Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:40.5057996Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:40.5058175Z tt.reduce.return %105 : f32 2026-02-21T08:11:40.5058358Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5058602Z %104 = arith.addf %98, %103 : tensor<32xf32> 2026-02-21T08:11:40.5058826Z scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:11:40.5059043Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:11:40.5059330Z %7 = tt.descriptor_load %0[%2, %c864_i32] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:11:40.5059648Z %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5059870Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5060061Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:11:40.5060238Z %47 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:11:40.5060433Z tt.reduce.return %47 : f32 2026-02-21T08:11:40.5060621Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5060835Z %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:11:40.5061079Z %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:11:40.5061302Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32> 2026-02-21T08:11:40.5061521Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:11:40.5061770Z %14 = arith.ori %12, %13 : tensor<32xi1> 2026-02-21T08:11:40.5062002Z %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:11:40.5062238Z %16 = arith.subf %6#0, %15 : tensor<32xf32> 2026-02-21T08:11:40.5062582Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5062941Z %18 = arith.mulf %6#1, %17 : tensor<32xf32> 2026-02-21T08:11:40.5063188Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5063482Z %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5063711Z %21 = arith.subf %8, %20 : tensor<32x32xf32> 2026-02-21T08:11:40.5064079Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5064504Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:11:40.5064693Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:11:40.5064875Z %47 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:11:40.5065056Z tt.reduce.return %47 : f32 2026-02-21T08:11:40.5065242Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:11:40.5065432Z %24 = arith.addf %18, %23 : tensor<32xf32> 2026-02-21T08:11:40.5065627Z %c864_i32_2 = arith.constant 864 : i32 2026-02-21T08:11:40.5065819Z %c96_i32_3 = arith.constant 96 : i32 2026-02-21T08:11:40.5066043Z scf.for %arg3 = %c0_i32 to %c864_i32_2 step %c96_i32_3 : i32 { 2026-02-21T08:11:40.5066291Z %47 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T08:11:40.5066491Z %48 = arith.addi %47, %3 : tensor<32xi32> 2026-02-21T08:11:40.5066775Z %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:11:40.5067049Z %50 = arith.muli %49, %cst : tensor<32x1xi32> 2026-02-21T08:11:40.5067300Z %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:11:40.5067593Z %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5067847Z %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5068086Z %54 = arith.addi %52, %53 : tensor<32x32xi32> 2026-02-21T08:11:40.5068330Z %55 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5068608Z %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5068916Z %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5069222Z %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5069566Z %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5069828Z %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5070052Z %61 = arith.subf %59, %60 : tensor<32x32xf32> 2026-02-21T08:11:40.5070418Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5070841Z %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5071126Z %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5071352Z %65 = arith.divf %62, %64 : tensor<32x32xf32> 2026-02-21T08:11:40.5071664Z %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:11:40.5071945Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5072227Z %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5072497Z tt.store %68, %66 : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5072709Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:40.5072910Z %69 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:11:40.5073133Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:11:40.5073347Z %71 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T08:11:40.5073558Z %72 = arith.addi %71, %3 : tensor<32xi32> 2026-02-21T08:11:40.5073820Z %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:11:40.5074095Z %74 = arith.muli %73, %cst : tensor<32x1xi32> 2026-02-21T08:11:40.5074347Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:11:40.5074647Z %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5074911Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5075156Z %78 = arith.addi %76, %77 : tensor<32x32xi32> 2026-02-21T08:11:40.5075453Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5075744Z %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5076063Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5076379Z %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5076676Z %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5076939Z %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5077185Z %85 = arith.subf %83, %84 : tensor<32x32xf32> 2026-02-21T08:11:40.5077569Z %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5077995Z %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5078301Z %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5078544Z %89 = arith.divf %86, %88 : tensor<32x32xf32> 2026-02-21T08:11:40.5078790Z %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:11:40.5079062Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5079350Z %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5079618Z tt.store %92, %90 : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5079825Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:40.5080031Z %93 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:11:40.5080215Z %94 = arith.addi %arg3, %93 : i32 2026-02-21T08:11:40.5080409Z %95 = tt.splat %94 : i32 -> tensor<32xi32> 2026-02-21T08:11:40.5080604Z %96 = arith.addi %95, %3 : tensor<32xi32> 2026-02-21T08:11:40.5080903Z %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:11:40.5081164Z %98 = arith.muli %97, %cst : tensor<32x1xi32> 2026-02-21T08:11:40.5081404Z %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:11:40.5081727Z %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5081981Z %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5082227Z %102 = arith.addi %100, %101 : tensor<32x32xi32> 2026-02-21T08:11:40.5082465Z %103 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5082756Z %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5083069Z %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5083379Z %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5083675Z %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5083938Z %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5084183Z %109 = arith.subf %107, %108 : tensor<32x32xf32> 2026-02-21T08:11:40.5084556Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5084975Z %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5085270Z %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5085507Z %113 = arith.divf %110, %112 : tensor<32x32xf32> 2026-02-21T08:11:40.5085749Z %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:11:40.5086021Z %115 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5086316Z %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5086665Z tt.store %116, %114 : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5086876Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:11:40.5087091Z %25 = tt.splat %c864_i32_2 : i32 -> tensor<32xi32> 2026-02-21T08:11:40.5087296Z %26 = arith.addi %25, %3 : tensor<32xi32> 2026-02-21T08:11:40.5087546Z %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:11:40.5087806Z %28 = arith.muli %27, %cst : tensor<32x1xi32> 2026-02-21T08:11:40.5088053Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:11:40.5088343Z %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5088599Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:11:40.5088832Z %32 = arith.addi %30, %31 : tensor<32x32xi32> 2026-02-21T08:11:40.5089064Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5089346Z %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5089649Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5089953Z %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5090238Z %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:11:40.5090490Z %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5090724Z %39 = arith.subf %37, %38 : tensor<32x32xf32> 2026-02-21T08:11:40.5091082Z %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:11:40.5091494Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:11:40.5091815Z %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:11:40.5092087Z %43 = arith.divf %40, %42 : tensor<32x32xf32> 2026-02-21T08:11:40.5092324Z %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:11:40.5092578Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5092849Z %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:11:40.5093102Z tt.store %46, %44 : tensor<32x32x!tt.ptr> 2026-02-21T08:11:40.5093366Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:40.5093621Z tt.return 2026-02-21T08:11:40.5093749Z } 2026-02-21T08:11:40.5093879Z } 2026-02-21T08:11:40.5093948Z 2026-02-21T08:11:40.5093998Z {-# 2026-02-21T08:11:40.5094132Z external_resources: { 2026-02-21T08:11:40.5094286Z mlir_reproducer: { 2026-02-21T08:11:40.5098562Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:40.5103024Z disable_threading: false, 2026-02-21T08:11:40.5103218Z verify_each: true 2026-02-21T08:11:40.5103383Z } 2026-02-21T08:11:40.5103522Z } 2026-02-21T08:11:40.5103668Z #-} 2026-02-21T08:11:40.5104151Z /tmp/torchinductor_root/xw/cxwatlt3kybgikn2evimt44eih4znjmoz32sxtzwi3iwawmh5uqd.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:40.5105478Z /tmp/torchinductor_root/xw/cxwatlt3kybgikn2evimt44eih4znjmoz32sxtzwi3iwawmh5uqd.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:40.5106554Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:40.5107725Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:11:40.5108736Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:40.5109050Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:42.2400119Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.1 configs/s 2026-02-21T08:11:42.2413956Z [39s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 30s]) 2026-02-21T08:11:42.6316946Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2551.3 configs/s 2026-02-21T08:11:42.6772418Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T08:11:42.6776730Z error=6 2026-02-21T08:11:42.6779854Z timeout=1 2026-02-21T08:11:42.6784390Z ok=93 2026-02-21T08:11:42.6786487Z min=0.0123 2026-02-21T08:11:42.6786684Z mid=0.1127 2026-02-21T08:11:42.6791358Z max=8.6245 2026-02-21T08:11:42.6793265Z best={'block_sizes': [1, 1024], 2026-02-21T08:11:42.6793538Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:11:42.6793770Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:11:42.6793956Z 'maxnreg': 32, 2026-02-21T08:11:42.6794142Z 'num_sm_multiplier': 64, 2026-02-21T08:11:42.6794325Z 'num_stages': 7, 2026-02-21T08:11:42.6794461Z 'num_warps': 4, 2026-02-21T08:11:42.6794621Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:11:42.6794810Z 'range_flattens': [None, True], 2026-02-21T08:11:42.6794989Z 'range_multi_buffers': [False, True], 2026-02-21T08:11:42.6795170Z 'range_num_stages': [1, 4], 2026-02-21T08:11:42.6795331Z 'range_unroll_factors': [1, 4], 2026-02-21T08:11:42.6795511Z 'range_warp_specializes': [True, None]} 2026-02-21T08:11:42.6795720Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:11:43.6956862Z [40s] Generation 1 starting: 82 neighbors, 5 active search path(s) 2026-02-21T08:11:49.2056607Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 8.4 configs/s 2026-02-21T08:11:54.7385272Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 15.7 configs/s 2026-02-21T08:11:58.2010509Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.3 2026-02-21T08:11:58.2014649Z configs/s 2026-02-21T08:11:58.5297272Z [55s] Generation 1 complete: 2026-02-21T08:11:58.5299058Z ok=88 2026-02-21T08:11:58.5299215Z min=0.0102 2026-02-21T08:11:58.5299352Z mid=0.0142 2026-02-21T08:11:58.5299472Z max=0.1782 2026-02-21T08:11:58.5299618Z best={'block_sizes': [4, 1024], 2026-02-21T08:11:58.5299828Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:11:58.5300057Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:11:58.5300243Z 'num_stages': 5, 2026-02-21T08:11:58.5300380Z 'num_warps': 16, 2026-02-21T08:11:58.5300524Z 'pid_type': 'flat', 2026-02-21T08:11:58.5300676Z 'range_flattens': [None, True], 2026-02-21T08:11:58.5300857Z 'range_multi_buffers': [None, True], 2026-02-21T08:11:58.5301035Z 'range_num_stages': [0, 1], 2026-02-21T08:11:58.5301202Z 'range_unroll_factors': [0, 3], 2026-02-21T08:11:58.5301379Z 'range_warp_specializes': [None, False]} 2026-02-21T08:11:58.5310572Z [55s] Fitting surrogate: 188 points, 188 targets 2026-02-21T08:11:59.4406909Z [56s] Generation 2 starting: 62 neighbors, 5 active search path(s) 2026-02-21T08:12:02.7791200Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 39.6 configs/s 2026-02-21T08:12:07.2205060Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 15.0 configs/s 2026-02-21T08:12:10.6700790Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.7 2026-02-21T08:12:10.6702224Z configs/s 2026-02-21T08:12:11.0360477Z [67s] Generation 2 complete: 2026-02-21T08:12:11.0365386Z ok=68 2026-02-21T08:12:11.0370801Z min=0.0102 2026-02-21T08:12:11.0372339Z mid=0.0123 2026-02-21T08:12:11.0372521Z max=0.0185 2026-02-21T08:12:11.0372691Z best={'block_sizes': [4, 1024], 2026-02-21T08:12:11.0372939Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:12:11.0373178Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:11.0373417Z 'num_stages': 5, 2026-02-21T08:12:11.0373594Z 'num_warps': 16, 2026-02-21T08:12:11.0373731Z 'pid_type': 'flat', 2026-02-21T08:12:11.0373891Z 'range_flattens': [None, True], 2026-02-21T08:12:11.0374065Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:11.0374251Z 'range_num_stages': [0, 1], 2026-02-21T08:12:11.0374412Z 'range_unroll_factors': [0, 3], 2026-02-21T08:12:11.0374598Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:11.0374917Z [67s] Fitting surrogate: 256 points, 256 targets 2026-02-21T08:12:12.0664199Z [68s] Generation 3 starting: 63 neighbors, 5 active search path(s) 2026-02-21T08:12:15.8301237Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 27.4 configs/s 2026-02-21T08:12:19.8772019Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 16.0 configs/s 2026-02-21T08:12:23.4302414Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.2 2026-02-21T08:12:23.4307463Z configs/s 2026-02-21T08:12:23.7988683Z [80s] Generation 3 complete: 2026-02-21T08:12:23.7992795Z ok=69 2026-02-21T08:12:23.7994536Z min=0.0102 2026-02-21T08:12:23.7994743Z mid=0.0102 2026-02-21T08:12:23.7994912Z max=0.0173 2026-02-21T08:12:23.7995110Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:23.7995482Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:12:23.7995827Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:23.7996052Z 'num_stages': 7, 2026-02-21T08:12:23.8000233Z 'num_warps': 2, 2026-02-21T08:12:23.8001919Z 'pid_type': 'flat', 2026-02-21T08:12:23.8002176Z 'range_flattens': [None, True], 2026-02-21T08:12:23.8002413Z 'range_multi_buffers': [None, None], 2026-02-21T08:12:23.8002650Z 'range_num_stages': [0, 4], 2026-02-21T08:12:23.8002854Z 'range_unroll_factors': [0, 4], 2026-02-21T08:12:23.8003085Z 'range_warp_specializes': [None, None]} 2026-02-21T08:12:23.8003428Z [80s] Fitting surrogate: 325 points, 325 targets 2026-02-21T08:12:24.5978382Z [81s] Generation 4 starting: 45 neighbors, 4 active search path(s) 2026-02-21T08:12:27.3919729Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 21.6 configs/s 2026-02-21T08:12:30.3279017Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 15.9 configs/s 2026-02-21T08:12:32.8506895Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 406.3 2026-02-21T08:12:32.8507355Z configs/s 2026-02-21T08:12:33.1169753Z [89s] Generation 4 complete: 2026-02-21T08:12:33.1171304Z ok=49 2026-02-21T08:12:33.1171487Z min=0.0091 2026-02-21T08:12:33.1171690Z mid=0.0102 2026-02-21T08:12:33.1171828Z max=0.0164 2026-02-21T08:12:33.1171979Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:33.1172232Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:12:33.1172486Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:33.1172689Z 'num_stages': 6, 2026-02-21T08:12:33.1173318Z 'num_warps': 1, 2026-02-21T08:12:33.1173507Z 'pid_type': 'flat', 2026-02-21T08:12:33.1173676Z 'range_flattens': [None, True], 2026-02-21T08:12:33.1173878Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:33.1174082Z 'range_num_stages': [0, 1], 2026-02-21T08:12:33.1174255Z 'range_unroll_factors': [0, 0], 2026-02-21T08:12:33.1174456Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:33.1184175Z [89s] Fitting surrogate: 374 points, 374 targets 2026-02-21T08:12:33.7321853Z [90s] Generation 5 starting: 32 neighbors, 3 active search path(s) 2026-02-21T08:12:35.5380836Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 20.8 configs/s 2026-02-21T08:12:37.9089176Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 13.7 configs/s 2026-02-21T08:12:39.6233058Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.0 2026-02-21T08:12:39.6234361Z configs/s 2026-02-21T08:12:39.8109558Z [96s] Generation 5 complete: 2026-02-21T08:12:39.8113344Z ok=35 2026-02-21T08:12:39.8114850Z min=0.0084 2026-02-21T08:12:39.8115046Z mid=0.0101 2026-02-21T08:12:39.8115187Z max=0.0210 2026-02-21T08:12:39.8115402Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:39.8115661Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:12:39.8115955Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:39.8116154Z 'num_stages': 6, 2026-02-21T08:12:39.8116314Z 'num_warps': 1, 2026-02-21T08:12:39.8116466Z 'pid_type': 'flat', 2026-02-21T08:12:39.8116644Z 'range_flattens': [None, True], 2026-02-21T08:12:39.8116859Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:39.8117080Z 'range_num_stages': [0, 1], 2026-02-21T08:12:39.8117260Z 'range_unroll_factors': [0, 0], 2026-02-21T08:12:39.8117478Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:39.8124006Z [96s] Fitting surrogate: 409 points, 409 targets 2026-02-21T08:12:40.3376125Z [97s] Generation 6 starting: 19 neighbors, 3 active search path(s) 2026-02-21T08:12:41.6032748Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 16.2 configs/s 2026-02-21T08:12:42.9842866Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 14.1 configs/s 2026-02-21T08:12:44.0702716Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 969.9 2026-02-21T08:12:44.0703490Z configs/s 2026-02-21T08:12:44.1976198Z [101s] Generation 6 complete: 2026-02-21T08:12:44.1976696Z ok=22 2026-02-21T08:12:44.1977011Z min=0.0162 2026-02-21T08:12:44.1977325Z mid=0.0220 2026-02-21T08:12:44.1977621Z max=0.0307 2026-02-21T08:12:44.1977956Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:44.1978580Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:12:44.1979197Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:44.1979634Z 'num_stages': 2, 2026-02-21T08:12:44.1979966Z 'num_warps': 2, 2026-02-21T08:12:44.1980945Z 'pid_type': 'flat', 2026-02-21T08:12:44.1981364Z 'range_flattens': [None, False], 2026-02-21T08:12:44.1982231Z 'range_multi_buffers': [None, None], 2026-02-21T08:12:44.1982612Z 'range_num_stages': [0, 1], 2026-02-21T08:12:44.1982972Z 'range_unroll_factors': [0, 3], 2026-02-21T08:12:44.1983352Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:44.2003598Z [101s] Fitting surrogate: 431 points, 431 targets 2026-02-21T08:12:45.0344745Z [101s] Generation 7 starting: 33 neighbors, 3 active search path(s) 2026-02-21T08:12:48.0355430Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 14.2 configs/s 2026-02-21T08:12:50.4011411Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 15.5 configs/s 2026-02-21T08:12:52.5968282Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 543.7 2026-02-21T08:12:52.5968846Z configs/s 2026-02-21T08:12:52.8083760Z [109s] Generation 7 complete: 2026-02-21T08:12:52.8084098Z ok=37 2026-02-21T08:12:52.8086274Z min=0.0082 2026-02-21T08:12:52.8086480Z mid=0.0102 2026-02-21T08:12:52.8086674Z max=0.0162 2026-02-21T08:12:52.8086901Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:52.8087267Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:12:52.8087662Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:52.8087935Z 'num_stages': 2, 2026-02-21T08:12:52.8088159Z 'num_warps': 2, 2026-02-21T08:12:52.8088382Z 'pid_type': 'flat', 2026-02-21T08:12:52.8088636Z 'range_flattens': [None, False], 2026-02-21T08:12:52.8088907Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:52.8089174Z 'range_num_stages': [0, 1], 2026-02-21T08:12:52.8089438Z 'range_unroll_factors': [0, 3], 2026-02-21T08:12:52.8089717Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:52.8101070Z [109s] Fitting surrogate: 468 points, 468 targets 2026-02-21T08:12:53.4134885Z [110s] Generation 8 starting: 30 neighbors, 3 active search path(s) 2026-02-21T08:12:55.3615083Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 20.3 configs/s 2026-02-21T08:12:57.3505408Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 15.9 configs/s 2026-02-21T08:12:59.0702165Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 596.5 2026-02-21T08:12:59.0702645Z configs/s 2026-02-21T08:12:59.2695713Z [116s] Generation 8 complete: 2026-02-21T08:12:59.2696518Z ok=33 2026-02-21T08:12:59.2696712Z min=0.0100 2026-02-21T08:12:59.2696900Z mid=0.0102 2026-02-21T08:12:59.2697075Z max=0.0102 2026-02-21T08:12:59.2697279Z best={'block_sizes': [1, 1024], 2026-02-21T08:12:59.2697563Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:12:59.2697882Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:12:59.2698117Z 'num_stages': 2, 2026-02-21T08:12:59.2698286Z 'num_warps': 2, 2026-02-21T08:12:59.2698501Z 'pid_type': 'flat', 2026-02-21T08:12:59.2699229Z 'range_flattens': [None, False], 2026-02-21T08:12:59.2699440Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:59.2699662Z 'range_num_stages': [0, 1], 2026-02-21T08:12:59.2699866Z 'range_unroll_factors': [0, 3], 2026-02-21T08:12:59.2700076Z 'range_warp_specializes': [None, False]} 2026-02-21T08:12:59.2715340Z [116s] Fitting surrogate: 501 points, 501 targets 2026-02-21T08:12:59.7270178Z [116s] Generation 9 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:13:00.5611631Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 39.0 configs/s 2026-02-21T08:13:01.3236121Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 16.7 configs/s 2026-02-21T08:13:02.0410882Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1412.9 2026-02-21T08:13:02.0411251Z configs/s 2026-02-21T08:13:02.1242592Z [118s] Generation 9 complete: 2026-02-21T08:13:02.1243097Z ok=14 2026-02-21T08:13:02.1243248Z min=0.0083 2026-02-21T08:13:02.1243401Z mid=0.0101 2026-02-21T08:13:02.1243522Z max=0.0102 2026-02-21T08:13:02.1243669Z best={'block_sizes': [1, 1024], 2026-02-21T08:13:02.1243906Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:13:02.1244174Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:13:02.1244357Z 'num_stages': 7, 2026-02-21T08:13:02.1244505Z 'num_warps': 2, 2026-02-21T08:13:02.1244642Z 'pid_type': 'flat', 2026-02-21T08:13:02.1244800Z 'range_flattens': [None, True], 2026-02-21T08:13:02.1244983Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:02.1245162Z 'range_num_stages': [0, 4], 2026-02-21T08:13:02.1245332Z 'range_unroll_factors': [0, 3], 2026-02-21T08:13:02.1245508Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:02.1258634Z [118s] Fitting surrogate: 515 points, 515 targets 2026-02-21T08:13:02.4787357Z [119s] Generation 10 starting: 9 neighbors, 1 active search path(s) 2026-02-21T08:13:03.0100187Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 43.9 configs/s 2026-02-21T08:13:03.5776050Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 17.2 configs/s 2026-02-21T08:13:04.3298258Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1967.0 2026-02-21T08:13:04.3298749Z configs/s 2026-02-21T08:13:04.3902879Z [121s] Generation 10 complete: 2026-02-21T08:13:04.3903206Z ok=10 2026-02-21T08:13:04.3907168Z min=0.0083 2026-02-21T08:13:04.3910363Z mid=0.0101 2026-02-21T08:13:04.3914384Z max=0.0123 2026-02-21T08:13:04.3918266Z best={'block_sizes': [1, 1024], 2026-02-21T08:13:04.3922452Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:13:04.3926380Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:13:04.3928117Z 'num_stages': 7, 2026-02-21T08:13:04.3928303Z 'num_warps': 2, 2026-02-21T08:13:04.3928490Z 'pid_type': 'flat', 2026-02-21T08:13:04.3928676Z 'range_flattens': [None, True], 2026-02-21T08:13:04.3928860Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:04.3929039Z 'range_num_stages': [0, 4], 2026-02-21T08:13:04.3929208Z 'range_unroll_factors': [0, 3], 2026-02-21T08:13:04.3929385Z 'range_warp_specializes': [None, False]} 2026-02-21T08:13:04.3929607Z [121s] Fitting surrogate: 525 points, 525 targets 2026-02-21T08:13:04.6764528Z [121s] Autotuning complete in 121.5s after searching 500 configs. 2026-02-21T08:13:04.6766777Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:13:04.6767688Z @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:13:04.6768540Z 2026-02-21T08:13:04.6769197Z [121s] Code of selected kernel: /tmp/torchinductor_root/xz/cxzmpdljubuu42pq6odv7xvbsl2kv7wub6tl52m7bjruwfhzfnkd.py 2026-02-21T08:13:05.3243409Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T08:13:05.3248232Z (M, N) 2026-02-21T08:13:05.3252587Z ----------- 2026-02-21T08:13:05.3255944Z (4096, 896) 2026-02-21T08:13:05.3259819Z 2026-02-21T08:13:05.3263875Z 10%|█ | 2/20 [04:02<36:44, 122.48s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10: 2026-02-21T08:13:05.3264739Z (M, N) 2026-02-21T08:13:05.3264875Z ------------ 2026-02-21T08:13:05.3265017Z (4096, 1536) 2026-02-21T08:13:05.3265270Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:13:06.7753415Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:13:08.0988840Z INFO:tritonbench.utils.triton_op:Took 2.13ms to get benchmark function for torch_compile_softmax 2026-02-21T08:13:12.0912767Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:13:12.0913563Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:13:12.0913798Z 'dtype': 'torch.float16', 2026-02-21T08:13:12.0914030Z 'shape': (4096, 1536), 2026-02-21T08:13:12.0914256Z 'stride': (1536, 1)},), 2026-02-21T08:13:12.0914473Z 'kwargs': {}} 2026-02-21T08:13:12.0924871Z INFO:tritonbench.utils.triton_op:Took 1.59ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:13:12.2809404Z [0s] Autotune random seed: 2134816249 2026-02-21T08:13:12.4398101Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:13:39.5073404Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T08:13:45.5260258Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.7 configs/s 2026-02-21T08:13:45.5269607Z [33s] Adaptive compile timeout: 30s (90% percentile=2.0s, bounds=[30.0s, 30s]) 2026-02-21T08:13:46.0042615Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2089.8 configs/s 2026-02-21T08:13:46.0567745Z [33s] Initial random population of 100, 5 starting points: 2026-02-21T08:13:46.0572332Z error=5 2026-02-21T08:13:46.0577699Z ok=95 2026-02-21T08:13:46.0579726Z min=0.0164 2026-02-21T08:13:46.0579880Z mid=0.1618 2026-02-21T08:13:46.0580015Z max=14.7754 2026-02-21T08:13:46.0580162Z best={'block_sizes': [4, 2048], 2026-02-21T08:13:46.0580419Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:13:46.0580682Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:13:46.0580867Z 'num_stages': 8, 2026-02-21T08:13:46.0581017Z 'num_warps': 4, 2026-02-21T08:13:46.0581159Z 'pid_type': 'flat', 2026-02-21T08:13:46.0581321Z 'range_flattens': [None, False], 2026-02-21T08:13:46.0581501Z 'range_multi_buffers': [None, False], 2026-02-21T08:13:46.0581765Z 'range_num_stages': [0, 0], 2026-02-21T08:13:46.0581963Z 'range_unroll_factors': [0, 0], 2026-02-21T08:13:46.0582167Z 'range_warp_specializes': [None, True]} 2026-02-21T08:13:46.0582370Z [33s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:13:47.0271020Z [34s] Generation 1 starting: 78 neighbors, 5 active search path(s) 2026-02-21T08:13:53.5241119Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 4.8 configs/s 2026-02-21T08:13:58.7805263Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 15.7 configs/s 2026-02-21T08:14:02.4061379Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 281.4 2026-02-21T08:14:02.4062421Z configs/s 2026-02-21T08:14:02.7388559Z [50s] Generation 1 complete: 2026-02-21T08:14:02.7393988Z ok=84 2026-02-21T08:14:02.7399512Z min=0.0123 2026-02-21T08:14:02.7400879Z mid=0.0164 2026-02-21T08:14:02.7401056Z max=0.2846 2026-02-21T08:14:02.7401216Z best={'block_sizes': [2, 2048], 2026-02-21T08:14:02.7401528Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:02.7405056Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:02.7405243Z 'num_stages': 8, 2026-02-21T08:14:02.7405403Z 'num_warps': 4, 2026-02-21T08:14:02.7405557Z 'pid_type': 'flat', 2026-02-21T08:14:02.7405719Z 'range_flattens': [None, False], 2026-02-21T08:14:02.7405922Z 'range_multi_buffers': [None, False], 2026-02-21T08:14:02.7406120Z 'range_num_stages': [0, 0], 2026-02-21T08:14:02.7406310Z 'range_unroll_factors': [0, 0], 2026-02-21T08:14:02.7406502Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:02.7406741Z [50s] Fitting surrogate: 184 points, 184 targets 2026-02-21T08:14:03.8094304Z [51s] Generation 2 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:14:06.8648337Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 35.6 configs/s 2026-02-21T08:14:11.3347204Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 15.6 configs/s 2026-02-21T08:14:14.6533754Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 307.9 2026-02-21T08:14:14.6538138Z configs/s 2026-02-21T08:14:14.9416114Z [62s] Generation 2 complete: 2026-02-21T08:14:14.9420387Z ok=71 2026-02-21T08:14:14.9424948Z min=0.0123 2026-02-21T08:14:14.9429400Z mid=0.0163 2026-02-21T08:14:14.9431772Z max=0.0350 2026-02-21T08:14:14.9431977Z best={'block_sizes': [2, 2048], 2026-02-21T08:14:14.9432231Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:14.9432500Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:14.9432678Z 'num_stages': 8, 2026-02-21T08:14:14.9432828Z 'num_warps': 4, 2026-02-21T08:14:14.9432980Z 'pid_type': 'flat', 2026-02-21T08:14:14.9433138Z 'range_flattens': [None, False], 2026-02-21T08:14:14.9433328Z 'range_multi_buffers': [None, False], 2026-02-21T08:14:14.9433514Z 'range_num_stages': [0, 0], 2026-02-21T08:14:14.9433688Z 'range_unroll_factors': [0, 0], 2026-02-21T08:14:14.9433892Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:14.9434130Z [62s] Fitting surrogate: 255 points, 255 targets 2026-02-21T08:14:15.8447379Z [63s] Generation 3 starting: 63 neighbors, 5 active search path(s) 2026-02-21T08:14:18.9944220Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 23.1 configs/s 2026-02-21T08:14:23.0364030Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.2 configs/s 2026-02-21T08:14:26.2615735Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 337.4 2026-02-21T08:14:26.2616076Z configs/s 2026-02-21T08:14:26.5552660Z [74s] Generation 3 complete: 2026-02-21T08:14:26.5554476Z ok=69 2026-02-21T08:14:26.5554655Z min=0.0123 2026-02-21T08:14:26.5554812Z mid=0.0143 2026-02-21T08:14:26.5554958Z max=0.0410 2026-02-21T08:14:26.5555110Z best={'block_sizes': [1, 2048], 2026-02-21T08:14:26.5555407Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:26.5555697Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:14:26.5555903Z 'num_stages': 4, 2026-02-21T08:14:26.5556050Z 'num_warps': 1, 2026-02-21T08:14:26.5556203Z 'pid_type': 'flat', 2026-02-21T08:14:26.5556365Z 'range_flattens': [None, None], 2026-02-21T08:14:26.5556553Z 'range_multi_buffers': [None, False], 2026-02-21T08:14:26.5556743Z 'range_num_stages': [0, 4], 2026-02-21T08:14:26.5556919Z 'range_unroll_factors': [0, 0], 2026-02-21T08:14:26.5557111Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:26.5570340Z [74s] Fitting surrogate: 324 points, 324 targets 2026-02-21T08:14:27.2677349Z [74s] Generation 4 starting: 46 neighbors, 4 active search path(s) 2026-02-21T08:14:29.6879696Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 19.5 configs/s 2026-02-21T08:14:32.5622022Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.6 configs/s 2026-02-21T08:14:35.0766098Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 444.5 2026-02-21T08:14:35.0767139Z configs/s 2026-02-21T08:14:35.2956913Z [82s] Generation 4 complete: 2026-02-21T08:14:35.2962107Z ok=50 2026-02-21T08:14:35.2966696Z min=0.0102 2026-02-21T08:14:35.2971272Z mid=0.0123 2026-02-21T08:14:35.2975777Z max=0.0368 2026-02-21T08:14:35.2977905Z best={'block_sizes': [1, 2048], 2026-02-21T08:14:35.2978219Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:35.2978504Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:35.2978751Z 'num_stages': 5, 2026-02-21T08:14:35.2978905Z 'num_warps': 1, 2026-02-21T08:14:35.2982321Z 'pid_type': 'flat', 2026-02-21T08:14:35.2986886Z 'range_flattens': [None, True], 2026-02-21T08:14:35.2987202Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:35.2987439Z 'range_num_stages': [0, 1], 2026-02-21T08:14:35.2991440Z 'range_unroll_factors': [0, 2], 2026-02-21T08:14:35.2996922Z 'range_warp_specializes': [None, False]} 2026-02-21T08:14:35.3000991Z [82s] Fitting surrogate: 374 points, 374 targets 2026-02-21T08:14:35.9349993Z [83s] Generation 5 starting: 39 neighbors, 4 active search path(s) 2026-02-21T08:14:39.8759092Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 6.1 configs/s 2026-02-21T08:14:42.3402799Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 16.5 configs/s 2026-02-21T08:14:44.5064991Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 471.3 2026-02-21T08:14:44.5065489Z configs/s 2026-02-21T08:14:44.7074250Z [92s] Generation 5 complete: 2026-02-21T08:14:44.7078535Z ok=43 2026-02-21T08:14:44.7082928Z min=0.0102 2026-02-21T08:14:44.7087444Z mid=0.0123 2026-02-21T08:14:44.7091994Z max=0.1085 2026-02-21T08:14:44.7093577Z best={'block_sizes': [1, 2048], 2026-02-21T08:14:44.7093918Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:44.7097518Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:44.7101068Z 'num_stages': 5, 2026-02-21T08:14:44.7102478Z 'num_warps': 1, 2026-02-21T08:14:44.7102674Z 'pid_type': 'flat', 2026-02-21T08:14:44.7102843Z 'range_flattens': [None, True], 2026-02-21T08:14:44.7103040Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:44.7103237Z 'range_num_stages': [0, 1], 2026-02-21T08:14:44.7103405Z 'range_unroll_factors': [0, 2], 2026-02-21T08:14:44.7103596Z 'range_warp_specializes': [None, False]} 2026-02-21T08:14:44.7103894Z [92s] Fitting surrogate: 417 points, 417 targets 2026-02-21T08:14:45.3127059Z [92s] Generation 6 starting: 35 neighbors, 4 active search path(s) 2026-02-21T08:14:47.4359882Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 22.3 configs/s 2026-02-21T08:14:49.7991395Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 16.4 configs/s 2026-02-21T08:14:52.1101027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.5 2026-02-21T08:14:52.1105476Z configs/s 2026-02-21T08:14:52.3176610Z [99s] Generation 6 complete: 2026-02-21T08:14:52.3178195Z ok=39 2026-02-21T08:14:52.3178367Z min=0.0102 2026-02-21T08:14:52.3178506Z mid=0.0123 2026-02-21T08:14:52.3178625Z max=0.0123 2026-02-21T08:14:52.3178771Z best={'block_sizes': [1, 2048], 2026-02-21T08:14:52.3179037Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:52.3179314Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:52.3179544Z 'num_stages': 5, 2026-02-21T08:14:52.3179693Z 'num_warps': 1, 2026-02-21T08:14:52.3179916Z 'pid_type': 'flat', 2026-02-21T08:14:52.3181374Z 'range_flattens': [None, True], 2026-02-21T08:14:52.3181648Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:52.3181899Z 'range_num_stages': [0, 1], 2026-02-21T08:14:52.3182096Z 'range_unroll_factors': [0, 2], 2026-02-21T08:14:52.3182284Z 'range_warp_specializes': [None, False]} 2026-02-21T08:14:52.3189887Z [99s] Fitting surrogate: 456 points, 456 targets 2026-02-21T08:14:52.5824418Z [100s] Generation 7 starting: 10 neighbors, 1 active search path(s) 2026-02-21T08:14:53.5044188Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 18.3 configs/s 2026-02-21T08:14:54.1233762Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.4 configs/s 2026-02-21T08:14:54.7594941Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1587.9 2026-02-21T08:14:54.7599035Z configs/s 2026-02-21T08:14:54.8258562Z [102s] Generation 7 complete: 2026-02-21T08:14:54.8262837Z ok=12 2026-02-21T08:14:54.8264380Z min=0.0102 2026-02-21T08:14:54.8264549Z mid=0.0123 2026-02-21T08:14:54.8264687Z max=0.0123 2026-02-21T08:14:54.8264831Z best={'block_sizes': [1, 2048], 2026-02-21T08:14:54.8265110Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:14:54.8265438Z 'load_eviction_policies': ['', ''], 2026-02-21T08:14:54.8265641Z 'num_stages': 5, 2026-02-21T08:14:54.8265793Z 'num_warps': 1, 2026-02-21T08:14:54.8265937Z 'pid_type': 'flat', 2026-02-21T08:14:54.8266104Z 'range_flattens': [None, True], 2026-02-21T08:14:54.8266284Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:54.8266476Z 'range_num_stages': [0, 1], 2026-02-21T08:14:54.8266643Z 'range_unroll_factors': [0, 2], 2026-02-21T08:14:54.8266829Z 'range_warp_specializes': [None, False]} 2026-02-21T08:14:54.8280695Z [102s] Fitting surrogate: 468 points, 468 targets 2026-02-21T08:14:54.9988720Z [102s] Autotuning complete in 102.6s after searching 451 configs. 2026-02-21T08:14:54.9990428Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:14:54.9991413Z @helion.kernel(config=helion.Config(block_sizes=[1, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:14:54.9992534Z 2026-02-21T08:14:54.9992790Z [102s] Code of selected kernel: /tmp/torchinductor_root/aw/caw6zyyh5d37sapqzmdiy3gwzprk2nqshcqamsu2nqflpi4266ja.py 2026-02-21T08:14:55.8270005Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T08:14:55.8273310Z (M, N) 2026-02-21T08:14:55.8279730Z ------------ 2026-02-21T08:14:55.8280048Z (4096, 1536) 2026-02-21T08:14:55.8280214Z 2026-02-21T08:14:55.8280955Z 15%|█▌ | 3/20 [05:53<33:09, 117.01s/it]WARNING:tritonbench.utils.triton_op:Running input ID 15: 2026-02-21T08:14:55.8281358Z (M, N) 2026-02-21T08:14:55.8285351Z ------------ 2026-02-21T08:14:55.8287345Z (4096, 2176) 2026-02-21T08:14:55.8287681Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:14:57.2528171Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:14:58.5048946Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax 2026-02-21T08:15:01.6170684Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:15:01.6174828Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:15:01.6179001Z 'dtype': 'torch.float16', 2026-02-21T08:15:01.6182902Z 'shape': (4096, 2176), 2026-02-21T08:15:01.6187395Z 'stride': (2176, 1)},), 2026-02-21T08:15:01.6191495Z 'kwargs': {}} 2026-02-21T08:15:01.6196335Z INFO:tritonbench.utils.triton_op:Took 1.95ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:15:01.7952460Z [0s] Autotune random seed: 2134816249 2026-02-21T08:15:01.9370186Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:15:34.6720575Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:15:34.7888334Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:15:34.7910174Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T08:15:40.8088256Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.6 configs/s 2026-02-21T08:15:40.8100072Z [38s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s]) 2026-02-21T08:15:41.5405377Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1370.0 configs/s 2026-02-21T08:15:41.6113533Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T08:15:41.6117150Z error=5 2026-02-21T08:15:41.6119211Z timeout=2 2026-02-21T08:15:41.6119421Z ok=93 2026-02-21T08:15:41.6124819Z min=0.0224 2026-02-21T08:15:41.6126923Z mid=0.2211 2026-02-21T08:15:41.6127093Z max=14.6094 2026-02-21T08:15:41.6127254Z best={'block_sizes': [2, 1024], 2026-02-21T08:15:41.6127526Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:15:41.6127813Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:15:41.6128010Z 'num_sm_multiplier': 64, 2026-02-21T08:15:41.6128175Z 'num_stages': 5, 2026-02-21T08:15:41.6128313Z 'num_warps': 1, 2026-02-21T08:15:41.6128473Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:15:41.6128661Z 'range_flattens': [True, True], 2026-02-21T08:15:41.6128874Z 'range_multi_buffers': [False, None], 2026-02-21T08:15:41.6129077Z 'range_num_stages': [3, 1], 2026-02-21T08:15:41.6129240Z 'range_unroll_factors': [0, 2], 2026-02-21T08:15:41.6129421Z 'range_warp_specializes': [True, None]} 2026-02-21T08:15:41.6129705Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:15:42.7867990Z [40s] Generation 1 starting: 88 neighbors, 5 active search path(s) 2026-02-21T08:15:47.3939634Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 22.9 configs/s 2026-02-21T08:15:53.0084660Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s 2026-02-21T08:15:56.2509728Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 314.0 2026-02-21T08:15:56.2513472Z configs/s 2026-02-21T08:15:56.5241490Z [54s] Generation 1 complete: 2026-02-21T08:15:56.5245412Z ok=94 2026-02-21T08:15:56.5250422Z min=0.0164 2026-02-21T08:15:56.5252592Z mid=0.0245 2026-02-21T08:15:56.5252852Z max=0.1107 2026-02-21T08:15:56.5253430Z best={'block_sizes': [1, 4096], 2026-02-21T08:15:56.5257540Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:15:56.5262018Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:15:56.5266297Z 'maxnreg': 256, 2026-02-21T08:15:56.5269512Z 'num_sm_multiplier': 16, 2026-02-21T08:15:56.5271683Z 'num_stages': 5, 2026-02-21T08:15:56.5271955Z 'num_warps': 4, 2026-02-21T08:15:56.5272132Z 'pid_type': 'persistent_blocked', 2026-02-21T08:15:56.5274804Z 'range_flattens': [None, False], 2026-02-21T08:15:56.5275102Z 'range_multi_buffers': [None, True], 2026-02-21T08:15:56.5279201Z 'range_num_stages': [3, 4], 2026-02-21T08:15:56.5282405Z 'range_unroll_factors': [1, 0], 2026-02-21T08:15:56.5286810Z 'range_warp_specializes': [None, True]} 2026-02-21T08:15:56.5290015Z [54s] Fitting surrogate: 194 points, 194 targets 2026-02-21T08:15:57.6697582Z [55s] Generation 2 starting: 82 neighbors, 5 active search path(s) 2026-02-21T08:16:09.6692124Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 1.8 configs/s 2026-02-21T08:16:14.8908864Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.4 configs/s 2026-02-21T08:16:18.5806296Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 276.4 2026-02-21T08:16:18.5807612Z configs/s 2026-02-21T08:16:18.8890107Z [76s] Generation 2 complete: 2026-02-21T08:16:18.8893899Z ok=88 2026-02-21T08:16:18.8898407Z min=0.0143 2026-02-21T08:16:18.8902545Z mid=0.0205 2026-02-21T08:16:18.8907341Z max=0.0777 2026-02-21T08:16:18.8911815Z best={'block_sizes': [1, 4096], 2026-02-21T08:16:18.8913210Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:16:18.8913481Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:16:18.8913686Z 'maxnreg': 256, 2026-02-21T08:16:18.8913839Z 'num_sm_multiplier': 16, 2026-02-21T08:16:18.8914010Z 'num_stages': 5, 2026-02-21T08:16:18.8914181Z 'num_warps': 1, 2026-02-21T08:16:18.8914352Z 'pid_type': 'persistent_blocked', 2026-02-21T08:16:18.8914539Z 'range_flattens': [None, False], 2026-02-21T08:16:18.8914716Z 'range_multi_buffers': [None, True], 2026-02-21T08:16:18.8914902Z 'range_num_stages': [3, 3], 2026-02-21T08:16:18.8915067Z 'range_unroll_factors': [1, 0], 2026-02-21T08:16:18.8915252Z 'range_warp_specializes': [None, False]} 2026-02-21T08:16:18.8915583Z [76s] Fitting surrogate: 282 points, 282 targets 2026-02-21T08:16:20.0589111Z [78s] Generation 3 starting: 83 neighbors, 5 active search path(s) 2026-02-21T08:16:24.8309374Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 19.4 configs/s 2026-02-21T08:16:29.9557595Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.5 configs/s 2026-02-21T08:16:34.0345848Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 250.3 2026-02-21T08:16:34.0347021Z configs/s 2026-02-21T08:16:34.3952211Z [92s] Generation 3 complete: 2026-02-21T08:16:34.3954070Z error=2 2026-02-21T08:16:34.3954229Z ok=87 2026-02-21T08:16:34.3954360Z min=0.0143 2026-02-21T08:16:34.3954495Z mid=0.0184 2026-02-21T08:16:34.3954615Z max=0.0532 2026-02-21T08:16:34.3954759Z best={'block_sizes': [1, 4096], 2026-02-21T08:16:34.3954984Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:16:34.3955225Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:16:34.3955410Z 'maxnreg': 256, 2026-02-21T08:16:34.3955562Z 'num_sm_multiplier': 16, 2026-02-21T08:16:34.3955719Z 'num_stages': 5, 2026-02-21T08:16:34.3955863Z 'num_warps': 1, 2026-02-21T08:16:34.3956020Z 'pid_type': 'persistent_blocked', 2026-02-21T08:16:34.3956199Z 'range_flattens': [None, False], 2026-02-21T08:16:34.3956382Z 'range_multi_buffers': [None, True], 2026-02-21T08:16:34.3956562Z 'range_num_stages': [3, 4], 2026-02-21T08:16:34.3956743Z 'range_unroll_factors': [1, 0], 2026-02-21T08:16:34.3960321Z 'range_warp_specializes': [None, False]} 2026-02-21T08:16:34.3971106Z [92s] Fitting surrogate: 371 points, 371 targets 2026-02-21T08:16:35.2697707Z [93s] Generation 4 starting: 60 neighbors, 4 active search path(s) 2026-02-21T08:16:38.5768566Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 29.4 configs/s 2026-02-21T08:16:42.4252485Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.6 configs/s 2026-02-21T08:16:45.9549133Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.2 2026-02-21T08:16:45.9552832Z configs/s 2026-02-21T08:16:46.2796060Z [104s] Generation 4 complete: 2026-02-21T08:16:46.2800147Z ok=65 2026-02-21T08:16:46.2802303Z min=0.0143 2026-02-21T08:16:46.2802465Z mid=0.0144 2026-02-21T08:16:46.2802600Z max=0.0266 2026-02-21T08:16:46.2802742Z best={'block_sizes': [1, 4096], 2026-02-21T08:16:46.2803026Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:16:46.2803825Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:16:46.2804033Z 'num_stages': 5, 2026-02-21T08:16:46.2804186Z 'num_warps': 4, 2026-02-21T08:16:46.2804330Z 'pid_type': 'flat', 2026-02-21T08:16:46.2804496Z 'range_flattens': [None, True], 2026-02-21T08:16:46.2804678Z 'range_multi_buffers': [None, True], 2026-02-21T08:16:46.2804873Z 'range_num_stages': [0, 1], 2026-02-21T08:16:46.2805044Z 'range_unroll_factors': [0, 0], 2026-02-21T08:16:46.2805231Z 'range_warp_specializes': [None, True]} 2026-02-21T08:16:46.2814884Z [104s] Fitting surrogate: 436 points, 436 targets 2026-02-21T08:16:47.0156973Z [105s] Generation 5 starting: 52 neighbors, 4 active search path(s) 2026-02-21T08:16:50.6182724Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 11.9 configs/s 2026-02-21T08:16:53.9282918Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.5 configs/s 2026-02-21T08:16:57.1678030Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 351.7 2026-02-21T08:16:57.1678407Z configs/s 2026-02-21T08:16:57.4452495Z [115s] Generation 5 complete: 2026-02-21T08:16:57.4456546Z ok=56 2026-02-21T08:16:57.4460586Z min=0.0123 2026-02-21T08:16:57.4462628Z mid=0.0143 2026-02-21T08:16:57.4462863Z max=0.0225 2026-02-21T08:16:57.4463034Z best={'block_sizes': [1, 4096], 2026-02-21T08:16:57.4463317Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:16:57.4468405Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:16:57.4472235Z 'num_stages': 4, 2026-02-21T08:16:57.4475340Z 'num_warps': 1, 2026-02-21T08:16:57.4479284Z 'pid_type': 'flat', 2026-02-21T08:16:57.4483145Z 'range_flattens': [None, False], 2026-02-21T08:16:57.4487494Z 'range_multi_buffers': [None, False], 2026-02-21T08:16:57.4488992Z 'range_num_stages': [0, 4], 2026-02-21T08:16:57.4489219Z 'range_unroll_factors': [0, 3], 2026-02-21T08:16:57.4489468Z 'range_warp_specializes': [None, None]} 2026-02-21T08:16:57.4489790Z [115s] Fitting surrogate: 492 points, 492 targets 2026-02-21T08:16:58.1185695Z [116s] Generation 6 starting: 39 neighbors, 4 active search path(s) 2026-02-21T08:17:00.7963710Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 17.9 configs/s 2026-02-21T08:17:03.3880631Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.5 configs/s 2026-02-21T08:17:05.6458904Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 451.7 2026-02-21T08:17:05.6459312Z configs/s 2026-02-21T08:17:05.8611297Z [123s] Generation 6 complete: 2026-02-21T08:17:05.8611931Z ok=44 2026-02-21T08:17:05.8612086Z min=0.0123 2026-02-21T08:17:05.8612219Z mid=0.0143 2026-02-21T08:17:05.8612346Z max=0.0266 2026-02-21T08:17:05.8612482Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:05.8612739Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:17:05.8613425Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:05.8613789Z 'num_stages': 5, 2026-02-21T08:17:05.8613929Z 'num_warps': 1, 2026-02-21T08:17:05.8614080Z 'pid_type': 'flat', 2026-02-21T08:17:05.8614243Z 'range_flattens': [None, False], 2026-02-21T08:17:05.8614419Z 'range_multi_buffers': [None, True], 2026-02-21T08:17:05.8614604Z 'range_num_stages': [0, 1], 2026-02-21T08:17:05.8614766Z 'range_unroll_factors': [0, 1], 2026-02-21T08:17:05.8614948Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:05.8628861Z [123s] Fitting surrogate: 536 points, 536 targets 2026-02-21T08:17:06.4232020Z [124s] Generation 7 starting: 29 neighbors, 3 active search path(s) 2026-02-21T08:17:09.4192934Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.9 configs/s 2026-02-21T08:17:11.3207513Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.7 configs/s 2026-02-21T08:17:13.3667319Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 569.4 2026-02-21T08:17:13.3670978Z configs/s 2026-02-21T08:17:13.5383013Z [131s] Generation 7 complete: 2026-02-21T08:17:13.5384669Z ok=33 2026-02-21T08:17:13.5384838Z min=0.0123 2026-02-21T08:17:13.5384981Z mid=0.0125 2026-02-21T08:17:13.5385110Z max=0.0225 2026-02-21T08:17:13.5385270Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:13.5385544Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:17:13.5385843Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:13.5386028Z 'num_stages': 5, 2026-02-21T08:17:13.5386173Z 'num_warps': 1, 2026-02-21T08:17:13.5386315Z 'pid_type': 'flat', 2026-02-21T08:17:13.5386481Z 'range_flattens': [None, False], 2026-02-21T08:17:13.5386667Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:13.5386850Z 'range_num_stages': [0, 1], 2026-02-21T08:17:13.5387021Z 'range_unroll_factors': [0, 1], 2026-02-21T08:17:13.5387222Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:13.5399319Z [131s] Fitting surrogate: 569 points, 569 targets 2026-02-21T08:17:14.1907745Z [132s] Generation 8 starting: 30 neighbors, 3 active search path(s) 2026-02-21T08:17:15.7871113Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 35.7 configs/s 2026-02-21T08:17:17.6247742Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 16.7 configs/s 2026-02-21T08:17:19.3435087Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 592.3 2026-02-21T08:17:19.3435516Z configs/s 2026-02-21T08:17:19.5022492Z [137s] Generation 8 complete: 2026-02-21T08:17:19.5024361Z ok=33 2026-02-21T08:17:19.5024576Z min=0.0123 2026-02-21T08:17:19.5024742Z mid=0.0142 2026-02-21T08:17:19.5024909Z max=0.0247 2026-02-21T08:17:19.5025099Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:19.5025369Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:17:19.5025690Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:19.5026265Z 'num_stages': 4, 2026-02-21T08:17:19.5026433Z 'num_warps': 1, 2026-02-21T08:17:19.5026598Z 'pid_type': 'flat', 2026-02-21T08:17:19.5026780Z 'range_flattens': [None, False], 2026-02-21T08:17:19.5026989Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:19.5027202Z 'range_num_stages': [0, 4], 2026-02-21T08:17:19.5027394Z 'range_unroll_factors': [0, 3], 2026-02-21T08:17:19.5027607Z 'range_warp_specializes': [None, None]} 2026-02-21T08:17:19.5042212Z [137s] Fitting surrogate: 602 points, 602 targets 2026-02-21T08:17:19.9696302Z [138s] Generation 9 starting: 14 neighbors, 2 active search path(s) 2026-02-21T08:17:20.9959111Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 24.9 configs/s 2026-02-21T08:17:21.9159353Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 17.1 configs/s 2026-02-21T08:17:22.7386720Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1230.8 2026-02-21T08:17:22.7391043Z configs/s 2026-02-21T08:17:22.8176803Z [140s] Generation 9 complete: 2026-02-21T08:17:22.8181802Z ok=17 2026-02-21T08:17:22.8185643Z min=0.0123 2026-02-21T08:17:22.8190045Z mid=0.0123 2026-02-21T08:17:22.8193928Z max=0.0267 2026-02-21T08:17:22.8195907Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:22.8196213Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:17:22.8196484Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:22.8196680Z 'num_stages': 5, 2026-02-21T08:17:22.8196821Z 'num_warps': 1, 2026-02-21T08:17:22.8196973Z 'pid_type': 'flat', 2026-02-21T08:17:22.8197130Z 'range_flattens': [None, False], 2026-02-21T08:17:22.8197317Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:22.8197496Z 'range_num_stages': [0, 1], 2026-02-21T08:17:22.8197667Z 'range_unroll_factors': [0, 1], 2026-02-21T08:17:22.8197852Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:22.8202029Z [140s] Fitting surrogate: 619 points, 619 targets 2026-02-21T08:17:23.3387999Z [141s] Generation 10 starting: 18 neighbors, 2 active search path(s) 2026-02-21T08:17:25.9632741Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 6.6 configs/s 2026-02-21T08:17:27.0423558Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.4 configs/s 2026-02-21T08:17:28.4955936Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 966.0 2026-02-21T08:17:28.4960406Z configs/s 2026-02-21T08:17:28.6059138Z [146s] Generation 10 complete: 2026-02-21T08:17:28.6060878Z ok=20 2026-02-21T08:17:28.6061105Z min=0.0123 2026-02-21T08:17:28.6061303Z mid=0.0124 2026-02-21T08:17:28.6061494Z max=0.0204 2026-02-21T08:17:28.6061796Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:28.6062087Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:17:28.6062435Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:28.6062698Z 'num_stages': 5, 2026-02-21T08:17:28.6065747Z 'num_warps': 1, 2026-02-21T08:17:28.6069166Z 'pid_type': 'flat', 2026-02-21T08:17:28.6074804Z 'range_flattens': [None, False], 2026-02-21T08:17:28.6076913Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:28.6077181Z 'range_num_stages': [0, 1], 2026-02-21T08:17:28.6077391Z 'range_unroll_factors': [0, 1], 2026-02-21T08:17:28.6077618Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:28.6077962Z [146s] Fitting surrogate: 639 points, 639 targets 2026-02-21T08:17:29.1957036Z [147s] Generation 11 starting: 19 neighbors, 2 active search path(s) 2026-02-21T08:17:32.5379400Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s 2026-02-21T08:17:33.7126781Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 16.8 configs/s 2026-02-21T08:17:34.8153116Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 924.4 2026-02-21T08:17:34.8157695Z configs/s 2026-02-21T08:17:34.9376779Z [153s] Generation 11 complete: 2026-02-21T08:17:34.9380949Z ok=21 2026-02-21T08:17:34.9384270Z min=0.0123 2026-02-21T08:17:34.9388851Z mid=0.0124 2026-02-21T08:17:34.9392779Z max=0.0204 2026-02-21T08:17:34.9398117Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:34.9401931Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:17:34.9402342Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:34.9406189Z 'num_stages': 5, 2026-02-21T08:17:34.9409865Z 'num_warps': 1, 2026-02-21T08:17:34.9413560Z 'pid_type': 'flat', 2026-02-21T08:17:34.9417697Z 'range_flattens': [None, False], 2026-02-21T08:17:34.9421531Z 'range_multi_buffers': [None, None], 2026-02-21T08:17:34.9423127Z 'range_num_stages': [0, 1], 2026-02-21T08:17:34.9423373Z 'range_unroll_factors': [0, 2], 2026-02-21T08:17:34.9423590Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:34.9424467Z [153s] Fitting surrogate: 660 points, 660 targets 2026-02-21T08:17:35.5002414Z [153s] Generation 12 starting: 17 neighbors, 2 active search path(s) 2026-02-21T08:17:37.5054589Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 6.3 configs/s 2026-02-21T08:17:38.5692148Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.7 configs/s 2026-02-21T08:17:39.6249799Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 964.4 2026-02-21T08:17:39.6250250Z configs/s 2026-02-21T08:17:39.7421194Z [157s] Generation 12 complete: 2026-02-21T08:17:39.7421476Z ok=19 2026-02-21T08:17:39.7422133Z min=0.0123 2026-02-21T08:17:39.7422288Z mid=0.0124 2026-02-21T08:17:39.7427022Z max=0.0164 2026-02-21T08:17:39.7431806Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:39.7433497Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:17:39.7433890Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:39.7434220Z 'num_stages': 5, 2026-02-21T08:17:39.7434419Z 'num_warps': 1, 2026-02-21T08:17:39.7434610Z 'pid_type': 'flat', 2026-02-21T08:17:39.7434838Z 'range_flattens': [None, False], 2026-02-21T08:17:39.7435086Z 'range_multi_buffers': [None, False], 2026-02-21T08:17:39.7435326Z 'range_num_stages': [0, 1], 2026-02-21T08:17:39.7435568Z 'range_unroll_factors': [0, 2], 2026-02-21T08:17:39.7435816Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:39.7445030Z [157s] Fitting surrogate: 679 points, 679 targets 2026-02-21T08:17:40.1865023Z [158s] Generation 13 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:17:43.2242296Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 1.8 configs/s 2026-02-21T08:17:43.9083569Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.2 configs/s 2026-02-21T08:17:44.5091156Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1682.5 2026-02-21T08:17:44.5092150Z configs/s 2026-02-21T08:17:44.5786172Z [162s] Generation 13 complete: 2026-02-21T08:17:44.5790566Z ok=12 2026-02-21T08:17:44.5795175Z min=0.0123 2026-02-21T08:17:44.5799588Z mid=0.0123 2026-02-21T08:17:44.5804282Z max=0.0204 2026-02-21T08:17:44.5808913Z best={'block_sizes': [1, 4096], 2026-02-21T08:17:44.5813828Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:17:44.5814249Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:17:44.5814523Z 'num_stages': 5, 2026-02-21T08:17:44.5818930Z 'num_warps': 1, 2026-02-21T08:17:44.5823649Z 'pid_type': 'flat', 2026-02-21T08:17:44.5825087Z 'range_flattens': [None, False], 2026-02-21T08:17:44.5825347Z 'range_multi_buffers': [None, False], 2026-02-21T08:17:44.5825565Z 'range_num_stages': [0, 1], 2026-02-21T08:17:44.5825765Z 'range_unroll_factors': [0, 2], 2026-02-21T08:17:44.5825966Z 'range_warp_specializes': [None, False]} 2026-02-21T08:17:44.5826336Z [162s] Fitting surrogate: 691 points, 691 targets 2026-02-21T08:17:44.8930976Z [162s] Autotuning complete in 163.0s after searching 660 configs. 2026-02-21T08:17:44.8931379Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:17:44.8932758Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:17:44.8933720Z 2026-02-21T08:17:44.8934004Z [162s] Code of selected kernel: /tmp/torchinductor_root/ds/cdsvhtpe65d2cr6qxifitp7ecvoikhbcqzhfyd7orochwnmaoojc.py 2026-02-21T08:17:45.7236453Z WARNING:tritonbench.utils.triton_op:Completed input ID 15: 2026-02-21T08:17:45.7240600Z (M, N) 2026-02-21T08:17:45.7242680Z ------------ 2026-02-21T08:17:45.7242920Z (4096, 2176) 2026-02-21T08:17:45.7243556Z 2026-02-21T08:17:45.7244244Z 20%|██ | 4/20 [08:43<36:46, 137.89s/it]WARNING:tritonbench.utils.triton_op:Running input ID 20: 2026-02-21T08:17:45.7249184Z (M, N) 2026-02-21T08:17:45.7250582Z ------------ 2026-02-21T08:17:45.7250793Z (4096, 2816) 2026-02-21T08:17:45.7251169Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:17:47.1232461Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:17:48.6522774Z INFO:tritonbench.utils.triton_op:Took 2.57ms to get benchmark function for torch_compile_softmax 2026-02-21T08:17:52.2940321Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:17:52.2943921Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:17:52.2948598Z 'dtype': 'torch.float16', 2026-02-21T08:17:52.2953902Z 'shape': (4096, 2816), 2026-02-21T08:17:52.2958388Z 'stride': (2816, 1)},), 2026-02-21T08:17:52.2961123Z 'kwargs': {}} 2026-02-21T08:17:52.2966194Z INFO:tritonbench.utils.triton_op:Took 2.78ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:17:52.5005468Z [0s] Autotune random seed: 2134816249 2026-02-21T08:17:52.6670658Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:18:26.9070732Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:18:27.1778009Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:18:27.1799086Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:18:27.3766059Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:18:27.3766630Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:18:27.3767158Z %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16> 2026-02-21T08:18:27.3767412Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:18:27.3767624Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:18:27.3767832Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:18:27.3768069Z %cst_0 = arith.constant dense<2816> : tensor<8x1xi32> 2026-02-21T08:18:27.3768386Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32> 2026-02-21T08:18:27.3769222Z %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16> 2026-02-21T08:18:27.3769496Z %cst_3 = arith.constant dense<2816> : tensor<512xi32> 2026-02-21T08:18:27.3769756Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:18:27.3770032Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:18:27.3770264Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:18:27.3770477Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:18:27.3770677Z %c2816_i32 = arith.constant 2816 : i32 2026-02-21T08:18:27.3770869Z %c2816_i64 = arith.constant 2816 : i64 2026-02-21T08:18:27.3771066Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:18:27.3771401Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : , > 2026-02-21T08:18:27.3772027Z %1 = tt.get_program_id x : i32 2026-02-21T08:18:27.3772411Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:18:27.3772661Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:18:27.3773058Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:18:27.3773329Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:18:27.3773527Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:18:27.3773714Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:18:27.3773912Z %c2048_i32_6 = arith.constant 2048 : i32 2026-02-21T08:18:27.3774195Z %6 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3774795Z %7 = tt.splat %c0_i32 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3775039Z %8 = arith.addi %7, %6 : tensor<512xi32> 2026-02-21T08:18:27.3775313Z %9 = arith.cmpi slt, %8, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3775644Z %10 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:18:27.3776035Z %11 = tt.expand_dims %9 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3776350Z %12 = tt.broadcast %11 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3776649Z %13 = arith.select %12, %10, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:18:27.3776950Z %14 = arith.extf %13 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3777196Z %15 = "tt.reduce"(%14) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3777411Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3777609Z %231 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:27.3777822Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3778014Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3778254Z %16 = arith.truncf %15 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:18:27.3778516Z %17 = arith.extf %16 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:18:27.3778762Z %18 = arith.cmpf ogt, %cst_5, %17 : tensor<8xf32> 2026-02-21T08:18:27.3779013Z %19 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32> 2026-02-21T08:18:27.3779240Z %20 = arith.ori %18, %19 : tensor<8xi1> 2026-02-21T08:18:27.3779488Z %21 = arith.select %20, %cst_5, %17 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:18:27.3779739Z %22 = arith.subf %cst_5, %21 : tensor<8xf32> 2026-02-21T08:18:27.3780140Z %23 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3780532Z %24 = arith.mulf %cst_4, %23 : tensor<8xf32> 2026-02-21T08:18:27.3780794Z %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3781114Z %26 = arith.extf %10 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3781380Z %27 = tt.broadcast %25 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3781667Z %28 = arith.subf %26, %27 : tensor<8x512xf32> 2026-02-21T08:18:27.3782054Z %29 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3782581Z %30 = arith.select %12, %29, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:18:27.3782856Z %31 = "tt.reduce"(%30) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3783054Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3783249Z %231 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:27.3783445Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3783646Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3783848Z %32 = arith.addf %24, %31 : tensor<8xf32> 2026-02-21T08:18:27.3784059Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:18:27.3784262Z %33 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:18:27.3784460Z %34 = arith.addi %c0_i32, %33 : i32 2026-02-21T08:18:27.3784713Z %35 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3785046Z %36 = tt.splat %34 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3785269Z %37 = arith.addi %36, %35 : tensor<512xi32> 2026-02-21T08:18:27.3785495Z %38 = arith.cmpi slt, %37, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3785869Z %39 = tt.descriptor_load %0[%2, %34] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:18:27.3786238Z %40 = tt.expand_dims %38 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3786543Z %41 = tt.broadcast %40 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3786839Z %42 = arith.select %41, %39, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:18:27.3787136Z %43 = arith.extf %42 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3787390Z %44 = "tt.reduce"(%43) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3787597Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3787806Z %231 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:27.3788022Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3788224Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3788469Z %45 = arith.truncf %44 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:18:27.3788721Z %46 = arith.extf %45 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:18:27.3788968Z %47 = arith.cmpf ogt, %21, %46 : tensor<8xf32> 2026-02-21T08:18:27.3789191Z %48 = arith.cmpf une, %21, %21 : tensor<8xf32> 2026-02-21T08:18:27.3789409Z %49 = arith.ori %47, %48 : tensor<8xi1> 2026-02-21T08:18:27.3789649Z %50 = arith.select %49, %21, %46 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:18:27.3789889Z %51 = arith.subf %21, %50 : tensor<8xf32> 2026-02-21T08:18:27.3790262Z %52 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3790634Z %53 = arith.mulf %32, %52 : tensor<8xf32> 2026-02-21T08:18:27.3790895Z %54 = tt.expand_dims %50 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3791196Z %55 = arith.extf %39 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3791469Z %56 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3791779Z %57 = arith.subf %55, %56 : tensor<8x512xf32> 2026-02-21T08:18:27.3792161Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3792606Z %59 = arith.select %41, %58, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:18:27.3792867Z %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3793072Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3793259Z %231 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:27.3793464Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3793662Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3793865Z %61 = arith.addf %53, %60 : tensor<8xf32> 2026-02-21T08:18:27.3794075Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:18:27.3794337Z %62 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:18:27.3794538Z %63 = arith.addi %c0_i32, %62 : i32 2026-02-21T08:18:27.3794779Z %64 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3795049Z %65 = tt.splat %63 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3795265Z %66 = arith.addi %65, %64 : tensor<512xi32> 2026-02-21T08:18:27.3795492Z %67 = arith.cmpi slt, %66, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3795829Z %68 = tt.descriptor_load %0[%2, %63] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:18:27.3796239Z %69 = tt.expand_dims %67 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3796543Z %70 = tt.broadcast %69 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3796837Z %71 = arith.select %70, %68, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:18:27.3797183Z %72 = arith.extf %71 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3797423Z %73 = "tt.reduce"(%72) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3797629Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3797817Z %231 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:27.3798021Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3798210Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3798445Z %74 = arith.truncf %73 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:18:27.3798702Z %75 = arith.extf %74 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:18:27.3798939Z %76 = arith.cmpf ogt, %50, %75 : tensor<8xf32> 2026-02-21T08:18:27.3799169Z %77 = arith.cmpf une, %50, %50 : tensor<8xf32> 2026-02-21T08:18:27.3799380Z %78 = arith.ori %76, %77 : tensor<8xi1> 2026-02-21T08:18:27.3799624Z %79 = arith.select %78, %50, %75 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:18:27.3799861Z %80 = arith.subf %50, %79 : tensor<8xf32> 2026-02-21T08:18:27.3800237Z %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3800612Z %82 = arith.mulf %61, %81 : tensor<8xf32> 2026-02-21T08:18:27.3800865Z %83 = tt.expand_dims %79 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3801164Z %84 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3801427Z %85 = tt.broadcast %83 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3801723Z %86 = arith.subf %84, %85 : tensor<8x512xf32> 2026-02-21T08:18:27.3802107Z %87 = tt.extern_elementwise %86 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3802547Z %88 = arith.select %70, %87, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:18:27.3802817Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3803063Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3803262Z %231 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:27.3803458Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3803657Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3803861Z %90 = arith.addf %82, %89 : tensor<8xf32> 2026-02-21T08:18:27.3804066Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:18:27.3804268Z %91 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:18:27.3804469Z %92 = arith.addi %c0_i32, %91 : i32 2026-02-21T08:18:27.3804720Z %93 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3804987Z %94 = tt.splat %92 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3805203Z %95 = arith.addi %94, %93 : tensor<512xi32> 2026-02-21T08:18:27.3805431Z %96 = arith.cmpi slt, %95, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3805752Z %97 = tt.descriptor_load %0[%2, %92] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:18:27.3806180Z %98 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3806482Z %99 = tt.broadcast %98 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3806775Z %100 = arith.select %99, %97, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:18:27.3807071Z %101 = arith.extf %100 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3807323Z %102 = "tt.reduce"(%101) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3807524Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3807716Z %231 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:27.3807920Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3808107Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3808347Z %103 = arith.truncf %102 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:18:27.3808607Z %104 = arith.extf %103 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:18:27.3808911Z %105 = arith.cmpf ogt, %79, %104 : tensor<8xf32> 2026-02-21T08:18:27.3809141Z %106 = arith.cmpf une, %79, %79 : tensor<8xf32> 2026-02-21T08:18:27.3809365Z %107 = arith.ori %105, %106 : tensor<8xi1> 2026-02-21T08:18:27.3809612Z %108 = arith.select %107, %79, %104 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:18:27.3809866Z %109 = arith.subf %79, %108 : tensor<8xf32> 2026-02-21T08:18:27.3810250Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3810624Z %111 = arith.mulf %90, %110 : tensor<8xf32> 2026-02-21T08:18:27.3810893Z %112 = tt.expand_dims %108 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3811201Z %113 = arith.extf %97 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3811483Z %114 = tt.broadcast %112 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3811776Z %115 = arith.subf %113, %114 : tensor<8x512xf32> 2026-02-21T08:18:27.3812172Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3812621Z %117 = arith.select %99, %116, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:18:27.3812899Z %118 = "tt.reduce"(%117) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3813114Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:27.3813303Z %231 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:27.3813504Z tt.reduce.return %231 : f32 2026-02-21T08:18:27.3813699Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3813904Z %119 = arith.addf %111, %118 : tensor<8xf32> 2026-02-21T08:18:27.3814313Z %120:2 = scf.for %arg3 = %c2048_i32 to %c2816_i32 step %c512_i32 iter_args(%arg4 = %108, %arg5 = %119) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:18:27.3814762Z %231 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3815053Z %232 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3815261Z %233 = arith.addi %232, %231 : tensor<512xi32> 2026-02-21T08:18:27.3815484Z %234 = arith.cmpi slt, %233, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3815809Z %235 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:18:27.3816173Z %236 = tt.expand_dims %234 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3816487Z %237 = tt.broadcast %236 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3816777Z %238 = arith.select %237, %235, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:18:27.3817082Z %239 = arith.extf %238 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3817333Z %240 = "tt.reduce"(%239) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3817532Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:27.3817734Z %258 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:27.3818028Z tt.reduce.return %258 : f32 2026-02-21T08:18:27.3818235Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3818471Z %241 = arith.truncf %240 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:18:27.3818737Z %242 = arith.extf %241 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:18:27.3818985Z %243 = arith.cmpf ogt, %arg4, %242 : tensor<8xf32> 2026-02-21T08:18:27.3819235Z %244 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:18:27.3819472Z %245 = arith.ori %243, %244 : tensor<8xi1> 2026-02-21T08:18:27.3819726Z %246 = arith.select %245, %arg4, %242 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:18:27.3819995Z %247 = arith.subf %arg4, %246 : tensor<8xf32> 2026-02-21T08:18:27.3820382Z %248 = tt.extern_elementwise %247 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3820828Z %249 = arith.mulf %arg5, %248 : tensor<8xf32> 2026-02-21T08:18:27.3821105Z %250 = tt.expand_dims %246 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3821408Z %251 = arith.extf %235 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3821717Z %252 = tt.broadcast %250 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3821969Z %253 = arith.subf %251, %252 : tensor<8x512xf32> 2026-02-21T08:18:27.3822363Z %254 = tt.extern_elementwise %253 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3822798Z %255 = arith.select %237, %254, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:18:27.3823071Z %256 = "tt.reduce"(%255) <{axis = 1 : i32}> ({ 2026-02-21T08:18:27.3823276Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:27.3823462Z %258 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:27.3823664Z tt.reduce.return %258 : f32 2026-02-21T08:18:27.3823859Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:18:27.3824086Z %257 = arith.addf %249, %256 : tensor<8xf32> 2026-02-21T08:18:27.3824298Z scf.yield %246, %257 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:18:27.3824547Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:18:27.3824775Z %c2048_i32_7 = arith.constant 2048 : i32 2026-02-21T08:18:27.3824965Z %c2048_i32_8 = arith.constant 2048 : i32 2026-02-21T08:18:27.3825203Z %121 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3825457Z %122 = tt.splat %c0_i32 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3825674Z %123 = arith.addi %122, %121 : tensor<512xi32> 2026-02-21T08:18:27.3825889Z %124 = arith.cmpi slt, %123, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3826189Z %125 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:18:27.3826459Z %126 = arith.muli %125, %cst_0 : tensor<8x1xi32> 2026-02-21T08:18:27.3826718Z %127 = tt.expand_dims %123 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:18:27.3827019Z %128 = tt.broadcast %126 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3827278Z %129 = tt.broadcast %127 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3827517Z %130 = arith.addi %128, %129 : tensor<8x512xi32> 2026-02-21T08:18:27.3827754Z %131 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3828039Z %132 = tt.addptr %131, %130 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3828349Z %133 = tt.expand_dims %124 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3828638Z %134 = tt.broadcast %133 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3828944Z %135 = tt.load %132, %134, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3829282Z %136 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3829636Z %137 = arith.extf %135 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3829896Z %138 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3830126Z %139 = arith.subf %137, %138 : tensor<8x512xf32> 2026-02-21T08:18:27.3830497Z %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3830910Z %141 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3831200Z %142 = tt.broadcast %141 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3831429Z %143 = arith.divf %140, %142 : tensor<8x512xf32> 2026-02-21T08:18:27.3831713Z %144 = arith.truncf %143 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:18:27.3831992Z %145 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3832325Z %146 = tt.addptr %145, %130 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3832602Z tt.store %146, %144, %134 : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3832814Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T08:18:27.3833011Z %147 = arith.muli %c512_i32, %c1_i32_9 : i32 2026-02-21T08:18:27.3833201Z %148 = arith.addi %c0_i32, %147 : i32 2026-02-21T08:18:27.3833437Z %149 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3833694Z %150 = tt.splat %148 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3833899Z %151 = arith.addi %150, %149 : tensor<512xi32> 2026-02-21T08:18:27.3834120Z %152 = arith.cmpi slt, %151, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3834379Z %153 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:18:27.3834639Z %154 = arith.muli %153, %cst_0 : tensor<8x1xi32> 2026-02-21T08:18:27.3834899Z %155 = tt.expand_dims %151 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:18:27.3835202Z %156 = tt.broadcast %154 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3835469Z %157 = tt.broadcast %155 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3835702Z %158 = arith.addi %156, %157 : tensor<8x512xi32> 2026-02-21T08:18:27.3835942Z %159 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3836217Z %160 = tt.addptr %159, %158 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3836525Z %161 = tt.expand_dims %152 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3836818Z %162 = tt.broadcast %161 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3837122Z %163 = tt.load %160, %162, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3837461Z %164 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3837747Z %165 = arith.extf %163 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3838037Z %166 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3838265Z %167 = arith.subf %165, %166 : tensor<8x512xf32> 2026-02-21T08:18:27.3838652Z %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3839072Z %169 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3839350Z %170 = tt.broadcast %169 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3839590Z %171 = arith.divf %168, %170 : tensor<8x512xf32> 2026-02-21T08:18:27.3839831Z %172 = arith.truncf %171 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:18:27.3840112Z %173 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3840408Z %174 = tt.addptr %173, %158 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3840748Z tt.store %174, %172, %162 : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3840965Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T08:18:27.3841159Z %175 = arith.muli %c512_i32, %c2_i32_10 : i32 2026-02-21T08:18:27.3841446Z %176 = arith.addi %c0_i32, %175 : i32 2026-02-21T08:18:27.3841712Z %177 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3841969Z %178 = tt.splat %176 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3842179Z %179 = arith.addi %178, %177 : tensor<512xi32> 2026-02-21T08:18:27.3842397Z %180 = arith.cmpi slt, %179, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3842664Z %181 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:18:27.3842922Z %182 = arith.muli %181, %cst_0 : tensor<8x1xi32> 2026-02-21T08:18:27.3843252Z %183 = tt.expand_dims %179 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:18:27.3843542Z %184 = tt.broadcast %182 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3843809Z %185 = tt.broadcast %183 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3844049Z %186 = arith.addi %184, %185 : tensor<8x512xi32> 2026-02-21T08:18:27.3844285Z %187 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3844567Z %188 = tt.addptr %187, %186 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3844862Z %189 = tt.expand_dims %180 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3845153Z %190 = tt.broadcast %189 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3845448Z %191 = tt.load %188, %190, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3845776Z %192 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3846063Z %193 = arith.extf %191 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3846320Z %194 = tt.broadcast %192 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3846556Z %195 = arith.subf %193, %194 : tensor<8x512xf32> 2026-02-21T08:18:27.3846922Z %196 = tt.extern_elementwise %195 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3847335Z %197 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3847619Z %198 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3847847Z %199 = arith.divf %196, %198 : tensor<8x512xf32> 2026-02-21T08:18:27.3848085Z %200 = arith.truncf %199 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:18:27.3848347Z %201 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3848625Z %202 = tt.addptr %201, %186 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3848896Z tt.store %202, %200, %190 : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3849106Z %c3_i32_11 = arith.constant 3 : i32 2026-02-21T08:18:27.3849305Z %203 = arith.muli %c512_i32, %c3_i32_11 : i32 2026-02-21T08:18:27.3849497Z %204 = arith.addi %c0_i32, %203 : i32 2026-02-21T08:18:27.3849731Z %205 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3849973Z %206 = tt.splat %204 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3850183Z %207 = arith.addi %206, %205 : tensor<512xi32> 2026-02-21T08:18:27.3850397Z %208 = arith.cmpi slt, %207, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3850666Z %209 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:18:27.3850946Z %210 = arith.muli %209, %cst_0 : tensor<8x1xi32> 2026-02-21T08:18:27.3851204Z %211 = tt.expand_dims %207 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:18:27.3851501Z %212 = tt.broadcast %210 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3851845Z %213 = tt.broadcast %211 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3852085Z %214 = arith.addi %212, %213 : tensor<8x512xi32> 2026-02-21T08:18:27.3852327Z %215 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3852604Z %216 = tt.addptr %215, %214 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3852910Z %217 = tt.expand_dims %208 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3853199Z %218 = tt.broadcast %217 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3853499Z %219 = tt.load %216, %218, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3853822Z %220 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3854111Z %221 = arith.extf %219 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3854440Z %222 = tt.broadcast %220 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3854674Z %223 = arith.subf %221, %222 : tensor<8x512xf32> 2026-02-21T08:18:27.3855050Z %224 = tt.extern_elementwise %223 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3855461Z %225 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3855749Z %226 = tt.broadcast %225 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3855985Z %227 = arith.divf %224, %226 : tensor<8x512xf32> 2026-02-21T08:18:27.3856217Z %228 = arith.truncf %227 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:18:27.3856487Z %229 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3856764Z %230 = tt.addptr %229, %214 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3857040Z tt.store %230, %228, %218 : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3857293Z scf.for %arg3 = %c2048_i32_7 to %c2816_i32 step %c512_i32 : i32 { 2026-02-21T08:18:27.3857579Z %231 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:18:27.3857838Z %232 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:18:27.3858045Z %233 = arith.addi %232, %231 : tensor<512xi32> 2026-02-21T08:18:27.3858268Z %234 = arith.cmpi slt, %233, %cst_3 : tensor<512xi32> 2026-02-21T08:18:27.3858529Z %235 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:18:27.3858797Z %236 = arith.muli %235, %cst_0 : tensor<8x1xi32> 2026-02-21T08:18:27.3859060Z %237 = tt.expand_dims %233 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:18:27.3859357Z %238 = tt.broadcast %236 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3859663Z %239 = tt.broadcast %237 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:18:27.3859907Z %240 = arith.addi %238, %239 : tensor<8x512xi32> 2026-02-21T08:18:27.3860155Z %241 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3860458Z %242 = tt.addptr %241, %240 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3860786Z %243 = tt.expand_dims %234 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:18:27.3861094Z %244 = tt.broadcast %243 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:18:27.3861408Z %245 = tt.load %242, %244, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3861779Z %246 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3862078Z %247 = arith.extf %245 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:18:27.3862352Z %248 = tt.broadcast %246 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3862600Z %249 = arith.subf %247, %248 : tensor<8x512xf32> 2026-02-21T08:18:27.3863101Z %250 = tt.extern_elementwise %249 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:18:27.3863565Z %251 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:18:27.3863862Z %252 = tt.broadcast %251 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:18:27.3864116Z %253 = arith.divf %250, %252 : tensor<8x512xf32> 2026-02-21T08:18:27.3864382Z %254 = arith.truncf %253 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:18:27.3864674Z %255 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3864980Z %256 = tt.addptr %255, %240 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:18:27.3865264Z tt.store %256, %254, %244 : tensor<8x512x!tt.ptr> 2026-02-21T08:18:27.3865529Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:18:27.3865897Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:18:27.3866171Z tt.return 2026-02-21T08:18:27.3866307Z } 2026-02-21T08:18:27.3866443Z } 2026-02-21T08:18:27.3866520Z 2026-02-21T08:18:27.3866582Z {-# 2026-02-21T08:18:27.3866717Z external_resources: { 2026-02-21T08:18:27.3866896Z mlir_reproducer: { 2026-02-21T08:18:27.3871261Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:18:27.3875741Z disable_threading: false, 2026-02-21T08:18:27.3875913Z verify_each: true 2026-02-21T08:18:27.3876056Z } 2026-02-21T08:18:27.3876179Z } 2026-02-21T08:18:27.3876290Z #-} 2026-02-21T08:18:27.3876715Z /tmp/torchinductor_root/bw/cbwqrdl2s7tkwz2cqoupn5wyjcm3wl4ekwbt6r3xubd4zarcf4r3.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:18:27.3877919Z /tmp/torchinductor_root/bw/cbwqrdl2s7tkwz2cqoupn5wyjcm3wl4ekwbt6r3xubd4zarcf4r3.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:18:27.3878894Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:18:27.3880060Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:18:27.3881055Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:18:27.3881312Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:18:30.0433361Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:18:30.0435221Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:18:30.0435715Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:18:30.0436343Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:18:30.0436570Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:18:30.0436759Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:18:30.0436974Z %cst = arith.constant dense<2816> : tensor<32x1xi32> 2026-02-21T08:18:30.0437246Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:18:30.0437521Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:18:30.0437748Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:18:30.0437950Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:18:30.0438145Z %c2816_i32 = arith.constant 2816 : i32 2026-02-21T08:18:30.0438342Z %c2816_i64 = arith.constant 2816 : i64 2026-02-21T08:18:30.0438529Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:18:30.0438860Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : , > 2026-02-21T08:18:30.0439365Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : , > 2026-02-21T08:18:30.0439675Z %2 = tt.get_program_id x : i32 2026-02-21T08:18:30.0439859Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:18:30.0440037Z %4 = arith.minsi %3, %c128_i32 : i32 2026-02-21T08:18:30.0440238Z scf.for %arg2 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:18:30.0440531Z %5 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:18:30.0445671Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:18:30.0447641Z %7 = tt.splat %5 : i32 -> tensor<32xi32> 2026-02-21T08:18:30.0447893Z %8 = arith.addi %7, %6 : tensor<32xi32> 2026-02-21T08:18:30.0448125Z %c2808_i32 = arith.constant 2808 : i32 2026-02-21T08:18:30.0448335Z %c24_i32 = arith.constant 24 : i32 2026-02-21T08:18:30.0448721Z %9:2 = scf.for %arg3 = %c0_i32 to %c2808_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:18:30.0449139Z %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:18:30.0449406Z %50 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:18:30.0449610Z %51 = arith.addi %50, %49 : tensor<8xi32> 2026-02-21T08:18:30.0449869Z %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:30.0450132Z %53 = arith.muli %52, %cst : tensor<32x1xi32> 2026-02-21T08:18:30.0450388Z %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:18:30.0450675Z %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0450928Z %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0451165Z %57 = arith.addi %55, %56 : tensor<32x8xi32> 2026-02-21T08:18:30.0451405Z %58 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0451794Z %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:18:30.0452385Z %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0452716Z %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0452945Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0453146Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0453338Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:30.0453536Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0453725Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0453953Z %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:30.0454204Z %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:30.0454436Z %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32> 2026-02-21T08:18:30.0454669Z %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:18:30.0454960Z %67 = arith.ori %65, %66 : tensor<32xi1> 2026-02-21T08:18:30.0455204Z %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:30.0455445Z %69 = arith.subf %arg4, %68 : tensor<32xf32> 2026-02-21T08:18:30.0455810Z %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0456174Z %71 = arith.mulf %arg5, %70 : tensor<32xf32> 2026-02-21T08:18:30.0456423Z %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0456716Z %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0456940Z %74 = arith.subf %61, %73 : tensor<32x8xf32> 2026-02-21T08:18:30.0457294Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0457653Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0457848Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0458036Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:30.0458223Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0458412Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0458605Z %77 = arith.addf %71, %76 : tensor<32xf32> 2026-02-21T08:18:30.0458806Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:18:30.0458992Z %78 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:18:30.0459191Z %79 = arith.addi %arg3, %78 : i32 2026-02-21T08:18:30.0459418Z %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:18:30.0459657Z %81 = tt.splat %79 : i32 -> tensor<8xi32> 2026-02-21T08:18:30.0459859Z %82 = arith.addi %81, %80 : tensor<8xi32> 2026-02-21T08:18:30.0460103Z %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:30.0460375Z %84 = arith.muli %83, %cst : tensor<32x1xi32> 2026-02-21T08:18:30.0460621Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:18:30.0460906Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0461163Z %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0461384Z %88 = arith.addi %86, %87 : tensor<32x8xi32> 2026-02-21T08:18:30.0461665Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0461930Z %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:18:30.0462227Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0462510Z %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0462732Z %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0462924Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0463109Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:30.0463387Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0463569Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0463790Z %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:30.0464027Z %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:30.0464259Z %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32> 2026-02-21T08:18:30.0464474Z %97 = arith.cmpf une, %68, %68 : tensor<32xf32> 2026-02-21T08:18:30.0464672Z %98 = arith.ori %96, %97 : tensor<32xi1> 2026-02-21T08:18:30.0464904Z %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:30.0465136Z %100 = arith.subf %68, %99 : tensor<32xf32> 2026-02-21T08:18:30.0465506Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0465937Z %102 = arith.mulf %77, %101 : tensor<32xf32> 2026-02-21T08:18:30.0466190Z %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0466484Z %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0466716Z %105 = arith.subf %92, %104 : tensor<32x8xf32> 2026-02-21T08:18:30.0467085Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0467447Z %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0467644Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0467829Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:30.0468014Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0468203Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0468401Z %108 = arith.addf %102, %107 : tensor<32xf32> 2026-02-21T08:18:30.0468606Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:18:30.0468794Z %109 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:18:30.0468989Z %110 = arith.addi %arg3, %109 : i32 2026-02-21T08:18:30.0469213Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:18:30.0469470Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T08:18:30.0469677Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T08:18:30.0469923Z %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:30.0470194Z %115 = arith.muli %114, %cst : tensor<32x1xi32> 2026-02-21T08:18:30.0470440Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:18:30.0470731Z %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0470997Z %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0471231Z %119 = arith.addi %117, %118 : tensor<32x8xi32> 2026-02-21T08:18:30.0471476Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0471789Z %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:18:30.0472099Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0472388Z %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0472626Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0472826Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0473006Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:30.0473201Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0473381Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0473607Z %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:30.0473857Z %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:30.0474107Z %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32> 2026-02-21T08:18:30.0474406Z %128 = arith.cmpf une, %99, %99 : tensor<32xf32> 2026-02-21T08:18:30.0474612Z %129 = arith.ori %127, %128 : tensor<32xi1> 2026-02-21T08:18:30.0474851Z %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:30.0475088Z %131 = arith.subf %99, %130 : tensor<32xf32> 2026-02-21T08:18:30.0475457Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0475817Z %133 = arith.mulf %108, %132 : tensor<32xf32> 2026-02-21T08:18:30.0476078Z %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0476389Z %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0476663Z %136 = arith.subf %123, %135 : tensor<32x8xf32> 2026-02-21T08:18:30.0477150Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0477570Z %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0477804Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:30.0478014Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:30.0478246Z tt.reduce.return %140 : f32 2026-02-21T08:18:30.0478480Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0478695Z %139 = arith.addf %133, %138 : tensor<32xf32> 2026-02-21T08:18:30.0478967Z scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:18:30.0479230Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:18:30.0479551Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:18:30.0479824Z %11 = tt.splat %c2808_i32 : i32 -> tensor<8xi32> 2026-02-21T08:18:30.0480067Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:18:30.0480356Z %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:30.0480654Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:18:30.0480951Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:18:30.0481294Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0482883Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:18:30.0483132Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:18:30.0483383Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0483682Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:18:30.0483999Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:18:30.0484313Z %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0484568Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0484780Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:30.0484972Z %49 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:30.0485179Z tt.reduce.return %49 : f32 2026-02-21T08:18:30.0485379Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0485605Z %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:30.0485861Z %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:30.0486133Z %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32> 2026-02-21T08:18:30.0486350Z %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32> 2026-02-21T08:18:30.0486554Z %28 = arith.ori %26, %27 : tensor<32xi1> 2026-02-21T08:18:30.0486793Z %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:30.0487033Z %30 = arith.subf %9#0, %29 : tensor<32xf32> 2026-02-21T08:18:30.0487379Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0487847Z %32 = arith.mulf %9#1, %31 : tensor<32xf32> 2026-02-21T08:18:30.0488096Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0488393Z %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0488622Z %35 = arith.subf %22, %34 : tensor<32x8xf32> 2026-02-21T08:18:30.0488982Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0489346Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:18:30.0489535Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:30.0489716Z %49 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:30.0489900Z tt.reduce.return %49 : f32 2026-02-21T08:18:30.0490090Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:18:30.0490342Z %38 = arith.addf %32, %37 : tensor<32xf32> 2026-02-21T08:18:30.0490546Z %c2808_i32_2 = arith.constant 2808 : i32 2026-02-21T08:18:30.0490740Z %c24_i32_3 = arith.constant 24 : i32 2026-02-21T08:18:30.0490961Z scf.for %arg3 = %c0_i32 to %c2808_i32_2 step %c24_i32_3 : i32 { 2026-02-21T08:18:30.0491366Z %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:18:30.0491801Z %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0492109Z %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0492360Z %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0492595Z %53 = arith.subf %51, %52 : tensor<32x8xf32> 2026-02-21T08:18:30.0492960Z %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0493368Z %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0493658Z %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0493883Z %57 = arith.divf %54, %56 : tensor<32x8xf32> 2026-02-21T08:18:30.0494116Z %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:18:30.0494427Z tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:18:30.0494708Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:18:30.0494907Z %59 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:18:30.0495095Z %60 = arith.addi %arg3, %59 : i32 2026-02-21T08:18:30.0495360Z %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:18:30.0495681Z %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0495965Z %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0496214Z %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0496437Z %65 = arith.subf %63, %64 : tensor<32x8xf32> 2026-02-21T08:18:30.0496797Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0497195Z %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0497479Z %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0497706Z %69 = arith.divf %66, %68 : tensor<32x8xf32> 2026-02-21T08:18:30.0497925Z %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:18:30.0498229Z tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:18:30.0498497Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:18:30.0498692Z %71 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:18:30.0498927Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:18:30.0499191Z %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:18:30.0499518Z %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0499792Z %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0500041Z %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0500265Z %77 = arith.subf %75, %76 : tensor<32x8xf32> 2026-02-21T08:18:30.0500627Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0501058Z %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0501333Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0501637Z %81 = arith.divf %78, %80 : tensor<32x8xf32> 2026-02-21T08:18:30.0501857Z %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:18:30.0502156Z tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:18:30.0502456Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:18:30.0502773Z %39 = tt.descriptor_load %0[%5, %c2808_i32_2] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:18:30.0503117Z %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0503387Z %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:18:30.0503635Z %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0503856Z %43 = arith.subf %41, %42 : tensor<32x8xf32> 2026-02-21T08:18:30.0504213Z %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:18:30.0504618Z %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:30.0504888Z %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:18:30.0505118Z %47 = arith.divf %44, %46 : tensor<32x8xf32> 2026-02-21T08:18:30.0505339Z %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:18:30.0505661Z tt.descriptor_store %1[%5, %c2808_i32_2], %48 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:18:30.0505963Z } {tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:18:30.0506163Z tt.return 2026-02-21T08:18:30.0506295Z } 2026-02-21T08:18:30.0506415Z } 2026-02-21T08:18:30.0506481Z 2026-02-21T08:18:30.0506541Z {-# 2026-02-21T08:18:30.0506670Z external_resources: { 2026-02-21T08:18:30.0506831Z mlir_reproducer: { 2026-02-21T08:18:30.0511127Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:18:30.0515759Z disable_threading: false, 2026-02-21T08:18:30.0515936Z verify_each: true 2026-02-21T08:18:30.0516080Z } 2026-02-21T08:18:30.0516211Z } 2026-02-21T08:18:30.0516328Z #-} 2026-02-21T08:18:30.0516821Z /tmp/torchinductor_root/e4/ce4plioxhs4dkgdsuiqbo5taxqthwvps2cnco56uv7pliohm3mam.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:18:30.0518022Z /tmp/torchinductor_root/e4/ce4plioxhs4dkgdsuiqbo5taxqthwvps2cnco56uv7pliohm3mam.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:18:30.0518996Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:18:30.0520095Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:18:30.0521127Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:18:30.0521391Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:18:31.6545041Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:18:31.6546527Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:18:31.6547022Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:18:31.6547221Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:18:31.6547421Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:18:31.6547643Z %cst = arith.constant dense<2816> : tensor<32x1xi32> 2026-02-21T08:18:31.6547904Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:18:31.6548161Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:18:31.6548422Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:18:31.6548627Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:18:31.6548812Z %c2816_i32 = arith.constant 2816 : i32 2026-02-21T08:18:31.6548998Z %c2816_i64 = arith.constant 2816 : i64 2026-02-21T08:18:31.6549180Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:18:31.6549503Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : , > 2026-02-21T08:18:31.6549828Z %1 = tt.get_program_id x : i32 2026-02-21T08:18:31.6550068Z scf.for %arg2 = %1 to %c128_i32 step %c9472_i32 : i32 { 2026-02-21T08:18:31.6550328Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:18:31.6550557Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:18:31.6550811Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:18:31.6551011Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:18:31.6551196Z %c2784_i32 = arith.constant 2784 : i32 2026-02-21T08:18:31.6551393Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:18:31.6552296Z %6:2 = scf.for %arg3 = %c0_i32 to %c2784_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:18:31.6552762Z %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:18:31.6553086Z %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6553334Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6553549Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6553739Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:31.6553938Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6554121Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6554350Z %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:31.6554587Z %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:31.6554929Z %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32> 2026-02-21T08:18:31.6555170Z %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:18:31.6555386Z %54 = arith.ori %52, %53 : tensor<32xi1> 2026-02-21T08:18:31.6555631Z %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:31.6555876Z %56 = arith.subf %arg4, %55 : tensor<32xf32> 2026-02-21T08:18:31.6556246Z %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6556656Z %58 = arith.mulf %arg5, %57 : tensor<32xf32> 2026-02-21T08:18:31.6556918Z %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6557223Z %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6557471Z %61 = arith.subf %48, %60 : tensor<32x32xf32> 2026-02-21T08:18:31.6557865Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6558240Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6558451Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6558649Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:31.6558844Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6559041Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6559243Z %64 = arith.addf %58, %63 : tensor<32xf32> 2026-02-21T08:18:31.6559444Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:18:31.6559638Z %65 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:18:31.6559839Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T08:18:31.6560118Z %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:18:31.6560449Z %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6560692Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6560884Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6561084Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:31.6561280Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6561474Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6561755Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:31.6562012Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:31.6562259Z %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32> 2026-02-21T08:18:31.6562481Z %73 = arith.cmpf une, %55, %55 : tensor<32xf32> 2026-02-21T08:18:31.6562699Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:18:31.6562934Z %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:31.6563182Z %76 = arith.subf %55, %75 : tensor<32xf32> 2026-02-21T08:18:31.6563553Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6564018Z %78 = arith.mulf %64, %77 : tensor<32xf32> 2026-02-21T08:18:31.6564284Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6564588Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6564841Z %81 = arith.subf %68, %80 : tensor<32x32xf32> 2026-02-21T08:18:31.6565217Z %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6565601Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6565808Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6565999Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:31.6566212Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6566470Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6566683Z %84 = arith.addf %78, %83 : tensor<32xf32> 2026-02-21T08:18:31.6566878Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:18:31.6567076Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:18:31.6567274Z %86 = arith.addi %arg3, %85 : i32 2026-02-21T08:18:31.6567558Z %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:18:31.6567869Z %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6568091Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6568282Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6568464Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:18:31.6568658Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6568837Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6569064Z %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:31.6569310Z %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:31.6569532Z %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32> 2026-02-21T08:18:31.6569748Z %93 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:18:31.6569944Z %94 = arith.ori %92, %93 : tensor<32xi1> 2026-02-21T08:18:31.6570173Z %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:31.6570396Z %96 = arith.subf %75, %95 : tensor<32xf32> 2026-02-21T08:18:31.6570742Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6571094Z %98 = arith.mulf %84, %97 : tensor<32xf32> 2026-02-21T08:18:31.6571336Z %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6571676Z %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6571917Z %101 = arith.subf %88, %100 : tensor<32x32xf32> 2026-02-21T08:18:31.6572284Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6572651Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6572840Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:18:31.6573025Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:18:31.6573208Z tt.reduce.return %105 : f32 2026-02-21T08:18:31.6573394Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6573591Z %104 = arith.addf %98, %103 : tensor<32xf32> 2026-02-21T08:18:31.6573810Z scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:18:31.6574027Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:18:31.6574327Z %7 = tt.descriptor_load %0[%2, %c2784_i32] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:18:31.6574710Z %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6574925Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6575116Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:31.6575296Z %47 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:18:31.6575489Z tt.reduce.return %47 : f32 2026-02-21T08:18:31.6575670Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6575892Z %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:18:31.6576133Z %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:18:31.6576350Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32> 2026-02-21T08:18:31.6576569Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:18:31.6576770Z %14 = arith.ori %12, %13 : tensor<32xi1> 2026-02-21T08:18:31.6577009Z %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:18:31.6577300Z %16 = arith.subf %6#0, %15 : tensor<32xf32> 2026-02-21T08:18:31.6577661Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6578018Z %18 = arith.mulf %6#1, %17 : tensor<32xf32> 2026-02-21T08:18:31.6578262Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6578551Z %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6578776Z %21 = arith.subf %8, %20 : tensor<32x32xf32> 2026-02-21T08:18:31.6579132Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6579488Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:18:31.6579677Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:18:31.6579858Z %47 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:18:31.6580038Z tt.reduce.return %47 : f32 2026-02-21T08:18:31.6580233Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:18:31.6580425Z %24 = arith.addf %18, %23 : tensor<32xf32> 2026-02-21T08:18:31.6580627Z %c2784_i32_2 = arith.constant 2784 : i32 2026-02-21T08:18:31.6580814Z %c96_i32_3 = arith.constant 96 : i32 2026-02-21T08:18:31.6581047Z scf.for %arg3 = %c0_i32 to %c2784_i32_2 step %c96_i32_3 : i32 { 2026-02-21T08:18:31.6581293Z %47 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T08:18:31.6581491Z %48 = arith.addi %47, %3 : tensor<32xi32> 2026-02-21T08:18:31.6581783Z %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:31.6582045Z %50 = arith.muli %49, %cst : tensor<32x1xi32> 2026-02-21T08:18:31.6582300Z %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:18:31.6582581Z %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6582871Z %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6583108Z %54 = arith.addi %52, %53 : tensor<32x32xi32> 2026-02-21T08:18:31.6583339Z %55 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6583617Z %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6583917Z %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6584259Z %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6584549Z %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6584818Z %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6585053Z %61 = arith.subf %59, %60 : tensor<32x32xf32> 2026-02-21T08:18:31.6585421Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6585877Z %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6586158Z %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6586392Z %65 = arith.divf %62, %64 : tensor<32x32xf32> 2026-02-21T08:18:31.6586619Z %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:18:31.6586887Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6587155Z %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6587411Z tt.store %68, %66 : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6587612Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:18:31.6587805Z %69 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:18:31.6587996Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:18:31.6588184Z %71 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T08:18:31.6588466Z %72 = arith.addi %71, %3 : tensor<32xi32> 2026-02-21T08:18:31.6588709Z %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:31.6588967Z %74 = arith.muli %73, %cst : tensor<32x1xi32> 2026-02-21T08:18:31.6589208Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:18:31.6589490Z %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6589744Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6589965Z %78 = arith.addi %76, %77 : tensor<32x32xi32> 2026-02-21T08:18:31.6590201Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6590468Z %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6590768Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6591067Z %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6591352Z %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6591649Z %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6591874Z %85 = arith.subf %83, %84 : tensor<32x32xf32> 2026-02-21T08:18:31.6592241Z %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6592645Z %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6592928Z %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6593161Z %89 = arith.divf %86, %88 : tensor<32x32xf32> 2026-02-21T08:18:31.6593387Z %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:18:31.6593659Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6593932Z %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6594190Z tt.store %92, %90 : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6594387Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:18:31.6594578Z %93 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:18:31.6594770Z %94 = arith.addi %arg3, %93 : i32 2026-02-21T08:18:31.6594957Z %95 = tt.splat %94 : i32 -> tensor<32xi32> 2026-02-21T08:18:31.6595159Z %96 = arith.addi %95, %3 : tensor<32xi32> 2026-02-21T08:18:31.6595400Z %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:31.6595663Z %98 = arith.muli %97, %cst : tensor<32x1xi32> 2026-02-21T08:18:31.6595905Z %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:18:31.6596196Z %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6596467Z %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6596765Z %102 = arith.addi %100, %101 : tensor<32x32xi32> 2026-02-21T08:18:31.6597016Z %103 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6597302Z %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6597617Z %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6597935Z %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6598219Z %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6598486Z %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6598723Z %109 = arith.subf %107, %108 : tensor<32x32xf32> 2026-02-21T08:18:31.6599176Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6599597Z %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6599903Z %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6600160Z %113 = arith.divf %110, %112 : tensor<32x32xf32> 2026-02-21T08:18:31.6600409Z %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:18:31.6600698Z %115 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6600993Z %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6601273Z tt.store %116, %114 : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6601489Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:18:31.6601763Z %25 = tt.splat %c2784_i32_2 : i32 -> tensor<32xi32> 2026-02-21T08:18:31.6602117Z %26 = arith.addi %25, %3 : tensor<32xi32> 2026-02-21T08:18:31.6602445Z %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:18:31.6602902Z %28 = arith.muli %27, %cst : tensor<32x1xi32> 2026-02-21T08:18:31.6603251Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:18:31.6603612Z %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6603995Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:18:31.6604285Z %32 = arith.addi %30, %31 : tensor<32x32xi32> 2026-02-21T08:18:31.6604631Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6604972Z %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6605378Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6605798Z %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6606173Z %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:18:31.6606532Z %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6606844Z %39 = arith.subf %37, %38 : tensor<32x32xf32> 2026-02-21T08:18:31.6607327Z %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:18:31.6607853Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:18:31.6608216Z %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:18:31.6608542Z %43 = arith.divf %40, %42 : tensor<32x32xf32> 2026-02-21T08:18:31.6608837Z %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:18:31.6609204Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6609586Z %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:18:31.6609973Z tt.store %46, %44 : tensor<32x32x!tt.ptr> 2026-02-21T08:18:31.6610321Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:18:31.6610604Z tt.return 2026-02-21T08:18:31.6610853Z } 2026-02-21T08:18:31.6611022Z } 2026-02-21T08:18:31.6611147Z 2026-02-21T08:18:31.6611206Z {-# 2026-02-21T08:18:31.6611459Z external_resources: { 2026-02-21T08:18:31.6611702Z mlir_reproducer: { 2026-02-21T08:18:31.6616076Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:18:31.6620667Z disable_threading: false, 2026-02-21T08:18:31.6621026Z verify_each: true 2026-02-21T08:18:31.6621250Z } 2026-02-21T08:18:31.6621455Z } 2026-02-21T08:18:31.6621677Z #-} 2026-02-21T08:18:31.6622211Z /tmp/torchinductor_root/sc/csc5amcls62xcw2ccldevip5lr6kn5mvhagnayklxtlwdzf5nnbl.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:18:31.6623537Z /tmp/torchinductor_root/sc/csc5amcls62xcw2ccldevip5lr6kn5mvhagnayklxtlwdzf5nnbl.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:18:31.6624603Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:18:31.6625782Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:18:31.6626866Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:18:31.6627181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:18:33.5733708Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.8 configs/s 2026-02-21T08:18:33.5744743Z [40s] Adaptive compile timeout: 30s (90% percentile=4.3s, bounds=[30.0s, 30s]) 2026-02-21T08:18:34.3689999Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1883.4 configs/s 2026-02-21T08:18:34.4224982Z [41s] Initial random population of 100, 5 starting points: 2026-02-21T08:18:34.4226952Z error=7 2026-02-21T08:18:34.4227295Z timeout=2 2026-02-21T08:18:34.4227473Z ok=91 2026-02-21T08:18:34.4227842Z min=0.0225 2026-02-21T08:18:34.4228029Z mid=0.2968 2026-02-21T08:18:34.4228226Z max=18.9768 2026-02-21T08:18:34.4228485Z best={'block_sizes': [2, 1024], 2026-02-21T08:18:34.4228876Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:18:34.4229220Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:18:34.4229541Z 'num_sm_multiplier': 64, 2026-02-21T08:18:34.4229795Z 'num_stages': 5, 2026-02-21T08:18:34.4229994Z 'num_warps': 1, 2026-02-21T08:18:34.4230277Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:18:34.4230535Z 'range_flattens': [True, True], 2026-02-21T08:18:34.4230810Z 'range_multi_buffers': [False, None], 2026-02-21T08:18:34.4231059Z 'range_num_stages': [3, 1], 2026-02-21T08:18:34.4231887Z 'range_unroll_factors': [0, 2], 2026-02-21T08:18:34.4232176Z 'range_warp_specializes': [True, None]} 2026-02-21T08:18:34.4237790Z [41s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:18:35.4987847Z [42s] Generation 1 starting: 86 neighbors, 5 active search path(s) 2026-02-21T08:18:54.1295413Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 1.1 configs/s 2026-02-21T08:18:59.5436465Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.8 configs/s 2026-02-21T08:19:00.5146816Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1040.8 2026-02-21T08:19:00.5148897Z configs/s 2026-02-21T08:19:00.6009917Z [67s] Generation 1 complete: 2026-02-21T08:19:00.6014428Z error=1 2026-02-21T08:19:00.6016540Z ok=90 2026-02-21T08:19:00.6016779Z min=0.0143 2026-02-21T08:19:00.6017058Z mid=0.0245 2026-02-21T08:19:00.6017394Z max=0.1863 2026-02-21T08:19:00.6019729Z best={'block_sizes': [1, 4096], 2026-02-21T08:19:00.6020042Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:00.6020441Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:19:00.6020706Z 'num_stages': 5, 2026-02-21T08:19:00.6020934Z 'num_warps': 4, 2026-02-21T08:19:00.6021188Z 'pid_type': 'flat', 2026-02-21T08:19:00.6021443Z 'range_flattens': [None, False], 2026-02-21T08:19:00.6021912Z 'range_multi_buffers': [None, True], 2026-02-21T08:19:00.6022207Z 'range_num_stages': [0, 4], 2026-02-21T08:19:00.6022476Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:00.6022727Z 'range_warp_specializes': [None, False]} 2026-02-21T08:19:00.6023125Z [67s] Fitting surrogate: 191 points, 191 targets 2026-02-21T08:19:01.6616005Z [68s] Generation 2 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:19:10.9697247Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 2.4 configs/s 2026-02-21T08:19:15.5610753Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s 2026-02-21T08:19:18.0489650Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.0 2026-02-21T08:19:18.0491109Z configs/s 2026-02-21T08:19:18.2465112Z [85s] Generation 2 complete: 2026-02-21T08:19:18.2469081Z ok=78 2026-02-21T08:19:18.2471385Z min=0.0143 2026-02-21T08:19:18.2472148Z mid=0.0225 2026-02-21T08:19:18.2476601Z max=0.8029 2026-02-21T08:19:18.2481166Z best={'block_sizes': [1, 4096], 2026-02-21T08:19:18.2482413Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:18.2482771Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:19:18.2483033Z 'num_stages': 5, 2026-02-21T08:19:18.2483352Z 'num_warps': 4, 2026-02-21T08:19:18.2483597Z 'pid_type': 'flat', 2026-02-21T08:19:18.2483811Z 'range_flattens': [None, False], 2026-02-21T08:19:18.2484100Z 'range_multi_buffers': [None, True], 2026-02-21T08:19:18.2484348Z 'range_num_stages': [0, 4], 2026-02-21T08:19:18.2485011Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:18.2485266Z 'range_warp_specializes': [None, False]} 2026-02-21T08:19:18.2485601Z [85s] Fitting surrogate: 269 points, 269 targets 2026-02-21T08:19:19.2855660Z [86s] Generation 3 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:19:27.1098643Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 4.1 configs/s 2026-02-21T08:19:31.5710621Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.0 configs/s 2026-02-21T08:19:34.2592021Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 379.2 2026-02-21T08:19:34.2596329Z configs/s 2026-02-21T08:19:34.4994627Z [101s] Generation 3 complete: 2026-02-21T08:19:34.4998385Z error=3 2026-02-21T08:19:34.5000099Z ok=75 2026-02-21T08:19:34.5000354Z min=0.0143 2026-02-21T08:19:34.5000553Z mid=0.0205 2026-02-21T08:19:34.5000816Z max=0.0880 2026-02-21T08:19:34.5001033Z best={'block_sizes': [1, 4096], 2026-02-21T08:19:34.5001373Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:34.5001837Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:19:34.5004482Z 'num_stages': 5, 2026-02-21T08:19:34.5008153Z 'num_warps': 2, 2026-02-21T08:19:34.5009394Z 'pid_type': 'flat', 2026-02-21T08:19:34.5009686Z 'range_flattens': [None, True], 2026-02-21T08:19:34.5009947Z 'range_multi_buffers': [None, None], 2026-02-21T08:19:34.5010236Z 'range_num_stages': [0, 4], 2026-02-21T08:19:34.5010499Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:34.5010735Z 'range_warp_specializes': [None, None]} 2026-02-21T08:19:34.5012652Z [101s] Fitting surrogate: 347 points, 347 targets 2026-02-21T08:19:35.3079479Z [102s] Generation 4 starting: 59 neighbors, 4 active search path(s) 2026-02-21T08:19:42.2967966Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 2.9 configs/s 2026-02-21T08:19:46.0653510Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.4 configs/s 2026-02-21T08:19:48.7213186Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 384.0 2026-02-21T08:19:48.7214118Z configs/s 2026-02-21T08:19:48.9632115Z [116s] Generation 4 complete: 2026-02-21T08:19:48.9633123Z ok=64 2026-02-21T08:19:48.9633314Z min=0.0143 2026-02-21T08:19:48.9633604Z mid=0.0164 2026-02-21T08:19:48.9638172Z max=0.4035 2026-02-21T08:19:48.9639573Z best={'block_sizes': [1, 4096], 2026-02-21T08:19:48.9639964Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:48.9640367Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:19:48.9640605Z 'num_stages': 5, 2026-02-21T08:19:48.9640836Z 'num_warps': 2, 2026-02-21T08:19:48.9641095Z 'pid_type': 'flat', 2026-02-21T08:19:48.9641302Z 'range_flattens': [None, True], 2026-02-21T08:19:48.9641626Z 'range_multi_buffers': [None, None], 2026-02-21T08:19:48.9641877Z 'range_num_stages': [0, 4], 2026-02-21T08:19:48.9642167Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:48.9642922Z 'range_warp_specializes': [None, None]} 2026-02-21T08:19:48.9645573Z [116s] Fitting surrogate: 411 points, 411 targets 2026-02-21T08:19:49.7529893Z [117s] Generation 5 starting: 53 neighbors, 4 active search path(s) 2026-02-21T08:19:53.0857187Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 19.0 configs/s 2026-02-21T08:19:56.4285347Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.4 configs/s 2026-02-21T08:19:59.6035465Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 364.2 2026-02-21T08:19:59.6035945Z configs/s 2026-02-21T08:19:59.8640688Z [127s] Generation 5 complete: 2026-02-21T08:19:59.8642998Z ok=58 2026-02-21T08:19:59.8644517Z min=0.0143 2026-02-21T08:19:59.8647605Z mid=0.0164 2026-02-21T08:19:59.8648906Z max=0.1352 2026-02-21T08:19:59.8649109Z best={'block_sizes': [1, 4096], 2026-02-21T08:19:59.8649972Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:19:59.8650345Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:19:59.8650627Z 'num_stages': 5, 2026-02-21T08:19:59.8650824Z 'num_warps': 2, 2026-02-21T08:19:59.8651076Z 'pid_type': 'flat', 2026-02-21T08:19:59.8651328Z 'range_flattens': [None, True], 2026-02-21T08:19:59.8651737Z 'range_multi_buffers': [None, None], 2026-02-21T08:19:59.8652031Z 'range_num_stages': [0, 4], 2026-02-21T08:19:59.8652260Z 'range_unroll_factors': [0, 0], 2026-02-21T08:19:59.8652523Z 'range_warp_specializes': [None, None]} 2026-02-21T08:19:59.8657014Z [127s] Fitting surrogate: 469 points, 469 targets 2026-02-21T08:20:00.7031273Z [128s] Generation 6 starting: 53 neighbors, 4 active search path(s) 2026-02-21T08:20:03.6680864Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 25.0 configs/s 2026-02-21T08:20:06.9515742Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.3 configs/s 2026-02-21T08:20:10.0670219Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.7 2026-02-21T08:20:10.0671227Z configs/s 2026-02-21T08:20:10.3446711Z [137s] Generation 6 complete: 2026-02-21T08:20:10.3450797Z ok=58 2026-02-21T08:20:10.3454484Z min=0.0143 2026-02-21T08:20:10.3456019Z mid=0.0144 2026-02-21T08:20:10.3456383Z max=0.0235 2026-02-21T08:20:10.3456621Z best={'block_sizes': [1, 4096], 2026-02-21T08:20:10.3456915Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:20:10.3457269Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:20:10.3457502Z 'num_stages': 5, 2026-02-21T08:20:10.3457728Z 'num_warps': 2, 2026-02-21T08:20:10.3457978Z 'pid_type': 'flat', 2026-02-21T08:20:10.3458199Z 'range_flattens': [None, True], 2026-02-21T08:20:10.3458463Z 'range_multi_buffers': [None, None], 2026-02-21T08:20:10.3458714Z 'range_num_stages': [0, 4], 2026-02-21T08:20:10.3458991Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:10.3459230Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:10.3466119Z [137s] Fitting surrogate: 527 points, 527 targets 2026-02-21T08:20:10.9062413Z [138s] Generation 7 starting: 25 neighbors, 2 active search path(s) 2026-02-21T08:20:12.3999008Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 40.7 configs/s 2026-02-21T08:20:14.0097145Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 16.6 configs/s 2026-02-21T08:20:15.8592604Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 642.9 2026-02-21T08:20:15.8597406Z configs/s 2026-02-21T08:20:16.0006370Z [143s] Generation 7 complete: 2026-02-21T08:20:16.0011455Z ok=28 2026-02-21T08:20:16.0015853Z min=0.0143 2026-02-21T08:20:16.0020638Z mid=0.0145 2026-02-21T08:20:16.0024324Z max=0.0266 2026-02-21T08:20:16.0026403Z best={'block_sizes': [1, 4096], 2026-02-21T08:20:16.0026880Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:20:16.0027622Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:20:16.0027866Z 'num_stages': 2, 2026-02-21T08:20:16.0028128Z 'num_warps': 2, 2026-02-21T08:20:16.0028342Z 'pid_type': 'flat', 2026-02-21T08:20:16.0028595Z 'range_flattens': [None, False], 2026-02-21T08:20:16.0028836Z 'range_multi_buffers': [None, False], 2026-02-21T08:20:16.0029122Z 'range_num_stages': [0, 1], 2026-02-21T08:20:16.0029373Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:16.0029607Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:16.0029927Z [143s] Fitting surrogate: 555 points, 555 targets 2026-02-21T08:20:16.5049969Z [143s] Generation 8 starting: 22 neighbors, 2 active search path(s) 2026-02-21T08:20:17.9437924Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 36.5 configs/s 2026-02-21T08:20:19.3627837Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.7 configs/s 2026-02-21T08:20:20.7143578Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 750.3 2026-02-21T08:20:20.7145018Z configs/s 2026-02-21T08:20:20.8285040Z [148s] Generation 8 complete: 2026-02-21T08:20:20.8286882Z ok=24 2026-02-21T08:20:20.8287198Z min=0.0143 2026-02-21T08:20:20.8287546Z mid=0.0164 2026-02-21T08:20:20.8287799Z max=0.0286 2026-02-21T08:20:20.8288021Z best={'block_sizes': [1, 4096], 2026-02-21T08:20:20.8288412Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:20:20.8288753Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:20:20.8289091Z 'num_stages': 2, 2026-02-21T08:20:20.8289419Z 'num_warps': 2, 2026-02-21T08:20:20.8289611Z 'pid_type': 'flat', 2026-02-21T08:20:20.8289872Z 'range_flattens': [None, False], 2026-02-21T08:20:20.8290112Z 'range_multi_buffers': [None, False], 2026-02-21T08:20:20.8290418Z 'range_num_stages': [0, 1], 2026-02-21T08:20:20.8290648Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:20.8290937Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:20.8302085Z [148s] Fitting surrogate: 579 points, 579 targets 2026-02-21T08:20:21.2609616Z [148s] Generation 9 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:20:22.3639381Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 21.6 configs/s 2026-02-21T08:20:23.4105094Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 16.9 configs/s 2026-02-21T08:20:24.4027958Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1019.1 2026-02-21T08:20:24.4032257Z configs/s 2026-02-21T08:20:24.4950213Z [151s] Generation 9 complete: 2026-02-21T08:20:24.4954519Z ok=18 2026-02-21T08:20:24.4956054Z min=0.0143 2026-02-21T08:20:24.4956329Z mid=0.0144 2026-02-21T08:20:24.4956541Z max=0.0235 2026-02-21T08:20:24.4956766Z best={'block_sizes': [1, 4096], 2026-02-21T08:20:24.4957322Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:20:24.4959213Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:20:24.4959543Z 'num_stages': 2, 2026-02-21T08:20:24.4964211Z 'num_warps': 2, 2026-02-21T08:20:24.4968474Z 'pid_type': 'flat', 2026-02-21T08:20:24.4972267Z 'range_flattens': [None, False], 2026-02-21T08:20:24.4972610Z 'range_multi_buffers': [None, False], 2026-02-21T08:20:24.4972887Z 'range_num_stages': [0, 1], 2026-02-21T08:20:24.4977056Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:24.4979117Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:24.4979417Z [151s] Fitting surrogate: 597 points, 597 targets 2026-02-21T08:20:24.9466902Z [152s] Generation 10 starting: 15 neighbors, 1 active search path(s) 2026-02-21T08:20:29.4577990Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.7 configs/s 2026-02-21T08:20:30.4466501Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.9 configs/s 2026-02-21T08:20:31.3312834Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1141.0 2026-02-21T08:20:31.3314453Z configs/s 2026-02-21T08:20:31.4171337Z [158s] Generation 10 complete: 2026-02-21T08:20:31.4175681Z ok=17 2026-02-21T08:20:31.4177185Z min=0.0143 2026-02-21T08:20:31.4177436Z mid=0.0164 2026-02-21T08:20:31.4177646Z max=0.0328 2026-02-21T08:20:31.4177825Z best={'block_sizes': [1, 4096], 2026-02-21T08:20:31.4178200Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:20:31.4178537Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:20:31.4178758Z 'num_stages': 2, 2026-02-21T08:20:31.4179030Z 'num_warps': 2, 2026-02-21T08:20:31.4179223Z 'pid_type': 'flat', 2026-02-21T08:20:31.4179446Z 'range_flattens': [None, False], 2026-02-21T08:20:31.4179714Z 'range_multi_buffers': [None, False], 2026-02-21T08:20:31.4179983Z 'range_num_stages': [0, 1], 2026-02-21T08:20:31.4180200Z 'range_unroll_factors': [0, 0], 2026-02-21T08:20:31.4180505Z 'range_warp_specializes': [None, None]} 2026-02-21T08:20:31.4190411Z [158s] Fitting surrogate: 614 points, 614 targets 2026-02-21T08:20:31.6817565Z [159s] Autotuning complete in 159.0s after searching 588 configs. 2026-02-21T08:20:31.6818115Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:20:31.6819275Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:20:31.6820127Z 2026-02-21T08:20:31.6820407Z [159s] Code of selected kernel: /tmp/torchinductor_root/tv/ctvw6nvwxfmleqem5ps4ftlh564h4auc6z3vbynzdex7px3ify2o.py 2026-02-21T08:20:32.4457353Z WARNING:tritonbench.utils.triton_op:Completed input ID 20: 2026-02-21T08:20:32.4462289Z (M, N) 2026-02-21T08:20:32.4464600Z ------------ 2026-02-21T08:20:32.4466865Z (4096, 2816) 2026-02-21T08:20:32.4467242Z 2026-02-21T08:20:32.4472826Z 25%|██▌ | 5/20 [11:30<37:04, 148.29s/it]WARNING:tritonbench.utils.triton_op:Running input ID 26: 2026-02-21T08:20:32.4473545Z (M, N) 2026-02-21T08:20:32.4473847Z ------------ 2026-02-21T08:20:32.4474075Z (4096, 3584) 2026-02-21T08:20:32.4474445Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:20:33.8102030Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:20:35.3632561Z INFO:tritonbench.utils.triton_op:Took 2.37ms to get benchmark function for torch_compile_softmax 2026-02-21T08:20:38.6476619Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:20:38.6478106Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:20:38.6478354Z 'dtype': 'torch.float16', 2026-02-21T08:20:38.6478658Z 'shape': (4096, 3584), 2026-02-21T08:20:38.6478902Z 'stride': (3584, 1)},), 2026-02-21T08:20:38.6479237Z 'kwargs': {}} 2026-02-21T08:20:38.6495892Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:20:38.8265286Z [0s] Autotune random seed: 2134816249 2026-02-21T08:20:38.9705959Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:21:12.9628131Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:21:13.2226530Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:21:13.2241942Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T08:21:13.3914693Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:21:13.3915451Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:21:13.3916241Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:21:13.3916555Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:21:13.3916819Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:21:13.3917124Z %cst = arith.constant dense<3584> : tensor<8x1xi32> 2026-02-21T08:21:13.3917461Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:21:13.3917824Z %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:21:13.3918136Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:21:13.3918410Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:21:13.3918717Z %c3584_i32 = arith.constant 3584 : i32 2026-02-21T08:21:13.3918942Z %c3584_i64 = arith.constant 3584 : i64 2026-02-21T08:21:13.3919199Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:21:13.3919634Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : , > 2026-02-21T08:21:13.3920007Z %1 = tt.get_program_id x : i32 2026-02-21T08:21:13.3920282Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:21:13.3920590Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:21:13.3920893Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:21:13.3921185Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:21:13.3921617Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:21:13.3921899Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:21:13.3922272Z %c2048_i32_2 = arith.constant 2048 : i32 2026-02-21T08:21:13.3922690Z %6 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:21:13.3923061Z %7 = arith.extf %6 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.3923372Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3923629Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3923892Z %183 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:13.3924165Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3924410Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3924702Z %9 = arith.truncf %8 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:21:13.3924972Z %10 = arith.extf %9 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:21:13.3925411Z %11 = arith.cmpf ogt, %cst_1, %10 : tensor<8xf32> 2026-02-21T08:21:13.3925691Z %12 = arith.cmpf une, %cst_1, %cst_1 : tensor<8xf32> 2026-02-21T08:21:13.3926337Z %13 = arith.ori %11, %12 : tensor<8xi1> 2026-02-21T08:21:13.3926654Z %14 = arith.select %13, %cst_1, %10 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:21:13.3926944Z %15 = arith.subf %cst_1, %14 : tensor<8xf32> 2026-02-21T08:21:13.3927405Z %16 = tt.extern_elementwise %15 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3927819Z %17 = arith.mulf %cst_0, %16 : tensor<8xf32> 2026-02-21T08:21:13.3928179Z %18 = tt.expand_dims %14 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.3928554Z %19 = tt.broadcast %18 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.3928829Z %20 = arith.subf %7, %19 : tensor<8x512xf32> 2026-02-21T08:21:13.3929380Z %21 = tt.extern_elementwise %20 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.3929788Z %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3930066Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3930670Z %183 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:13.3930972Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3931241Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3931484Z %23 = arith.addf %17, %22 : tensor<8xf32> 2026-02-21T08:21:13.3931836Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:21:13.3932070Z %24 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:21:13.3932361Z %25 = arith.addi %c0_i32, %24 : i32 2026-02-21T08:21:13.3932747Z %26 = tt.descriptor_load %0[%2, %25] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:21:13.3933118Z %27 = arith.extf %26 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.3933413Z %28 = "tt.reduce"(%27) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3933674Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3933932Z %183 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:13.3934166Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3934439Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3934742Z %29 = arith.truncf %28 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:21:13.3935011Z %30 = arith.extf %29 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:21:13.3935341Z %31 = arith.cmpf ogt, %14, %30 : tensor<8xf32> 2026-02-21T08:21:13.3935599Z %32 = arith.cmpf une, %14, %14 : tensor<8xf32> 2026-02-21T08:21:13.3935865Z %33 = arith.ori %31, %32 : tensor<8xi1> 2026-02-21T08:21:13.3936165Z %34 = arith.select %33, %14, %30 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:21:13.3936464Z %35 = arith.subf %14, %34 : tensor<8xf32> 2026-02-21T08:21:13.3936872Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3937297Z %37 = arith.mulf %23, %36 : tensor<8xf32> 2026-02-21T08:21:13.3937625Z %38 = tt.expand_dims %34 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.3937948Z %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.3938268Z %40 = arith.subf %27, %39 : tensor<8x512xf32> 2026-02-21T08:21:13.3938707Z %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.3939114Z %42 = "tt.reduce"(%41) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3939386Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3939615Z %183 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:13.3939875Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3940102Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3940386Z %43 = arith.addf %37, %42 : tensor<8xf32> 2026-02-21T08:21:13.3940651Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:21:13.3940954Z %44 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:21:13.3941233Z %45 = arith.addi %c0_i32, %44 : i32 2026-02-21T08:21:13.3941580Z %46 = tt.descriptor_load %0[%2, %45] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:21:13.3941970Z %47 = arith.extf %46 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.3942289Z %48 = "tt.reduce"(%47) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3942520Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3942783Z %183 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:13.3943026Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3943281Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3943527Z %49 = arith.truncf %48 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:21:13.3943875Z %50 = arith.extf %49 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:21:13.3944168Z %51 = arith.cmpf ogt, %34, %50 : tensor<8xf32> 2026-02-21T08:21:13.3944476Z %52 = arith.cmpf une, %34, %34 : tensor<8xf32> 2026-02-21T08:21:13.3944796Z %53 = arith.ori %51, %52 : tensor<8xi1> 2026-02-21T08:21:13.3945063Z %54 = arith.select %53, %34, %50 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:21:13.3945525Z %55 = arith.subf %34, %54 : tensor<8xf32> 2026-02-21T08:21:13.3945939Z %56 = tt.extern_elementwise %55 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3946362Z %57 = arith.mulf %43, %56 : tensor<8xf32> 2026-02-21T08:21:13.3946679Z %58 = tt.expand_dims %54 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.3947035Z %59 = tt.broadcast %58 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.3947346Z %60 = arith.subf %47, %59 : tensor<8x512xf32> 2026-02-21T08:21:13.3947750Z %61 = tt.extern_elementwise %60 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.3948205Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3948479Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3948699Z %183 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:13.3948975Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.3949205Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.3949473Z %63 = arith.addf %57, %62 : tensor<8xf32> 2026-02-21T08:21:13.3949709Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:21:13.3949992Z %64 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:21:13.3950252Z %65 = arith.addi %c0_i32, %64 : i32 2026-02-21T08:21:13.3950600Z %66 = tt.descriptor_load %0[%2, %65] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:21:13.3950988Z %67 = arith.extf %66 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.3951252Z %68 = "tt.reduce"(%67) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.3951578Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.3951813Z %183 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:13.3952073Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.4026259Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4026708Z %69 = arith.truncf %68 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:21:13.4026966Z %70 = arith.extf %69 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:21:13.4027206Z %71 = arith.cmpf ogt, %54, %70 : tensor<8xf32> 2026-02-21T08:21:13.4027415Z %72 = arith.cmpf une, %54, %54 : tensor<8xf32> 2026-02-21T08:21:13.4027629Z %73 = arith.ori %71, %72 : tensor<8xi1> 2026-02-21T08:21:13.4027868Z %74 = arith.select %73, %54, %70 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:21:13.4028102Z %75 = arith.subf %54, %74 : tensor<8xf32> 2026-02-21T08:21:13.4028465Z %76 = tt.extern_elementwise %75 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4028835Z %77 = arith.mulf %63, %76 : tensor<8xf32> 2026-02-21T08:21:13.4029319Z %78 = tt.expand_dims %74 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4029618Z %79 = tt.broadcast %78 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4029855Z %80 = arith.subf %67, %79 : tensor<8x512xf32> 2026-02-21T08:21:13.4030267Z %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4030639Z %82 = "tt.reduce"(%81) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.4030830Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:13.4031022Z %183 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:13.4031206Z tt.reduce.return %183 : f32 2026-02-21T08:21:13.4031395Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4031630Z %83 = arith.addf %77, %82 : tensor<8xf32> 2026-02-21T08:21:13.4032082Z %84:2 = scf.for %arg3 = %c2048_i32 to %c3584_i32 step %c512_i32 iter_args(%arg4 = %74, %arg5 = %83) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:21:13.4032544Z %183 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:21:13.4032866Z %184 = arith.extf %183 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4033116Z %185 = "tt.reduce"(%184) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.4033322Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:13.4033511Z %201 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:13.4033713Z tt.reduce.return %201 : f32 2026-02-21T08:21:13.4033899Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4034133Z %186 = arith.truncf %185 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:21:13.4034379Z %187 = arith.extf %186 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:21:13.4034623Z %188 = arith.cmpf ogt, %arg4, %187 : tensor<8xf32> 2026-02-21T08:21:13.4034852Z %189 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:21:13.4035076Z %190 = arith.ori %188, %189 : tensor<8xi1> 2026-02-21T08:21:13.4035319Z %191 = arith.select %190, %arg4, %187 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:21:13.4035564Z %192 = arith.subf %arg4, %191 : tensor<8xf32> 2026-02-21T08:21:13.4035927Z %193 = tt.extern_elementwise %192 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4036288Z %194 = arith.mulf %arg5, %193 : tensor<8xf32> 2026-02-21T08:21:13.4036552Z %195 = tt.expand_dims %191 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4036847Z %196 = tt.broadcast %195 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4037089Z %197 = arith.subf %184, %196 : tensor<8x512xf32> 2026-02-21T08:21:13.4037463Z %198 = tt.extern_elementwise %197 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4037828Z %199 = "tt.reduce"(%198) <{axis = 1 : i32}> ({ 2026-02-21T08:21:13.4038024Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:13.4038205Z %201 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:13.4038398Z tt.reduce.return %201 : f32 2026-02-21T08:21:13.4038588Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:21:13.4038784Z %200 = arith.addf %194, %199 : tensor<8xf32> 2026-02-21T08:21:13.4039007Z scf.yield %191, %200 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:21:13.4039243Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:21:13.4039474Z %c2048_i32_3 = arith.constant 2048 : i32 2026-02-21T08:21:13.4039659Z %c2048_i32_4 = arith.constant 2048 : i32 2026-02-21T08:21:13.4039899Z %85 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:21:13.4040159Z %86 = tt.splat %c0_i32 : i32 -> tensor<512xi32> 2026-02-21T08:21:13.4040365Z %87 = arith.addi %86, %85 : tensor<512xi32> 2026-02-21T08:21:13.4040678Z %88 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:21:13.4040926Z %89 = arith.muli %88, %cst : tensor<8x1xi32> 2026-02-21T08:21:13.4041194Z %90 = tt.expand_dims %87 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:21:13.4041488Z %91 = tt.broadcast %89 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4041800Z %92 = tt.broadcast %90 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4042053Z %93 = arith.addi %91, %92 : tensor<8x512xi32> 2026-02-21T08:21:13.4042298Z %94 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4042597Z %95 = tt.addptr %94, %93 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4042908Z %96 = tt.load %95 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4043310Z %97 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4043611Z %98 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4043876Z %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4044120Z %100 = arith.subf %98, %99 : tensor<8x512xf32> 2026-02-21T08:21:13.4044508Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4044947Z %102 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4045247Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4045502Z %104 = arith.divf %101, %103 : tensor<8x512xf32> 2026-02-21T08:21:13.4045756Z %105 = arith.truncf %104 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:21:13.4046044Z %106 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4046350Z %107 = tt.addptr %106, %93 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4046622Z tt.store %107, %105 : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4046847Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T08:21:13.4047048Z %108 = arith.muli %c512_i32, %c1_i32_5 : i32 2026-02-21T08:21:13.4047261Z %109 = arith.addi %c0_i32, %108 : i32 2026-02-21T08:21:13.4047515Z %110 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:21:13.4047775Z %111 = tt.splat %109 : i32 -> tensor<512xi32> 2026-02-21T08:21:13.4048000Z %112 = arith.addi %111, %110 : tensor<512xi32> 2026-02-21T08:21:13.4048258Z %113 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:21:13.4048529Z %114 = arith.muli %113, %cst : tensor<8x1xi32> 2026-02-21T08:21:13.4048801Z %115 = tt.expand_dims %112 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:21:13.4049115Z %116 = tt.broadcast %114 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4049403Z %117 = tt.broadcast %115 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4049639Z %118 = arith.addi %116, %117 : tensor<8x512xi32> 2026-02-21T08:21:13.4049878Z %119 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4050153Z %120 = tt.addptr %119, %118 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4050459Z %121 = tt.load %120 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4050758Z %122 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4051047Z %123 = arith.extf %121 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4051312Z %124 = tt.broadcast %122 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4051588Z %125 = arith.subf %123, %124 : tensor<8x512xf32> 2026-02-21T08:21:13.4051968Z %126 = tt.extern_elementwise %125 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4052437Z %127 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4052725Z %128 = tt.broadcast %127 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4052965Z %129 = arith.divf %126, %128 : tensor<8x512xf32> 2026-02-21T08:21:13.4053200Z %130 = arith.truncf %129 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:21:13.4053476Z %131 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4053750Z %132 = tt.addptr %131, %118 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4054019Z tt.store %132, %130 : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4054219Z %c2_i32_6 = arith.constant 2 : i32 2026-02-21T08:21:13.4054415Z %133 = arith.muli %c512_i32, %c2_i32_6 : i32 2026-02-21T08:21:13.4054614Z %134 = arith.addi %c0_i32, %133 : i32 2026-02-21T08:21:13.4054901Z %135 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:21:13.4055159Z %136 = tt.splat %134 : i32 -> tensor<512xi32> 2026-02-21T08:21:13.4055363Z %137 = arith.addi %136, %135 : tensor<512xi32> 2026-02-21T08:21:13.4055616Z %138 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:21:13.4055870Z %139 = arith.muli %138, %cst : tensor<8x1xi32> 2026-02-21T08:21:13.4056137Z %140 = tt.expand_dims %137 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:21:13.4056434Z %141 = tt.broadcast %139 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4056697Z %142 = tt.broadcast %140 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4056940Z %143 = arith.addi %141, %142 : tensor<8x512xi32> 2026-02-21T08:21:13.4057174Z %144 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4057457Z %145 = tt.addptr %144, %143 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4057759Z %146 = tt.load %145 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4058071Z %147 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4058358Z %148 = arith.extf %146 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4058612Z %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4058852Z %150 = arith.subf %148, %149 : tensor<8x512xf32> 2026-02-21T08:21:13.4059217Z %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4059634Z %152 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4059925Z %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4060155Z %154 = arith.divf %151, %153 : tensor<8x512xf32> 2026-02-21T08:21:13.4060393Z %155 = arith.truncf %154 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:21:13.4060656Z %156 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4060940Z %157 = tt.addptr %156, %143 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4061191Z tt.store %157, %155 : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4061398Z %c3_i32_7 = arith.constant 3 : i32 2026-02-21T08:21:13.4061617Z %158 = arith.muli %c512_i32, %c3_i32_7 : i32 2026-02-21T08:21:13.4061808Z %159 = arith.addi %c0_i32, %158 : i32 2026-02-21T08:21:13.4062049Z %160 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:21:13.4062301Z %161 = tt.splat %159 : i32 -> tensor<512xi32> 2026-02-21T08:21:13.4062513Z %162 = arith.addi %161, %160 : tensor<512xi32> 2026-02-21T08:21:13.4062757Z %163 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:21:13.4063020Z %164 = arith.muli %163, %cst : tensor<8x1xi32> 2026-02-21T08:21:13.4063346Z %165 = tt.expand_dims %162 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:21:13.4063630Z %166 = tt.broadcast %164 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4063893Z %167 = tt.broadcast %165 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4064131Z %168 = arith.addi %166, %167 : tensor<8x512xi32> 2026-02-21T08:21:13.4064358Z %169 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4064636Z %170 = tt.addptr %169, %168 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4064930Z %171 = tt.load %170 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4065233Z %172 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4065513Z %173 = arith.extf %171 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4065807Z %174 = tt.broadcast %172 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4066046Z %175 = arith.subf %173, %174 : tensor<8x512xf32> 2026-02-21T08:21:13.4066403Z %176 = tt.extern_elementwise %175 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4066814Z %177 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4067098Z %178 = tt.broadcast %177 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4067323Z %179 = arith.divf %176, %178 : tensor<8x512xf32> 2026-02-21T08:21:13.4067561Z %180 = arith.truncf %179 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:21:13.4067821Z %181 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4068098Z %182 = tt.addptr %181, %168 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4068352Z tt.store %182, %180 : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4068598Z scf.for %arg3 = %c2048_i32_3 to %c3584_i32 step %c512_i32 : i32 { 2026-02-21T08:21:13.4068880Z %183 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:21:13.4069126Z %184 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:21:13.4069334Z %185 = arith.addi %184, %183 : tensor<512xi32> 2026-02-21T08:21:13.4069575Z %186 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:21:13.4069832Z %187 = arith.muli %186, %cst : tensor<8x1xi32> 2026-02-21T08:21:13.4070095Z %188 = tt.expand_dims %185 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:21:13.4070386Z %189 = tt.broadcast %187 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4070654Z %190 = tt.broadcast %188 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:21:13.4070890Z %191 = arith.addi %189, %190 : tensor<8x512xi32> 2026-02-21T08:21:13.4071133Z %192 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4071415Z %193 = tt.addptr %192, %191 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4071883Z %194 = tt.load %193 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4072203Z %195 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4072490Z %196 = arith.extf %194 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:21:13.4072757Z %197 = tt.broadcast %195 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4072991Z %198 = arith.subf %196, %197 : tensor<8x512xf32> 2026-02-21T08:21:13.4073371Z %199 = tt.extern_elementwise %198 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:21:13.4073799Z %200 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:21:13.4074144Z %201 = tt.broadcast %200 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:21:13.4074386Z %202 = arith.divf %199, %201 : tensor<8x512xf32> 2026-02-21T08:21:13.4074622Z %203 = arith.truncf %202 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:21:13.4074894Z %204 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4075173Z %205 = tt.addptr %204, %191 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:21:13.4075439Z tt.store %205, %203 : tensor<8x512x!tt.ptr> 2026-02-21T08:21:13.4075679Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:21:13.4075965Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:21:13.4076224Z tt.return 2026-02-21T08:21:13.4076358Z } 2026-02-21T08:21:13.4076493Z } 2026-02-21T08:21:13.4076561Z 2026-02-21T08:21:13.4076611Z {-# 2026-02-21T08:21:13.4076745Z external_resources: { 2026-02-21T08:21:13.4076950Z mlir_reproducer: { 2026-02-21T08:21:13.4081212Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:21:13.4085647Z disable_threading: false, 2026-02-21T08:21:13.4085824Z verify_each: true 2026-02-21T08:21:13.4085990Z } 2026-02-21T08:21:13.4086138Z } 2026-02-21T08:21:13.4086291Z #-} 2026-02-21T08:21:13.4086849Z /tmp/torchinductor_root/gh/cghtk5hmlnh2ptfqo2is6rzrnewaeoxnr36ykwemcetkbrwezaxq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:21:13.4088270Z /tmp/torchinductor_root/gh/cghtk5hmlnh2ptfqo2is6rzrnewaeoxnr36ykwemcetkbrwezaxq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:21:13.4089385Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:21:13.4090605Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:21:13.4091785Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:21:13.4092089Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:21:16.5269678Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:21:16.5274738Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:21:16.5275716Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:21:16.5275925Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:21:16.5276161Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:21:16.5279266Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:21:16.5279545Z %cst = arith.constant dense<3584> : tensor<32x1xi32> 2026-02-21T08:21:16.5280151Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:21:16.5280452Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:21:16.5280674Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:21:16.5280866Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:21:16.5281047Z %c3584_i32 = arith.constant 3584 : i32 2026-02-21T08:21:16.5281231Z %c3584_i64 = arith.constant 3584 : i64 2026-02-21T08:21:16.5281407Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:21:16.5282220Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : , > 2026-02-21T08:21:16.5282664Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : , > 2026-02-21T08:21:16.5282980Z %2 = tt.get_program_id x : i32 2026-02-21T08:21:16.5283177Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:21:16.5283360Z %4 = arith.minsi %3, %c128_i32 : i32 2026-02-21T08:21:16.5283575Z scf.for %arg2 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:21:16.5283784Z %5 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:21:16.5284021Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:21:16.5284279Z %7 = tt.splat %5 : i32 -> tensor<32xi32> 2026-02-21T08:21:16.5284468Z %8 = arith.addi %7, %6 : tensor<32xi32> 2026-02-21T08:21:16.5284659Z %c3576_i32 = arith.constant 3576 : i32 2026-02-21T08:21:16.5284843Z %c24_i32 = arith.constant 24 : i32 2026-02-21T08:21:16.5285209Z %9:2 = scf.for %arg3 = %c0_i32 to %c3576_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:21:16.5285614Z %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:21:16.5285867Z %50 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:21:16.5286076Z %51 = arith.addi %50, %49 : tensor<8xi32> 2026-02-21T08:21:16.5286325Z %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:16.5286595Z %53 = arith.muli %52, %cst : tensor<32x1xi32> 2026-02-21T08:21:16.5286843Z %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:21:16.5287128Z %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5287389Z %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5287615Z %57 = arith.addi %55, %56 : tensor<32x8xi32> 2026-02-21T08:21:16.5287896Z %58 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5288171Z %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:21:16.5288460Z %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5288743Z %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5288973Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5289165Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5289520Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:16.5289716Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5289909Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5290130Z %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:16.5290379Z %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:16.5290607Z %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32> 2026-02-21T08:21:16.5290838Z %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:21:16.5291054Z %67 = arith.ori %65, %66 : tensor<32xi1> 2026-02-21T08:21:16.5291285Z %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:16.5291531Z %69 = arith.subf %arg4, %68 : tensor<32xf32> 2026-02-21T08:21:16.5292036Z %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5292410Z %71 = arith.mulf %arg5, %70 : tensor<32xf32> 2026-02-21T08:21:16.5292674Z %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5292961Z %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5293189Z %74 = arith.subf %61, %73 : tensor<32x8xf32> 2026-02-21T08:21:16.5293540Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5293906Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5294099Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5294296Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:16.5294498Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5294682Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5294886Z %77 = arith.addf %71, %76 : tensor<32xf32> 2026-02-21T08:21:16.5295080Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:21:16.5295274Z %78 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:21:16.5295461Z %79 = arith.addi %arg3, %78 : i32 2026-02-21T08:21:16.5295688Z %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:21:16.5295925Z %81 = tt.splat %79 : i32 -> tensor<8xi32> 2026-02-21T08:21:16.5296121Z %82 = arith.addi %81, %80 : tensor<8xi32> 2026-02-21T08:21:16.5296371Z %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:16.5296626Z %84 = arith.muli %83, %cst : tensor<32x1xi32> 2026-02-21T08:21:16.5296878Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:21:16.5297150Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5297407Z %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5297644Z %88 = arith.addi %86, %87 : tensor<32x8xi32> 2026-02-21T08:21:16.5297873Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5298142Z %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:21:16.5298425Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5298708Z %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5298923Z %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5299114Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5299301Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:16.5299486Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5299671Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5299886Z %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:16.5300130Z %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:16.5300419Z %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32> 2026-02-21T08:21:16.5300639Z %97 = arith.cmpf une, %68, %68 : tensor<32xf32> 2026-02-21T08:21:16.5300836Z %98 = arith.ori %96, %97 : tensor<32xi1> 2026-02-21T08:21:16.5301064Z %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:16.5301304Z %100 = arith.subf %68, %99 : tensor<32xf32> 2026-02-21T08:21:16.5301696Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5302068Z %102 = arith.mulf %77, %101 : tensor<32xf32> 2026-02-21T08:21:16.5302318Z %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5302612Z %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5302852Z %105 = arith.subf %92, %104 : tensor<32x8xf32> 2026-02-21T08:21:16.5303278Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5303656Z %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5303850Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5304046Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:16.5304240Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5304439Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5304653Z %108 = arith.addf %102, %107 : tensor<32xf32> 2026-02-21T08:21:16.5304858Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:21:16.5305060Z %109 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:21:16.5305251Z %110 = arith.addi %arg3, %109 : i32 2026-02-21T08:21:16.5305493Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:21:16.5305746Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T08:21:16.5305961Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T08:21:16.5306222Z %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:16.5306494Z %115 = arith.muli %114, %cst : tensor<32x1xi32> 2026-02-21T08:21:16.5306760Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:21:16.5307057Z %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5307333Z %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5307574Z %119 = arith.addi %117, %118 : tensor<32x8xi32> 2026-02-21T08:21:16.5307823Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5308117Z %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:21:16.5308426Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5308733Z %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5308968Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5309170Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5309357Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:16.5309559Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5309754Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5309982Z %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:16.5310243Z %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:16.5310480Z %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32> 2026-02-21T08:21:16.5310710Z %128 = arith.cmpf une, %99, %99 : tensor<32xf32> 2026-02-21T08:21:16.5310917Z %129 = arith.ori %127, %128 : tensor<32xi1> 2026-02-21T08:21:16.5311165Z %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:16.5311466Z %131 = arith.subf %99, %130 : tensor<32xf32> 2026-02-21T08:21:16.5311858Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5312225Z %133 = arith.mulf %108, %132 : tensor<32xf32> 2026-02-21T08:21:16.5312480Z %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5312779Z %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5313015Z %136 = arith.subf %123, %135 : tensor<32x8xf32> 2026-02-21T08:21:16.5313378Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5313747Z %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5313935Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:16.5314177Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:16.5314366Z tt.reduce.return %140 : f32 2026-02-21T08:21:16.5314554Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5314747Z %139 = arith.addf %133, %138 : tensor<32xf32> 2026-02-21T08:21:16.5314969Z scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:21:16.5315217Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:21:16.5315475Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:21:16.5315728Z %11 = tt.splat %c3576_i32 : i32 -> tensor<8xi32> 2026-02-21T08:21:16.5315927Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:21:16.5316175Z %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:16.5316429Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:21:16.5316681Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:21:16.5316972Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5317228Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:21:16.5317464Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:21:16.5317695Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5317968Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:21:16.5318260Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:21:16.5318533Z %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5318757Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5318946Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:16.5319134Z %49 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:16.5319318Z tt.reduce.return %49 : f32 2026-02-21T08:21:16.5319509Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5319725Z %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:16.5319967Z %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:16.5320192Z %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32> 2026-02-21T08:21:16.5320404Z %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32> 2026-02-21T08:21:16.5320611Z %28 = arith.ori %26, %27 : tensor<32xi1> 2026-02-21T08:21:16.5320835Z %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:16.5321070Z %30 = arith.subf %9#0, %29 : tensor<32xf32> 2026-02-21T08:21:16.5321415Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5321812Z %32 = arith.mulf %9#1, %31 : tensor<32xf32> 2026-02-21T08:21:16.5322083Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5322376Z %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5322717Z %35 = arith.subf %22, %34 : tensor<32x8xf32> 2026-02-21T08:21:16.5323081Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5323460Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:21:16.5323669Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:16.5323855Z %49 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:16.5324077Z tt.reduce.return %49 : f32 2026-02-21T08:21:16.5324286Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:21:16.5324508Z %38 = arith.addf %32, %37 : tensor<32xf32> 2026-02-21T08:21:16.5324718Z %c3576_i32_2 = arith.constant 3576 : i32 2026-02-21T08:21:16.5324925Z %c24_i32_3 = arith.constant 24 : i32 2026-02-21T08:21:16.5325164Z scf.for %arg3 = %c0_i32 to %c3576_i32_2 step %c24_i32_3 : i32 { 2026-02-21T08:21:16.5325566Z %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:21:16.5325921Z %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5326217Z %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5326481Z %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5326715Z %53 = arith.subf %51, %52 : tensor<32x8xf32> 2026-02-21T08:21:16.5327094Z %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5327513Z %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5327813Z %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5328046Z %57 = arith.divf %54, %56 : tensor<32x8xf32> 2026-02-21T08:21:16.5328287Z %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:21:16.5328617Z tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:21:16.5328907Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:21:16.5329113Z %59 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:21:16.5329311Z %60 = arith.addi %arg3, %59 : i32 2026-02-21T08:21:16.5329593Z %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:21:16.5329947Z %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5330219Z %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5330472Z %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5330693Z %65 = arith.subf %63, %64 : tensor<32x8xf32> 2026-02-21T08:21:16.5331050Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5331449Z %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5331769Z %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5331996Z %69 = arith.divf %66, %68 : tensor<32x8xf32> 2026-02-21T08:21:16.5332216Z %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:21:16.5332520Z tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:21:16.5332791Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:21:16.5332982Z %71 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:21:16.5333167Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:21:16.5333429Z %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:21:16.5333763Z %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5334090Z %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5334338Z %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5334561Z %77 = arith.subf %75, %76 : tensor<32x8xf32> 2026-02-21T08:21:16.5334917Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5335317Z %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5335590Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5335819Z %81 = arith.divf %78, %80 : tensor<32x8xf32> 2026-02-21T08:21:16.5336041Z %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:21:16.5336339Z tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:21:16.5336702Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:21:16.5337034Z %39 = tt.descriptor_load %0[%5, %c3576_i32_2] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:21:16.5337389Z %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5337670Z %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:21:16.5337932Z %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5338160Z %43 = arith.subf %41, %42 : tensor<32x8xf32> 2026-02-21T08:21:16.5338516Z %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:21:16.5338926Z %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:16.5339207Z %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:21:16.5339448Z %47 = arith.divf %44, %46 : tensor<32x8xf32> 2026-02-21T08:21:16.5339672Z %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:21:16.5339998Z tt.descriptor_store %1[%5, %c3576_i32_2], %48 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:21:16.5340302Z } {tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:21:16.5340504Z tt.return 2026-02-21T08:21:16.5340638Z } 2026-02-21T08:21:16.5340758Z } 2026-02-21T08:21:16.5340827Z 2026-02-21T08:21:16.5340884Z {-# 2026-02-21T08:21:16.5341012Z external_resources: { 2026-02-21T08:21:16.5341177Z mlir_reproducer: { 2026-02-21T08:21:16.5345600Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:21:16.5350104Z disable_threading: false, 2026-02-21T08:21:16.5350270Z verify_each: true 2026-02-21T08:21:16.5350422Z } 2026-02-21T08:21:16.5350536Z } 2026-02-21T08:21:16.5350652Z #-} 2026-02-21T08:21:16.5351081Z /tmp/torchinductor_root/67/c6752cyk4ba67x52vcddptv4uiempdrdll4dxehbdy4w7nhlfag6.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:21:16.5352351Z /tmp/torchinductor_root/67/c6752cyk4ba67x52vcddptv4uiempdrdll4dxehbdy4w7nhlfag6.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:21:16.5353338Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:21:16.5354435Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:21:16.5355425Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:21:16.5355683Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:21:18.2043372Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:21:18.2048600Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:21:18.2050368Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:21:18.2050659Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:21:18.2057245Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:21:18.2059189Z %cst = arith.constant dense<3584> : tensor<32x1xi32> 2026-02-21T08:21:18.2059480Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:21:18.2059733Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:21:18.2059963Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:21:18.2060158Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:21:18.2060345Z %c3584_i32 = arith.constant 3584 : i32 2026-02-21T08:21:18.2060529Z %c3584_i64 = arith.constant 3584 : i64 2026-02-21T08:21:18.2060708Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:21:18.2061045Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : , > 2026-02-21T08:21:18.2061373Z %1 = tt.get_program_id x : i32 2026-02-21T08:21:18.2061867Z scf.for %arg2 = %1 to %c128_i32 step %c9472_i32 : i32 { 2026-02-21T08:21:18.2062093Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:21:18.2062337Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:21:18.2062598Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:21:18.2062791Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:21:18.2062991Z %c3552_i32 = arith.constant 3552 : i32 2026-02-21T08:21:18.2063174Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:21:18.2063547Z %6:2 = scf.for %arg3 = %c0_i32 to %c3552_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:21:18.2064012Z %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:21:18.2064674Z %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2064917Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2065112Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2065310Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:18.2065501Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2065691Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2065913Z %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:18.2066165Z %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:18.2066404Z %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32> 2026-02-21T08:21:18.2066652Z %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:21:18.2066884Z %54 = arith.ori %52, %53 : tensor<32xi1> 2026-02-21T08:21:18.2067225Z %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:18.2067493Z %56 = arith.subf %arg4, %55 : tensor<32xf32> 2026-02-21T08:21:18.2067870Z %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2068309Z %58 = arith.mulf %arg5, %57 : tensor<32xf32> 2026-02-21T08:21:18.2068578Z %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2068891Z %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2069144Z %61 = arith.subf %48, %60 : tensor<32x32xf32> 2026-02-21T08:21:18.2069527Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2069912Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2070109Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2070305Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:18.2070502Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2070707Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2070913Z %64 = arith.addf %58, %63 : tensor<32xf32> 2026-02-21T08:21:18.2071122Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:21:18.2071318Z %65 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:21:18.2071516Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T08:21:18.2071851Z %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:21:18.2072172Z %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2072417Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2072610Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2072807Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:18.2073004Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2073209Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2073445Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:18.2073695Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:18.2073940Z %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32> 2026-02-21T08:21:18.2074162Z %73 = arith.cmpf une, %55, %55 : tensor<32xf32> 2026-02-21T08:21:18.2074378Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:21:18.2074616Z %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:18.2074863Z %76 = arith.subf %55, %75 : tensor<32xf32> 2026-02-21T08:21:18.2075237Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2075583Z %78 = arith.mulf %64, %77 : tensor<32xf32> 2026-02-21T08:21:18.2075833Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2076198Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2076431Z %81 = arith.subf %68, %80 : tensor<32x32xf32> 2026-02-21T08:21:18.2076783Z %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2077138Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2077330Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2077505Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:18.2077691Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2077870Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2078067Z %84 = arith.addf %78, %83 : tensor<32xf32> 2026-02-21T08:21:18.2078254Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:21:18.2078446Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:21:18.2078698Z %86 = arith.addi %arg3, %85 : i32 2026-02-21T08:21:18.2078967Z %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:21:18.2079280Z %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2079505Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2079695Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2079875Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:21:18.2080070Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2080255Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2080473Z %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:18.2080721Z %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:18.2080944Z %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32> 2026-02-21T08:21:18.2081167Z %93 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:21:18.2081372Z %94 = arith.ori %92, %93 : tensor<32xi1> 2026-02-21T08:21:18.2081655Z %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:18.2081889Z %96 = arith.subf %75, %95 : tensor<32xf32> 2026-02-21T08:21:18.2082236Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2082595Z %98 = arith.mulf %84, %97 : tensor<32xf32> 2026-02-21T08:21:18.2082844Z %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2083145Z %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2083385Z %101 = arith.subf %88, %100 : tensor<32x32xf32> 2026-02-21T08:21:18.2083765Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2084149Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2084344Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:21:18.2084530Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:21:18.2084715Z tt.reduce.return %105 : f32 2026-02-21T08:21:18.2084905Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2085104Z %104 = arith.addf %98, %103 : tensor<32xf32> 2026-02-21T08:21:18.2085350Z scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:21:18.2085568Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:21:18.2085866Z %7 = tt.descriptor_load %0[%2, %c3552_i32] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:21:18.2086181Z %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2086408Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2086598Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:18.2086781Z %47 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:21:18.2086976Z tt.reduce.return %47 : f32 2026-02-21T08:21:18.2087217Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2087442Z %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:21:18.2087678Z %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:21:18.2087908Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32> 2026-02-21T08:21:18.2088124Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:21:18.2088325Z %14 = arith.ori %12, %13 : tensor<32xi1> 2026-02-21T08:21:18.2088555Z %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:21:18.2088786Z %16 = arith.subf %6#0, %15 : tensor<32xf32> 2026-02-21T08:21:18.2089137Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2089488Z %18 = arith.mulf %6#1, %17 : tensor<32xf32> 2026-02-21T08:21:18.2089801Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2090099Z %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2090321Z %21 = arith.subf %8, %20 : tensor<32x32xf32> 2026-02-21T08:21:18.2090677Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2091028Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:21:18.2091222Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:21:18.2091396Z %47 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:21:18.2091613Z tt.reduce.return %47 : f32 2026-02-21T08:21:18.2091800Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:21:18.2091989Z %24 = arith.addf %18, %23 : tensor<32xf32> 2026-02-21T08:21:18.2092184Z %c3552_i32_2 = arith.constant 3552 : i32 2026-02-21T08:21:18.2092367Z %c96_i32_3 = arith.constant 96 : i32 2026-02-21T08:21:18.2092600Z scf.for %arg3 = %c0_i32 to %c3552_i32_2 step %c96_i32_3 : i32 { 2026-02-21T08:21:18.2092839Z %47 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T08:21:18.2093043Z %48 = arith.addi %47, %3 : tensor<32xi32> 2026-02-21T08:21:18.2093295Z %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:18.2093555Z %50 = arith.muli %49, %cst : tensor<32x1xi32> 2026-02-21T08:21:18.2093812Z %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:21:18.2094092Z %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2094351Z %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2094581Z %54 = arith.addi %52, %53 : tensor<32x32xi32> 2026-02-21T08:21:18.2094828Z %55 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2095110Z %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2095410Z %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2095720Z %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2095995Z %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2096252Z %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2096476Z %61 = arith.subf %59, %60 : tensor<32x32xf32> 2026-02-21T08:21:18.2096843Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2097256Z %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2097536Z %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2097771Z %65 = arith.divf %62, %64 : tensor<32x32xf32> 2026-02-21T08:21:18.2098002Z %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:21:18.2098332Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2098605Z %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2098853Z tt.store %68, %66 : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2099061Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:21:18.2099245Z %69 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:21:18.2099438Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:21:18.2099625Z %71 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T08:21:18.2099827Z %72 = arith.addi %71, %3 : tensor<32xi32> 2026-02-21T08:21:18.2100074Z %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:18.2100329Z %74 = arith.muli %73, %cst : tensor<32x1xi32> 2026-02-21T08:21:18.2100657Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:21:18.2100939Z %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2101198Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2101424Z %78 = arith.addi %76, %77 : tensor<32x32xi32> 2026-02-21T08:21:18.2101687Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2101965Z %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2102259Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2102572Z %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2102853Z %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2103114Z %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2103342Z %85 = arith.subf %83, %84 : tensor<32x32xf32> 2026-02-21T08:21:18.2103714Z %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2104128Z %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2104405Z %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2104636Z %89 = arith.divf %86, %88 : tensor<32x32xf32> 2026-02-21T08:21:18.2104864Z %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:21:18.2105131Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2105407Z %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2105652Z tt.store %92, %90 : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2105858Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:21:18.2106045Z %93 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:21:18.2106238Z %94 = arith.addi %arg3, %93 : i32 2026-02-21T08:21:18.2106424Z %95 = tt.splat %94 : i32 -> tensor<32xi32> 2026-02-21T08:21:18.2106624Z %96 = arith.addi %95, %3 : tensor<32xi32> 2026-02-21T08:21:18.2106872Z %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:18.2107124Z %98 = arith.muli %97, %cst : tensor<32x1xi32> 2026-02-21T08:21:18.2107373Z %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:21:18.2107656Z %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2107923Z %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2108160Z %102 = arith.addi %100, %101 : tensor<32x32xi32> 2026-02-21T08:21:18.2108405Z %103 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2108698Z %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2109073Z %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2109390Z %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2109688Z %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2109974Z %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2110235Z %109 = arith.subf %107, %108 : tensor<32x32xf32> 2026-02-21T08:21:18.2110627Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2111086Z %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2111396Z %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2111770Z %113 = arith.divf %110, %112 : tensor<32x32xf32> 2026-02-21T08:21:18.2112023Z %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:21:18.2112311Z %115 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2112613Z %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2112884Z tt.store %116, %114 : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2113110Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:21:18.2113326Z %25 = tt.splat %c3552_i32_2 : i32 -> tensor<32xi32> 2026-02-21T08:21:18.2113550Z %26 = arith.addi %25, %3 : tensor<32xi32> 2026-02-21T08:21:18.2113800Z %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:21:18.2114074Z %28 = arith.muli %27, %cst : tensor<32x1xi32> 2026-02-21T08:21:18.2114340Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:21:18.2114639Z %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2114921Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:21:18.2115162Z %32 = arith.addi %30, %31 : tensor<32x32xi32> 2026-02-21T08:21:18.2115408Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2115686Z %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2116000Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2116323Z %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2116613Z %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:21:18.2116884Z %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2117117Z %39 = arith.subf %37, %38 : tensor<32x32xf32> 2026-02-21T08:21:18.2117500Z %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:21:18.2117946Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:21:18.2118219Z %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:21:18.2118447Z %43 = arith.divf %40, %42 : tensor<32x32xf32> 2026-02-21T08:21:18.2118672Z %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:21:18.2118936Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2119196Z %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:21:18.2119447Z tt.store %46, %44 : tensor<32x32x!tt.ptr> 2026-02-21T08:21:18.2119719Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:21:18.2119964Z tt.return 2026-02-21T08:21:18.2120098Z } 2026-02-21T08:21:18.2120215Z } 2026-02-21T08:21:18.2120291Z 2026-02-21T08:21:18.2120398Z {-# 2026-02-21T08:21:18.2120526Z external_resources: { 2026-02-21T08:21:18.2120687Z mlir_reproducer: { 2026-02-21T08:21:18.2125105Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:21:18.2129591Z disable_threading: false, 2026-02-21T08:21:18.2129774Z verify_each: true 2026-02-21T08:21:18.2129920Z } 2026-02-21T08:21:18.2130045Z } 2026-02-21T08:21:18.2130160Z #-} 2026-02-21T08:21:18.2130608Z /tmp/torchinductor_root/3k/c3kroivlf44f54gqsibocybymp3h5kiwuog4cw4ukazisoehdqy2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:21:18.2131920Z /tmp/torchinductor_root/3k/c3kroivlf44f54gqsibocybymp3h5kiwuog4cw4ukazisoehdqy2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:21:18.2132924Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:21:18.2134037Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:21:18.2135060Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:21:18.2135319Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:21:20.2480762Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.3 configs/s 2026-02-21T08:21:20.2492848Z [41s] Adaptive compile timeout: 30s (90% percentile=4.6s, bounds=[30.0s, 30s]) 2026-02-21T08:21:20.8085590Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1777.5 configs/s 2026-02-21T08:21:20.8652777Z [41s] Initial random population of 100, 5 starting points: 2026-02-21T08:21:20.8654571Z error=7 2026-02-21T08:21:20.8654782Z timeout=2 2026-02-21T08:21:20.8659879Z ok=91 2026-02-21T08:21:20.8664428Z min=0.0246 2026-02-21T08:21:20.8666426Z mid=0.3922 2026-02-21T08:21:20.8667113Z max=23.9688 2026-02-21T08:21:20.8667255Z best={'block_sizes': [1, 4096], 2026-02-21T08:21:20.8667490Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:21:20.8667733Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:21:20.8667922Z 'num_sm_multiplier': 16, 2026-02-21T08:21:20.8668082Z 'num_stages': 5, 2026-02-21T08:21:20.8668217Z 'num_warps': 16, 2026-02-21T08:21:20.8668371Z 'pid_type': 'persistent_blocked', 2026-02-21T08:21:20.8668553Z 'range_flattens': [None, False], 2026-02-21T08:21:20.8668735Z 'range_multi_buffers': [None, True], 2026-02-21T08:21:20.8668914Z 'range_num_stages': [3, 4], 2026-02-21T08:21:20.8669085Z 'range_unroll_factors': [1, 0], 2026-02-21T08:21:20.8669267Z 'range_warp_specializes': [None, False]} 2026-02-21T08:21:20.8669484Z [41s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:21:22.0248003Z [43s] Generation 1 starting: 84 neighbors, 5 active search path(s) 2026-02-21T08:21:28.2440866Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 9.4 configs/s 2026-02-21T08:21:33.5669319Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.9 configs/s 2026-02-21T08:21:36.6359505Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 331.3 2026-02-21T08:21:36.6364442Z configs/s 2026-02-21T08:21:36.8619628Z [57s] Generation 1 complete: 2026-02-21T08:21:36.8621401Z error=1 2026-02-21T08:21:36.8621838Z ok=89 2026-02-21T08:21:36.8621989Z min=0.0184 2026-02-21T08:21:36.8622118Z mid=0.0267 2026-02-21T08:21:36.8622272Z max=0.2519 2026-02-21T08:21:36.8622413Z best={'block_sizes': [1, 4096], 2026-02-21T08:21:36.8622653Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:21:36.8622895Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:21:36.8623093Z 'num_stages': 5, 2026-02-21T08:21:36.8623235Z 'num_warps': 4, 2026-02-21T08:21:36.8623384Z 'pid_type': 'flat', 2026-02-21T08:21:36.8623581Z 'range_flattens': [None, False], 2026-02-21T08:21:36.8623783Z 'range_multi_buffers': [None, True], 2026-02-21T08:21:36.8623976Z 'range_num_stages': [0, 4], 2026-02-21T08:21:36.8624144Z 'range_unroll_factors': [0, 0], 2026-02-21T08:21:36.8624334Z 'range_warp_specializes': [None, False]} 2026-02-21T08:21:36.8636088Z [57s] Fitting surrogate: 190 points, 190 targets 2026-02-21T08:21:37.7506172Z [58s] Generation 2 starting: 71 neighbors, 5 active search path(s) 2026-02-21T08:21:46.3460833Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 3.4 configs/s 2026-02-21T08:21:50.8722069Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.5 configs/s 2026-02-21T08:21:54.8270271Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 257.5 2026-02-21T08:21:54.8271855Z configs/s 2026-02-21T08:21:55.1403182Z [76s] Generation 2 complete: 2026-02-21T08:21:55.1407497Z ok=77 2026-02-21T08:21:55.1410937Z min=0.0184 2026-02-21T08:21:55.1411187Z mid=0.0246 2026-02-21T08:21:55.1411337Z max=0.6145 2026-02-21T08:21:55.1411507Z best={'block_sizes': [1, 4096], 2026-02-21T08:21:55.1411853Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:21:55.1412141Z 'load_eviction_policies': ['', ''], 2026-02-21T08:21:55.1412338Z 'num_stages': 7, 2026-02-21T08:21:55.1412490Z 'num_warps': 2, 2026-02-21T08:21:55.1412630Z 'pid_type': 'flat', 2026-02-21T08:21:55.1412796Z 'range_flattens': [None, False], 2026-02-21T08:21:55.1412975Z 'range_multi_buffers': [None, True], 2026-02-21T08:21:55.1413168Z 'range_num_stages': [0, 4], 2026-02-21T08:21:55.1413333Z 'range_unroll_factors': [0, 0], 2026-02-21T08:21:55.1413523Z 'range_warp_specializes': [None, True]} 2026-02-21T08:21:55.1417892Z [76s] Fitting surrogate: 267 points, 267 targets 2026-02-21T08:21:56.0279861Z [77s] Generation 3 starting: 63 neighbors, 5 active search path(s) 2026-02-21T08:22:01.8609225Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 4.8 configs/s 2026-02-21T08:22:05.7724958Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.8 configs/s 2026-02-21T08:22:09.8410593Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 278.6 2026-02-21T08:22:09.8411907Z configs/s 2026-02-21T08:22:10.1412901Z [91s] Generation 3 complete: 2026-02-21T08:22:10.1414818Z error=1 2026-02-21T08:22:10.1414998Z ok=67 2026-02-21T08:22:10.1415161Z min=0.0184 2026-02-21T08:22:10.1415324Z mid=0.0225 2026-02-21T08:22:10.1415549Z max=0.2908 2026-02-21T08:22:10.1415730Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:10.1415980Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:22:10.1419439Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:10.1423967Z 'num_stages': 7, 2026-02-21T08:22:10.1425493Z 'num_warps': 2, 2026-02-21T08:22:10.1425696Z 'pid_type': 'flat', 2026-02-21T08:22:10.1426269Z 'range_flattens': [None, False], 2026-02-21T08:22:10.1426501Z 'range_multi_buffers': [None, False], 2026-02-21T08:22:10.1426693Z 'range_num_stages': [0, 4], 2026-02-21T08:22:10.1426875Z 'range_unroll_factors': [0, 0], 2026-02-21T08:22:10.1427059Z 'range_warp_specializes': [None, True]} 2026-02-21T08:22:10.1427365Z [91s] Fitting surrogate: 335 points, 335 targets 2026-02-21T08:22:10.8721365Z [91s] Generation 4 starting: 46 neighbors, 4 active search path(s) 2026-02-21T08:22:13.7518632Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 19.1 configs/s 2026-02-21T08:22:16.6795118Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.6 configs/s 2026-02-21T08:22:19.5456580Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 355.4 2026-02-21T08:22:19.5458064Z configs/s 2026-02-21T08:22:19.7767332Z [100s] Generation 4 complete: 2026-02-21T08:22:19.7771882Z ok=50 2026-02-21T08:22:19.7775319Z min=0.0184 2026-02-21T08:22:19.7776877Z mid=0.0184 2026-02-21T08:22:19.7777055Z max=0.0307 2026-02-21T08:22:19.7777203Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:19.7777457Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:22:19.7777698Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:19.7777896Z 'num_stages': 7, 2026-02-21T08:22:19.7778128Z 'num_warps': 2, 2026-02-21T08:22:19.7782044Z 'pid_type': 'flat', 2026-02-21T08:22:19.7786036Z 'range_flattens': [None, False], 2026-02-21T08:22:19.7788078Z 'range_multi_buffers': [None, False], 2026-02-21T08:22:19.7788305Z 'range_num_stages': [0, 4], 2026-02-21T08:22:19.7788548Z 'range_unroll_factors': [0, 0], 2026-02-21T08:22:19.7788739Z 'range_warp_specializes': [None, True]} 2026-02-21T08:22:19.7793640Z [100s] Fitting surrogate: 385 points, 385 targets 2026-02-21T08:22:20.4991306Z [101s] Generation 5 starting: 43 neighbors, 4 active search path(s) 2026-02-21T08:22:23.1029725Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 31.4 configs/s 2026-02-21T08:22:25.7390805Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.0 configs/s 2026-02-21T08:22:28.4076205Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 381.9 2026-02-21T08:22:28.4080078Z configs/s 2026-02-21T08:22:28.6339891Z [109s] Generation 5 complete: 2026-02-21T08:22:28.6343700Z ok=47 2026-02-21T08:22:28.6345339Z min=0.0164 2026-02-21T08:22:28.6345561Z mid=0.0184 2026-02-21T08:22:28.6350239Z max=0.0247 2026-02-21T08:22:28.6352358Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:28.6352652Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:22:28.6352932Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:28.6353188Z 'num_stages': 6, 2026-02-21T08:22:28.6353339Z 'num_warps': 2, 2026-02-21T08:22:28.6357931Z 'pid_type': 'flat', 2026-02-21T08:22:28.6362538Z 'range_flattens': [None, False], 2026-02-21T08:22:28.6366988Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:28.6368753Z 'range_num_stages': [0, 4], 2026-02-21T08:22:28.6368930Z 'range_unroll_factors': [0, 4], 2026-02-21T08:22:28.6369149Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:28.6374060Z [109s] Fitting surrogate: 432 points, 432 targets 2026-02-21T08:22:29.0223824Z [110s] Generation 6 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:22:30.4451997Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 27.5 configs/s 2026-02-21T08:22:31.7356455Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.8 configs/s 2026-02-21T08:22:32.8779733Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 886.4 2026-02-21T08:22:32.8781034Z configs/s 2026-02-21T08:22:32.9877378Z [114s] Generation 6 complete: 2026-02-21T08:22:32.9881516Z ok=22 2026-02-21T08:22:32.9883077Z min=0.0164 2026-02-21T08:22:32.9883247Z mid=0.0184 2026-02-21T08:22:32.9884041Z max=0.0246 2026-02-21T08:22:32.9888673Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:32.9890927Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:22:32.9891221Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:32.9891416Z 'num_stages': 6, 2026-02-21T08:22:32.9891636Z 'num_warps': 2, 2026-02-21T08:22:32.9891810Z 'pid_type': 'flat', 2026-02-21T08:22:32.9891989Z 'range_flattens': [None, False], 2026-02-21T08:22:32.9892176Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:32.9892379Z 'range_num_stages': [0, 3], 2026-02-21T08:22:32.9892551Z 'range_unroll_factors': [0, 4], 2026-02-21T08:22:32.9892740Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:32.9897243Z [114s] Fitting surrogate: 454 points, 454 targets 2026-02-21T08:22:33.3251212Z [114s] Generation 7 starting: 19 neighbors, 2 active search path(s) 2026-02-21T08:22:35.0147184Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 13.8 configs/s 2026-02-21T08:22:36.1666289Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s 2026-02-21T08:22:37.3092886Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 886.2 2026-02-21T08:22:37.3094788Z configs/s 2026-02-21T08:22:37.4181117Z [118s] Generation 7 complete: 2026-02-21T08:22:37.4182995Z ok=21 2026-02-21T08:22:37.4183164Z min=0.0164 2026-02-21T08:22:37.4183304Z mid=0.0184 2026-02-21T08:22:37.4183426Z max=0.0287 2026-02-21T08:22:37.4183567Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:37.4183817Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:22:37.4184070Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:37.4184243Z 'num_stages': 6, 2026-02-21T08:22:37.4184386Z 'num_warps': 2, 2026-02-21T08:22:37.4184525Z 'pid_type': 'flat', 2026-02-21T08:22:37.4184686Z 'range_flattens': [None, False], 2026-02-21T08:22:37.4184863Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:37.4185519Z 'range_num_stages': [0, 3], 2026-02-21T08:22:37.4185693Z 'range_unroll_factors': [0, 4], 2026-02-21T08:22:37.4185869Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:37.4199173Z [118s] Fitting surrogate: 475 points, 475 targets 2026-02-21T08:22:37.8119764Z [118s] Generation 8 starting: 17 neighbors, 2 active search path(s) 2026-02-21T08:22:40.2890453Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.1 configs/s 2026-02-21T08:22:41.3183220Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.3 configs/s 2026-02-21T08:22:42.7674413Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 944.0 2026-02-21T08:22:42.7678823Z configs/s 2026-02-21T08:22:42.8633417Z [123s] Generation 8 complete: 2026-02-21T08:22:42.8635273Z ok=19 2026-02-21T08:22:42.8635505Z min=0.0164 2026-02-21T08:22:42.8635667Z mid=0.0184 2026-02-21T08:22:42.8635835Z max=0.0307 2026-02-21T08:22:42.8636530Z best={'block_sizes': [1, 4096], 2026-02-21T08:22:42.8636794Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:22:42.8637056Z 'load_eviction_policies': ['', ''], 2026-02-21T08:22:42.8637253Z 'num_stages': 6, 2026-02-21T08:22:42.8637418Z 'num_warps': 2, 2026-02-21T08:22:42.8637586Z 'pid_type': 'flat', 2026-02-21T08:22:42.8637781Z 'range_flattens': [None, False], 2026-02-21T08:22:42.8637971Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:42.8638172Z 'range_num_stages': [0, 4], 2026-02-21T08:22:42.8638349Z 'range_unroll_factors': [0, 4], 2026-02-21T08:22:42.8638546Z 'range_warp_specializes': [None, None]} 2026-02-21T08:22:42.8645073Z [123s] Fitting surrogate: 494 points, 494 targets 2026-02-21T08:22:43.0346498Z [124s] Autotuning complete in 124.1s after searching 475 configs. 2026-02-21T08:22:43.0348188Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:22:43.0349102Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T08:22:43.0349903Z 2026-02-21T08:22:43.0350146Z [124s] Code of selected kernel: /tmp/torchinductor_root/jz/cjz33fcpu2l4q3itrjabq4fkakv5igzr4fmx6v2h6g5idtokg7x5.py 2026-02-21T08:22:43.9590754Z WARNING:tritonbench.utils.triton_op:Completed input ID 26: 2026-02-21T08:22:43.9594823Z (M, N) 2026-02-21T08:22:43.9596325Z ------------ 2026-02-21T08:22:43.9596541Z (4096, 3584) 2026-02-21T08:22:43.9596688Z 2026-02-21T08:22:43.9597213Z 30%|███ | 6/20 [13:41<33:16, 142.58s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31: 2026-02-21T08:22:43.9598749Z (M, N) 2026-02-21T08:22:43.9598924Z ------------ 2026-02-21T08:22:43.9599071Z (4096, 4224) 2026-02-21T08:22:43.9603634Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:22:45.2386709Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:22:46.4849516Z INFO:tritonbench.utils.triton_op:Took 2.18ms to get benchmark function for torch_compile_softmax 2026-02-21T08:22:49.9077533Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:22:49.9081960Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:22:49.9086102Z 'dtype': 'torch.float16', 2026-02-21T08:22:49.9087692Z 'shape': (4096, 4224), 2026-02-21T08:22:49.9087957Z 'stride': (4224, 1)},), 2026-02-21T08:22:49.9092751Z 'kwargs': {}} 2026-02-21T08:22:49.9097294Z INFO:tritonbench.utils.triton_op:Took 1.77ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:22:50.0855943Z [0s] Autotune random seed: 2134816249 2026-02-21T08:22:50.2288159Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:23:23.7029818Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:23:23.9431151Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:23:24.0934471Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:23:24.0950774Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:23:24.2833541Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:23:24.2836044Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:23:24.2836604Z %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16> 2026-02-21T08:23:24.2841498Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:23:24.2845995Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:23:24.2851299Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:23:24.2855886Z %cst_0 = arith.constant dense<4224> : tensor<8x1xi32> 2026-02-21T08:23:24.2857451Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32> 2026-02-21T08:23:24.2857817Z %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16> 2026-02-21T08:23:24.2863984Z %cst_3 = arith.constant dense<4224> : tensor<512xi32> 2026-02-21T08:23:24.2868531Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:23:24.2870678Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:23:24.2870985Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:23:24.2875767Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:23:24.2880329Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:23:24.2884367Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:23:24.2888808Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:23:24.2892689Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:23:24.2897122Z %1 = tt.get_program_id x : i32 2026-02-21T08:23:24.2899007Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:23:24.2899280Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:23:24.2899516Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:23:24.2899777Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:23:24.2899976Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:23:24.2900177Z %c4096_i32_6 = arith.constant 4096 : i32 2026-02-21T08:23:24.2900363Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:23:24.2900741Z %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:23:24.2901156Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2901416Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2901706Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:23:24.2902235Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2902555Z %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:23:24.2902909Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2903214Z %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2903503Z %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:23:24.2903783Z %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2904023Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2904216Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2904411Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:24.2904604Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2904865Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2905106Z %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:24.2905357Z %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:24.2905607Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32> 2026-02-21T08:23:24.2905833Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:23:24.2906132Z %74 = arith.ori %72, %73 : tensor<8xi1> 2026-02-21T08:23:24.2906360Z %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:24.2906605Z %76 = arith.subf %arg4, %75 : tensor<8xf32> 2026-02-21T08:23:24.2906970Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2907327Z %78 = arith.mulf %arg5, %77 : tensor<8xf32> 2026-02-21T08:23:24.2907586Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2907871Z %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2908134Z %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2908373Z %82 = arith.subf %80, %81 : tensor<8x512xf32> 2026-02-21T08:23:24.2908736Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2909147Z %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:23:24.2909401Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2909600Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2909782Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:24.2909980Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2910177Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2910369Z %86 = arith.addf %78, %85 : tensor<8xf32> 2026-02-21T08:23:24.2910568Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:23:24.2910752Z %87 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:23:24.2910942Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:23:24.2911173Z %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2911432Z %90 = tt.splat %88 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2911693Z %91 = arith.addi %90, %89 : tensor<512xi32> 2026-02-21T08:23:24.2911903Z %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2912203Z %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:23:24.2912534Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2912822Z %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2913090Z %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:23:24.2913447Z %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2913677Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2913864Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2914047Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:24.2914233Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2914420Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2914634Z %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:24.2914886Z %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:24.2915125Z %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32> 2026-02-21T08:23:24.2915341Z %102 = arith.cmpf une, %75, %75 : tensor<8xf32> 2026-02-21T08:23:24.2915555Z %103 = arith.ori %101, %102 : tensor<8xi1> 2026-02-21T08:23:24.2915838Z %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:24.2916082Z %105 = arith.subf %75, %104 : tensor<8xf32> 2026-02-21T08:23:24.2916453Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2916830Z %107 = arith.mulf %86, %106 : tensor<8xf32> 2026-02-21T08:23:24.2917095Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2917397Z %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2917676Z %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2917924Z %111 = arith.subf %109, %110 : tensor<8x512xf32> 2026-02-21T08:23:24.2918319Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2918756Z %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:23:24.2919024Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2919231Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2919413Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:24.2919613Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2919803Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2920015Z %115 = arith.addf %107, %114 : tensor<8xf32> 2026-02-21T08:23:24.2920216Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:23:24.2920418Z %116 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:23:24.2920638Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:23:24.2920882Z %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2921153Z %119 = tt.splat %117 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2921367Z %120 = arith.addi %119, %118 : tensor<512xi32> 2026-02-21T08:23:24.2921645Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2921968Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:23:24.2922342Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2922667Z %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2922957Z %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:23:24.2923261Z %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2923504Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2923709Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2923897Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:24.2924105Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2924307Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2924540Z %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:24.2924870Z %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:24.2925116Z %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32> 2026-02-21T08:23:24.2925337Z %131 = arith.cmpf une, %104, %104 : tensor<8xf32> 2026-02-21T08:23:24.2925543Z %132 = arith.ori %130, %131 : tensor<8xi1> 2026-02-21T08:23:24.2925781Z %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:24.2926026Z %134 = arith.subf %104, %133 : tensor<8xf32> 2026-02-21T08:23:24.2926376Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2926749Z %136 = arith.mulf %115, %135 : tensor<8xf32> 2026-02-21T08:23:24.2926998Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2927332Z %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2927603Z %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2927844Z %140 = arith.subf %138, %139 : tensor<8x512xf32> 2026-02-21T08:23:24.2928240Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2928654Z %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:23:24.2928909Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2929106Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2929282Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:24.2929471Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2929651Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2929851Z %144 = arith.addf %136, %143 : tensor<8xf32> 2026-02-21T08:23:24.2930043Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:23:24.2930236Z %145 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:23:24.2930431Z %146 = arith.addi %arg3, %145 : i32 2026-02-21T08:23:24.2930660Z %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2930920Z %148 = tt.splat %146 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2931126Z %149 = arith.addi %148, %147 : tensor<512xi32> 2026-02-21T08:23:24.2931348Z %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2931676Z %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:23:24.2932024Z %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2932326Z %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2932604Z %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:23:24.2932893Z %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2933125Z %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2933324Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2933512Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:24.2933697Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2933884Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2934106Z %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:24.2934355Z %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:24.2934582Z %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32> 2026-02-21T08:23:24.2934801Z %160 = arith.cmpf une, %133, %133 : tensor<8xf32> 2026-02-21T08:23:24.2935001Z %161 = arith.ori %159, %160 : tensor<8xi1> 2026-02-21T08:23:24.2935238Z %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:24.2935486Z %163 = arith.subf %133, %162 : tensor<8xf32> 2026-02-21T08:23:24.2935924Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2936283Z %165 = arith.mulf %144, %164 : tensor<8xf32> 2026-02-21T08:23:24.2936529Z %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2936824Z %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2937086Z %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2937323Z %169 = arith.subf %167, %168 : tensor<8x512xf32> 2026-02-21T08:23:24.2937692Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2938097Z %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:23:24.2938405Z %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2938598Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:24.2938783Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:24.2938975Z tt.reduce.return %174 : f32 2026-02-21T08:23:24.2939155Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2939358Z %173 = arith.addf %165, %172 : tensor<8xf32> 2026-02-21T08:23:24.2939572Z scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:23:24.2939822Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:23:24.2940085Z %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2940347Z %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2940562Z %9 = arith.addi %8, %7 : tensor<512xi32> 2026-02-21T08:23:24.2940768Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2941078Z %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:23:24.2941425Z %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2941747Z %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2942016Z %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:23:24.2942289Z %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2942520Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2942711Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:23:24.2942899Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:23:24.2943086Z tt.reduce.return %60 : f32 2026-02-21T08:23:24.2943279Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2943491Z %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:24.2943734Z %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:24.2943964Z %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32> 2026-02-21T08:23:24.2944174Z %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T08:23:24.2944379Z %21 = arith.ori %19, %20 : tensor<8xi1> 2026-02-21T08:23:24.2944597Z %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:24.2944829Z %23 = arith.subf %6#0, %22 : tensor<8xf32> 2026-02-21T08:23:24.2945179Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2945543Z %25 = arith.mulf %6#1, %24 : tensor<8xf32> 2026-02-21T08:23:24.2945792Z %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2946069Z %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2946324Z %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2946552Z %29 = arith.subf %27, %28 : tensor<8x512xf32> 2026-02-21T08:23:24.2946986Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2947388Z %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:23:24.2947631Z %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({ 2026-02-21T08:23:24.2947825Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:23:24.2947996Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:23:24.2948182Z tt.reduce.return %60 : f32 2026-02-21T08:23:24.2948361Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:23:24.2948556Z %33 = arith.addf %25, %32 : tensor<8xf32> 2026-02-21T08:23:24.2948746Z %c4096_i32_7 = arith.constant 4096 : i32 2026-02-21T08:23:24.2948942Z %c2048_i32_8 = arith.constant 2048 : i32 2026-02-21T08:23:24.2949173Z scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8 : i32 { 2026-02-21T08:23:24.2949498Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2949763Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2949966Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:23:24.2950183Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2950439Z %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:23:24.2950702Z %65 = arith.muli %64, %cst_0 : tensor<8x1xi32> 2026-02-21T08:23:24.2950962Z %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:23:24.2951248Z %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2951511Z %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2951783Z %69 = arith.addi %67, %68 : tensor<8x512xi32> 2026-02-21T08:23:24.2952031Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2952316Z %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2952619Z %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2952917Z %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2953214Z %74 = tt.load %71, %73, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2953540Z %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2953819Z %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2954078Z %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2954310Z %78 = arith.subf %76, %77 : tensor<8x512xf32> 2026-02-21T08:23:24.2954676Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2955102Z %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2955379Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2955610Z %82 = arith.divf %79, %81 : tensor<8x512xf32> 2026-02-21T08:23:24.2955837Z %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:23:24.2956106Z %84 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2956386Z %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2956644Z tt.store %85, %83, %73 : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2956859Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:23:24.2957049Z %86 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:23:24.2957246Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:23:24.2957474Z %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2957777Z %89 = tt.splat %87 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2957981Z %90 = arith.addi %89, %88 : tensor<512xi32> 2026-02-21T08:23:24.2958192Z %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2958456Z %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:23:24.2958713Z %93 = arith.muli %92, %cst_0 : tensor<8x1xi32> 2026-02-21T08:23:24.2958974Z %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:23:24.2959261Z %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2959542Z %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2959792Z %97 = arith.addi %95, %96 : tensor<8x512xi32> 2026-02-21T08:23:24.2960032Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2960367Z %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2960677Z %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2960988Z %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2961316Z %102 = tt.load %99, %101, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2961706Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2962012Z %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2962284Z %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2962542Z %106 = arith.subf %104, %105 : tensor<8x512xf32> 2026-02-21T08:23:24.2962930Z %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2963377Z %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2963680Z %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2963930Z %110 = arith.divf %107, %109 : tensor<8x512xf32> 2026-02-21T08:23:24.2964185Z %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:23:24.2964471Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2964776Z %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2965065Z tt.store %113, %111, %101 : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2965288Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:23:24.2965493Z %114 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:23:24.2965695Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T08:23:24.2965949Z %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2966224Z %117 = tt.splat %115 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2966452Z %118 = arith.addi %117, %116 : tensor<512xi32> 2026-02-21T08:23:24.2966695Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2966976Z %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:23:24.2967245Z %121 = arith.muli %120, %cst_0 : tensor<8x1xi32> 2026-02-21T08:23:24.2967508Z %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:23:24.2967808Z %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2968074Z %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2968321Z %125 = arith.addi %123, %124 : tensor<8x512xi32> 2026-02-21T08:23:24.2968566Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2968851Z %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2969208Z %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2969494Z %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2969807Z %130 = tt.load %127, %129, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2970138Z %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2970417Z %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2970678Z %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2970915Z %134 = arith.subf %132, %133 : tensor<8x512xf32> 2026-02-21T08:23:24.2971294Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2971805Z %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2972096Z %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2972340Z %138 = arith.divf %135, %137 : tensor<8x512xf32> 2026-02-21T08:23:24.2972577Z %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:23:24.2972851Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2973131Z %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2973405Z tt.store %141, %139, %129 : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2973623Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:23:24.2973814Z %142 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:23:24.2974009Z %143 = arith.addi %arg3, %142 : i32 2026-02-21T08:23:24.2974238Z %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2974500Z %145 = tt.splat %143 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2974703Z %146 = arith.addi %145, %144 : tensor<512xi32> 2026-02-21T08:23:24.2974925Z %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2975191Z %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:23:24.2975449Z %149 = arith.muli %148, %cst_0 : tensor<8x1xi32> 2026-02-21T08:23:24.2975714Z %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:23:24.2976006Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2976273Z %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2976513Z %153 = arith.addi %151, %152 : tensor<8x512xi32> 2026-02-21T08:23:24.2976755Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2977047Z %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2977349Z %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2977644Z %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2977948Z %158 = tt.load %155, %157, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2978279Z %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2978571Z %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2978829Z %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2979070Z %162 = arith.subf %160, %161 : tensor<8x512xf32> 2026-02-21T08:23:24.2979442Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2979869Z %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2980200Z %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2980455Z %166 = arith.divf %163, %165 : tensor<8x512xf32> 2026-02-21T08:23:24.2980696Z %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:23:24.2980964Z %168 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2981247Z %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2981511Z tt.store %169, %167, %157 : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2981804Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:23:24.2982070Z %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:23:24.2982333Z %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32> 2026-02-21T08:23:24.2982599Z %36 = arith.addi %35, %34 : tensor<512xi32> 2026-02-21T08:23:24.2982813Z %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32> 2026-02-21T08:23:24.2983075Z %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:23:24.2983332Z %39 = arith.muli %38, %cst_0 : tensor<8x1xi32> 2026-02-21T08:23:24.2983591Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:23:24.2983883Z %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2984139Z %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:23:24.2984379Z %43 = arith.addi %41, %42 : tensor<8x512xi32> 2026-02-21T08:23:24.2984612Z %44 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2984887Z %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2985173Z %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:23:24.2985459Z %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:23:24.2985756Z %48 = tt.load %45, %47, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2986071Z %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2986353Z %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:23:24.2986599Z %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2986830Z %52 = arith.subf %50, %51 : tensor<8x512xf32> 2026-02-21T08:23:24.2987185Z %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:23:24.2987593Z %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:24.2987872Z %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:23:24.2988096Z %56 = arith.divf %53, %55 : tensor<8x512xf32> 2026-02-21T08:23:24.2988329Z %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:23:24.2988583Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2988855Z %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:23:24.2989112Z tt.store %59, %57, %47 : tensor<8x512x!tt.ptr> 2026-02-21T08:23:24.2989388Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:23:24.2989644Z tt.return 2026-02-21T08:23:24.2989770Z } 2026-02-21T08:23:24.2989898Z } 2026-02-21T08:23:24.2989965Z 2026-02-21T08:23:24.2990015Z {-# 2026-02-21T08:23:24.2990151Z external_resources: { 2026-02-21T08:23:24.2990305Z mlir_reproducer: { 2026-02-21T08:23:24.2994683Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:23:24.2999093Z disable_threading: false, 2026-02-21T08:23:24.2999268Z verify_each: true 2026-02-21T08:23:24.2999411Z } 2026-02-21T08:23:24.2999538Z } 2026-02-21T08:23:24.2999648Z #-} 2026-02-21T08:23:24.3000077Z /tmp/torchinductor_root/ay/cay7nz7nggw5j73svyjfng2qcf2be64lex7hhfxux7fbsi3w3ldy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:23:24.3001252Z /tmp/torchinductor_root/ay/cay7nz7nggw5j73svyjfng2qcf2be64lex7hhfxux7fbsi3w3ldy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:23:24.3002271Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:23:24.3003368Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:23:24.3004341Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:23:24.3004609Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:23:30.5339278Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.6 configs/s 2026-02-21T08:23:30.5347885Z [40s] Adaptive compile timeout: 30s (90% percentile=4.5s, bounds=[30.0s, 30s]) 2026-02-21T08:23:30.9958592Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2123.1 configs/s 2026-02-21T08:23:31.0412114Z [40s] Initial random population of 100, 5 starting points: 2026-02-21T08:23:31.0416617Z error=6 2026-02-21T08:23:31.0418622Z timeout=3 2026-02-21T08:23:31.0423776Z ok=91 2026-02-21T08:23:31.0425361Z min=0.0307 2026-02-21T08:23:31.0425523Z mid=0.4117 2026-02-21T08:23:31.0425649Z max=28.4201 2026-02-21T08:23:31.0425797Z best={'block_sizes': [1, 8192], 2026-02-21T08:23:31.0426026Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:23:31.0426270Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:23:31.0426450Z 'maxnreg': 32, 2026-02-21T08:23:31.0426600Z 'num_sm_multiplier': 64, 2026-02-21T08:23:31.0427130Z 'num_stages': 7, 2026-02-21T08:23:31.0427269Z 'num_warps': 4, 2026-02-21T08:23:31.0431938Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:23:31.0435592Z 'range_flattens': [None, True], 2026-02-21T08:23:31.0435895Z 'range_multi_buffers': [False, True], 2026-02-21T08:23:31.0436119Z 'range_num_stages': [1, 4], 2026-02-21T08:23:31.0441460Z 'range_unroll_factors': [1, 4], 2026-02-21T08:23:31.0443398Z 'range_warp_specializes': [True, None]} 2026-02-21T08:23:31.0443714Z [40s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:23:32.1619009Z [41s] Generation 1 starting: 84 neighbors, 5 active search path(s) 2026-02-21T08:23:38.9858839Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 7.0 configs/s 2026-02-21T08:23:41.0983236Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:23:41.0985887Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:23:41.0986462Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:23:41.0986665Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:23:41.0988402Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:23:41.0988632Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:23:41.0988902Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T08:23:41.0989208Z %cst_0 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T08:23:41.0994073Z %cst_1 = arith.constant dense<4224> : tensor<1024xi32> 2026-02-21T08:23:41.0999140Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:23:41.1003746Z %cst_3 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:23:41.1005564Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:23:41.1005765Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:23:41.1005968Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:23:41.1006170Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:23:41.1006365Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:23:41.1006688Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:23:41.1007122Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:23:41.1007475Z %2 = tt.get_program_id x : i32 2026-02-21T08:23:41.1007696Z scf.for %arg2 = %2 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:23:41.1007914Z %3 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:23:41.1008112Z %c4096_i32_4 = arith.constant 4096 : i32 2026-02-21T08:23:41.1008308Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:23:41.1008683Z %4:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c2048_i32 iter_args(%arg4 = %cst_3, %arg5 = %cst_2) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:23:41.1009118Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:23:41.1009386Z %43 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T08:23:41.1009605Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T08:23:41.1009829Z %45 = arith.cmpi slt, %44, %cst_1 : tensor<1024xi32> 2026-02-21T08:23:41.1010132Z %46 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1010486Z %47 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:23:41.1010781Z %48 = tt.broadcast %47 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:23:41.1011064Z %49 = arith.select %48, %46, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:23:41.1011337Z %50 = arith.extf %49 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1011634Z %51 = "tt.reduce"(%50) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1011840Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:41.1012037Z %98 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:41.1012577Z tt.reduce.return %98 : f32 2026-02-21T08:23:41.1012767Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1012999Z %52 = arith.truncf %51 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:41.1013240Z %53 = arith.extf %52 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:41.1013475Z %54 = arith.cmpf ogt, %arg4, %53 : tensor<8xf32> 2026-02-21T08:23:41.1013706Z %55 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:23:41.1013915Z %56 = arith.ori %54, %55 : tensor<8xi1> 2026-02-21T08:23:41.1014150Z %57 = arith.select %56, %arg4, %53 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:41.1014382Z %58 = arith.subf %arg4, %57 : tensor<8xf32> 2026-02-21T08:23:41.1014752Z %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1015213Z %60 = arith.mulf %arg5, %59 : tensor<8xf32> 2026-02-21T08:23:41.1015479Z %61 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1015770Z %62 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1016051Z %63 = tt.broadcast %61 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1016309Z %64 = arith.subf %62, %63 : tensor<8x1024xf32> 2026-02-21T08:23:41.1016697Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1017142Z %66 = arith.select %48, %65, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:23:41.1017422Z %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1017628Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:41.1017838Z %98 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:41.1018037Z tt.reduce.return %98 : f32 2026-02-21T08:23:41.1018259Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1018469Z %68 = arith.addf %60, %67 : tensor<8xf32> 2026-02-21T08:23:41.1018677Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:23:41.1018871Z %69 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T08:23:41.1019075Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:23:41.1019326Z %71 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:23:41.1019588Z %72 = tt.splat %70 : i32 -> tensor<1024xi32> 2026-02-21T08:23:41.1019802Z %73 = arith.addi %72, %71 : tensor<1024xi32> 2026-02-21T08:23:41.1020023Z %74 = arith.cmpi slt, %73, %cst_1 : tensor<1024xi32> 2026-02-21T08:23:41.1020338Z %75 = tt.descriptor_load %0[%3, %70] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1020691Z %76 = tt.expand_dims %74 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:23:41.1021002Z %77 = tt.broadcast %76 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:23:41.1021296Z %78 = arith.select %77, %75, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:23:41.1021640Z %79 = arith.extf %78 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1021887Z %80 = "tt.reduce"(%79) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1022083Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:41.1022286Z %98 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:23:41.1022484Z tt.reduce.return %98 : f32 2026-02-21T08:23:41.1022687Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1022927Z %81 = arith.truncf %80 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:41.1023181Z %82 = arith.extf %81 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:41.1023430Z %83 = arith.cmpf ogt, %57, %82 : tensor<8xf32> 2026-02-21T08:23:41.1023638Z %84 = arith.cmpf une, %57, %57 : tensor<8xf32> 2026-02-21T08:23:41.1023846Z %85 = arith.ori %83, %84 : tensor<8xi1> 2026-02-21T08:23:41.1024141Z %86 = arith.select %85, %57, %82 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:41.1024370Z %87 = arith.subf %57, %86 : tensor<8xf32> 2026-02-21T08:23:41.1024724Z %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1025071Z %89 = arith.mulf %68, %88 : tensor<8xf32> 2026-02-21T08:23:41.1025316Z %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1025594Z %91 = arith.extf %75 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1025853Z %92 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1026085Z %93 = arith.subf %91, %92 : tensor<8x1024xf32> 2026-02-21T08:23:41.1026502Z %94 = tt.extern_elementwise %93 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1026918Z %95 = arith.select %77, %94, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:23:41.1027164Z %96 = "tt.reduce"(%95) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1027359Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:23:41.1027536Z %98 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:23:41.1027728Z tt.reduce.return %98 : f32 2026-02-21T08:23:41.1027920Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1028113Z %97 = arith.addf %89, %96 : tensor<8xf32> 2026-02-21T08:23:41.1028354Z scf.yield %86, %97 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:23:41.1028573Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:23:41.1028828Z %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:23:41.1029087Z %6 = tt.splat %c4096_i32_4 : i32 -> tensor<1024xi32> 2026-02-21T08:23:41.1029304Z %7 = arith.addi %6, %5 : tensor<1024xi32> 2026-02-21T08:23:41.1029514Z %8 = arith.cmpi slt, %7, %cst_1 : tensor<1024xi32> 2026-02-21T08:23:41.1029832Z %9 = tt.descriptor_load %0[%3, %c4096_i32_4] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1030194Z %10 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:23:41.1030484Z %11 = tt.broadcast %10 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:23:41.1030761Z %12 = arith.select %11, %9, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:23:41.1031031Z %13 = arith.extf %12 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1031263Z %14 = "tt.reduce"(%13) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1031453Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:23:41.1031659Z %42 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:23:41.1031855Z tt.reduce.return %42 : f32 2026-02-21T08:23:41.1032035Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1032263Z %15 = arith.truncf %14 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:23:41.1032493Z %16 = arith.extf %15 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:23:41.1032719Z %17 = arith.cmpf ogt, %4#0, %16 : tensor<8xf32> 2026-02-21T08:23:41.1032931Z %18 = arith.cmpf une, %4#0, %4#0 : tensor<8xf32> 2026-02-21T08:23:41.1033136Z %19 = arith.ori %17, %18 : tensor<8xi1> 2026-02-21T08:23:41.1033360Z %20 = arith.select %19, %4#0, %16 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:23:41.1033584Z %21 = arith.subf %4#0, %20 : tensor<8xf32> 2026-02-21T08:23:41.1033937Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1034282Z %23 = arith.mulf %4#1, %22 : tensor<8xf32> 2026-02-21T08:23:41.1034531Z %24 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1034814Z %25 = arith.extf %9 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1035118Z %26 = tt.broadcast %24 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1035351Z %27 = arith.subf %25, %26 : tensor<8x1024xf32> 2026-02-21T08:23:41.1035709Z %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1036118Z %29 = arith.select %11, %28, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:23:41.1036359Z %30 = "tt.reduce"(%29) <{axis = 1 : i32}> ({ 2026-02-21T08:23:41.1036550Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:23:41.1036731Z %42 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:23:41.1036914Z tt.reduce.return %42 : f32 2026-02-21T08:23:41.1037100Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:23:41.1037290Z %31 = arith.addf %23, %30 : tensor<8xf32> 2026-02-21T08:23:41.1037487Z %c4096_i32_5 = arith.constant 4096 : i32 2026-02-21T08:23:41.1037720Z %c2048_i32_6 = arith.constant 2048 : i32 2026-02-21T08:23:41.1037961Z scf.for %arg3 = %c0_i32 to %c4096_i32_5 step %c2048_i32_6 : i32 { 2026-02-21T08:23:41.1038292Z %42 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1038627Z %43 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1038918Z %44 = arith.extf %42 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1039168Z %45 = tt.broadcast %43 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1039406Z %46 = arith.subf %44, %45 : tensor<8x1024xf32> 2026-02-21T08:23:41.1039770Z %47 = tt.extern_elementwise %46 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1040182Z %48 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1040470Z %49 = tt.broadcast %48 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1040704Z %50 = arith.divf %47, %49 : tensor<8x1024xf32> 2026-02-21T08:23:41.1040946Z %51 = arith.truncf %50 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:23:41.1041265Z tt.descriptor_store %1[%3, %arg3], %51 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:23:41.1041581Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:23:41.1041780Z %52 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T08:23:41.1041971Z %53 = arith.addi %arg3, %52 : i32 2026-02-21T08:23:41.1042246Z %54 = tt.descriptor_load %0[%3, %53] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1042576Z %55 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1042863Z %56 = arith.extf %54 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1043114Z %57 = tt.broadcast %55 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1043354Z %58 = arith.subf %56, %57 : tensor<8x1024xf32> 2026-02-21T08:23:41.1043728Z %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1044138Z %60 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1044423Z %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1044651Z %62 = arith.divf %59, %61 : tensor<8x1024xf32> 2026-02-21T08:23:41.1044891Z %63 = arith.truncf %62 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:23:41.1045206Z tt.descriptor_store %1[%3, %53], %63 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:23:41.1045488Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:23:41.1045794Z %32 = tt.descriptor_load %0[%3, %c4096_i32_5] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:23:41.1046222Z %33 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1046522Z %34 = arith.extf %32 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:23:41.1046777Z %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1047018Z %36 = arith.subf %34, %35 : tensor<8x1024xf32> 2026-02-21T08:23:41.1047393Z %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:23:41.1047808Z %38 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:23:41.1048098Z %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:23:41.1048331Z %40 = arith.divf %37, %39 : tensor<8x1024xf32> 2026-02-21T08:23:41.1048576Z %41 = arith.truncf %40 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:23:41.1048963Z tt.descriptor_store %1[%3, %c4096_i32_5], %41 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:23:41.1049334Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:23:41.1049600Z tt.return 2026-02-21T08:23:41.1049730Z } 2026-02-21T08:23:41.1049862Z } 2026-02-21T08:23:41.1049934Z 2026-02-21T08:23:41.1049986Z {-# 2026-02-21T08:23:41.1050127Z external_resources: { 2026-02-21T08:23:41.1050291Z mlir_reproducer: { 2026-02-21T08:23:41.1054755Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:23:41.1059312Z disable_threading: false, 2026-02-21T08:23:41.1059487Z verify_each: true 2026-02-21T08:23:41.1059631Z } 2026-02-21T08:23:41.1059759Z } 2026-02-21T08:23:41.1059875Z #-} 2026-02-21T08:23:41.1060311Z /tmp/torchinductor_root/7t/c7toaxgqpp4rbcfrjffndnwgmleknkeskrx3lz6pdu55gkgvczik.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:23:41.1061490Z /tmp/torchinductor_root/7t/c7toaxgqpp4rbcfrjffndnwgmleknkeskrx3lz6pdu55gkgvczik.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:23:41.1062549Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:23:41.1063660Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:23:41.1064653Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:23:41.1064906Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:23:44.0956861Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.0 configs/s 2026-02-21T08:23:48.1268576Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.8 2026-02-21T08:23:48.1273513Z configs/s 2026-02-21T08:23:48.3921986Z [58s] Generation 1 complete: 2026-02-21T08:23:48.3926699Z error=2 2026-02-21T08:23:48.3931897Z ok=88 2026-02-21T08:23:48.3933962Z min=0.0266 2026-02-21T08:23:48.3934119Z mid=0.0389 2026-02-21T08:23:48.3934249Z max=0.1720 2026-02-21T08:23:48.3934387Z best={'block_sizes': [1, 8192], 2026-02-21T08:23:48.3934626Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:23:48.3934862Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:23:48.3935054Z 'num_stages': 7, 2026-02-21T08:23:48.3935202Z 'num_warps': 4, 2026-02-21T08:23:48.3935339Z 'pid_type': 'flat', 2026-02-21T08:23:48.3935499Z 'range_flattens': [None, True], 2026-02-21T08:23:48.3935674Z 'range_multi_buffers': [None, True], 2026-02-21T08:23:48.3935861Z 'range_num_stages': [0, 4], 2026-02-21T08:23:48.3936022Z 'range_unroll_factors': [0, 0], 2026-02-21T08:23:48.3936202Z 'range_warp_specializes': [None, True]} 2026-02-21T08:23:48.3936431Z [58s] Fitting surrogate: 190 points, 190 targets 2026-02-21T08:23:49.4434000Z [59s] Generation 2 starting: 76 neighbors, 5 active search path(s) 2026-02-21T08:24:16.7875983Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.7 configs/s 2026-02-21T08:24:21.2249583Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:24:21.2254059Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:24:21.2255567Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf16> 2026-02-21T08:24:21.2255857Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:24:21.2256060Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:21.2256246Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:24:21.2256468Z %cst_0 = arith.constant dense<4224> : tensor<16x1xi32> 2026-02-21T08:24:21.2256756Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<16x256xf32> 2026-02-21T08:24:21.2257050Z %cst_2 = arith.constant dense<0xFC00> : tensor<16x256xf16> 2026-02-21T08:24:21.2257294Z %cst_3 = arith.constant dense<4224> : tensor<256xi32> 2026-02-21T08:24:21.2257547Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T08:24:21.2257795Z %cst_5 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T08:24:21.2258027Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:24:21.2258223Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:21.2258408Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:24:21.2258598Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:24:21.2258776Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:21.2259103Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:24:21.2259427Z %1 = tt.get_program_id x : i32 2026-02-21T08:24:21.2259642Z scf.for %arg2 = %1 to %c256_i32 step %c592_i32 : i32 { 2026-02-21T08:24:21.2260231Z %2 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T08:24:21.2260469Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:24:21.2260724Z %4 = tt.splat %2 : i32 -> tensor<16xi32> 2026-02-21T08:24:21.2260917Z %5 = arith.addi %4, %3 : tensor<16xi32> 2026-02-21T08:24:21.2261114Z %c4096_i32_6 = arith.constant 4096 : i32 2026-02-21T08:24:21.2261300Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:24:21.2262015Z %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c1024_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T08:24:21.2262448Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2262707Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2262924Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:24:21.2263144Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2263630Z %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<16x256xf16> 2026-02-21T08:24:21.2263996Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2264293Z %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2264582Z %67 = arith.select %66, %64, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16> 2026-02-21T08:24:21.2264863Z %68 = arith.extf %67 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2265102Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2265293Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2265488Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:24:21.2265698Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2265886Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2266162Z %70 = arith.truncf %69 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:24:21.2266413Z %71 = arith.extf %70 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:24:21.2266641Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<16xf32> 2026-02-21T08:24:21.2266871Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T08:24:21.2267081Z %74 = arith.ori %72, %73 : tensor<16xi1> 2026-02-21T08:24:21.2267321Z %75 = arith.select %74, %arg4, %71 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:24:21.2267568Z %76 = arith.subf %arg4, %75 : tensor<16xf32> 2026-02-21T08:24:21.2267935Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2268309Z %78 = arith.mulf %arg5, %77 : tensor<16xf32> 2026-02-21T08:24:21.2268558Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2268857Z %80 = arith.extf %64 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2269156Z %81 = tt.broadcast %79 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2269391Z %82 = arith.subf %80, %81 : tensor<16x256xf32> 2026-02-21T08:24:21.2269759Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2270169Z %84 = arith.select %66, %83, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:24:21.2270422Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2270612Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2270801Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:21.2270995Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2271180Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2271382Z %86 = arith.addf %78, %85 : tensor<16xf32> 2026-02-21T08:24:21.2271618Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:21.2271819Z %87 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:24:21.2272099Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:24:21.2272334Z %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2272589Z %90 = tt.splat %88 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2272782Z %91 = arith.addi %90, %89 : tensor<256xi32> 2026-02-21T08:24:21.2272998Z %92 = arith.cmpi slt, %91, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2273295Z %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc> -> tensor<16x256xf16> 2026-02-21T08:24:21.2273640Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2273925Z %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2274198Z %96 = arith.select %95, %93, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16> 2026-02-21T08:24:21.2274543Z %97 = arith.extf %96 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2274778Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2274969Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2275151Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:24:21.2275344Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2275525Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2275750Z %99 = arith.truncf %98 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:24:21.2275993Z %100 = arith.extf %99 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:24:21.2276221Z %101 = arith.cmpf ogt, %75, %100 : tensor<16xf32> 2026-02-21T08:24:21.2276443Z %102 = arith.cmpf une, %75, %75 : tensor<16xf32> 2026-02-21T08:24:21.2276647Z %103 = arith.ori %101, %102 : tensor<16xi1> 2026-02-21T08:24:21.2276883Z %104 = arith.select %103, %75, %100 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:24:21.2277126Z %105 = arith.subf %75, %104 : tensor<16xf32> 2026-02-21T08:24:21.2277514Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2277907Z %107 = arith.mulf %86, %106 : tensor<16xf32> 2026-02-21T08:24:21.2278168Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2278496Z %109 = arith.extf %93 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2278767Z %110 = tt.broadcast %108 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2279028Z %111 = arith.subf %109, %110 : tensor<16x256xf32> 2026-02-21T08:24:21.2279428Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2279866Z %113 = arith.select %95, %112, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:24:21.2280143Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2280341Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2280530Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:21.2280727Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2280920Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2281130Z %115 = arith.addf %107, %114 : tensor<16xf32> 2026-02-21T08:24:21.2281324Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:21.2281519Z %116 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:24:21.2281753Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:24:21.2281992Z %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2282242Z %119 = tt.splat %117 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2282454Z %120 = arith.addi %119, %118 : tensor<256xi32> 2026-02-21T08:24:21.2282669Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2282987Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<16x256xf16> 2026-02-21T08:24:21.2283404Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2283698Z %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2283982Z %125 = arith.select %124, %122, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16> 2026-02-21T08:24:21.2284267Z %126 = arith.extf %125 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2284510Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2284705Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2284884Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:24:21.2285078Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2285260Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2285493Z %128 = arith.truncf %127 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:24:21.2285793Z %129 = arith.extf %128 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:24:21.2286034Z %130 = arith.cmpf ogt, %104, %129 : tensor<16xf32> 2026-02-21T08:24:21.2286256Z %131 = arith.cmpf une, %104, %104 : tensor<16xf32> 2026-02-21T08:24:21.2286459Z %132 = arith.ori %130, %131 : tensor<16xi1> 2026-02-21T08:24:21.2286698Z %133 = arith.select %132, %104, %129 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:24:21.2286940Z %134 = arith.subf %104, %133 : tensor<16xf32> 2026-02-21T08:24:21.2287299Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2287661Z %136 = arith.mulf %115, %135 : tensor<16xf32> 2026-02-21T08:24:21.2287920Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2288214Z %138 = arith.extf %122 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2288480Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2288734Z %140 = arith.subf %138, %139 : tensor<16x256xf32> 2026-02-21T08:24:21.2289099Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2289520Z %142 = arith.select %124, %141, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:24:21.2289782Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2289970Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2290152Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:21.2290336Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2290523Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2290717Z %144 = arith.addf %136, %143 : tensor<16xf32> 2026-02-21T08:24:21.2290915Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:24:21.2291104Z %145 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T08:24:21.2291306Z %146 = arith.addi %arg3, %145 : i32 2026-02-21T08:24:21.2291581Z %147 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2291835Z %148 = tt.splat %146 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2292043Z %149 = arith.addi %148, %147 : tensor<256xi32> 2026-02-21T08:24:21.2292259Z %150 = arith.cmpi slt, %149, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2292572Z %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc> -> tensor<16x256xf16> 2026-02-21T08:24:21.2292916Z %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2293222Z %153 = tt.broadcast %152 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2293516Z %154 = arith.select %153, %151, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16> 2026-02-21T08:24:21.2293811Z %155 = arith.extf %154 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2294153Z %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2294340Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2294528Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:24:21.2294715Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2294903Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2295137Z %157 = arith.truncf %156 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:24:21.2295387Z %158 = arith.extf %157 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:24:21.2295629Z %159 = arith.cmpf ogt, %133, %158 : tensor<16xf32> 2026-02-21T08:24:21.2295845Z %160 = arith.cmpf une, %133, %133 : tensor<16xf32> 2026-02-21T08:24:21.2296058Z %161 = arith.ori %159, %160 : tensor<16xi1> 2026-02-21T08:24:21.2296288Z %162 = arith.select %161, %133, %158 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:24:21.2296595Z %163 = arith.subf %133, %162 : tensor<16xf32> 2026-02-21T08:24:21.2296963Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2297318Z %165 = arith.mulf %144, %164 : tensor<16xf32> 2026-02-21T08:24:21.2297575Z %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2297865Z %167 = arith.extf %151 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2298136Z %168 = tt.broadcast %166 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2298390Z %169 = arith.subf %167, %168 : tensor<16x256xf32> 2026-02-21T08:24:21.2298776Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2299213Z %171 = arith.select %153, %170, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:24:21.2299484Z %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2299690Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:21.2299875Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:21.2300076Z tt.reduce.return %174 : f32 2026-02-21T08:24:21.2300271Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2300477Z %173 = arith.addf %165, %172 : tensor<16xf32> 2026-02-21T08:24:21.2300706Z scf.yield %162, %173 : tensor<16xf32>, tensor<16xf32> 2026-02-21T08:24:21.2300928Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:21.2301179Z %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2301446Z %8 = tt.splat %c4096_i32_6 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2301696Z %9 = arith.addi %8, %7 : tensor<256xi32> 2026-02-21T08:24:21.2301918Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2302247Z %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc> -> tensor<16x256xf16> 2026-02-21T08:24:21.2302624Z %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2302920Z %13 = tt.broadcast %12 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2303206Z %14 = arith.select %13, %11, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16> 2026-02-21T08:24:21.2303488Z %15 = arith.extf %14 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2303733Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2303942Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:24:21.2304130Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:24:21.2304336Z tt.reduce.return %60 : f32 2026-02-21T08:24:21.2304527Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2304764Z %17 = arith.truncf %16 : tensor<16xf32> to tensor<16xf16> 2026-02-21T08:24:21.2305016Z %18 = arith.extf %17 : tensor<16xf16> to tensor<16xf32> 2026-02-21T08:24:21.2305316Z %19 = arith.cmpf ogt, %6#0, %18 : tensor<16xf32> 2026-02-21T08:24:21.2305545Z %20 = arith.cmpf une, %6#0, %6#0 : tensor<16xf32> 2026-02-21T08:24:21.2305754Z %21 = arith.ori %19, %20 : tensor<16xi1> 2026-02-21T08:24:21.2306002Z %22 = arith.select %21, %6#0, %18 : tensor<16xi1>, tensor<16xf32> 2026-02-21T08:24:21.2306249Z %23 = arith.subf %6#0, %22 : tensor<16xf32> 2026-02-21T08:24:21.2306628Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2306976Z %25 = arith.mulf %6#1, %24 : tensor<16xf32> 2026-02-21T08:24:21.2307227Z %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2307514Z %27 = arith.extf %11 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2307768Z %28 = tt.broadcast %26 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2308052Z %29 = arith.subf %27, %28 : tensor<16x256xf32> 2026-02-21T08:24:21.2308410Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2308810Z %31 = arith.select %13, %30, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:24:21.2309059Z %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({ 2026-02-21T08:24:21.2309244Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:24:21.2309426Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:24:21.2309607Z tt.reduce.return %60 : f32 2026-02-21T08:24:21.2309791Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:24:21.2309981Z %33 = arith.addf %25, %32 : tensor<16xf32> 2026-02-21T08:24:21.2310180Z %c4096_i32_7 = arith.constant 4096 : i32 2026-02-21T08:24:21.2310367Z %c1024_i32_8 = arith.constant 1024 : i32 2026-02-21T08:24:21.2310604Z scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c1024_i32_8 : i32 { 2026-02-21T08:24:21.2310889Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2311139Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2311348Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:24:21.2311598Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2311870Z %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:24:21.2312131Z %65 = arith.muli %64, %cst_0 : tensor<16x1xi32> 2026-02-21T08:24:21.2312395Z %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:24:21.2312689Z %67 = tt.broadcast %65 : tensor<16x1xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2312947Z %68 = tt.broadcast %66 : tensor<1x256xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2313185Z %69 = arith.addi %67, %68 : tensor<16x256xi32> 2026-02-21T08:24:21.2313423Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2313709Z %71 = tt.addptr %70, %69 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2314008Z %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2314287Z %73 = tt.broadcast %72 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2314589Z %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2314912Z %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2315199Z %76 = arith.extf %74 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2315455Z %77 = tt.broadcast %75 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2315694Z %78 = arith.subf %76, %77 : tensor<16x256xf32> 2026-02-21T08:24:21.2316065Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2316518Z %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2316805Z %81 = tt.broadcast %80 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2317034Z %82 = arith.divf %79, %81 : tensor<16x256xf32> 2026-02-21T08:24:21.2317270Z %83 = arith.truncf %82 : tensor<16x256xf32> to tensor<16x256xf16> 2026-02-21T08:24:21.2317546Z %84 = tt.splat %arg1 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2317819Z %85 = tt.addptr %84, %69 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2318086Z tt.store %85, %83, %73 : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2318299Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:21.2318495Z %86 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:24:21.2318722Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:24:21.2319011Z %88 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2319260Z %89 = tt.splat %87 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2319479Z %90 = arith.addi %89, %88 : tensor<256xi32> 2026-02-21T08:24:21.2319696Z %91 = arith.cmpi slt, %90, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2319970Z %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:24:21.2320237Z %93 = arith.muli %92, %cst_0 : tensor<16x1xi32> 2026-02-21T08:24:21.2320502Z %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:24:21.2320789Z %95 = tt.broadcast %93 : tensor<16x1xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2321042Z %96 = tt.broadcast %94 : tensor<1x256xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2321280Z %97 = arith.addi %95, %96 : tensor<16x256xi32> 2026-02-21T08:24:21.2321513Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2321834Z %99 = tt.addptr %98, %97 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2322128Z %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2322427Z %101 = tt.broadcast %100 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2322741Z %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2323071Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2323366Z %104 = arith.extf %102 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2323632Z %105 = tt.broadcast %103 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2323884Z %106 = arith.subf %104, %105 : tensor<16x256xf32> 2026-02-21T08:24:21.2324265Z %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2324690Z %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2324988Z %109 = tt.broadcast %108 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2325230Z %110 = arith.divf %107, %109 : tensor<16x256xf32> 2026-02-21T08:24:21.2325477Z %111 = arith.truncf %110 : tensor<16x256xf32> to tensor<16x256xf16> 2026-02-21T08:24:21.2325750Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2326040Z %113 = tt.addptr %112, %97 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2326311Z tt.store %113, %111, %101 : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2326522Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:21.2326714Z %114 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:24:21.2326902Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T08:24:21.2327141Z %116 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2327440Z %117 = tt.splat %115 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2327647Z %118 = arith.addi %117, %116 : tensor<256xi32> 2026-02-21T08:24:21.2327868Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2328125Z %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:24:21.2328396Z %121 = arith.muli %120, %cst_0 : tensor<16x1xi32> 2026-02-21T08:24:21.2328653Z %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:24:21.2328950Z %123 = tt.broadcast %121 : tensor<16x1xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2329222Z %124 = tt.broadcast %122 : tensor<1x256xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2329461Z %125 = arith.addi %123, %124 : tensor<16x256xi32> 2026-02-21T08:24:21.2329764Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2330053Z %127 = tt.addptr %126, %125 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2330362Z %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2330652Z %129 = tt.broadcast %128 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2330970Z %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2331311Z %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2331645Z %132 = arith.extf %130 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2331920Z %133 = tt.broadcast %131 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2332160Z %134 = arith.subf %132, %133 : tensor<16x256xf32> 2026-02-21T08:24:21.2332543Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2332973Z %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2333267Z %137 = tt.broadcast %136 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2333514Z %138 = arith.divf %135, %137 : tensor<16x256xf32> 2026-02-21T08:24:21.2333754Z %139 = arith.truncf %138 : tensor<16x256xf32> to tensor<16x256xf16> 2026-02-21T08:24:21.2334030Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2334309Z %141 = tt.addptr %140, %125 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2334586Z tt.store %141, %139, %129 : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2334805Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:24:21.2334993Z %142 = arith.muli %c256_i32, %c3_i32 : i32 2026-02-21T08:24:21.2335191Z %143 = arith.addi %arg3, %142 : i32 2026-02-21T08:24:21.2335420Z %144 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2335678Z %145 = tt.splat %143 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2335881Z %146 = arith.addi %145, %144 : tensor<256xi32> 2026-02-21T08:24:21.2336101Z %147 = arith.cmpi slt, %146, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2336366Z %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:24:21.2336627Z %149 = arith.muli %148, %cst_0 : tensor<16x1xi32> 2026-02-21T08:24:21.2336888Z %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:24:21.2337180Z %151 = tt.broadcast %149 : tensor<16x1xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2337450Z %152 = tt.broadcast %150 : tensor<1x256xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2337696Z %153 = arith.addi %151, %152 : tensor<16x256xi32> 2026-02-21T08:24:21.2337935Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2338302Z %155 = tt.addptr %154, %153 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2338602Z %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2338894Z %157 = tt.broadcast %156 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2339202Z %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2339539Z %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2339835Z %160 = arith.extf %158 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2340109Z %161 = tt.broadcast %159 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2340368Z %162 = arith.subf %160, %161 : tensor<16x256xf32> 2026-02-21T08:24:21.2340809Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2341249Z %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2341604Z %165 = tt.broadcast %164 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2341858Z %166 = arith.divf %163, %165 : tensor<16x256xf32> 2026-02-21T08:24:21.2342114Z %167 = arith.truncf %166 : tensor<16x256xf32> to tensor<16x256xf16> 2026-02-21T08:24:21.2342397Z %168 = tt.splat %arg1 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2342699Z %169 = tt.addptr %168, %153 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2342981Z tt.store %169, %167, %157 : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2343220Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:21.2343474Z %34 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:21.2343745Z %35 = tt.splat %c4096_i32_7 : i32 -> tensor<256xi32> 2026-02-21T08:24:21.2343972Z %36 = arith.addi %35, %34 : tensor<256xi32> 2026-02-21T08:24:21.2344186Z %37 = arith.cmpi slt, %36, %cst_3 : tensor<256xi32> 2026-02-21T08:24:21.2344458Z %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:24:21.2344725Z %39 = arith.muli %38, %cst_0 : tensor<16x1xi32> 2026-02-21T08:24:21.2344994Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:24:21.2345304Z %41 = tt.broadcast %39 : tensor<16x1xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2345578Z %42 = tt.broadcast %40 : tensor<1x256xi32> -> tensor<16x256xi32> 2026-02-21T08:24:21.2345826Z %43 = arith.addi %41, %42 : tensor<16x256xi32> 2026-02-21T08:24:21.2346073Z %44 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2346365Z %45 = tt.addptr %44, %43 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2346688Z %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:24:21.2346984Z %47 = tt.broadcast %46 : tensor<1x256xi1> -> tensor<16x256xi1> 2026-02-21T08:24:21.2347302Z %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2347634Z %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2347933Z %50 = arith.extf %48 : tensor<16x256xf16> to tensor<16x256xf32> 2026-02-21T08:24:21.2348185Z %51 = tt.broadcast %49 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2348422Z %52 = arith.subf %50, %51 : tensor<16x256xf32> 2026-02-21T08:24:21.2348789Z %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:24:21.2349189Z %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T08:24:21.2349524Z %55 = tt.broadcast %54 : tensor<16x1xf32> -> tensor<16x256xf32> 2026-02-21T08:24:21.2349753Z %56 = arith.divf %53, %55 : tensor<16x256xf32> 2026-02-21T08:24:21.2349987Z %57 = arith.truncf %56 : tensor<16x256xf32> to tensor<16x256xf16> 2026-02-21T08:24:21.2350255Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2350524Z %59 = tt.addptr %58, %43 : tensor<16x256x!tt.ptr>, tensor<16x256xi32> 2026-02-21T08:24:21.2350789Z tt.store %59, %57, %47 : tensor<16x256x!tt.ptr> 2026-02-21T08:24:21.2350997Z } {tt.warp_specialize} 2026-02-21T08:24:21.2351159Z tt.return 2026-02-21T08:24:21.2351284Z } 2026-02-21T08:24:21.2351409Z } 2026-02-21T08:24:21.2351477Z 2026-02-21T08:24:21.2351526Z {-# 2026-02-21T08:24:21.2351707Z external_resources: { 2026-02-21T08:24:21.2351870Z mlir_reproducer: { 2026-02-21T08:24:21.2356240Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:21.2360615Z disable_threading: false, 2026-02-21T08:24:21.2360786Z verify_each: true 2026-02-21T08:24:21.2360930Z } 2026-02-21T08:24:21.2361056Z } 2026-02-21T08:24:21.2361168Z #-} 2026-02-21T08:24:21.2361636Z /tmp/torchinductor_root/y6/cy6zcomat6qj4f462goe36wtmj6q2ss5acr34lcigmttspyclgjf.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:21.2362833Z /tmp/torchinductor_root/y6/cy6zcomat6qj4f462goe36wtmj6q2ss5acr34lcigmttspyclgjf.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:21.2363797Z [91s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:21.2364859Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:24:21.2365817Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:21.2366120Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:21.5928923Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.6 configs/s 2026-02-21T08:24:25.0448346Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 294.1 2026-02-21T08:24:25.3202480Z [95s] Generation 2 complete: 2026-02-21T08:24:25.3202808Z configs/s 2026-02-21T08:24:25.3203739Z error=1 2026-02-21T08:24:25.3203892Z ok=80 2026-02-21T08:24:25.3204026Z min=0.0266 2026-02-21T08:24:25.3204173Z mid=0.0389 2026-02-21T08:24:25.3204305Z max=0.2602 2026-02-21T08:24:25.3204467Z best={'block_sizes': [1, 8192], 2026-02-21T08:24:25.3204769Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:24:25.3209613Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:24:25.3213474Z 'num_stages': 7, 2026-02-21T08:24:25.3218473Z 'num_warps': 4, 2026-02-21T08:24:25.3219514Z 'pid_type': 'flat', 2026-02-21T08:24:25.3219737Z 'range_flattens': [None, True], 2026-02-21T08:24:25.3219964Z 'range_multi_buffers': [None, True], 2026-02-21T08:24:25.3220183Z 'range_num_stages': [0, 4], 2026-02-21T08:24:25.3220389Z 'range_unroll_factors': [0, 0], 2026-02-21T08:24:25.3220612Z 'range_warp_specializes': [None, True]} 2026-02-21T08:24:25.3224953Z [95s] Fitting surrogate: 271 points, 271 targets 2026-02-21T08:24:26.4773610Z [96s] Generation 3 starting: 73 neighbors, 5 active search path(s) 2026-02-21T08:25:00.3980243Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.9 configs/s 2026-02-21T08:25:05.1009789Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.3 configs/s 2026-02-21T08:25:07.6613752Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 489.7 2026-02-21T08:25:07.6615136Z configs/s 2026-02-21T08:25:07.8429670Z [137s] Generation 3 complete: 2026-02-21T08:25:07.8434996Z ok=79 2026-02-21T08:25:07.8440518Z min=0.0205 2026-02-21T08:25:07.8442627Z mid=0.0348 2026-02-21T08:25:07.8442794Z max=0.2365 2026-02-21T08:25:07.8442954Z best={'block_sizes': [1, 8192], 2026-02-21T08:25:07.8443234Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:25:07.8443509Z 'load_eviction_policies': ['', ''], 2026-02-21T08:25:07.8443711Z 'num_sm_multiplier': 32, 2026-02-21T08:25:07.8443879Z 'num_stages': 6, 2026-02-21T08:25:07.8444057Z 'num_warps': 1, 2026-02-21T08:25:07.8444233Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:25:07.8444435Z 'range_flattens': [True, True], 2026-02-21T08:25:07.8444629Z 'range_multi_buffers': [False, None], 2026-02-21T08:25:07.8444819Z 'range_num_stages': [3, 1], 2026-02-21T08:25:07.8445003Z 'range_unroll_factors': [0, 2], 2026-02-21T08:25:07.8445194Z 'range_warp_specializes': [True, None]} 2026-02-21T08:25:07.8449747Z [137s] Fitting surrogate: 350 points, 350 targets 2026-02-21T08:25:09.0011020Z [138s] Generation 4 starting: 77 neighbors, 5 active search path(s) 2026-02-21T08:25:17.0341234Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 10.1 configs/s 2026-02-21T08:25:21.0924122Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:25:21.0929397Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:25:21.0930081Z %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16> 2026-02-21T08:25:21.0930408Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:25:21.0930637Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:25:21.0930848Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:25:21.0931116Z %cst_0 = arith.constant dense<4224> : tensor<8x1xi32> 2026-02-21T08:25:21.0931435Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32> 2026-02-21T08:25:21.0932063Z %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16> 2026-02-21T08:25:21.0932824Z %cst_3 = arith.constant dense<4224> : tensor<512xi32> 2026-02-21T08:25:21.0933123Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:25:21.0933436Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:25:21.0933701Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:25:21.0933939Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:25:21.0934159Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:25:21.0934378Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:25:21.0934583Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:25:21.0934971Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:25:21.0935375Z %1 = tt.get_program_id x : i32 2026-02-21T08:25:21.0935629Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T08:25:21.0936067Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:25:21.0936366Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:25:21.0936684Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:25:21.0936920Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:25:21.0937160Z %c4096_i32_6 = arith.constant 4096 : i32 2026-02-21T08:25:21.0937390Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:25:21.0937871Z %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:25:21.0938409Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0938734Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0939005Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:25:21.0939274Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0939666Z %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:25:21.0940105Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0940466Z %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0940819Z %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:25:21.0941162Z %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0941457Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0941785Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0942027Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:21.0942271Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0942501Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0942776Z %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:21.0943070Z %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:21.0943364Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32> 2026-02-21T08:25:21.0943639Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:25:21.0943910Z %74 = arith.ori %72, %73 : tensor<8xi1> 2026-02-21T08:25:21.0944201Z %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:21.0944487Z %76 = arith.subf %arg4, %75 : tensor<8xf32> 2026-02-21T08:25:21.0944937Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0945374Z %78 = arith.mulf %arg5, %77 : tensor<8xf32> 2026-02-21T08:25:21.0945684Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.0946029Z %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0946346Z %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.0946742Z %82 = arith.subf %80, %81 : tensor<8x512xf32> 2026-02-21T08:25:21.0947184Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.0947694Z %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:25:21.0948004Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0948251Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0948486Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:21.0948718Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0948954Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0949190Z %86 = arith.addf %78, %85 : tensor<8xf32> 2026-02-21T08:25:21.0949432Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:21.0949658Z %87 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:25:21.0950017Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:25:21.0950311Z %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0950624Z %90 = tt.splat %88 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0950875Z %91 = arith.addi %90, %89 : tensor<512xi32> 2026-02-21T08:25:21.0951133Z %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0951504Z %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:25:21.0951974Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0952341Z %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0952682Z %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:25:21.0953017Z %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0953304Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0953543Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0953779Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:21.0954017Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0954259Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0954527Z %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:21.0954834Z %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:21.0955124Z %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32> 2026-02-21T08:25:21.0955390Z %102 = arith.cmpf une, %75, %75 : tensor<8xf32> 2026-02-21T08:25:21.0955652Z %103 = arith.ori %101, %102 : tensor<8xi1> 2026-02-21T08:25:21.0955940Z %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:21.0956234Z %105 = arith.subf %75, %104 : tensor<8xf32> 2026-02-21T08:25:21.0956682Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0957130Z %107 = arith.mulf %86, %106 : tensor<8xf32> 2026-02-21T08:25:21.0957443Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.0957800Z %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0958132Z %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.0958426Z %111 = arith.subf %109, %110 : tensor<8x512xf32> 2026-02-21T08:25:21.0958892Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.0959408Z %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:25:21.0959721Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0959967Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0960189Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:21.0960510Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0960732Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0960981Z %115 = arith.addf %107, %114 : tensor<8xf32> 2026-02-21T08:25:21.0961225Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:21.0961457Z %116 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:25:21.0961770Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:25:21.0962057Z %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0962445Z %119 = tt.splat %117 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0962698Z %120 = arith.addi %119, %118 : tensor<512xi32> 2026-02-21T08:25:21.0962972Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0963343Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:25:21.0963853Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0964239Z %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0964598Z %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:25:21.0964966Z %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0965262Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0965516Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0965757Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:21.0966001Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0966243Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0966520Z %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:21.0966836Z %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:21.0967130Z %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32> 2026-02-21T08:25:21.0967415Z %131 = arith.cmpf une, %104, %104 : tensor<8xf32> 2026-02-21T08:25:21.0967682Z %132 = arith.ori %130, %131 : tensor<8xi1> 2026-02-21T08:25:21.0967979Z %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:21.0968285Z %134 = arith.subf %104, %133 : tensor<8xf32> 2026-02-21T08:25:21.0968730Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0969198Z %136 = arith.mulf %115, %135 : tensor<8xf32> 2026-02-21T08:25:21.0969516Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.0969895Z %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0970231Z %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.0970540Z %140 = arith.subf %138, %139 : tensor<8x512xf32> 2026-02-21T08:25:21.0971012Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.0971581Z %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:25:21.0971920Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0972151Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0972360Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:21.0972584Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0972798Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0973039Z %144 = arith.addf %136, %143 : tensor<8xf32> 2026-02-21T08:25:21.0973266Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:21.0973490Z %145 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:25:21.0973714Z %146 = arith.addi %arg3, %145 : i32 2026-02-21T08:25:21.0973995Z %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0974410Z %148 = tt.splat %146 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0974653Z %149 = arith.addi %148, %147 : tensor<512xi32> 2026-02-21T08:25:21.0974920Z %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0975270Z %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:25:21.0975690Z %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0976045Z %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0976372Z %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:25:21.0976711Z %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0976982Z %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0977276Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0977494Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:21.0977719Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0977936Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0978191Z %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:21.0978483Z %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:21.0978749Z %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32> 2026-02-21T08:25:21.0979012Z %160 = arith.cmpf une, %133, %133 : tensor<8xf32> 2026-02-21T08:25:21.0979250Z %161 = arith.ori %159, %160 : tensor<8xi1> 2026-02-21T08:25:21.0979532Z %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:21.0979816Z %163 = arith.subf %133, %162 : tensor<8xf32> 2026-02-21T08:25:21.0980244Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0980679Z %165 = arith.mulf %144, %164 : tensor<8xf32> 2026-02-21T08:25:21.0980967Z %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.0981314Z %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0981658Z %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.0981944Z %169 = arith.subf %167, %168 : tensor<8x512xf32> 2026-02-21T08:25:21.0982385Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.0982911Z %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:25:21.0983213Z %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0983432Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:21.0983651Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:21.0983874Z tt.reduce.return %174 : f32 2026-02-21T08:25:21.0984095Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0984334Z %173 = arith.addf %165, %172 : tensor<8xf32> 2026-02-21T08:25:21.0984591Z scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:25:21.0984880Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:25:21.0985155Z %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0985463Z %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0985700Z %9 = arith.addi %8, %7 : tensor<512xi32> 2026-02-21T08:25:21.0985931Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0986271Z %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:25:21.0986648Z %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0986956Z %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0987310Z %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T08:25:21.0987611Z %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0987864Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0988071Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:21.0988271Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:25:21.0988474Z tt.reduce.return %60 : f32 2026-02-21T08:25:21.0988682Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0988914Z %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:21.0989175Z %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:21.0989413Z %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32> 2026-02-21T08:25:21.0989650Z %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T08:25:21.0989930Z %21 = arith.ori %19, %20 : tensor<8xi1> 2026-02-21T08:25:21.0990169Z %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:21.0990416Z %23 = arith.subf %6#0, %22 : tensor<8xf32> 2026-02-21T08:25:21.0990789Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0991178Z %25 = arith.mulf %6#1, %24 : tensor<8xf32> 2026-02-21T08:25:21.0991436Z %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.0991797Z %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.0992071Z %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.0992315Z %29 = arith.subf %27, %28 : tensor<8x512xf32> 2026-02-21T08:25:21.0992702Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.0993136Z %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:25:21.0993407Z %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({ 2026-02-21T08:25:21.0993616Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:21.0993804Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:25:21.0994009Z tt.reduce.return %60 : f32 2026-02-21T08:25:21.0994205Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:25:21.0994421Z %33 = arith.addf %25, %32 : tensor<8xf32> 2026-02-21T08:25:21.0994627Z %c4096_i32_7 = arith.constant 4096 : i32 2026-02-21T08:25:21.0994841Z %c2048_i32_8 = arith.constant 2048 : i32 2026-02-21T08:25:21.0995090Z scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8 : i32 { 2026-02-21T08:25:21.0995398Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.0995676Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.0995898Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:25:21.0996137Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.0996415Z %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:21.0996699Z %65 = arith.muli %64, %cst_0 : tensor<8x1xi32> 2026-02-21T08:25:21.0996974Z %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:21.0997298Z %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.0997585Z %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.0997842Z %69 = arith.addi %67, %68 : tensor<8x512xi32> 2026-02-21T08:25:21.0998110Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.0998409Z %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.0998733Z %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.0999105Z %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.0999425Z %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.0999776Z %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1000076Z %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.1000352Z %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1000600Z %78 = arith.subf %76, %77 : tensor<8x512xf32> 2026-02-21T08:25:21.1001001Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.1001446Z %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1001849Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1002107Z %82 = arith.divf %79, %81 : tensor<8x512xf32> 2026-02-21T08:25:21.1002355Z %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:25:21.1002642Z %84 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1002940Z %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1003217Z tt.store %85, %83, %73 : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1003450Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:21.1003654Z %86 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:25:21.1003869Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:25:21.1004116Z %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.1004385Z %89 = tt.splat %87 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.1004602Z %90 = arith.addi %89, %88 : tensor<512xi32> 2026-02-21T08:25:21.1004830Z %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.1005110Z %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:21.1005388Z %93 = arith.muli %92, %cst_0 : tensor<8x1xi32> 2026-02-21T08:25:21.1005670Z %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:21.1005975Z %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1006255Z %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1006509Z %97 = arith.addi %95, %96 : tensor<8x512xi32> 2026-02-21T08:25:21.1006753Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1007047Z %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1007361Z %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.1007683Z %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.1008014Z %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1008374Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1008686Z %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.1008965Z %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1009231Z %106 = arith.subf %104, %105 : tensor<8x512xf32> 2026-02-21T08:25:21.1009635Z %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.1010084Z %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1010391Z %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1010650Z %110 = arith.divf %107, %109 : tensor<8x512xf32> 2026-02-21T08:25:21.1010979Z %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:25:21.1011269Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1011633Z %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1011926Z tt.store %113, %111, %101 : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1012175Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:21.1012388Z %114 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:25:21.1012603Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T08:25:21.1012879Z %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.1013156Z %117 = tt.splat %115 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.1013385Z %118 = arith.addi %117, %116 : tensor<512xi32> 2026-02-21T08:25:21.1013681Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.1013976Z %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:21.1014266Z %121 = arith.muli %120, %cst_0 : tensor<8x1xi32> 2026-02-21T08:25:21.1014549Z %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:21.1014875Z %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1015165Z %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1015433Z %125 = arith.addi %123, %124 : tensor<8x512xi32> 2026-02-21T08:25:21.1015689Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1016000Z %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1016335Z %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.1016648Z %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.1016991Z %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1017343Z %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1017652Z %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.1017936Z %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1018188Z %134 = arith.subf %132, %133 : tensor<8x512xf32> 2026-02-21T08:25:21.1018595Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.1019038Z %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1019350Z %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1019617Z %138 = arith.divf %135, %137 : tensor<8x512xf32> 2026-02-21T08:25:21.1019871Z %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:25:21.1020165Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1020466Z %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1020758Z tt.store %141, %139, %129 : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1020981Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:21.1021192Z %142 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:25:21.1021403Z %143 = arith.addi %arg3, %142 : i32 2026-02-21T08:25:21.1021702Z %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.1021981Z %145 = tt.splat %143 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.1022199Z %146 = arith.addi %145, %144 : tensor<512xi32> 2026-02-21T08:25:21.1022445Z %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.1022830Z %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:21.1023118Z %149 = arith.muli %148, %cst_0 : tensor<8x1xi32> 2026-02-21T08:25:21.1023405Z %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:21.1023723Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1024011Z %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1024268Z %153 = arith.addi %151, %152 : tensor<8x512xi32> 2026-02-21T08:25:21.1024535Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1024842Z %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1025175Z %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.1025566Z %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.1025899Z %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1026256Z %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1026556Z %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.1026846Z %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1027112Z %162 = arith.subf %160, %161 : tensor<8x512xf32> 2026-02-21T08:25:21.1027512Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.1027963Z %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1028261Z %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1028531Z %166 = arith.divf %163, %165 : tensor<8x512xf32> 2026-02-21T08:25:21.1028785Z %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:25:21.1029083Z %168 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1029398Z %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1029692Z tt.store %169, %167, %157 : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1029934Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:25:21.1030188Z %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:21.1030475Z %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32> 2026-02-21T08:25:21.1030707Z %36 = arith.addi %35, %34 : tensor<512xi32> 2026-02-21T08:25:21.1030934Z %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32> 2026-02-21T08:25:21.1031216Z %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:21.1031487Z %39 = arith.muli %38, %cst_0 : tensor<8x1xi32> 2026-02-21T08:25:21.1031807Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:21.1032119Z %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1032398Z %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:25:21.1032652Z %43 = arith.addi %41, %42 : tensor<8x512xi32> 2026-02-21T08:25:21.1032899Z %44 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1033195Z %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1033506Z %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:21.1033815Z %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T08:25:21.1034122Z %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1034470Z %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1034826Z %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:25:21.1035091Z %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1035336Z %52 = arith.subf %50, %51 : tensor<8x512xf32> 2026-02-21T08:25:21.1035713Z %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:25:21.1036153Z %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:21.1036448Z %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:25:21.1036687Z %56 = arith.divf %53, %55 : tensor<8x512xf32> 2026-02-21T08:25:21.1036935Z %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:25:21.1037208Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1037554Z %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:25:21.1037827Z tt.store %59, %57, %47 : tensor<8x512x!tt.ptr> 2026-02-21T08:25:21.1038085Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:25:21.1038305Z tt.return 2026-02-21T08:25:21.1038442Z } 2026-02-21T08:25:21.1038582Z } 2026-02-21T08:25:21.1038658Z 2026-02-21T08:25:21.1038712Z {-# 2026-02-21T08:25:21.1038855Z external_resources: { 2026-02-21T08:25:21.1039025Z mlir_reproducer: { 2026-02-21T08:25:21.1043733Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:25:21.1048674Z disable_threading: false, 2026-02-21T08:25:21.1048864Z verify_each: true 2026-02-21T08:25:21.1049032Z } 2026-02-21T08:25:21.1049161Z } 2026-02-21T08:25:21.1049311Z #-} 2026-02-21T08:25:21.1049771Z /tmp/torchinductor_root/jk/cjkllnsh5e55uktu24aq3faymfp2bujo2g4hnlxlayzgzmxzeasa.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:25:21.1051067Z /tmp/torchinductor_root/jk/cjkllnsh5e55uktu24aq3faymfp2bujo2g4hnlxlayzgzmxzeasa.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:25:21.1052250Z [150s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:25:21.1053411Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:25:21.1054453Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:25:21.1054747Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:25:21.9434546Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 16.4 configs/s 2026-02-21T08:25:23.8270452Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 538.9 2026-02-21T08:25:23.8270967Z configs/s 2026-02-21T08:25:23.9905395Z [153s] Generation 4 complete: 2026-02-21T08:25:23.9907129Z error=1 2026-02-21T08:25:23.9907311Z ok=81 2026-02-21T08:25:23.9907472Z min=0.0204 2026-02-21T08:25:23.9907629Z mid=0.0368 2026-02-21T08:25:23.9907782Z max=0.3369 2026-02-21T08:25:23.9907955Z best={'block_sizes': [1, 8192], 2026-02-21T08:25:23.9908240Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:25:23.9908546Z 'load_eviction_policies': ['', ''], 2026-02-21T08:25:23.9908744Z 'num_sm_multiplier': 32, 2026-02-21T08:25:23.9908914Z 'num_stages': 5, 2026-02-21T08:25:23.9909063Z 'num_warps': 1, 2026-02-21T08:25:23.9909246Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:25:23.9909469Z 'range_flattens': [True, True], 2026-02-21T08:25:23.9909665Z 'range_multi_buffers': [False, None], 2026-02-21T08:25:23.9909874Z 'range_num_stages': [3, 0], 2026-02-21T08:25:23.9910111Z 'range_unroll_factors': [0, 2], 2026-02-21T08:25:23.9910332Z 'range_warp_specializes': [True, None]} 2026-02-21T08:25:23.9927257Z [153s] Fitting surrogate: 432 points, 432 targets 2026-02-21T08:25:24.9571192Z [154s] Generation 5 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:25:52.6549731Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.8 configs/s 2026-02-21T08:25:56.1437607Z module { 2026-02-21T08:25:56.1442754Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:25:56.1447324Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:25:56.1452016Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:25:56.1454227Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:25:56.1454507Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:56.1460412Z %cst = arith.constant dense<4224> : tensor<8x1xi32> 2026-02-21T08:25:56.1465214Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:25:56.1467460Z %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:25:56.1467775Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:25:56.1472525Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:25:56.1477070Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:25:56.1478697Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:25:56.1478977Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:25:56.1483782Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:25:56.1487513Z %1 = tt.get_program_id x : i32 2026-02-21T08:25:56.1488935Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:25:56.1489169Z %3 = arith.minsi %2, %c512_i32 : i32 2026-02-21T08:25:56.1489386Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:25:56.1489587Z %4 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:25:56.1489844Z %5 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:25:56.1490478Z %6 = tt.splat %4 : i32 -> tensor<8xi32> 2026-02-21T08:25:56.1490673Z %7 = arith.addi %6, %5 : tensor<8xi32> 2026-02-21T08:25:56.1490873Z %c4096_i32_2 = arith.constant 4096 : i32 2026-02-21T08:25:56.1491061Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T08:25:56.1491443Z %8:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_2 step %c512_i32_3 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:25:56.1491977Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T08:25:56.1492309Z %51 = arith.extf %50 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1492551Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1492819Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1493017Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.1493379Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1493579Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1494101Z %53 = arith.truncf %52 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:56.1494370Z %54 = arith.extf %53 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:56.1498524Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<8xf32> 2026-02-21T08:25:56.1501364Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:25:56.1501768Z %57 = arith.ori %55, %56 : tensor<8xi1> 2026-02-21T08:25:56.1502047Z %58 = arith.select %57, %arg4, %54 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:56.1502302Z %59 = arith.subf %arg4, %58 : tensor<8xf32> 2026-02-21T08:25:56.1502697Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1503080Z %61 = arith.mulf %arg5, %60 : tensor<8xf32> 2026-02-21T08:25:56.1503356Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1503680Z %63 = tt.broadcast %62 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1503924Z %64 = arith.subf %51, %63 : tensor<8x128xf32> 2026-02-21T08:25:56.1504310Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1504695Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1504910Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1505111Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.1505310Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1505511Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1505716Z %67 = arith.addf %61, %66 : tensor<8xf32> 2026-02-21T08:25:56.1505930Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T08:25:56.1506126Z %68 = arith.muli %c128_i32, %c1_i32_6 : i32 2026-02-21T08:25:56.1506331Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T08:25:56.1508875Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T08:25:56.1509194Z %71 = arith.extf %70 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1509460Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1509665Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1509852Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.1510062Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1510247Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1510481Z %73 = arith.truncf %72 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:56.1510725Z %74 = arith.extf %73 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:56.1510964Z %75 = arith.cmpf ogt, %58, %74 : tensor<8xf32> 2026-02-21T08:25:56.1511182Z %76 = arith.cmpf une, %58, %58 : tensor<8xf32> 2026-02-21T08:25:56.1511644Z %77 = arith.ori %75, %76 : tensor<8xi1> 2026-02-21T08:25:56.1511879Z %78 = arith.select %77, %58, %74 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:56.1512105Z %79 = arith.subf %58, %78 : tensor<8xf32> 2026-02-21T08:25:56.1512458Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1512805Z %81 = arith.mulf %67, %80 : tensor<8xf32> 2026-02-21T08:25:56.1513055Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1513344Z %83 = tt.broadcast %82 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1513575Z %84 = arith.subf %71, %83 : tensor<8x128xf32> 2026-02-21T08:25:56.1513938Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1514370Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1514570Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1514748Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.1514967Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1515161Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1515356Z %87 = arith.addf %81, %86 : tensor<8xf32> 2026-02-21T08:25:56.1515549Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:56.1515732Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:25:56.1515924Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T08:25:56.1516189Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T08:25:56.1516497Z %91 = arith.extf %90 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1516724Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1516909Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1517094Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.1517276Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1517458Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1517671Z %93 = arith.truncf %92 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:56.1517915Z %94 = arith.extf %93 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:56.1518138Z %95 = arith.cmpf ogt, %78, %94 : tensor<8xf32> 2026-02-21T08:25:56.1518341Z %96 = arith.cmpf une, %78, %78 : tensor<8xf32> 2026-02-21T08:25:56.1518545Z %97 = arith.ori %95, %96 : tensor<8xi1> 2026-02-21T08:25:56.1518763Z %98 = arith.select %97, %78, %94 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:56.1518991Z %99 = arith.subf %78, %98 : tensor<8xf32> 2026-02-21T08:25:56.1519333Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1519701Z %101 = arith.mulf %87, %100 : tensor<8xf32> 2026-02-21T08:25:56.1519963Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1520253Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1520501Z %104 = arith.subf %91, %103 : tensor<8x128xf32> 2026-02-21T08:25:56.1520868Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1521246Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1521436Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1521653Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.1521847Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1522026Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1522230Z %107 = arith.addf %101, %106 : tensor<8xf32> 2026-02-21T08:25:56.1522426Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:56.1522686Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:25:56.1522875Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T08:25:56.1523156Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T08:25:56.1523480Z %111 = arith.extf %110 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1523708Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1523904Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1524082Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.1524274Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1524453Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1524680Z %113 = arith.truncf %112 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:56.1524929Z %114 = arith.extf %113 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:56.1525204Z %115 = arith.cmpf ogt, %98, %114 : tensor<8xf32> 2026-02-21T08:25:56.1525429Z %116 = arith.cmpf une, %98, %98 : tensor<8xf32> 2026-02-21T08:25:56.1525672Z %117 = arith.ori %115, %116 : tensor<8xi1> 2026-02-21T08:25:56.1525901Z %118 = arith.select %117, %98, %114 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:56.1526144Z %119 = arith.subf %98, %118 : tensor<8xf32> 2026-02-21T08:25:56.1526504Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1526858Z %121 = arith.mulf %107, %120 : tensor<8xf32> 2026-02-21T08:25:56.1527118Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1527409Z %123 = tt.broadcast %122 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1527656Z %124 = arith.subf %111, %123 : tensor<8x128xf32> 2026-02-21T08:25:56.1528016Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1528387Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1528586Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.1528764Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.1528957Z tt.reduce.return %128 : f32 2026-02-21T08:25:56.1529136Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1529335Z %127 = arith.addf %121, %126 : tensor<8xf32> 2026-02-21T08:25:56.1529548Z scf.yield %118, %127 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:25:56.1529785Z } {tt.num_stages = 1 : i32} 2026-02-21T08:25:56.1530075Z %9 = tt.descriptor_load %0[%4, %c4096_i32_2] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T08:25:56.1530407Z %10 = arith.extf %9 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1530632Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1530825Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:56.1531003Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:25:56.1531198Z tt.reduce.return %50 : f32 2026-02-21T08:25:56.1531377Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1531638Z %12 = arith.truncf %11 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:25:56.1531872Z %13 = arith.extf %12 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:25:56.1532099Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<8xf32> 2026-02-21T08:25:56.1532314Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<8xf32> 2026-02-21T08:25:56.1532511Z %16 = arith.ori %14, %15 : tensor<8xi1> 2026-02-21T08:25:56.1532740Z %17 = arith.select %16, %8#0, %13 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:25:56.1532966Z %18 = arith.subf %8#0, %17 : tensor<8xf32> 2026-02-21T08:25:56.1533318Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1533720Z %20 = arith.mulf %8#1, %19 : tensor<8xf32> 2026-02-21T08:25:56.1533968Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1534253Z %22 = tt.broadcast %21 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1534478Z %23 = arith.subf %10, %22 : tensor<8x128xf32> 2026-02-21T08:25:56.1534846Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1535199Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.1535394Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:56.1535568Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:25:56.1535758Z tt.reduce.return %50 : f32 2026-02-21T08:25:56.1535944Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T08:25:56.1536134Z %26 = arith.addf %20, %25 : tensor<8xf32> 2026-02-21T08:25:56.1536395Z %c4096_i32_4 = arith.constant 4096 : i32 2026-02-21T08:25:56.1536589Z %c512_i32_5 = arith.constant 512 : i32 2026-02-21T08:25:56.1536828Z scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c512_i32_5 : i32 { 2026-02-21T08:25:56.1537108Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:25:56.1537377Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:25:56.1537588Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T08:25:56.1537832Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:56.1538100Z %54 = arith.muli %53, %cst : tensor<8x1xi32> 2026-02-21T08:25:56.1538358Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:25:56.1538658Z %56 = tt.broadcast %54 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1538921Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1539167Z %58 = arith.addi %56, %57 : tensor<8x128xi32> 2026-02-21T08:25:56.1539413Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1539692Z %60 = tt.addptr %59, %58 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1539997Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1540305Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1540598Z %63 = arith.extf %61 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1540873Z %64 = tt.broadcast %62 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1541115Z %65 = arith.subf %63, %64 : tensor<8x128xf32> 2026-02-21T08:25:56.1541510Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1541979Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1542280Z %68 = tt.broadcast %67 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1542518Z %69 = arith.divf %66, %68 : tensor<8x128xf32> 2026-02-21T08:25:56.1542766Z %70 = arith.truncf %69 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T08:25:56.1543044Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1543327Z %72 = tt.addptr %71, %58 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1543592Z tt.store %72, %70 : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1543807Z %c1_i32_6 = arith.constant 1 : i32 2026-02-21T08:25:56.1544014Z %73 = arith.muli %c128_i32, %c1_i32_6 : i32 2026-02-21T08:25:56.1544212Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T08:25:56.1544459Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:25:56.1544721Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T08:25:56.1544979Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T08:25:56.1545235Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:56.1545501Z %79 = arith.muli %78, %cst : tensor<8x1xi32> 2026-02-21T08:25:56.1545772Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:25:56.1546069Z %81 = tt.broadcast %79 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1546343Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1546588Z %83 = arith.addi %81, %82 : tensor<8x128xi32> 2026-02-21T08:25:56.1546829Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1547118Z %85 = tt.addptr %84, %83 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1547496Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1547823Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1548119Z %88 = arith.extf %86 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1548392Z %89 = tt.broadcast %87 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1548625Z %90 = arith.subf %88, %89 : tensor<8x128xf32> 2026-02-21T08:25:56.1548978Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1549384Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1549656Z %93 = tt.broadcast %92 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1549887Z %94 = arith.divf %91, %93 : tensor<8x128xf32> 2026-02-21T08:25:56.1550120Z %95 = arith.truncf %94 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T08:25:56.1550384Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1550661Z %97 = tt.addptr %96, %83 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1550909Z tt.store %97, %95 : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1551117Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:56.1551304Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:25:56.1551493Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T08:25:56.1551789Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:25:56.1552039Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T08:25:56.1552248Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T08:25:56.1552493Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:56.1552753Z %104 = arith.muli %103, %cst : tensor<8x1xi32> 2026-02-21T08:25:56.1553018Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:25:56.1553320Z %106 = tt.broadcast %104 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1553591Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1553834Z %108 = arith.addi %106, %107 : tensor<8x128xi32> 2026-02-21T08:25:56.1554079Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1554354Z %110 = tt.addptr %109, %108 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1554671Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1554988Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1555270Z %113 = arith.extf %111 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1555541Z %114 = tt.broadcast %112 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1555790Z %115 = arith.subf %113, %114 : tensor<8x128xf32> 2026-02-21T08:25:56.1556229Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1556648Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1556926Z %118 = tt.broadcast %117 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1557167Z %119 = arith.divf %116, %118 : tensor<8x128xf32> 2026-02-21T08:25:56.1557402Z %120 = arith.truncf %119 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T08:25:56.1557675Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1557956Z %122 = tt.addptr %121, %108 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1558224Z tt.store %122, %120 : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1558435Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:56.1558672Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:25:56.1558874Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T08:25:56.1559103Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:25:56.1559364Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T08:25:56.1559564Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T08:25:56.1559816Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:56.1560083Z %129 = arith.muli %128, %cst : tensor<8x1xi32> 2026-02-21T08:25:56.1560343Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:25:56.1560642Z %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1560904Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1561153Z %133 = arith.addi %131, %132 : tensor<8x128xi32> 2026-02-21T08:25:56.1561391Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1561715Z %135 = tt.addptr %134, %133 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1562025Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1562340Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1562639Z %138 = arith.extf %136 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1562907Z %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1563153Z %140 = arith.subf %138, %139 : tensor<8x128xf32> 2026-02-21T08:25:56.1563540Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1563963Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1564264Z %143 = tt.broadcast %142 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1564505Z %144 = arith.divf %141, %143 : tensor<8x128xf32> 2026-02-21T08:25:56.1564751Z %145 = arith.truncf %144 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T08:25:56.1565027Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1565323Z %147 = tt.addptr %146, %133 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1565595Z tt.store %147, %145 : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1565807Z } {tt.num_stages = 1 : i32} 2026-02-21T08:25:56.1566043Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:25:56.1566312Z %28 = tt.splat %c4096_i32_4 : i32 -> tensor<128xi32> 2026-02-21T08:25:56.1566536Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T08:25:56.1566786Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:25:56.1567093Z %31 = arith.muli %30, %cst : tensor<8x1xi32> 2026-02-21T08:25:56.1567351Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:25:56.1567635Z %33 = tt.broadcast %31 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1567895Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T08:25:56.1568122Z %35 = arith.addi %33, %34 : tensor<8x128xi32> 2026-02-21T08:25:56.1568358Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1568625Z %37 = tt.addptr %36, %35 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1568926Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1569239Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1569567Z %40 = arith.extf %38 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T08:25:56.1569827Z %41 = tt.broadcast %39 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1570047Z %42 = arith.subf %40, %41 : tensor<8x128xf32> 2026-02-21T08:25:56.1570404Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T08:25:56.1570805Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:25:56.1571074Z %45 = tt.broadcast %44 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T08:25:56.1571304Z %46 = arith.divf %43, %45 : tensor<8x128xf32> 2026-02-21T08:25:56.1571525Z %47 = arith.truncf %46 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T08:25:56.1571813Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1572073Z %49 = tt.addptr %48, %35 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T08:25:56.1572330Z tt.store %49, %47 : tensor<8x128x!tt.ptr> 2026-02-21T08:25:56.1572589Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.warp_specialize} 2026-02-21T08:25:56.1572811Z tt.return 2026-02-21T08:25:56.1572946Z } 2026-02-21T08:25:56.1573068Z } 2026-02-21T08:25:56.1573143Z 2026-02-21T08:25:56.1573194Z {-# 2026-02-21T08:25:56.1573323Z external_resources: { 2026-02-21T08:25:56.1573487Z mlir_reproducer: { 2026-02-21T08:25:56.1577768Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:25:56.1582298Z disable_threading: false, 2026-02-21T08:25:56.1582485Z verify_each: true 2026-02-21T08:25:56.1582633Z } 2026-02-21T08:25:56.1582767Z } 2026-02-21T08:25:56.1582879Z #-} 2026-02-21T08:25:56.1583316Z /tmp/torchinductor_root/pc/cpcb2g72dy7psfa7tvoq6f36rmz52skrwegzxl3uatputjkalgjh.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:25:56.1584552Z /tmp/torchinductor_root/pc/cpcb2g72dy7psfa7tvoq6f36rmz52skrwegzxl3uatputjkalgjh.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:25:56.1585611Z [185s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:25:56.1586746Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:25:56.1587752Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:25:56.1588019Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:25:56.4579139Z module { 2026-02-21T08:25:56.4586012Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:25:56.4590388Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:25:56.4594461Z %cst = arith.constant dense<0.000000e+00> : tensor<32x512xf16> 2026-02-21T08:25:56.4596874Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:25:56.4597162Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:25:56.4601912Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:25:56.4603607Z %cst_0 = arith.constant dense<4224> : tensor<32x1xi32> 2026-02-21T08:25:56.4603982Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x512xf32> 2026-02-21T08:25:56.4609899Z %cst_2 = arith.constant dense<0xFC00> : tensor<32x512xf16> 2026-02-21T08:25:56.4614076Z %cst_3 = arith.constant dense<4224> : tensor<512xi32> 2026-02-21T08:25:56.4618656Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:25:56.4622647Z %cst_5 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:25:56.4625865Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:25:56.4630368Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:25:56.4634232Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T08:25:56.4638808Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T08:25:56.4639131Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:25:56.4639497Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T08:25:56.4640110Z %1 = tt.get_program_id x : i32 2026-02-21T08:25:56.4645874Z scf.for %arg2 = %1 to %c128_i32 step %c592_i32 : i32 { 2026-02-21T08:25:56.4650001Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:25:56.4654475Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:25:56.4659519Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:25:56.4664553Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:25:56.4669660Z %c4096_i32_6 = arith.constant 4096 : i32 2026-02-21T08:25:56.4674688Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:25:56.4678489Z %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:25:56.4679292Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4679592Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4679810Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:25:56.4680050Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4680376Z %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x512xf16> 2026-02-21T08:25:56.4680726Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4681025Z %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4681309Z %67 = arith.select %66, %64, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16> 2026-02-21T08:25:56.4681667Z %68 = arith.extf %67 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4681981Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4682194Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4682396Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.4682594Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4682791Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4683016Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:25:56.4683265Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:25:56.4683498Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<32xf32> 2026-02-21T08:25:56.4683732Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:25:56.4683949Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:25:56.4684184Z %75 = arith.select %74, %arg4, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:25:56.4684435Z %76 = arith.subf %arg4, %75 : tensor<32xf32> 2026-02-21T08:25:56.4684802Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4685169Z %78 = arith.mulf %arg5, %77 : tensor<32xf32> 2026-02-21T08:25:56.4685421Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4685726Z %80 = arith.extf %64 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4685993Z %81 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4686226Z %82 = arith.subf %80, %81 : tensor<32x512xf32> 2026-02-21T08:25:56.4686597Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4686998Z %84 = arith.select %66, %83, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32> 2026-02-21T08:25:56.4687254Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4687451Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4687635Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.4687828Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4688013Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4688217Z %86 = arith.addf %78, %85 : tensor<32xf32> 2026-02-21T08:25:56.4688410Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:56.4688609Z %87 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:25:56.4688850Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:25:56.4689088Z %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4689347Z %90 = tt.splat %88 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4689541Z %91 = arith.addi %90, %89 : tensor<512xi32> 2026-02-21T08:25:56.4689759Z %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4690056Z %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc> -> tensor<32x512xf16> 2026-02-21T08:25:56.4690482Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4690780Z %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4691050Z %96 = arith.select %95, %93, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16> 2026-02-21T08:25:56.4691331Z %97 = arith.extf %96 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4691592Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4691786Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4691967Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.4692160Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4692339Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4692562Z %99 = arith.truncf %98 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:25:56.4692807Z %100 = arith.extf %99 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:25:56.4693098Z %101 = arith.cmpf ogt, %75, %100 : tensor<32xf32> 2026-02-21T08:25:56.4693327Z %102 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:25:56.4693527Z %103 = arith.ori %101, %102 : tensor<32xi1> 2026-02-21T08:25:56.4693766Z %104 = arith.select %103, %75, %100 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:25:56.4694005Z %105 = arith.subf %75, %104 : tensor<32xf32> 2026-02-21T08:25:56.4694374Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4694738Z %107 = arith.mulf %86, %106 : tensor<32xf32> 2026-02-21T08:25:56.4694992Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4695293Z %109 = arith.extf %93 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4695561Z %110 = tt.broadcast %108 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4695816Z %111 = arith.subf %109, %110 : tensor<32x512xf32> 2026-02-21T08:25:56.4696194Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4696603Z %113 = arith.select %95, %112, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32> 2026-02-21T08:25:56.4696864Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4697052Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4697239Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.4697426Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4697620Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4697828Z %115 = arith.addf %107, %114 : tensor<32xf32> 2026-02-21T08:25:56.4698020Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:56.4698213Z %116 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:25:56.4698405Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:25:56.4698642Z %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4698891Z %119 = tt.splat %117 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4699102Z %120 = arith.addi %119, %118 : tensor<512xi32> 2026-02-21T08:25:56.4699334Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4699652Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<32x512xf16> 2026-02-21T08:25:56.4700024Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4700331Z %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4700630Z %125 = arith.select %124, %122, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16> 2026-02-21T08:25:56.4700931Z %126 = arith.extf %125 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4701184Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4701449Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4701691Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.4701897Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4702091Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4702335Z %128 = arith.truncf %127 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:25:56.4702596Z %129 = arith.extf %128 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:25:56.4702853Z %130 = arith.cmpf ogt, %104, %129 : tensor<32xf32> 2026-02-21T08:25:56.4703094Z %131 = arith.cmpf une, %104, %104 : tensor<32xf32> 2026-02-21T08:25:56.4703311Z %132 = arith.ori %130, %131 : tensor<32xi1> 2026-02-21T08:25:56.4703562Z %133 = arith.select %132, %104, %129 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:25:56.4703817Z %134 = arith.subf %104, %133 : tensor<32xf32> 2026-02-21T08:25:56.4704257Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4704643Z %136 = arith.mulf %115, %135 : tensor<32xf32> 2026-02-21T08:25:56.4704905Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4705222Z %138 = arith.extf %122 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4705497Z %139 = tt.broadcast %137 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4705757Z %140 = arith.subf %138, %139 : tensor<32x512xf32> 2026-02-21T08:25:56.4706146Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4706642Z %142 = arith.select %124, %141, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32> 2026-02-21T08:25:56.4706906Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4707097Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4707285Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.4707468Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4707658Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4707854Z %144 = arith.addf %136, %143 : tensor<32xf32> 2026-02-21T08:25:56.4708051Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:56.4708241Z %145 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:25:56.4708432Z %146 = arith.addi %arg3, %145 : i32 2026-02-21T08:25:56.4708666Z %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4708917Z %148 = tt.splat %146 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4709128Z %149 = arith.addi %148, %147 : tensor<512xi32> 2026-02-21T08:25:56.4709343Z %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4709654Z %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc> -> tensor<32x512xf16> 2026-02-21T08:25:56.4710010Z %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4710304Z %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4710594Z %154 = arith.select %153, %151, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16> 2026-02-21T08:25:56.4710883Z %155 = arith.extf %154 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4711123Z %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4711312Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4711499Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:25:56.4711765Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4711944Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4712176Z %157 = arith.truncf %156 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:25:56.4712424Z %158 = arith.extf %157 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:25:56.4712711Z %159 = arith.cmpf ogt, %133, %158 : tensor<32xf32> 2026-02-21T08:25:56.4712927Z %160 = arith.cmpf une, %133, %133 : tensor<32xf32> 2026-02-21T08:25:56.4713142Z %161 = arith.ori %159, %160 : tensor<32xi1> 2026-02-21T08:25:56.4713378Z %162 = arith.select %161, %133, %158 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:25:56.4713616Z %163 = arith.subf %133, %162 : tensor<32xf32> 2026-02-21T08:25:56.4713983Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4714338Z %165 = arith.mulf %144, %164 : tensor<32xf32> 2026-02-21T08:25:56.4714596Z %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4714891Z %167 = arith.extf %151 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4715216Z %168 = tt.broadcast %166 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4715467Z %169 = arith.subf %167, %168 : tensor<32x512xf32> 2026-02-21T08:25:56.4715837Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4716265Z %171 = arith.select %153, %170, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32> 2026-02-21T08:25:56.4716524Z %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4716722Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:25:56.4716907Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:25:56.4717092Z tt.reduce.return %174 : f32 2026-02-21T08:25:56.4717279Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4717475Z %173 = arith.addf %165, %172 : tensor<32xf32> 2026-02-21T08:25:56.4717697Z scf.yield %162, %173 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:25:56.4717905Z } {tt.num_stages = 1 : i32} 2026-02-21T08:25:56.4718134Z %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4718396Z %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4718600Z %9 = arith.addi %8, %7 : tensor<512xi32> 2026-02-21T08:25:56.4718813Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4719125Z %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc> -> tensor<32x512xf16> 2026-02-21T08:25:56.4719486Z %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4719765Z %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4720043Z %14 = arith.select %13, %11, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16> 2026-02-21T08:25:56.4720324Z %15 = arith.extf %14 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4720555Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4720760Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:56.4720946Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:25:56.4721145Z tt.reduce.return %60 : f32 2026-02-21T08:25:56.4721334Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4721612Z %17 = arith.truncf %16 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:25:56.4721866Z %18 = arith.extf %17 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:25:56.4722093Z %19 = arith.cmpf ogt, %6#0, %18 : tensor<32xf32> 2026-02-21T08:25:56.4722320Z %20 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:25:56.4722528Z %21 = arith.ori %19, %20 : tensor<32xi1> 2026-02-21T08:25:56.4722775Z %22 = arith.select %21, %6#0, %18 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:25:56.4723015Z %23 = arith.subf %6#0, %22 : tensor<32xf32> 2026-02-21T08:25:56.4723408Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4723794Z %25 = arith.mulf %6#1, %24 : tensor<32xf32> 2026-02-21T08:25:56.4724122Z %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4724410Z %27 = arith.extf %11 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4724662Z %28 = tt.broadcast %26 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4724899Z %29 = arith.subf %27, %28 : tensor<32x512xf32> 2026-02-21T08:25:56.4725259Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4725667Z %31 = arith.select %13, %30, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32> 2026-02-21T08:25:56.4725919Z %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({ 2026-02-21T08:25:56.4726105Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:25:56.4726290Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:25:56.4726529Z tt.reduce.return %60 : f32 2026-02-21T08:25:56.4726724Z }) : (tensor<32x512xf32>) -> tensor<32xf32> 2026-02-21T08:25:56.4726914Z %33 = arith.addf %25, %32 : tensor<32xf32> 2026-02-21T08:25:56.4727113Z %c4096_i32_7 = arith.constant 4096 : i32 2026-02-21T08:25:56.4727308Z %c2048_i32_8 = arith.constant 2048 : i32 2026-02-21T08:25:56.4727538Z scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8 : i32 { 2026-02-21T08:25:56.4727822Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4728075Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4728282Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T08:25:56.4728490Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4728756Z %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:56.4729026Z %65 = arith.muli %64, %cst_0 : tensor<32x1xi32> 2026-02-21T08:25:56.4729282Z %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:56.4729580Z %67 = tt.broadcast %65 : tensor<32x1xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4729837Z %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4730078Z %69 = arith.addi %67, %68 : tensor<32x512xi32> 2026-02-21T08:25:56.4730311Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4730596Z %71 = tt.addptr %70, %69 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4730891Z %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4731170Z %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4731473Z %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4731838Z %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4732129Z %76 = arith.extf %74 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4732392Z %77 = tt.broadcast %75 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4732629Z %78 = arith.subf %76, %77 : tensor<32x512xf32> 2026-02-21T08:25:56.4733001Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4733410Z %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4733704Z %81 = tt.broadcast %80 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4733940Z %82 = arith.divf %79, %81 : tensor<32x512xf32> 2026-02-21T08:25:56.4734181Z %83 = arith.truncf %82 : tensor<32x512xf32> to tensor<32x512xf16> 2026-02-21T08:25:56.4734461Z %84 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4734745Z %85 = tt.addptr %84, %69 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4735089Z tt.store %85, %83, %73 : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4735299Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:25:56.4735496Z %86 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:25:56.4735684Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:25:56.4735922Z %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4736181Z %89 = tt.splat %87 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4736380Z %90 = arith.addi %89, %88 : tensor<512xi32> 2026-02-21T08:25:56.4736600Z %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4736857Z %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:56.4737122Z %93 = arith.muli %92, %cst_0 : tensor<32x1xi32> 2026-02-21T08:25:56.4737433Z %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:56.4737720Z %95 = tt.broadcast %93 : tensor<32x1xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4737983Z %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4738210Z %97 = arith.addi %95, %96 : tensor<32x512xi32> 2026-02-21T08:25:56.4738448Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4738721Z %99 = tt.addptr %98, %97 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4739021Z %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4739321Z %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4739628Z %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4739967Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4740259Z %104 = arith.extf %102 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4740532Z %105 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4740775Z %106 = arith.subf %104, %105 : tensor<32x512xf32> 2026-02-21T08:25:56.4741153Z %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4741588Z %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4741879Z %109 = tt.broadcast %108 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4742129Z %110 = arith.divf %107, %109 : tensor<32x512xf32> 2026-02-21T08:25:56.4742370Z %111 = arith.truncf %110 : tensor<32x512xf32> to tensor<32x512xf16> 2026-02-21T08:25:56.4742655Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4742958Z %113 = tt.addptr %112, %97 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4743245Z tt.store %113, %111, %101 : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4743476Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:25:56.4743672Z %114 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:25:56.4743880Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T08:25:56.4744121Z %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4744393Z %117 = tt.splat %115 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4744612Z %118 = arith.addi %117, %116 : tensor<512xi32> 2026-02-21T08:25:56.4744837Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4745120Z %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:56.4745394Z %121 = arith.muli %120, %cst_0 : tensor<32x1xi32> 2026-02-21T08:25:56.4745678Z %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:56.4746038Z %123 = tt.broadcast %121 : tensor<32x1xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4746330Z %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4746595Z %125 = arith.addi %123, %124 : tensor<32x512xi32> 2026-02-21T08:25:56.4746849Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4747157Z %127 = tt.addptr %126, %125 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4747484Z %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4747796Z %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4748134Z %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4748485Z %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4748849Z %132 = arith.extf %130 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4749133Z %133 = tt.broadcast %131 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4749392Z %134 = arith.subf %132, %133 : tensor<32x512xf32> 2026-02-21T08:25:56.4749781Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4750225Z %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4750537Z %137 = tt.broadcast %136 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4750793Z %138 = arith.divf %135, %137 : tensor<32x512xf32> 2026-02-21T08:25:56.4751040Z %139 = arith.truncf %138 : tensor<32x512xf32> to tensor<32x512xf16> 2026-02-21T08:25:56.4751309Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4751633Z %141 = tt.addptr %140, %125 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4751915Z tt.store %141, %139, %129 : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4752130Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:25:56.4752328Z %142 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:25:56.4752520Z %143 = arith.addi %arg3, %142 : i32 2026-02-21T08:25:56.4752755Z %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4753008Z %145 = tt.splat %143 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4753218Z %146 = arith.addi %145, %144 : tensor<512xi32> 2026-02-21T08:25:56.4753444Z %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4753705Z %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:56.4753975Z %149 = arith.muli %148, %cst_0 : tensor<32x1xi32> 2026-02-21T08:25:56.4754235Z %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:56.4754534Z %151 = tt.broadcast %149 : tensor<32x1xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4754801Z %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4755053Z %153 = arith.addi %151, %152 : tensor<32x512xi32> 2026-02-21T08:25:56.4755302Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4755588Z %155 = tt.addptr %154, %153 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4755903Z %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4756193Z %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4756514Z %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4756855Z %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4757190Z %160 = arith.extf %158 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4757460Z %161 = tt.broadcast %159 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4757692Z %162 = arith.subf %160, %161 : tensor<32x512xf32> 2026-02-21T08:25:56.4758072Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4758498Z %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4758789Z %165 = tt.broadcast %164 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4759031Z %166 = arith.divf %163, %165 : tensor<32x512xf32> 2026-02-21T08:25:56.4759266Z %167 = arith.truncf %166 : tensor<32x512xf32> to tensor<32x512xf16> 2026-02-21T08:25:56.4759543Z %168 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4759889Z %169 = tt.addptr %168, %153 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4760171Z tt.store %169, %167, %157 : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4760390Z } {tt.num_stages = 1 : i32} 2026-02-21T08:25:56.4760610Z %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:25:56.4760874Z %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32> 2026-02-21T08:25:56.4761087Z %36 = arith.addi %35, %34 : tensor<512xi32> 2026-02-21T08:25:56.4761300Z %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32> 2026-02-21T08:25:56.4761612Z %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:25:56.4761882Z %39 = arith.muli %38, %cst_0 : tensor<32x1xi32> 2026-02-21T08:25:56.4762137Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:25:56.4762419Z %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4762695Z %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<32x512xi32> 2026-02-21T08:25:56.4762929Z %43 = arith.addi %41, %42 : tensor<32x512xi32> 2026-02-21T08:25:56.4763170Z %44 = tt.splat %arg0 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4763445Z %45 = tt.addptr %44, %43 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4763744Z %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T08:25:56.4764036Z %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<32x512xi1> 2026-02-21T08:25:56.4764328Z %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4764653Z %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4764928Z %50 = arith.extf %48 : tensor<32x512xf16> to tensor<32x512xf32> 2026-02-21T08:25:56.4765188Z %51 = tt.broadcast %49 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4765423Z %52 = arith.subf %50, %51 : tensor<32x512xf32> 2026-02-21T08:25:56.4765780Z %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32> 2026-02-21T08:25:56.4766186Z %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:25:56.4766460Z %55 = tt.broadcast %54 : tensor<32x1xf32> -> tensor<32x512xf32> 2026-02-21T08:25:56.4766695Z %56 = arith.divf %53, %55 : tensor<32x512xf32> 2026-02-21T08:25:56.4766923Z %57 = arith.truncf %56 : tensor<32x512xf32> to tensor<32x512xf16> 2026-02-21T08:25:56.4767191Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4767465Z %59 = tt.addptr %58, %43 : tensor<32x512x!tt.ptr>, tensor<32x512xi32> 2026-02-21T08:25:56.4767721Z tt.store %59, %57, %47 : tensor<32x512x!tt.ptr> 2026-02-21T08:25:56.4767944Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:25:56.4768170Z tt.return 2026-02-21T08:25:56.4768310Z } 2026-02-21T08:25:56.4768433Z } 2026-02-21T08:25:56.4768513Z 2026-02-21T08:25:56.4768565Z {-# 2026-02-21T08:25:56.4768706Z external_resources: { 2026-02-21T08:25:56.4768869Z mlir_reproducer: { 2026-02-21T08:25:56.4773359Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:25:56.4777683Z disable_threading: false, 2026-02-21T08:25:56.4777856Z verify_each: true 2026-02-21T08:25:56.4777996Z } 2026-02-21T08:25:56.4778118Z } 2026-02-21T08:25:56.4778231Z #-} 2026-02-21T08:25:56.4778656Z /tmp/torchinductor_root/ey/ceyvajt67oz5df4svoktjonn76nenfn6xt77ywdibvxxhau3er2c.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:25:56.4782510Z /tmp/torchinductor_root/ey/ceyvajt67oz5df4svoktjonn76nenfn6xt77ywdibvxxhau3er2c.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:25:56.4783485Z [186s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:25:56.4784539Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=3, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:25:56.4785542Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:25:56.4785807Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:25:57.3746739Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.2 configs/s 2026-02-21T08:25:59.0335009Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 611.7 2026-02-21T08:25:59.0339593Z configs/s 2026-02-21T08:25:59.1784393Z [188s] Generation 5 complete: 2026-02-21T08:25:59.1789294Z error=2 2026-02-21T08:25:59.1793683Z ok=75 2026-02-21T08:25:59.1795573Z min=0.0204 2026-02-21T08:25:59.1795736Z mid=0.0368 2026-02-21T08:25:59.1795879Z max=1.0691 2026-02-21T08:25:59.1796017Z best={'block_sizes': [1, 8192], 2026-02-21T08:25:59.1796277Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:25:59.1796535Z 'load_eviction_policies': ['', ''], 2026-02-21T08:25:59.1796718Z 'num_sm_multiplier': 32, 2026-02-21T08:25:59.1796883Z 'num_stages': 5, 2026-02-21T08:25:59.1797024Z 'num_warps': 1, 2026-02-21T08:25:59.1797183Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:25:59.1797372Z 'range_flattens': [True, True], 2026-02-21T08:25:59.1797549Z 'range_multi_buffers': [False, None], 2026-02-21T08:25:59.1797725Z 'range_num_stages': [3, 0], 2026-02-21T08:25:59.1797894Z 'range_unroll_factors': [0, 2], 2026-02-21T08:25:59.1798075Z 'range_warp_specializes': [True, None]} 2026-02-21T08:25:59.1807540Z [188s] Fitting surrogate: 509 points, 509 targets 2026-02-21T08:25:59.9929507Z [189s] Generation 6 starting: 47 neighbors, 4 active search path(s) 2026-02-21T08:26:05.1812132Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 6.0 configs/s 2026-02-21T08:26:08.1834372Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 16.9 configs/s 2026-02-21T08:26:09.8789935Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.5 2026-02-21T08:26:09.8790317Z configs/s 2026-02-21T08:26:10.0097090Z [199s] Generation 6 complete: 2026-02-21T08:26:10.0098792Z ok=52 2026-02-21T08:26:10.0098964Z min=0.0204 2026-02-21T08:26:10.0099092Z mid=0.0307 2026-02-21T08:26:10.0099222Z max=0.0890 2026-02-21T08:26:10.0099357Z best={'block_sizes': [1, 8192], 2026-02-21T08:26:10.0099614Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:26:10.0099869Z 'load_eviction_policies': ['', ''], 2026-02-21T08:26:10.0100046Z 'num_stages': 6, 2026-02-21T08:26:10.0100193Z 'num_warps': 1, 2026-02-21T08:26:10.0100361Z 'pid_type': 'flat', 2026-02-21T08:26:10.0100540Z 'range_flattens': [None, True], 2026-02-21T08:26:10.0100713Z 'range_multi_buffers': [None, None], 2026-02-21T08:26:10.0100897Z 'range_num_stages': [0, 4], 2026-02-21T08:26:10.0101059Z 'range_unroll_factors': [0, 0], 2026-02-21T08:26:10.0101243Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:10.0117813Z [199s] Fitting surrogate: 561 points, 561 targets 2026-02-21T08:26:10.6860679Z [200s] Generation 7 starting: 40 neighbors, 3 active search path(s) 2026-02-21T08:26:13.5418832Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 16.4 configs/s 2026-02-21T08:26:16.0178707Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s 2026-02-21T08:26:17.3021736Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 786.4 2026-02-21T08:26:17.3022128Z configs/s 2026-02-21T08:26:17.4081108Z [207s] Generation 7 complete: 2026-02-21T08:26:17.4086191Z ok=44 2026-02-21T08:26:17.4090541Z min=0.0204 2026-02-21T08:26:17.4095530Z mid=0.0326 2026-02-21T08:26:17.4096901Z max=0.0891 2026-02-21T08:26:17.4097087Z best={'block_sizes': [1, 8192], 2026-02-21T08:26:17.4097339Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:26:17.4097600Z 'load_eviction_policies': ['', ''], 2026-02-21T08:26:17.4097784Z 'num_sm_multiplier': 32, 2026-02-21T08:26:17.4097950Z 'num_stages': 7, 2026-02-21T08:26:17.4098087Z 'num_warps': 1, 2026-02-21T08:26:17.4098339Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:26:17.4098541Z 'range_flattens': [None, True], 2026-02-21T08:26:17.4098717Z 'range_multi_buffers': [False, None], 2026-02-21T08:26:17.4100314Z 'range_num_stages': [3, 1], 2026-02-21T08:26:17.4100534Z 'range_unroll_factors': [0, 2], 2026-02-21T08:26:17.4100782Z 'range_warp_specializes': [True, None]} 2026-02-21T08:26:17.4105452Z [207s] Fitting surrogate: 605 points, 605 targets 2026-02-21T08:26:17.6843509Z [207s] Autotuning complete in 207.5s after searching 585 configs. 2026-02-21T08:26:17.6843939Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:26:17.6844993Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:26:17.6845889Z 2026-02-21T08:26:17.6846141Z [207s] Code of selected kernel: /tmp/torchinductor_root/di/cdidzf2wwarfgduqe66h5bkfocvrxgivwwf2hmp4pjyehufbn26t.py 2026-02-21T08:26:18.6302495Z WARNING:tritonbench.utils.triton_op:Completed input ID 31: 2026-02-21T08:26:18.6302751Z (M, N) 2026-02-21T08:26:18.6302890Z ------------ 2026-02-21T08:26:18.6303023Z (4096, 4224) 2026-02-21T08:26:18.6303105Z 2026-02-21T08:26:18.6310116Z 35%|███▌ | 7/20 [17:16<35:59, 166.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 36: 2026-02-21T08:26:18.6310474Z (M, N) 2026-02-21T08:26:18.6310608Z ------------ 2026-02-21T08:26:18.6310759Z (4096, 4864) 2026-02-21T08:26:18.6314951Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:26:19.8997384Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:26:21.4220262Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for torch_compile_softmax 2026-02-21T08:26:24.9112193Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:26:24.9113663Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:26:24.9113920Z 'dtype': 'torch.float16', 2026-02-21T08:26:24.9114179Z 'shape': (4096, 4864), 2026-02-21T08:26:24.9114384Z 'stride': (4864, 1)},), 2026-02-21T08:26:24.9118860Z 'kwargs': {}} 2026-02-21T08:26:24.9133659Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:26:25.0920435Z [0s] Autotune random seed: 2134816249 2026-02-21T08:26:25.2294253Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:26:58.9054532Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:26:59.1317273Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:26:59.2770522Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:26:59.2786611Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:27:05.7565579Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.5 configs/s 2026-02-21T08:27:05.7576247Z [40s] Adaptive compile timeout: 30s (90% percentile=5.2s, bounds=[30.0s, 30s]) 2026-02-21T08:27:06.2271517Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2085.1 configs/s 2026-02-21T08:27:06.2730294Z [41s] Initial random population of 100, 5 starting points: 2026-02-21T08:27:06.2733414Z error=5 2026-02-21T08:27:06.2739190Z timeout=3 2026-02-21T08:27:06.2743820Z ok=92 2026-02-21T08:27:06.2745703Z min=0.0328 2026-02-21T08:27:06.2745862Z mid=0.4394 2026-02-21T08:27:06.2745998Z max=32.5366 2026-02-21T08:27:06.2746142Z best={'block_sizes': [1, 8192], 2026-02-21T08:27:06.2746381Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:27:06.2746620Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:06.2746800Z 'maxnreg': 32, 2026-02-21T08:27:06.2746954Z 'num_sm_multiplier': 64, 2026-02-21T08:27:06.2747109Z 'num_stages': 7, 2026-02-21T08:27:06.2747254Z 'num_warps': 4, 2026-02-21T08:27:06.2747409Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:27:06.2747609Z 'range_flattens': [None, True], 2026-02-21T08:27:06.2747784Z 'range_multi_buffers': [False, True], 2026-02-21T08:27:06.2747975Z 'range_num_stages': [1, 4], 2026-02-21T08:27:06.2748155Z 'range_unroll_factors': [1, 4], 2026-02-21T08:27:06.2748356Z 'range_warp_specializes': [True, None]} 2026-02-21T08:27:06.2748668Z [41s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:27:07.3784260Z [42s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:27:20.0870688Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 1.3 configs/s 2026-02-21T08:27:21.3578074Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:27:21.3582979Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:27:21.3583540Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:27:21.3584226Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:27:21.3584550Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:27:21.3584756Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:27:21.3585053Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T08:27:21.3585359Z %cst_0 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T08:27:21.3585611Z %cst_1 = arith.constant dense<4864> : tensor<1024xi32> 2026-02-21T08:27:21.3585867Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:27:21.3586148Z %cst_3 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:27:21.3586370Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:27:21.3586547Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:27:21.3586735Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T08:27:21.3586915Z %c4864_i64 = arith.constant 4864 : i64 2026-02-21T08:27:21.3587089Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:27:21.3587410Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : , > 2026-02-21T08:27:21.3587869Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : , > 2026-02-21T08:27:21.3588522Z %2 = tt.get_program_id x : i32 2026-02-21T08:27:21.3588743Z scf.for %arg2 = %2 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:27:21.3588976Z %3 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:27:21.3589184Z %c4096_i32_4 = arith.constant 4096 : i32 2026-02-21T08:27:21.3589382Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:27:21.3589786Z %4:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c2048_i32 iter_args(%arg4 = %cst_3, %arg5 = %cst_2) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:27:21.3590230Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:27:21.3590516Z %43 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T08:27:21.3590747Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T08:27:21.3590976Z %45 = arith.cmpi slt, %44, %cst_1 : tensor<1024xi32> 2026-02-21T08:27:21.3591459Z %46 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3593921Z %47 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:27:21.3594243Z %48 = tt.broadcast %47 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:27:21.3594536Z %49 = arith.select %48, %46, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:27:21.3594839Z %50 = arith.extf %49 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3595100Z %51 = "tt.reduce"(%50) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3595308Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:21.3595543Z %98 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:27:21.3595749Z tt.reduce.return %98 : f32 2026-02-21T08:27:21.3595957Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3596192Z %52 = arith.truncf %51 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:27:21.3596455Z %53 = arith.extf %52 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:27:21.3596711Z %54 = arith.cmpf ogt, %arg4, %53 : tensor<8xf32> 2026-02-21T08:27:21.3596943Z %55 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:27:21.3597170Z %56 = arith.ori %54, %55 : tensor<8xi1> 2026-02-21T08:27:21.3597407Z %57 = arith.select %56, %arg4, %53 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:27:21.3597660Z %58 = arith.subf %arg4, %57 : tensor<8xf32> 2026-02-21T08:27:21.3598036Z %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3598404Z %60 = arith.mulf %arg5, %59 : tensor<8xf32> 2026-02-21T08:27:21.3598655Z %61 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3598936Z %62 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3599200Z %63 = tt.broadcast %61 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3599440Z %64 = arith.subf %62, %63 : tensor<8x1024xf32> 2026-02-21T08:27:21.3599811Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3600224Z %66 = arith.select %48, %65, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:27:21.3600473Z %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3600669Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:21.3600850Z %98 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:21.3601044Z tt.reduce.return %98 : f32 2026-02-21T08:27:21.3601226Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3601430Z %68 = arith.addf %60, %67 : tensor<8xf32> 2026-02-21T08:27:21.3601679Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:21.3601878Z %69 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T08:27:21.3602079Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:27:21.3602440Z %71 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:27:21.3602707Z %72 = tt.splat %70 : i32 -> tensor<1024xi32> 2026-02-21T08:27:21.3602913Z %73 = arith.addi %72, %71 : tensor<1024xi32> 2026-02-21T08:27:21.3603136Z %74 = arith.cmpi slt, %73, %cst_1 : tensor<1024xi32> 2026-02-21T08:27:21.3603438Z %75 = tt.descriptor_load %0[%3, %70] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3603794Z %76 = tt.expand_dims %74 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:27:21.3604097Z %77 = tt.broadcast %76 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:27:21.3604372Z %78 = arith.select %77, %75, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:27:21.3604655Z %79 = arith.extf %78 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3604884Z %80 = "tt.reduce"(%79) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3605157Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:21.3605342Z %98 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:27:21.3605538Z tt.reduce.return %98 : f32 2026-02-21T08:27:21.3605726Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3605939Z %81 = arith.truncf %80 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:27:21.3606179Z %82 = arith.extf %81 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:27:21.3606399Z %83 = arith.cmpf ogt, %57, %82 : tensor<8xf32> 2026-02-21T08:27:21.3606614Z %84 = arith.cmpf une, %57, %57 : tensor<8xf32> 2026-02-21T08:27:21.3606809Z %85 = arith.ori %83, %84 : tensor<8xi1> 2026-02-21T08:27:21.3607047Z %86 = arith.select %85, %57, %82 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:27:21.3607280Z %87 = arith.subf %57, %86 : tensor<8xf32> 2026-02-21T08:27:21.3607631Z %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3608005Z %89 = arith.mulf %68, %88 : tensor<8xf32> 2026-02-21T08:27:21.3608253Z %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3608547Z %91 = arith.extf %75 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3608805Z %92 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3609039Z %93 = arith.subf %91, %92 : tensor<8x1024xf32> 2026-02-21T08:27:21.3609404Z %94 = tt.extern_elementwise %93 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3609807Z %95 = arith.select %77, %94, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:27:21.3610057Z %96 = "tt.reduce"(%95) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3610242Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:21.3610428Z %98 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:21.3610620Z tt.reduce.return %98 : f32 2026-02-21T08:27:21.3610803Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3611003Z %97 = arith.addf %89, %96 : tensor<8xf32> 2026-02-21T08:27:21.3611209Z scf.yield %86, %97 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:27:21.3611425Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:27:21.3611692Z %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:27:21.3611963Z %6 = tt.splat %c4096_i32_4 : i32 -> tensor<1024xi32> 2026-02-21T08:27:21.3612173Z %7 = arith.addi %6, %5 : tensor<1024xi32> 2026-02-21T08:27:21.3612388Z %8 = arith.cmpi slt, %7, %cst_1 : tensor<1024xi32> 2026-02-21T08:27:21.3612706Z %9 = tt.descriptor_load %0[%3, %c4096_i32_4] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3613062Z %10 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T08:27:21.3613358Z %11 = tt.broadcast %10 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T08:27:21.3613699Z %12 = arith.select %11, %9, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T08:27:21.3613975Z %13 = arith.extf %12 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3614205Z %14 = "tt.reduce"(%13) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3614393Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:27:21.3614579Z %42 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:27:21.3614767Z tt.reduce.return %42 : f32 2026-02-21T08:27:21.3614953Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3615165Z %15 = arith.truncf %14 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:27:21.3615406Z %16 = arith.extf %15 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:27:21.3615627Z %17 = arith.cmpf ogt, %4#0, %16 : tensor<8xf32> 2026-02-21T08:27:21.3615846Z %18 = arith.cmpf une, %4#0, %4#0 : tensor<8xf32> 2026-02-21T08:27:21.3616114Z %19 = arith.ori %17, %18 : tensor<8xi1> 2026-02-21T08:27:21.3616336Z %20 = arith.select %19, %4#0, %16 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:27:21.3616566Z %21 = arith.subf %4#0, %20 : tensor<8xf32> 2026-02-21T08:27:21.3616912Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3617269Z %23 = arith.mulf %4#1, %22 : tensor<8xf32> 2026-02-21T08:27:21.3617514Z %24 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3617790Z %25 = arith.extf %9 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3618050Z %26 = tt.broadcast %24 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3618283Z %27 = arith.subf %25, %26 : tensor<8x1024xf32> 2026-02-21T08:27:21.3618676Z %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3619087Z %29 = arith.select %11, %28, %cst : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:27:21.3619329Z %30 = "tt.reduce"(%29) <{axis = 1 : i32}> ({ 2026-02-21T08:27:21.3619523Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:27:21.3619699Z %42 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:27:21.3619892Z tt.reduce.return %42 : f32 2026-02-21T08:27:21.3620083Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:27:21.3620280Z %31 = arith.addf %23, %30 : tensor<8xf32> 2026-02-21T08:27:21.3620478Z %c4096_i32_5 = arith.constant 4096 : i32 2026-02-21T08:27:21.3620669Z %c2048_i32_6 = arith.constant 2048 : i32 2026-02-21T08:27:21.3620915Z scf.for %arg3 = %c0_i32 to %c4096_i32_5 step %c2048_i32_6 : i32 { 2026-02-21T08:27:21.3621249Z %42 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3621650Z %43 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3621946Z %44 = arith.extf %42 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3622203Z %45 = tt.broadcast %43 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3622440Z %46 = arith.subf %44, %45 : tensor<8x1024xf32> 2026-02-21T08:27:21.3622804Z %47 = tt.extern_elementwise %46 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3623222Z %48 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3623508Z %49 = tt.broadcast %48 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3623736Z %50 = arith.divf %47, %49 : tensor<8x1024xf32> 2026-02-21T08:27:21.3623974Z %51 = arith.truncf %50 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:27:21.3624299Z tt.descriptor_store %1[%3, %arg3], %51 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:27:21.3624652Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:21.3624842Z %52 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T08:27:21.3625039Z %53 = arith.addi %arg3, %52 : i32 2026-02-21T08:27:21.3625314Z %54 = tt.descriptor_load %0[%3, %53] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3625645Z %55 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3625932Z %56 = arith.extf %54 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3626184Z %57 = tt.broadcast %55 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3626418Z %58 = arith.subf %56, %57 : tensor<8x1024xf32> 2026-02-21T08:27:21.3626782Z %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3627276Z %60 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3627565Z %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3627794Z %62 = arith.divf %59, %61 : tensor<8x1024xf32> 2026-02-21T08:27:21.3628031Z %63 = arith.truncf %62 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:27:21.3628341Z tt.descriptor_store %1[%3, %53], %63 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:27:21.3628628Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:27:21.3628931Z %32 = tt.descriptor_load %0[%3, %c4096_i32_5] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T08:27:21.3629274Z %33 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3629561Z %34 = arith.extf %32 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T08:27:21.3629811Z %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3630049Z %36 = arith.subf %34, %35 : tensor<8x1024xf32> 2026-02-21T08:27:21.3630412Z %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:27:21.3630825Z %38 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:27:21.3631106Z %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T08:27:21.3631333Z %40 = arith.divf %37, %39 : tensor<8x1024xf32> 2026-02-21T08:27:21.3631598Z %41 = arith.truncf %40 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T08:27:21.3631926Z tt.descriptor_store %1[%3, %c4096_i32_5], %41 : !tt.tensordesc>, tensor<8x1024xf16> 2026-02-21T08:27:21.3632297Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:27:21.3632553Z tt.return 2026-02-21T08:27:21.3632678Z } 2026-02-21T08:27:21.3632805Z } 2026-02-21T08:27:21.3632876Z 2026-02-21T08:27:21.3632926Z {-# 2026-02-21T08:27:21.3633065Z external_resources: { 2026-02-21T08:27:21.3633224Z mlir_reproducer: { 2026-02-21T08:27:21.3637715Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:27:21.3642477Z disable_threading: false, 2026-02-21T08:27:21.3642650Z verify_each: true 2026-02-21T08:27:21.3642867Z } 2026-02-21T08:27:21.3642994Z } 2026-02-21T08:27:21.3643105Z #-} 2026-02-21T08:27:21.3643540Z /tmp/torchinductor_root/62/c62yhclaekpli4npmlpzx47qarjsmheeqhqutpmunqmhec3r3562.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:27:21.3644740Z /tmp/torchinductor_root/62/c62yhclaekpli4npmlpzx47qarjsmheeqhqutpmunqmhec3r3562.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:27:21.3645714Z [56s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:27:21.3646839Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:27:21.3647844Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:27:21.3648098Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:27:25.4406640Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.8 configs/s 2026-02-21T08:27:29.5487647Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 246.9 2026-02-21T08:27:29.5488162Z configs/s 2026-02-21T08:27:29.8192315Z [64s] Generation 1 complete: 2026-02-21T08:27:29.8196710Z error=1 2026-02-21T08:27:29.8198132Z ok=90 2026-02-21T08:27:29.8198319Z min=0.0285 2026-02-21T08:27:29.8198456Z mid=0.0389 2026-02-21T08:27:29.8198599Z max=0.6842 2026-02-21T08:27:29.8198770Z best={'block_sizes': [1, 8192], 2026-02-21T08:27:29.8199020Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:27:29.8199263Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:27:29.8199444Z 'num_stages': 7, 2026-02-21T08:27:29.8199590Z 'num_warps': 4, 2026-02-21T08:27:29.8199729Z 'pid_type': 'flat', 2026-02-21T08:27:29.8199894Z 'range_flattens': [None, True], 2026-02-21T08:27:29.8200073Z 'range_multi_buffers': [None, None], 2026-02-21T08:27:29.8200265Z 'range_num_stages': [0, 4], 2026-02-21T08:27:29.8200431Z 'range_unroll_factors': [0, 0], 2026-02-21T08:27:29.8200633Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:29.8208041Z [64s] Fitting surrogate: 191 points, 191 targets 2026-02-21T08:27:30.7752839Z [65s] Generation 2 starting: 68 neighbors, 5 active search path(s) 2026-02-21T08:27:39.2376929Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 3.2 configs/s 2026-02-21T08:27:43.4099212Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.0 configs/s 2026-02-21T08:27:46.6335858Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 351.9 2026-02-21T08:27:46.6339792Z configs/s 2026-02-21T08:27:46.8432569Z [81s] Generation 2 complete: 2026-02-21T08:27:46.8437602Z error=1 2026-02-21T08:27:46.8439429Z ok=72 2026-02-21T08:27:46.8439596Z min=0.0245 2026-02-21T08:27:46.8439734Z mid=0.0328 2026-02-21T08:27:46.8439851Z max=0.2530 2026-02-21T08:27:46.8439995Z best={'block_sizes': [2, 8192], 2026-02-21T08:27:46.8440260Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:27:46.8440536Z 'load_eviction_policies': ['', ''], 2026-02-21T08:27:46.8440711Z 'num_stages': 4, 2026-02-21T08:27:46.8440858Z 'num_warps': 4, 2026-02-21T08:27:46.8441000Z 'pid_type': 'flat', 2026-02-21T08:27:46.8441165Z 'range_flattens': [None, None], 2026-02-21T08:27:46.8441342Z 'range_multi_buffers': [None, False], 2026-02-21T08:27:46.8442132Z 'range_num_stages': [0, 4], 2026-02-21T08:27:46.8442333Z 'range_unroll_factors': [0, 0], 2026-02-21T08:27:46.8442511Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:46.8448543Z [81s] Fitting surrogate: 264 points, 264 targets 2026-02-21T08:27:47.7545288Z [82s] Generation 3 starting: 60 neighbors, 5 active search path(s) 2026-02-21T08:27:54.5200987Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 4.6 configs/s 2026-02-21T08:27:58.2022007Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.8 configs/s 2026-02-21T08:28:01.4215721Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.3 2026-02-21T08:28:01.4220202Z configs/s 2026-02-21T08:28:01.6546290Z [96s] Generation 3 complete: 2026-02-21T08:28:01.6550623Z ok=66 2026-02-21T08:28:01.6554634Z min=0.0205 2026-02-21T08:28:01.6558520Z mid=0.0287 2026-02-21T08:28:01.6560141Z max=1.0701 2026-02-21T08:28:01.6560395Z best={'block_sizes': [1, 8192], 2026-02-21T08:28:01.6565208Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:28:01.6569501Z 'load_eviction_policies': ['', ''], 2026-02-21T08:28:01.6574001Z 'num_stages': 5, 2026-02-21T08:28:01.6578404Z 'num_warps': 4, 2026-02-21T08:28:01.6582297Z 'pid_type': 'flat', 2026-02-21T08:28:01.6586303Z 'range_flattens': [None, True], 2026-02-21T08:28:01.6587955Z 'range_multi_buffers': [None, False], 2026-02-21T08:28:01.6588212Z 'range_num_stages': [0, 4], 2026-02-21T08:28:01.6588404Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:01.6588607Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:01.6588822Z [96s] Fitting surrogate: 330 points, 330 targets 2026-02-21T08:28:02.3506560Z [97s] Generation 4 starting: 47 neighbors, 4 active search path(s) 2026-02-21T08:28:27.9282882Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 0.4 configs/s 2026-02-21T08:28:30.9909524Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 16.6 configs/s 2026-02-21T08:28:33.2987426Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 566.1 2026-02-21T08:28:33.2991258Z configs/s 2026-02-21T08:28:33.4554324Z [128s] Generation 4 complete: 2026-02-21T08:28:33.4558420Z ok=51 2026-02-21T08:28:33.4563426Z min=0.0204 2026-02-21T08:28:33.4565681Z mid=0.0287 2026-02-21T08:28:33.4565911Z max=0.1659 2026-02-21T08:28:33.4570081Z best={'block_sizes': [1, 8192], 2026-02-21T08:28:33.4571825Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:28:33.4572189Z 'load_eviction_policies': ['', ''], 2026-02-21T08:28:33.4576138Z 'num_stages': 5, 2026-02-21T08:28:33.4581226Z 'num_warps': 4, 2026-02-21T08:28:33.4582693Z 'pid_type': 'flat', 2026-02-21T08:28:33.4582899Z 'range_flattens': [None, False], 2026-02-21T08:28:33.4583091Z 'range_multi_buffers': [None, False], 2026-02-21T08:28:33.4583285Z 'range_num_stages': [0, 1], 2026-02-21T08:28:33.4584115Z 'range_unroll_factors': [0, 2], 2026-02-21T08:28:33.4584306Z 'range_warp_specializes': [None, None]} 2026-02-21T08:28:33.4584593Z [128s] Fitting surrogate: 381 points, 381 targets 2026-02-21T08:28:34.2035630Z [128s] Generation 5 starting: 44 neighbors, 4 active search path(s) 2026-02-21T08:28:41.3889625Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 2.9 configs/s 2026-02-21T08:28:44.2529695Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.7 configs/s 2026-02-21T08:28:45.9415510Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 600.1 2026-02-21T08:28:45.9419838Z configs/s 2026-02-21T08:28:46.0863050Z [140s] Generation 5 complete: 2026-02-21T08:28:46.0867242Z ok=49 2026-02-21T08:28:46.0872077Z min=0.0205 2026-02-21T08:28:46.0876238Z mid=0.0326 2026-02-21T08:28:46.0881136Z max=0.1065 2026-02-21T08:28:46.0881368Z best={'block_sizes': [1, 8192], 2026-02-21T08:28:46.0886933Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:28:46.0890627Z 'load_eviction_policies': ['', ''], 2026-02-21T08:28:46.0895147Z 'num_stages': 5, 2026-02-21T08:28:46.0895424Z 'num_warps': 4, 2026-02-21T08:28:46.0895636Z 'pid_type': 'flat', 2026-02-21T08:28:46.0895858Z 'range_flattens': [None, False], 2026-02-21T08:28:46.0896077Z 'range_multi_buffers': [None, False], 2026-02-21T08:28:46.0896274Z 'range_num_stages': [0, 1], 2026-02-21T08:28:46.0896466Z 'range_unroll_factors': [0, 2], 2026-02-21T08:28:46.0896655Z 'range_warp_specializes': [None, None]} 2026-02-21T08:28:46.0896899Z [140s] Fitting surrogate: 430 points, 430 targets 2026-02-21T08:28:46.6408450Z [141s] Generation 6 starting: 31 neighbors, 3 active search path(s) 2026-02-21T08:28:50.6095873Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 4.7 configs/s 2026-02-21T08:28:52.5758976Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.6 configs/s 2026-02-21T08:28:54.4097380Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 553.0 2026-02-21T08:28:54.4101853Z configs/s 2026-02-21T08:28:54.5706697Z [149s] Generation 6 complete: 2026-02-21T08:28:54.5710908Z ok=35 2026-02-21T08:28:54.5712809Z min=0.0204 2026-02-21T08:28:54.5712977Z mid=0.0246 2026-02-21T08:28:54.5713100Z max=0.0635 2026-02-21T08:28:54.5713248Z best={'block_sizes': [1, 8192], 2026-02-21T08:28:54.5713592Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:28:54.5713891Z 'load_eviction_policies': ['', ''], 2026-02-21T08:28:54.5718601Z 'num_stages': 5, 2026-02-21T08:28:54.5722562Z 'num_warps': 4, 2026-02-21T08:28:54.5727266Z 'pid_type': 'flat', 2026-02-21T08:28:54.5727542Z 'range_flattens': [None, True], 2026-02-21T08:28:54.5727771Z 'range_multi_buffers': [None, False], 2026-02-21T08:28:54.5728016Z 'range_num_stages': [0, 1], 2026-02-21T08:28:54.5728602Z 'range_unroll_factors': [0, 3], 2026-02-21T08:28:54.5733022Z 'range_warp_specializes': [None, None]} 2026-02-21T08:28:54.5735149Z [149s] Fitting surrogate: 465 points, 465 targets 2026-02-21T08:28:55.0108739Z [149s] Generation 7 starting: 23 neighbors, 2 active search path(s) 2026-02-21T08:28:57.4394144Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 18.5 configs/s 2026-02-21T08:28:58.8552347Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.8 configs/s 2026-02-21T08:29:00.2635712Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 719.8 2026-02-21T08:29:00.2637053Z configs/s 2026-02-21T08:29:00.3931246Z [155s] Generation 7 complete: 2026-02-21T08:29:00.3933429Z ok=25 2026-02-21T08:29:00.3938735Z min=0.0204 2026-02-21T08:29:00.3943377Z mid=0.0206 2026-02-21T08:29:00.3948502Z max=0.0390 2026-02-21T08:29:00.3953463Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:00.3956105Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:00.3956496Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:29:00.3960296Z 'num_stages': 6, 2026-02-21T08:29:00.3964824Z 'num_warps': 1, 2026-02-21T08:29:00.3966961Z 'pid_type': 'flat', 2026-02-21T08:29:00.3967232Z 'range_flattens': [None, True], 2026-02-21T08:29:00.3967443Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:00.3972149Z 'range_num_stages': [0, 1], 2026-02-21T08:29:00.3974071Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:00.3974307Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:00.3974605Z [155s] Fitting surrogate: 490 points, 490 targets 2026-02-21T08:29:00.8072647Z [155s] Generation 8 starting: 19 neighbors, 2 active search path(s) 2026-02-21T08:29:02.9635080Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 5.7 configs/s 2026-02-21T08:29:04.1130959Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s 2026-02-21T08:29:06.0211869Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 756.4 2026-02-21T08:29:06.0212329Z configs/s 2026-02-21T08:29:06.1461143Z [160s] Generation 8 complete: 2026-02-21T08:29:06.1465988Z ok=21 2026-02-21T08:29:06.1471094Z min=0.0204 2026-02-21T08:29:06.1476428Z mid=0.0205 2026-02-21T08:29:06.1478991Z max=0.0287 2026-02-21T08:29:06.1479176Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:06.1479440Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:06.1479692Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:06.1479875Z 'num_stages': 5, 2026-02-21T08:29:06.1480014Z 'num_warps': 4, 2026-02-21T08:29:06.1480162Z 'pid_type': 'flat', 2026-02-21T08:29:06.1480317Z 'range_flattens': [None, True], 2026-02-21T08:29:06.1480506Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:06.1480690Z 'range_num_stages': [0, 1], 2026-02-21T08:29:06.1480903Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:06.1481276Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:06.1499204Z [160s] Fitting surrogate: 511 points, 511 targets 2026-02-21T08:29:06.6647504Z [161s] Generation 9 starting: 23 neighbors, 2 active search path(s) 2026-02-21T08:29:09.1997694Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 11.4 configs/s 2026-02-21T08:29:10.5808958Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.2 configs/s 2026-02-21T08:29:12.0972149Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 668.0 2026-02-21T08:29:12.0975769Z configs/s 2026-02-21T08:29:12.2221363Z [166s] Generation 9 complete: 2026-02-21T08:29:12.2225867Z ok=25 2026-02-21T08:29:12.2230313Z min=0.0204 2026-02-21T08:29:12.2235318Z mid=0.0205 2026-02-21T08:29:12.2239948Z max=0.0307 2026-02-21T08:29:12.2244426Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:12.2249113Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:12.2249854Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:12.2250056Z 'num_stages': 5, 2026-02-21T08:29:12.2250214Z 'num_warps': 2, 2026-02-21T08:29:12.2250365Z 'pid_type': 'flat', 2026-02-21T08:29:12.2250524Z 'range_flattens': [None, True], 2026-02-21T08:29:12.2250720Z 'range_multi_buffers': [None, None], 2026-02-21T08:29:12.2250907Z 'range_num_stages': [0, 1], 2026-02-21T08:29:12.2251082Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:12.2251266Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:12.2251480Z [166s] Fitting surrogate: 536 points, 536 targets 2026-02-21T08:29:12.7189685Z [167s] Generation 10 starting: 19 neighbors, 2 active search path(s) 2026-02-21T08:29:14.2767497Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 20.0 configs/s 2026-02-21T08:29:15.4198744Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.3 configs/s 2026-02-21T08:29:16.6312858Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 833.8 2026-02-21T08:29:16.6313285Z configs/s 2026-02-21T08:29:16.7350644Z [171s] Generation 10 complete: 2026-02-21T08:29:16.7355106Z ok=21 2026-02-21T08:29:16.7359580Z min=0.0205 2026-02-21T08:29:16.7364034Z mid=0.0205 2026-02-21T08:29:16.7365410Z max=0.0368 2026-02-21T08:29:16.7365596Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:16.7365869Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:16.7366153Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:29:16.7366349Z 'num_stages': 8, 2026-02-21T08:29:16.7366502Z 'num_warps': 1, 2026-02-21T08:29:16.7366647Z 'pid_type': 'flat', 2026-02-21T08:29:16.7366816Z 'range_flattens': [None, None], 2026-02-21T08:29:16.7367001Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:16.7367197Z 'range_num_stages': [0, 1], 2026-02-21T08:29:16.7367364Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:16.7367584Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:16.7369623Z [171s] Fitting surrogate: 557 points, 557 targets 2026-02-21T08:29:17.2456765Z [172s] Generation 11 starting: 18 neighbors, 2 active search path(s) 2026-02-21T08:29:18.5697645Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 12.4 configs/s 2026-02-21T08:29:19.6413021Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.5 configs/s 2026-02-21T08:29:20.7667064Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 897.9 2026-02-21T08:29:20.7671194Z configs/s 2026-02-21T08:29:20.8649958Z [175s] Generation 11 complete: 2026-02-21T08:29:20.8654475Z ok=20 2026-02-21T08:29:20.8658808Z min=0.0204 2026-02-21T08:29:20.8660394Z mid=0.0205 2026-02-21T08:29:20.8660580Z max=0.0369 2026-02-21T08:29:20.8660773Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:20.8661075Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:20.8661349Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:20.8661620Z 'num_stages': 5, 2026-02-21T08:29:20.8661775Z 'num_warps': 2, 2026-02-21T08:29:20.8661974Z 'pid_type': 'flat', 2026-02-21T08:29:20.8662159Z 'range_flattens': [None, True], 2026-02-21T08:29:20.8662359Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:20.8662563Z 'range_num_stages': [0, 2], 2026-02-21T08:29:20.8662756Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:20.8662983Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:20.8670750Z [175s] Fitting surrogate: 577 points, 577 targets 2026-02-21T08:29:21.4056584Z [176s] Generation 12 starting: 21 neighbors, 2 active search path(s) 2026-02-21T08:29:23.0070725Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 11.2 configs/s 2026-02-21T08:29:24.2285520Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.9 configs/s 2026-02-21T08:29:25.6879189Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 694.3 2026-02-21T08:29:25.6880150Z configs/s 2026-02-21T08:29:25.8118174Z [180s] Generation 12 complete: 2026-02-21T08:29:25.8121990Z ok=23 2026-02-21T08:29:25.8123875Z min=0.0205 2026-02-21T08:29:25.8124062Z mid=0.0205 2026-02-21T08:29:25.8124195Z max=0.0286 2026-02-21T08:29:25.8124354Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:25.8124613Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:29:25.8124872Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:25.8125061Z 'num_stages': 5, 2026-02-21T08:29:25.8125204Z 'num_warps': 1, 2026-02-21T08:29:25.8125356Z 'pid_type': 'flat', 2026-02-21T08:29:25.8125514Z 'range_flattens': [None, True], 2026-02-21T08:29:25.8125706Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:25.8125892Z 'range_num_stages': [0, 2], 2026-02-21T08:29:25.8126070Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:25.8126581Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:25.8139407Z [180s] Fitting surrogate: 600 points, 600 targets 2026-02-21T08:29:26.1826752Z [180s] Generation 13 starting: 7 neighbors, 1 active search path(s) 2026-02-21T08:29:26.9621527Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 7.4 configs/s 2026-02-21T08:29:27.3775073Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 7/7 19.2 configs/s 2026-02-21T08:29:27.8807563Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1974.1 2026-02-21T08:29:27.8812010Z configs/s 2026-02-21T08:29:27.9300451Z [182s] Generation 13 complete: 2026-02-21T08:29:27.9305896Z ok=8 2026-02-21T08:29:27.9310539Z min=0.0204 2026-02-21T08:29:27.9312849Z mid=0.0205 2026-02-21T08:29:27.9313050Z max=0.0287 2026-02-21T08:29:27.9317414Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:27.9321766Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:27.9323848Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:27.9324069Z 'num_stages': 5, 2026-02-21T08:29:27.9324229Z 'num_warps': 1, 2026-02-21T08:29:27.9324375Z 'pid_type': 'flat', 2026-02-21T08:29:27.9324546Z 'range_flattens': [None, True], 2026-02-21T08:29:27.9324733Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:27.9324931Z 'range_num_stages': [0, 2], 2026-02-21T08:29:27.9325109Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:27.9325290Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:27.9325593Z [182s] Fitting surrogate: 608 points, 608 targets 2026-02-21T08:29:28.2980493Z [183s] Generation 14 starting: 9 neighbors, 1 active search path(s) 2026-02-21T08:29:29.0548796Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 25.4 configs/s 2026-02-21T08:29:29.5948146Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.3 configs/s 2026-02-21T08:29:30.2258734Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1584.9 2026-02-21T08:29:30.2259867Z configs/s 2026-02-21T08:29:30.2903651Z [185s] Generation 14 complete: 2026-02-21T08:29:30.2909429Z ok=10 2026-02-21T08:29:30.2911085Z min=0.0205 2026-02-21T08:29:30.2911285Z mid=0.0205 2026-02-21T08:29:30.2916370Z max=0.0267 2026-02-21T08:29:30.2921613Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:30.2925284Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:30.2929360Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:30.2933499Z 'num_stages': 5, 2026-02-21T08:29:30.2935099Z 'num_warps': 1, 2026-02-21T08:29:30.2935294Z 'pid_type': 'flat', 2026-02-21T08:29:30.2935461Z 'range_flattens': [None, True], 2026-02-21T08:29:30.2935654Z 'range_multi_buffers': [None, False], 2026-02-21T08:29:30.2935839Z 'range_num_stages': [0, 2], 2026-02-21T08:29:30.2936013Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:30.2936222Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:30.2936532Z [185s] Fitting surrogate: 618 points, 618 targets 2026-02-21T08:29:30.6456977Z [185s] Generation 15 starting: 5 neighbors, 1 active search path(s) 2026-02-21T08:29:31.3426657Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5/5 6.7 configs/s 2026-02-21T08:29:31.6408000Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 5/5 20.4 configs/s 2026-02-21T08:29:32.0064339Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2696.2 2026-02-21T08:29:32.0066035Z configs/s 2026-02-21T08:29:32.0472098Z [186s] Generation 15 complete: 2026-02-21T08:29:32.0476591Z ok=6 2026-02-21T08:29:32.0480432Z min=0.0205 2026-02-21T08:29:32.0481945Z mid=0.0205 2026-02-21T08:29:32.0482113Z max=0.0205 2026-02-21T08:29:32.0482257Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:32.0482537Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:32.0483159Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:32.0483383Z 'num_stages': 4, 2026-02-21T08:29:32.0483529Z 'num_warps': 1, 2026-02-21T08:29:32.0483683Z 'pid_type': 'flat', 2026-02-21T08:29:32.0483839Z 'range_flattens': [None, True], 2026-02-21T08:29:32.0484027Z 'range_multi_buffers': [None, True], 2026-02-21T08:29:32.0484221Z 'range_num_stages': [0, 2], 2026-02-21T08:29:32.0484386Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:32.0484581Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:32.0495435Z [186s] Fitting surrogate: 624 points, 624 targets 2026-02-21T08:29:32.4092329Z [187s] Generation 16 starting: 9 neighbors, 1 active search path(s) 2026-02-21T08:29:33.6464556Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 17.3 configs/s 2026-02-21T08:29:34.1854314Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.3 configs/s 2026-02-21T08:29:34.8038621Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1616.8 2026-02-21T08:29:34.8042784Z configs/s 2026-02-21T08:29:34.8618405Z [189s] Generation 16 complete: 2026-02-21T08:29:34.8620037Z ok=10 2026-02-21T08:29:34.8620218Z min=0.0205 2026-02-21T08:29:34.8620356Z mid=0.0205 2026-02-21T08:29:34.8620547Z max=0.0205 2026-02-21T08:29:34.8620695Z best={'block_sizes': [1, 8192], 2026-02-21T08:29:34.8625533Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:34.8629844Z 'load_eviction_policies': ['', ''], 2026-02-21T08:29:34.8631181Z 'num_stages': 4, 2026-02-21T08:29:34.8631359Z 'num_warps': 1, 2026-02-21T08:29:34.8631518Z 'pid_type': 'flat', 2026-02-21T08:29:34.8631861Z 'range_flattens': [None, True], 2026-02-21T08:29:34.8632050Z 'range_multi_buffers': [None, True], 2026-02-21T08:29:34.8632236Z 'range_num_stages': [0, 2], 2026-02-21T08:29:34.8632411Z 'range_unroll_factors': [0, 0], 2026-02-21T08:29:34.8632588Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:34.8644332Z [189s] Fitting surrogate: 634 points, 634 targets 2026-02-21T08:29:35.1366618Z [189s] Autotuning complete in 189.9s after searching 599 configs. 2026-02-21T08:29:35.1368197Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:29:35.1369147Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:29:35.1369960Z 2026-02-21T08:29:35.1370223Z [189s] Code of selected kernel: /tmp/torchinductor_root/eh/cehz4xnjf6rgufwtihsrd5tyojkz32iuatxzbunrtbcbt4wvl4e2.py 2026-02-21T08:29:36.1874272Z WARNING:tritonbench.utils.triton_op:Completed input ID 36: 2026-02-21T08:29:36.1879197Z (M, N) 2026-02-21T08:29:36.1880525Z ------------ 2026-02-21T08:29:36.1881028Z (4096, 4864) 2026-02-21T08:29:36.1881146Z 2026-02-21T08:29:36.1888749Z 40%|████ | 8/20 [20:33<35:13, 176.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 41: 2026-02-21T08:29:36.1892841Z (M, N) 2026-02-21T08:29:36.1894758Z ------------ 2026-02-21T08:29:36.1894934Z (4096, 5504) 2026-02-21T08:29:36.1895279Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:29:37.4282155Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:29:38.9433523Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for torch_compile_softmax 2026-02-21T08:29:42.3567385Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:29:42.3570942Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:29:42.3574172Z 'dtype': 'torch.float16', 2026-02-21T08:29:42.3577979Z 'shape': (4096, 5504), 2026-02-21T08:29:42.3581331Z 'stride': (5504, 1)},), 2026-02-21T08:29:42.3581859Z 'kwargs': {}} 2026-02-21T08:29:42.3590456Z INFO:tritonbench.utils.triton_op:Took 2.51ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:29:42.5354538Z [0s] Autotune random seed: 2134816249 2026-02-21T08:29:42.6730945Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:30:16.6290577Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:30:16.9085997Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:30:17.0848109Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:30:17.0868463Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:30:20.0670555Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:30:20.0672843Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:30:20.0673819Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:20.0674116Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:30:20.0674325Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:20.0678110Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:20.0682225Z %cst = arith.constant dense<5504> : tensor<32x1xi32> 2026-02-21T08:30:20.0687101Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:30:20.0691880Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:30:20.0692270Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:20.0695267Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:20.0695583Z %c5504_i32 = arith.constant 5504 : i32 2026-02-21T08:30:20.0695826Z %c5504_i64 = arith.constant 5504 : i64 2026-02-21T08:30:20.0703528Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:20.0705743Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : , > 2026-02-21T08:30:20.0706345Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : , > 2026-02-21T08:30:20.0706790Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:20.0710642Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:30:20.0715324Z %4 = arith.minsi %3, %c128_i32 : i32 2026-02-21T08:30:20.0717373Z scf.for %arg2 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:30:20.0717652Z %5 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:30:20.0717927Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:30:20.0718227Z %7 = tt.splat %5 : i32 -> tensor<32xi32> 2026-02-21T08:30:20.0718444Z %8 = arith.addi %7, %6 : tensor<32xi32> 2026-02-21T08:30:20.0718673Z %c5496_i32 = arith.constant 5496 : i32 2026-02-21T08:30:20.0718886Z %c24_i32 = arith.constant 24 : i32 2026-02-21T08:30:20.0719336Z %9:2 = scf.for %arg3 = %c0_i32 to %c5496_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:30:20.0719824Z %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:20.0720115Z %50 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:30:20.0720357Z %51 = arith.addi %50, %49 : tensor<8xi32> 2026-02-21T08:30:20.0720756Z %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:20.0721086Z %53 = arith.muli %52, %cst : tensor<32x1xi32> 2026-02-21T08:30:20.0724235Z %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:30:20.0728298Z %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0732192Z %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0736367Z %57 = arith.addi %55, %56 : tensor<32x8xi32> 2026-02-21T08:30:20.0739994Z %58 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0743789Z %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:30:20.0747502Z %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0747887Z %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0748154Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0748370Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0748585Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:20.0748792Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0749005Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0749244Z %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:20.0749514Z %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:20.0753228Z %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32> 2026-02-21T08:30:20.0756227Z %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:30:20.0756830Z %67 = arith.ori %65, %66 : tensor<32xi1> 2026-02-21T08:30:20.0757105Z %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:20.0757374Z %69 = arith.subf %arg4, %68 : tensor<32xf32> 2026-02-21T08:30:20.0757821Z %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0758213Z %71 = arith.mulf %arg5, %70 : tensor<32xf32> 2026-02-21T08:30:20.0758497Z %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0758816Z %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0759067Z %74 = arith.subf %61, %73 : tensor<32x8xf32> 2026-02-21T08:30:20.0759548Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0759945Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0760154Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0760363Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:20.0760578Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0760780Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0761001Z %77 = arith.addf %71, %76 : tensor<32xf32> 2026-02-21T08:30:20.0761211Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:30:20.0761425Z %78 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:30:20.0761703Z %79 = arith.addi %arg3, %78 : i32 2026-02-21T08:30:20.0761957Z %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:20.0762230Z %81 = tt.splat %79 : i32 -> tensor<8xi32> 2026-02-21T08:30:20.0762439Z %82 = arith.addi %81, %80 : tensor<8xi32> 2026-02-21T08:30:20.0762720Z %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:20.0763009Z %84 = arith.muli %83, %cst : tensor<32x1xi32> 2026-02-21T08:30:20.0763287Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:30:20.0763595Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0763887Z %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0764147Z %88 = arith.addi %86, %87 : tensor<32x8xi32> 2026-02-21T08:30:20.0764404Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0764706Z %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:30:20.0765021Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0765330Z %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0765569Z %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0765783Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0765987Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:20.0766190Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0766392Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0766625Z %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:20.0766889Z %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:20.0767139Z %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32> 2026-02-21T08:30:20.0767372Z %97 = arith.cmpf une, %68, %68 : tensor<32xf32> 2026-02-21T08:30:20.0767592Z %98 = arith.ori %96, %97 : tensor<32xi1> 2026-02-21T08:30:20.0767833Z %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:20.0768088Z %100 = arith.subf %68, %99 : tensor<32xf32> 2026-02-21T08:30:20.0768480Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0768952Z %102 = arith.mulf %77, %101 : tensor<32xf32> 2026-02-21T08:30:20.0769224Z %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0769551Z %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0769814Z %105 = arith.subf %92, %104 : tensor<32x8xf32> 2026-02-21T08:30:20.0770201Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0770604Z %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0770813Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0771017Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:20.0771218Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0771418Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0771751Z %108 = arith.addf %102, %107 : tensor<32xf32> 2026-02-21T08:30:20.0771966Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:20.0772194Z %109 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:30:20.0772415Z %110 = arith.addi %arg3, %109 : i32 2026-02-21T08:30:20.0772684Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:20.0772969Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T08:30:20.0773211Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T08:30:20.0773504Z %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:20.0773797Z %115 = arith.muli %114, %cst : tensor<32x1xi32> 2026-02-21T08:30:20.0774112Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:30:20.0774449Z %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0774760Z %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0775035Z %119 = arith.addi %117, %118 : tensor<32x8xi32> 2026-02-21T08:30:20.0775320Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0775654Z %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:30:20.0776013Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0776364Z %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0776634Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0776861Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0777069Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:20.0777298Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0777513Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0777770Z %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:20.0778069Z %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:20.0778341Z %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32> 2026-02-21T08:30:20.0778631Z %128 = arith.cmpf une, %99, %99 : tensor<32xf32> 2026-02-21T08:30:20.0778876Z %129 = arith.ori %127, %128 : tensor<32xi1> 2026-02-21T08:30:20.0779148Z %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:20.0779433Z %131 = arith.subf %99, %130 : tensor<32xf32> 2026-02-21T08:30:20.0779842Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0780240Z %133 = arith.mulf %108, %132 : tensor<32xf32> 2026-02-21T08:30:20.0780515Z %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0780836Z %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0781164Z %136 = arith.subf %123, %135 : tensor<32x8xf32> 2026-02-21T08:30:20.0781606Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0782038Z %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0782256Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:20.0782475Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:20.0782694Z tt.reduce.return %140 : f32 2026-02-21T08:30:20.0782912Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0783149Z %139 = arith.addf %133, %138 : tensor<32xf32> 2026-02-21T08:30:20.0783383Z scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:30:20.0783657Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:30:20.0783942Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:20.0784311Z %11 = tt.splat %c5496_i32 : i32 -> tensor<8xi32> 2026-02-21T08:30:20.0784530Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:30:20.0784797Z %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:20.0785080Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:30:20.0785349Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:30:20.0785655Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0785923Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:30:20.0786174Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:30:20.0786425Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0786721Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:30:20.0787043Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:30:20.0787347Z %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0787590Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0787795Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:30:20.0787992Z %49 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:30:20.0788193Z tt.reduce.return %49 : f32 2026-02-21T08:30:20.0788398Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0788639Z %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:20.0788894Z %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:20.0789141Z %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32> 2026-02-21T08:30:20.0789370Z %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32> 2026-02-21T08:30:20.0789590Z %28 = arith.ori %26, %27 : tensor<32xi1> 2026-02-21T08:30:20.0789829Z %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:20.0790086Z %30 = arith.subf %9#0, %29 : tensor<32xf32> 2026-02-21T08:30:20.0790477Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0790855Z %32 = arith.mulf %9#1, %31 : tensor<32xf32> 2026-02-21T08:30:20.0791130Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0791443Z %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0791751Z %35 = arith.subf %22, %34 : tensor<32x8xf32> 2026-02-21T08:30:20.0792164Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0792584Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:30:20.0792810Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:30:20.0793021Z %49 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:30:20.0793230Z tt.reduce.return %49 : f32 2026-02-21T08:30:20.0793485Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:30:20.0793700Z %38 = arith.addf %32, %37 : tensor<32xf32> 2026-02-21T08:30:20.0793904Z %c5496_i32_2 = arith.constant 5496 : i32 2026-02-21T08:30:20.0794113Z %c24_i32_3 = arith.constant 24 : i32 2026-02-21T08:30:20.0794361Z scf.for %arg3 = %c0_i32 to %c5496_i32_2 step %c24_i32_3 : i32 { 2026-02-21T08:30:20.0794703Z %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:30:20.0795070Z %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0795372Z %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0795651Z %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0795897Z %53 = arith.subf %51, %52 : tensor<32x8xf32> 2026-02-21T08:30:20.0796352Z %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0796799Z %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0797099Z %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0797351Z %57 = arith.divf %54, %56 : tensor<32x8xf32> 2026-02-21T08:30:20.0797590Z %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:30:20.0797926Z tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:30:20.0798236Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:30:20.0798441Z %59 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:30:20.0798648Z %60 = arith.addi %arg3, %59 : i32 2026-02-21T08:30:20.0798932Z %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:30:20.0799287Z %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0799588Z %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0799855Z %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0800103Z %65 = arith.subf %63, %64 : tensor<32x8xf32> 2026-02-21T08:30:20.0800482Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0800920Z %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0801220Z %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0801483Z %69 = arith.divf %66, %68 : tensor<32x8xf32> 2026-02-21T08:30:20.0801793Z %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:30:20.0802141Z tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:30:20.0802473Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:20.0802671Z %71 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:30:20.0802877Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:30:20.0803155Z %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:30:20.0803512Z %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0803820Z %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0804106Z %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0804375Z %77 = arith.subf %75, %76 : tensor<32x8xf32> 2026-02-21T08:30:20.0804785Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0805282Z %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0805673Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0805931Z %81 = arith.divf %78, %80 : tensor<32x8xf32> 2026-02-21T08:30:20.0806191Z %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:30:20.0806535Z tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:30:20.0806907Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:30:20.0807276Z %39 = tt.descriptor_load %0[%5, %c5496_i32_2] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:30:20.0807706Z %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0808034Z %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:30:20.0808321Z %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0808647Z %43 = arith.subf %41, %42 : tensor<32x8xf32> 2026-02-21T08:30:20.0809056Z %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:30:20.0809542Z %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:20.0809886Z %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:30:20.0810144Z %47 = arith.divf %44, %46 : tensor<32x8xf32> 2026-02-21T08:30:20.0810409Z %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:30:20.0810772Z tt.descriptor_store %1[%5, %c5496_i32_2], %48 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:30:20.0811143Z } {tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:30:20.0811365Z tt.return 2026-02-21T08:30:20.0811526Z } 2026-02-21T08:30:20.0811695Z } 2026-02-21T08:30:20.0811779Z 2026-02-21T08:30:20.0811833Z {-# 2026-02-21T08:30:20.0811980Z external_resources: { 2026-02-21T08:30:20.0812152Z mlir_reproducer: { 2026-02-21T08:30:20.0816860Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:20.0821802Z disable_threading: false, 2026-02-21T08:30:20.0822001Z verify_each: true 2026-02-21T08:30:20.0822180Z } 2026-02-21T08:30:20.0822315Z } 2026-02-21T08:30:20.0822514Z #-} 2026-02-21T08:30:20.0823010Z /tmp/torchinductor_root/2b/c2bvaufoxps2wu5oj3tnplvh5juo7e4hyjyxdfl3yacmcu3dh5ru.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:20.0824300Z /tmp/torchinductor_root/2b/c2bvaufoxps2wu5oj3tnplvh5juo7e4hyjyxdfl3yacmcu3dh5ru.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:20.0825346Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:20.0826594Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:30:20.0827666Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:20.0827949Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:21.5886565Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:30:21.5891926Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:30:21.5894114Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:21.5894452Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:21.5898139Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:30:21.5898595Z %cst = arith.constant dense<5504> : tensor<32x1xi32> 2026-02-21T08:30:21.5899015Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:30:21.5899467Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:30:21.5899830Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:21.5900139Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:21.5900399Z %c5504_i32 = arith.constant 5504 : i32 2026-02-21T08:30:21.5900640Z %c5504_i64 = arith.constant 5504 : i64 2026-02-21T08:30:21.5900888Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:21.5901312Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : , > 2026-02-21T08:30:21.5901848Z %1 = tt.get_program_id x : i32 2026-02-21T08:30:21.5902136Z scf.for %arg2 = %1 to %c128_i32 step %c9472_i32 : i32 { 2026-02-21T08:30:21.5902432Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:30:21.5902738Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:30:21.5903108Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:30:21.5903393Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:30:21.5903653Z %c5472_i32 = arith.constant 5472 : i32 2026-02-21T08:30:21.5903888Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:30:21.5904371Z %6:2 = scf.for %arg3 = %c0_i32 to %c5472_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:30:21.5905000Z %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:30:21.5905451Z %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5905767Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5906013Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5906263Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:21.5906509Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5906754Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5907053Z %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:21.5907816Z %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:21.5908145Z %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32> 2026-02-21T08:30:21.5908451Z %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:30:21.5908756Z %54 = arith.ori %52, %53 : tensor<32xi1> 2026-02-21T08:30:21.5909037Z %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:21.5909334Z %56 = arith.subf %arg4, %55 : tensor<32xf32> 2026-02-21T08:30:21.5909833Z %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5910302Z %58 = arith.mulf %arg5, %57 : tensor<32xf32> 2026-02-21T08:30:21.5910659Z %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5911186Z %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5911500Z %61 = arith.subf %48, %60 : tensor<32x32xf32> 2026-02-21T08:30:21.5912032Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5912477Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5912703Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5912914Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:21.5913136Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5913349Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5913582Z %64 = arith.addf %58, %63 : tensor<32xf32> 2026-02-21T08:30:21.5913801Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:21.5914024Z %65 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:30:21.5914253Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T08:30:21.5914608Z %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:30:21.5915013Z %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5915301Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5915539Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5915767Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:21.5916009Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5916268Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5916544Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:21.5916829Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:21.5917097Z %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32> 2026-02-21T08:30:21.5917340Z %73 = arith.cmpf une, %55, %55 : tensor<32xf32> 2026-02-21T08:30:21.5917578Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:30:21.5917841Z %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:21.5918118Z %76 = arith.subf %55, %75 : tensor<32xf32> 2026-02-21T08:30:21.5918522Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5918980Z %78 = arith.mulf %64, %77 : tensor<32xf32> 2026-02-21T08:30:21.5919289Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5919678Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5919969Z %81 = arith.subf %68, %80 : tensor<32x32xf32> 2026-02-21T08:30:21.5920386Z %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5920802Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5921026Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5921240Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:21.5921597Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5921817Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5922041Z %84 = arith.addf %78, %83 : tensor<32xf32> 2026-02-21T08:30:21.5922269Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:21.5922483Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:30:21.5922704Z %86 = arith.addi %arg3, %85 : i32 2026-02-21T08:30:21.5923022Z %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:30:21.5923379Z %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5923646Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5923861Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5924083Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:30:21.5924301Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5924597Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5924852Z %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:21.5925137Z %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:21.5925406Z %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32> 2026-02-21T08:30:21.5925649Z %93 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:30:21.5925887Z %94 = arith.ori %92, %93 : tensor<32xi1> 2026-02-21T08:30:21.5926152Z %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:21.5926424Z %96 = arith.subf %75, %95 : tensor<32xf32> 2026-02-21T08:30:21.5926833Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5927249Z %98 = arith.mulf %84, %97 : tensor<32xf32> 2026-02-21T08:30:21.5927545Z %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5927891Z %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5928178Z %101 = arith.subf %88, %100 : tensor<32x32xf32> 2026-02-21T08:30:21.5928605Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5929048Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5929276Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:21.5929482Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:21.5929709Z tt.reduce.return %105 : f32 2026-02-21T08:30:21.5929922Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5930163Z %104 = arith.addf %98, %103 : tensor<32xf32> 2026-02-21T08:30:21.5930420Z scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:30:21.5930681Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:30:21.5931032Z %7 = tt.descriptor_load %0[%2, %c5472_i32] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:30:21.5931426Z %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5931748Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5931979Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:30:21.5932210Z %47 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:30:21.5932438Z tt.reduce.return %47 : f32 2026-02-21T08:30:21.5932674Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5932943Z %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:30:21.5933243Z %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:30:21.5933537Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32> 2026-02-21T08:30:21.5933797Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:30:21.5934050Z %14 = arith.ori %12, %13 : tensor<32xi1> 2026-02-21T08:30:21.5934329Z %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:30:21.5934696Z %16 = arith.subf %6#0, %15 : tensor<32xf32> 2026-02-21T08:30:21.5935096Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5935509Z %18 = arith.mulf %6#1, %17 : tensor<32xf32> 2026-02-21T08:30:21.5935801Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5936134Z %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5936405Z %21 = arith.subf %8, %20 : tensor<32x32xf32> 2026-02-21T08:30:21.5936816Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5937234Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:30:21.5937455Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:30:21.5937719Z %47 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:30:21.5937943Z tt.reduce.return %47 : f32 2026-02-21T08:30:21.5938153Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:30:21.5938381Z %24 = arith.addf %18, %23 : tensor<32xf32> 2026-02-21T08:30:21.5938605Z %c5472_i32_2 = arith.constant 5472 : i32 2026-02-21T08:30:21.5938834Z %c96_i32_3 = arith.constant 96 : i32 2026-02-21T08:30:21.5939098Z scf.for %arg3 = %c0_i32 to %c5472_i32_2 step %c96_i32_3 : i32 { 2026-02-21T08:30:21.5939384Z %47 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T08:30:21.5939624Z %48 = arith.addi %47, %3 : tensor<32xi32> 2026-02-21T08:30:21.5939911Z %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:21.5940220Z %50 = arith.muli %49, %cst : tensor<32x1xi32> 2026-02-21T08:30:21.5940509Z %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:30:21.5940848Z %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5941156Z %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5941424Z %54 = arith.addi %52, %53 : tensor<32x32xi32> 2026-02-21T08:30:21.5941757Z %55 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5942076Z %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5942424Z %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5942775Z %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5943105Z %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5943404Z %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5943671Z %61 = arith.subf %59, %60 : tensor<32x32xf32> 2026-02-21T08:30:21.5944093Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5944571Z %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5944905Z %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5945179Z %65 = arith.divf %62, %64 : tensor<32x32xf32> 2026-02-21T08:30:21.5945449Z %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:30:21.5945769Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5946090Z %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5946390Z tt.store %68, %66 : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5946619Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:21.5946843Z %69 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:30:21.5947068Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:30:21.5947400Z %71 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T08:30:21.5947633Z %72 = arith.addi %71, %3 : tensor<32xi32> 2026-02-21T08:30:21.5947914Z %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:21.5948218Z %74 = arith.muli %73, %cst : tensor<32x1xi32> 2026-02-21T08:30:21.5948509Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:30:21.5948843Z %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5949145Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5949409Z %78 = arith.addi %76, %77 : tensor<32x32xi32> 2026-02-21T08:30:21.5949688Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5950001Z %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5950462Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5950821Z %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5951154Z %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5951452Z %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5951773Z %85 = arith.subf %83, %84 : tensor<32x32xf32> 2026-02-21T08:30:21.5952196Z %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5952671Z %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5953006Z %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5953277Z %89 = arith.divf %86, %88 : tensor<32x32xf32> 2026-02-21T08:30:21.5953547Z %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:30:21.5953861Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5954177Z %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5954473Z tt.store %92, %90 : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5954703Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:21.5954924Z %93 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:30:21.5955146Z %94 = arith.addi %arg3, %93 : i32 2026-02-21T08:30:21.5955364Z %95 = tt.splat %94 : i32 -> tensor<32xi32> 2026-02-21T08:30:21.5955596Z %96 = arith.addi %95, %3 : tensor<32xi32> 2026-02-21T08:30:21.5955873Z %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:21.5956175Z %98 = arith.muli %97, %cst : tensor<32x1xi32> 2026-02-21T08:30:21.5956464Z %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:30:21.5956801Z %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5957111Z %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5957388Z %102 = arith.addi %100, %101 : tensor<32x32xi32> 2026-02-21T08:30:21.5957671Z %103 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5958000Z %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5958365Z %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5958730Z %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5959078Z %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5959390Z %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5959676Z %109 = arith.subf %107, %108 : tensor<32x32xf32> 2026-02-21T08:30:21.5960197Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5960698Z %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5961049Z %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5961346Z %113 = arith.divf %110, %112 : tensor<32x32xf32> 2026-02-21T08:30:21.5961667Z %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:30:21.5961999Z %115 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5962330Z %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5962647Z tt.store %116, %114 : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5962899Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:30:21.5963186Z %25 = tt.splat %c5472_i32_2 : i32 -> tensor<32xi32> 2026-02-21T08:30:21.5963424Z %26 = arith.addi %25, %3 : tensor<32xi32> 2026-02-21T08:30:21.5963684Z %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:30:21.5963972Z %28 = arith.muli %27, %cst : tensor<32x1xi32> 2026-02-21T08:30:21.5964242Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:30:21.5964556Z %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5964831Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:30:21.5965084Z %32 = arith.addi %30, %31 : tensor<32x32xi32> 2026-02-21T08:30:21.5965337Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5965627Z %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5965950Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5966280Z %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5966587Z %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:30:21.5966855Z %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5967104Z %39 = arith.subf %37, %38 : tensor<32x32xf32> 2026-02-21T08:30:21.5967496Z %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:30:21.5967937Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:30:21.5968240Z %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:30:21.5968480Z %43 = arith.divf %40, %42 : tensor<32x32xf32> 2026-02-21T08:30:21.5968727Z %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:30:21.5969017Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5969305Z %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:30:21.5969580Z tt.store %46, %44 : tensor<32x32x!tt.ptr> 2026-02-21T08:30:21.5969868Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:21.5970140Z tt.return 2026-02-21T08:30:21.5970272Z } 2026-02-21T08:30:21.5970405Z } 2026-02-21T08:30:21.5970478Z 2026-02-21T08:30:21.5970539Z {-# 2026-02-21T08:30:21.5970674Z external_resources: { 2026-02-21T08:30:21.5970848Z mlir_reproducer: { 2026-02-21T08:30:21.5975848Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:21.5981078Z disable_threading: false, 2026-02-21T08:30:21.5981271Z verify_each: true 2026-02-21T08:30:21.5981447Z } 2026-02-21T08:30:21.5981630Z } 2026-02-21T08:30:21.5981766Z #-} 2026-02-21T08:30:21.5982260Z /tmp/torchinductor_root/7j/c7jycigb7ntky37gbhleowgtonxpngp7phyc36yil7xkkkov35se.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:21.5983664Z /tmp/torchinductor_root/7j/c7jycigb7ntky37gbhleowgtonxpngp7phyc36yil7xkkkov35se.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:21.5984823Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:21.5986050Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:30:21.5987161Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:21.5987456Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:23.8663056Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.8 configs/s 2026-02-21T08:30:23.8670805Z [41s] Adaptive compile timeout: 30s (90% percentile=5.4s, bounds=[30.0s, 30s]) 2026-02-21T08:30:24.4816397Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1602.7 configs/s 2026-02-21T08:30:24.5390249Z [41s] Initial random population of 100, 5 starting points: 2026-02-21T08:30:24.5394523Z error=8 2026-02-21T08:30:24.5399276Z timeout=3 2026-02-21T08:30:24.5400903Z ok=89 2026-02-21T08:30:24.5401140Z min=0.0428 2026-02-21T08:30:24.5406826Z mid=0.5037 2026-02-21T08:30:24.5408288Z max=37.0422 2026-02-21T08:30:24.5408510Z best={'block_sizes': [2, 1024], 2026-02-21T08:30:24.5408807Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:30:24.5409097Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:30:24.5409305Z 'num_sm_multiplier': 64, 2026-02-21T08:30:24.5409473Z 'num_stages': 5, 2026-02-21T08:30:24.5409611Z 'num_warps': 1, 2026-02-21T08:30:24.5409777Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:30:24.5409970Z 'range_flattens': [True, True], 2026-02-21T08:30:24.5410733Z 'range_multi_buffers': [False, None], 2026-02-21T08:30:24.5410930Z 'range_num_stages': [3, 1], 2026-02-21T08:30:24.5411106Z 'range_unroll_factors': [0, 2], 2026-02-21T08:30:24.5411303Z 'range_warp_specializes': [True, None]} 2026-02-21T08:30:24.5411529Z [41s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:30:25.7564913Z [43s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:30:55.2814611Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.4 configs/s 2026-02-21T08:31:00.7982873Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.4 configs/s 2026-02-21T08:31:06.2420535Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 211.6 2026-02-21T08:31:06.2422179Z configs/s 2026-02-21T08:31:06.5732666Z [83s] Generation 1 complete: 2026-02-21T08:31:06.5734490Z ok=91 2026-02-21T08:31:06.5734660Z min=0.0328 2026-02-21T08:31:06.5735330Z mid=0.0452 2026-02-21T08:31:06.5735487Z max=2.5037 2026-02-21T08:31:06.5735631Z best={'block_sizes': [1, 8192], 2026-02-21T08:31:06.5735865Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:31:06.5736094Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:31:06.5736278Z 'maxnreg': 128, 2026-02-21T08:31:06.5736422Z 'num_sm_multiplier': 64, 2026-02-21T08:31:06.5736581Z 'num_stages': 7, 2026-02-21T08:31:06.5736716Z 'num_warps': 2, 2026-02-21T08:31:06.5736873Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:31:06.5737062Z 'range_flattens': [None, True], 2026-02-21T08:31:06.5737240Z 'range_multi_buffers': [False, True], 2026-02-21T08:31:06.5737424Z 'range_num_stages': [1, 4], 2026-02-21T08:31:06.5737585Z 'range_unroll_factors': [0, 4], 2026-02-21T08:31:06.5737763Z 'range_warp_specializes': [True, None]} 2026-02-21T08:31:06.5749272Z [83s] Fitting surrogate: 191 points, 191 targets 2026-02-21T08:31:07.7229717Z [85s] Generation 2 starting: 79 neighbors, 5 active search path(s) 2026-02-21T08:31:16.6902904Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 15.9 configs/s 2026-02-21T08:31:21.6965485Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.7 configs/s 2026-02-21T08:31:26.9741321Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 192.4 2026-02-21T08:31:26.9745360Z configs/s 2026-02-21T08:31:27.3045046Z [104s] Generation 2 complete: 2026-02-21T08:31:27.3049964Z ok=85 2026-02-21T08:31:27.3051776Z min=0.0307 2026-02-21T08:31:27.3051998Z mid=0.0369 2026-02-21T08:31:27.3057122Z max=0.1556 2026-02-21T08:31:27.3058736Z best={'block_sizes': [1, 8192], 2026-02-21T08:31:27.3059065Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:31:27.3059366Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:31:27.3059592Z 'num_sm_multiplier': 32, 2026-02-21T08:31:27.3059771Z 'num_stages': 5, 2026-02-21T08:31:27.3059965Z 'num_warps': 1, 2026-02-21T08:31:27.3060156Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:31:27.3060357Z 'range_flattens': [None, True], 2026-02-21T08:31:27.3060551Z 'range_multi_buffers': [False, None], 2026-02-21T08:31:27.3060738Z 'range_num_stages': [3, 1], 2026-02-21T08:31:27.3060919Z 'range_unroll_factors': [0, 2], 2026-02-21T08:31:27.3061101Z 'range_warp_specializes': [True, None]} 2026-02-21T08:31:27.3065080Z [104s] Fitting surrogate: 276 points, 276 targets 2026-02-21T08:31:28.3404201Z [105s] Generation 3 starting: 71 neighbors, 5 active search path(s) 2026-02-21T08:31:35.6376846Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 4.0 configs/s 2026-02-21T08:31:40.0829799Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s 2026-02-21T08:31:45.2341321Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 211.4 2026-02-21T08:31:45.2342347Z configs/s 2026-02-21T08:31:45.5908794Z [122s] Generation 3 complete: 2026-02-21T08:31:45.5913762Z ok=77 2026-02-21T08:31:45.5915915Z min=0.0246 2026-02-21T08:31:45.5916078Z mid=0.0348 2026-02-21T08:31:45.5916203Z max=0.1578 2026-02-21T08:31:45.5916349Z best={'block_sizes': [2, 8192], 2026-02-21T08:31:45.5916575Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:31:45.5916813Z 'load_eviction_policies': ['', ''], 2026-02-21T08:31:45.5916987Z 'maxnreg': 128, 2026-02-21T08:31:45.5917140Z 'num_sm_multiplier': 8, 2026-02-21T08:31:45.5917303Z 'num_stages': 6, 2026-02-21T08:31:45.5917440Z 'num_warps': 1, 2026-02-21T08:31:45.5917598Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:31:45.5917788Z 'range_flattens': [True, None], 2026-02-21T08:31:45.5917971Z 'range_multi_buffers': [None, None], 2026-02-21T08:31:45.5918151Z 'range_num_stages': [4, 2], 2026-02-21T08:31:45.5918321Z 'range_unroll_factors': [0, 1], 2026-02-21T08:31:45.5918498Z 'range_warp_specializes': [True, None]} 2026-02-21T08:31:45.5926820Z [122s] Fitting surrogate: 353 points, 353 targets 2026-02-21T08:31:46.6060882Z [123s] Generation 4 starting: 70 neighbors, 5 active search path(s) 2026-02-21T08:32:02.2878470Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.9 configs/s 2026-02-21T08:32:06.4161529Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 17.4 configs/s 2026-02-21T08:32:10.9468738Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 224.3 2026-02-21T08:32:10.9470065Z configs/s 2026-02-21T08:32:11.2883219Z [148s] Generation 4 complete: 2026-02-21T08:32:11.2887614Z error=3 2026-02-21T08:32:11.2888986Z ok=72 2026-02-21T08:32:11.2889148Z min=0.0245 2026-02-21T08:32:11.2889287Z mid=0.0307 2026-02-21T08:32:11.2889407Z max=0.1925 2026-02-21T08:32:11.2889555Z best={'block_sizes': [1, 8192], 2026-02-21T08:32:11.2889863Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:32:11.2890159Z 'load_eviction_policies': ['', ''], 2026-02-21T08:32:11.2890332Z 'num_stages': 4, 2026-02-21T08:32:11.2890477Z 'num_warps': 1, 2026-02-21T08:32:11.2890618Z 'pid_type': 'flat', 2026-02-21T08:32:11.2890779Z 'range_flattens': [None, True], 2026-02-21T08:32:11.2890960Z 'range_multi_buffers': [None, False], 2026-02-21T08:32:11.2891142Z 'range_num_stages': [0, 3], 2026-02-21T08:32:11.2891310Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:11.2891485Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:11.2901143Z [148s] Fitting surrogate: 428 points, 428 targets 2026-02-21T08:32:12.7543577Z [150s] Generation 5 starting: 76 neighbors, 5 active search path(s) 2026-02-21T08:32:20.1498707Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 8.0 configs/s 2026-02-21T08:32:24.8694469Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.7 configs/s 2026-02-21T08:32:29.4952183Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 220.0 2026-02-21T08:32:29.4953169Z configs/s 2026-02-21T08:32:29.8448139Z [167s] Generation 5 complete: 2026-02-21T08:32:29.8452174Z ok=82 2026-02-21T08:32:29.8456084Z min=0.0244 2026-02-21T08:32:29.8460398Z mid=0.0266 2026-02-21T08:32:29.8463554Z max=0.1823 2026-02-21T08:32:29.8468084Z best={'block_sizes': [1, 8192], 2026-02-21T08:32:29.8472626Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:32:29.8476423Z 'load_eviction_policies': ['', ''], 2026-02-21T08:32:29.8476722Z 'num_stages': 4, 2026-02-21T08:32:29.8476916Z 'num_warps': 1, 2026-02-21T08:32:29.8477096Z 'pid_type': 'flat', 2026-02-21T08:32:29.8477275Z 'range_flattens': [None, True], 2026-02-21T08:32:29.8477470Z 'range_multi_buffers': [None, False], 2026-02-21T08:32:29.8477654Z 'range_num_stages': [0, 3], 2026-02-21T08:32:29.8477833Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:29.8482544Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:29.8486436Z [167s] Fitting surrogate: 510 points, 510 targets 2026-02-21T08:32:30.7372515Z [168s] Generation 6 starting: 50 neighbors, 4 active search path(s) 2026-02-21T08:32:36.3183609Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 7.3 configs/s 2026-02-21T08:32:39.4849510Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.6 configs/s 2026-02-21T08:32:43.6665510Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.4 2026-02-21T08:32:43.6666936Z configs/s 2026-02-21T08:32:43.9616359Z [181s] Generation 6 complete: 2026-02-21T08:32:43.9618323Z ok=55 2026-02-21T08:32:43.9618498Z min=0.0244 2026-02-21T08:32:43.9618628Z mid=0.0246 2026-02-21T08:32:43.9618757Z max=0.0572 2026-02-21T08:32:43.9618896Z best={'block_sizes': [1, 8192], 2026-02-21T08:32:43.9619133Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:32:43.9619752Z 'load_eviction_policies': ['', ''], 2026-02-21T08:32:43.9619963Z 'num_stages': 4, 2026-02-21T08:32:43.9620110Z 'num_warps': 1, 2026-02-21T08:32:43.9620249Z 'pid_type': 'flat', 2026-02-21T08:32:43.9620412Z 'range_flattens': [None, True], 2026-02-21T08:32:43.9620589Z 'range_multi_buffers': [None, False], 2026-02-21T08:32:43.9620781Z 'range_num_stages': [0, 3], 2026-02-21T08:32:43.9620945Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:43.9621127Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:43.9633503Z [181s] Fitting surrogate: 565 points, 565 targets 2026-02-21T08:32:44.8067460Z [182s] Generation 7 starting: 45 neighbors, 4 active search path(s) 2026-02-21T08:32:50.5239472Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 5.8 configs/s 2026-02-21T08:32:53.3219162Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.7 configs/s 2026-02-21T08:32:56.7595097Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.7 2026-02-21T08:32:56.7596398Z configs/s 2026-02-21T08:32:57.0324596Z [194s] Generation 7 complete: 2026-02-21T08:32:57.0326189Z ok=50 2026-02-21T08:32:57.0326361Z min=0.0244 2026-02-21T08:32:57.0326559Z mid=0.0246 2026-02-21T08:32:57.0326697Z max=0.0513 2026-02-21T08:32:57.0330748Z best={'block_sizes': [1, 8192], 2026-02-21T08:32:57.0334389Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:32:57.0336143Z 'load_eviction_policies': ['', ''], 2026-02-21T08:32:57.0336385Z 'num_stages': 5, 2026-02-21T08:32:57.0336536Z 'num_warps': 1, 2026-02-21T08:32:57.0336692Z 'pid_type': 'flat', 2026-02-21T08:32:57.0336850Z 'range_flattens': [None, None], 2026-02-21T08:32:57.0337039Z 'range_multi_buffers': [None, None], 2026-02-21T08:32:57.0337222Z 'range_num_stages': [0, 1], 2026-02-21T08:32:57.0337397Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:57.0337587Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:57.0347418Z [194s] Fitting surrogate: 615 points, 615 targets 2026-02-21T08:32:57.6281774Z [194s] Generation 8 starting: 28 neighbors, 3 active search path(s) 2026-02-21T08:33:00.7543139Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 9.7 configs/s 2026-02-21T08:33:02.4495697Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 17.0 configs/s 2026-02-21T08:33:04.5083604Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 492.7 2026-02-21T08:33:04.5084969Z configs/s 2026-02-21T08:33:04.6845462Z [202s] Generation 8 complete: 2026-02-21T08:33:04.6849031Z ok=31 2026-02-21T08:33:04.6850408Z min=0.0244 2026-02-21T08:33:04.6850568Z mid=0.0245 2026-02-21T08:33:04.6850701Z max=0.0307 2026-02-21T08:33:04.6850839Z best={'block_sizes': [1, 8192], 2026-02-21T08:33:04.6851101Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:04.6851360Z 'load_eviction_policies': ['', ''], 2026-02-21T08:33:04.6852360Z 'num_stages': 4, 2026-02-21T08:33:04.6852539Z 'num_warps': 2, 2026-02-21T08:33:04.6852682Z 'pid_type': 'flat', 2026-02-21T08:33:04.6852846Z 'range_flattens': [None, True], 2026-02-21T08:33:04.6853023Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:04.6853214Z 'range_num_stages': [0, 2], 2026-02-21T08:33:04.6853377Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:04.6853662Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:04.6866490Z [202s] Fitting surrogate: 646 points, 646 targets 2026-02-21T08:33:05.2258683Z [202s] Generation 9 starting: 21 neighbors, 2 active search path(s) 2026-02-21T08:33:08.2644633Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.2 configs/s 2026-02-21T08:33:09.5995735Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.0 configs/s 2026-02-21T08:33:11.6419390Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 687.4 2026-02-21T08:33:11.6419943Z configs/s 2026-02-21T08:33:11.7663100Z [209s] Generation 9 complete: 2026-02-21T08:33:11.7663303Z ok=23 2026-02-21T08:33:11.7663454Z min=0.0245 2026-02-21T08:33:11.7663587Z mid=0.0246 2026-02-21T08:33:11.7663724Z max=0.0512 2026-02-21T08:33:11.7663861Z best={'block_sizes': [1, 8192], 2026-02-21T08:33:11.7664090Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:11.7664330Z 'load_eviction_policies': ['', ''], 2026-02-21T08:33:11.7664513Z 'num_stages': 4, 2026-02-21T08:33:11.7664665Z 'num_warps': 1, 2026-02-21T08:33:11.7664805Z 'pid_type': 'flat', 2026-02-21T08:33:11.7664972Z 'range_flattens': [None, True], 2026-02-21T08:33:11.7665151Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:11.7665345Z 'range_num_stages': [0, 2], 2026-02-21T08:33:11.7665511Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:11.7682361Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:11.7682629Z [209s] Fitting surrogate: 669 points, 669 targets 2026-02-21T08:33:12.2673976Z [209s] Generation 10 starting: 17 neighbors, 2 active search path(s) 2026-02-21T08:33:13.7344625Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.7 configs/s 2026-02-21T08:33:14.7438428Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.6 configs/s 2026-02-21T08:33:15.9267565Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 852.6 2026-02-21T08:33:15.9271242Z configs/s 2026-02-21T08:33:16.0240162Z [213s] Generation 10 complete: 2026-02-21T08:33:16.0242102Z ok=19 2026-02-21T08:33:16.0242272Z min=0.0245 2026-02-21T08:33:16.0242410Z mid=0.0245 2026-02-21T08:33:16.0242529Z max=0.0430 2026-02-21T08:33:16.0242673Z best={'block_sizes': [1, 8192], 2026-02-21T08:33:16.0242901Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:16.0243138Z 'load_eviction_policies': ['', ''], 2026-02-21T08:33:16.0243315Z 'num_stages': 4, 2026-02-21T08:33:16.0243486Z 'num_warps': 1, 2026-02-21T08:33:16.0243970Z 'pid_type': 'flat', 2026-02-21T08:33:16.0244137Z 'range_flattens': [None, True], 2026-02-21T08:33:16.0244315Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:16.0244506Z 'range_num_stages': [0, 2], 2026-02-21T08:33:16.0244678Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:16.0244855Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:16.0271209Z [213s] Fitting surrogate: 688 points, 688 targets 2026-02-21T08:33:16.4563121Z [213s] Generation 11 starting: 11 neighbors, 2 active search path(s) 2026-02-21T08:33:18.0750604Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 3.5 configs/s 2026-02-21T08:33:18.7197450Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s 2026-02-21T08:33:19.5849013Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1160.3 2026-02-21T08:33:19.5852701Z configs/s 2026-02-21T08:33:19.6596718Z [216s] Generation 11 complete: 2026-02-21T08:33:19.6600267Z ok=13 2026-02-21T08:33:19.6603509Z min=0.0245 2026-02-21T08:33:19.6607971Z mid=0.0245 2026-02-21T08:33:19.6611850Z max=0.0308 2026-02-21T08:33:19.6613283Z best={'block_sizes': [1, 8192], 2026-02-21T08:33:19.6613612Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:19.6613845Z 'load_eviction_policies': ['', ''], 2026-02-21T08:33:19.6618882Z 'num_stages': 4, 2026-02-21T08:33:19.6619143Z 'num_warps': 4, 2026-02-21T08:33:19.6619332Z 'pid_type': 'flat', 2026-02-21T08:33:19.6619519Z 'range_flattens': [None, True], 2026-02-21T08:33:19.6623504Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:19.6627226Z 'range_num_stages': [0, 3], 2026-02-21T08:33:19.6632174Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:19.6636593Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:19.6638620Z [216s] Fitting surrogate: 701 points, 701 targets 2026-02-21T08:33:20.0708949Z [217s] Generation 12 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:33:21.5947041Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 2.6 configs/s 2026-02-21T08:33:22.2405376Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s 2026-02-21T08:33:22.9648438Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1382.9 2026-02-21T08:33:22.9649635Z configs/s 2026-02-21T08:33:23.0350874Z [220s] Generation 12 complete: 2026-02-21T08:33:23.0356164Z ok=12 2026-02-21T08:33:23.0360691Z min=0.0245 2026-02-21T08:33:23.0362211Z mid=0.0245 2026-02-21T08:33:23.0362390Z max=0.0369 2026-02-21T08:33:23.0362596Z best={'block_sizes': [1, 8192], 2026-02-21T08:33:23.0362855Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:33:23.0365218Z 'load_eviction_policies': ['', ''], 2026-02-21T08:33:23.0365499Z 'num_stages': 4, 2026-02-21T08:33:23.0370603Z 'num_warps': 4, 2026-02-21T08:33:23.0372926Z 'pid_type': 'flat', 2026-02-21T08:33:23.0373204Z 'range_flattens': [None, True], 2026-02-21T08:33:23.0373421Z 'range_multi_buffers': [None, False], 2026-02-21T08:33:23.0377633Z 'range_num_stages': [0, 3], 2026-02-21T08:33:23.0377957Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:23.0378190Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:23.0382635Z [220s] Fitting surrogate: 713 points, 713 targets 2026-02-21T08:33:23.3240718Z [220s] Autotuning complete in 220.7s after searching 680 configs. 2026-02-21T08:33:23.3245302Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:33:23.3247508Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:33:23.3248380Z 2026-02-21T08:33:23.3253946Z [220s] Code of selected kernel: /tmp/torchinductor_root/yl/cyl2mzswb3juxdo7sy6qfkfuw5h2iowrkcl3hdh555nmuczs77xm.py 2026-02-21T08:33:24.1308493Z WARNING:tritonbench.utils.triton_op:Completed input ID 41: 2026-02-21T08:33:24.1312616Z (M, N) 2026-02-21T08:33:24.1317748Z ------------ 2026-02-21T08:33:24.1319989Z (4096, 5504) 2026-02-21T08:33:24.1320128Z 2026-02-21T08:33:24.1320701Z 45%|████▌ | 9/20 [24:21<35:15, 192.34s/it]WARNING:tritonbench.utils.triton_op:Running input ID 46: 2026-02-21T08:33:24.1325033Z (M, N) 2026-02-21T08:33:24.1326724Z ------------ 2026-02-21T08:33:24.1326902Z (4096, 6144) 2026-02-21T08:33:24.1327181Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:33:25.3789876Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:33:26.8741449Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for torch_compile_softmax 2026-02-21T08:33:32.0115784Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:33:32.0117113Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:33:32.0117346Z 'dtype': 'torch.float16', 2026-02-21T08:33:32.0123352Z 'shape': (4096, 6144), 2026-02-21T08:33:32.0124821Z 'stride': (6144, 1)},), 2026-02-21T08:33:32.0125043Z 'kwargs': {}} 2026-02-21T08:33:32.0136898Z INFO:tritonbench.utils.triton_op:Took 2.61ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:33:32.1929058Z [0s] Autotune random seed: 2134816249 2026-02-21T08:33:32.3307341Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:34:06.5704872Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:34:06.9115833Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:34:07.1523504Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:34:07.1544619Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:34:13.9626470Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.8 configs/s 2026-02-21T08:34:13.9635637Z [41s] Adaptive compile timeout: 30s (90% percentile=6.4s, bounds=[30.0s, 30s]) 2026-02-21T08:34:14.3677484Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2405.1 configs/s 2026-02-21T08:34:14.4115229Z [42s] Initial random population of 100, 5 starting points: 2026-02-21T08:34:14.4116500Z error=6 2026-02-21T08:34:14.4116660Z timeout=3 2026-02-21T08:34:14.4116787Z ok=91 2026-02-21T08:34:14.4116917Z min=0.0348 2026-02-21T08:34:14.4117041Z mid=0.5385 2026-02-21T08:34:14.4117170Z max=41.0603 2026-02-21T08:34:14.4117317Z best={'block_sizes': [1, 8192], 2026-02-21T08:34:14.4117549Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:34:14.4117787Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:34:14.4117965Z 'maxnreg': 32, 2026-02-21T08:34:14.4118148Z 'num_sm_multiplier': 64, 2026-02-21T08:34:14.4118699Z 'num_stages': 7, 2026-02-21T08:34:14.4118850Z 'num_warps': 4, 2026-02-21T08:34:14.4119009Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:34:14.4119214Z 'range_flattens': [None, True], 2026-02-21T08:34:14.4119396Z 'range_multi_buffers': [False, True], 2026-02-21T08:34:14.4119595Z 'range_num_stages': [1, 4], 2026-02-21T08:34:14.4119775Z 'range_unroll_factors': [1, 4], 2026-02-21T08:34:14.4119953Z 'range_warp_specializes': [True, None]} 2026-02-21T08:34:14.4130874Z [42s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:34:15.5611286Z [43s] Generation 1 starting: 83 neighbors, 5 active search path(s) 2026-02-21T08:34:23.0337266Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 12.3 configs/s 2026-02-21T08:34:28.2389255Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.9 configs/s 2026-02-21T08:34:33.3714749Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 221.9 2026-02-21T08:34:33.3716842Z configs/s 2026-02-21T08:34:33.6693900Z [61s] Generation 1 complete: 2026-02-21T08:34:33.6698073Z ok=89 2026-02-21T08:34:33.6702032Z min=0.0307 2026-02-21T08:34:33.6704154Z mid=0.0409 2026-02-21T08:34:33.6704367Z max=2.1657 2026-02-21T08:34:33.6709305Z best={'block_sizes': [1, 8192], 2026-02-21T08:34:33.6711225Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:34:33.6711495Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:34:33.6711940Z 'num_stages': 7, 2026-02-21T08:34:33.6712091Z 'num_warps': 4, 2026-02-21T08:34:33.6712229Z 'pid_type': 'flat', 2026-02-21T08:34:33.6712393Z 'range_flattens': [None, True], 2026-02-21T08:34:33.6712572Z 'range_multi_buffers': [None, None], 2026-02-21T08:34:33.6712762Z 'range_num_stages': [0, 4], 2026-02-21T08:34:33.6712925Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:33.6713109Z 'range_warp_specializes': [None, True]} 2026-02-21T08:34:33.6713359Z [61s] Fitting surrogate: 189 points, 189 targets 2026-02-21T08:34:34.5585348Z [62s] Generation 2 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:34:44.4898003Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 2.4 configs/s 2026-02-21T08:34:48.5769418Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 16.8 configs/s 2026-02-21T08:34:52.7458340Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 243.2 2026-02-21T08:34:52.7459719Z configs/s 2026-02-21T08:34:53.0448377Z [80s] Generation 2 complete: 2026-02-21T08:34:53.0451529Z ok=71 2026-02-21T08:34:53.0454889Z min=0.0287 2026-02-21T08:34:53.0459296Z mid=0.0368 2026-02-21T08:34:53.0460701Z max=0.5110 2026-02-21T08:34:53.0460911Z best={'block_sizes': [2, 8192], 2026-02-21T08:34:53.0461187Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:34:53.0461509Z 'load_eviction_policies': ['', ''], 2026-02-21T08:34:53.0462175Z 'num_stages': 4, 2026-02-21T08:34:53.0462329Z 'num_warps': 2, 2026-02-21T08:34:53.0462473Z 'pid_type': 'flat', 2026-02-21T08:34:53.0462638Z 'range_flattens': [None, False], 2026-02-21T08:34:53.0462827Z 'range_multi_buffers': [None, False], 2026-02-21T08:34:53.0463026Z 'range_num_stages': [0, 4], 2026-02-21T08:34:53.0463201Z 'range_unroll_factors': [0, 0], 2026-02-21T08:34:53.0463382Z 'range_warp_specializes': [None, True]} 2026-02-21T08:34:53.0463717Z [80s] Fitting surrogate: 260 points, 260 targets 2026-02-21T08:34:53.8768935Z [81s] Generation 3 starting: 59 neighbors, 5 active search path(s) 2026-02-21T08:35:00.3396708Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 24.3 configs/s 2026-02-21T08:35:04.0781879Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.8 configs/s 2026-02-21T08:35:08.6769869Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 242.6 2026-02-21T08:35:08.6772582Z configs/s 2026-02-21T08:35:08.9585995Z [96s] Generation 3 complete: 2026-02-21T08:35:08.9587393Z ok=65 2026-02-21T08:35:08.9587595Z min=0.0246 2026-02-21T08:35:08.9587761Z mid=0.0328 2026-02-21T08:35:08.9587919Z max=0.0522 2026-02-21T08:35:08.9588096Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:08.9588396Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:08.9588710Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:08.9588901Z 'num_stages': 4, 2026-02-21T08:35:08.9589074Z 'num_warps': 1, 2026-02-21T08:35:08.9589250Z 'pid_type': 'flat', 2026-02-21T08:35:08.9589424Z 'range_flattens': [None, None], 2026-02-21T08:35:08.9589620Z 'range_multi_buffers': [None, False], 2026-02-21T08:35:08.9589819Z 'range_num_stages': [0, 4], 2026-02-21T08:35:08.9589998Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:08.9591804Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:08.9599078Z [96s] Fitting surrogate: 325 points, 325 targets 2026-02-21T08:35:09.5562205Z [97s] Generation 4 starting: 38 neighbors, 3 active search path(s) 2026-02-21T08:35:15.4060892Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 2.2 configs/s 2026-02-21T08:35:17.7699134Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.8 configs/s 2026-02-21T08:35:20.2157974Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 414.5 2026-02-21T08:35:20.2161207Z configs/s 2026-02-21T08:35:20.4049369Z [108s] Generation 4 complete: 2026-02-21T08:35:20.4053714Z ok=41 2026-02-21T08:35:20.4055147Z min=0.0265 2026-02-21T08:35:20.4055314Z mid=0.0328 2026-02-21T08:35:20.4055435Z max=0.1455 2026-02-21T08:35:20.4055583Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:20.4055855Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:20.4056128Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:20.4056348Z 'num_stages': 4, 2026-02-21T08:35:20.4056503Z 'num_warps': 1, 2026-02-21T08:35:20.4056655Z 'pid_type': 'flat', 2026-02-21T08:35:20.4056808Z 'range_flattens': [None, None], 2026-02-21T08:35:20.4056990Z 'range_multi_buffers': [None, False], 2026-02-21T08:35:20.4057172Z 'range_num_stages': [0, 4], 2026-02-21T08:35:20.4057339Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:20.4057522Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:20.4068192Z [108s] Fitting surrogate: 366 points, 366 targets 2026-02-21T08:35:20.8142649Z [108s] Generation 5 starting: 17 neighbors, 2 active search path(s) 2026-02-21T08:35:22.9669009Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 12.5 configs/s 2026-02-21T08:35:24.1240096Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.1 configs/s 2026-02-21T08:35:25.4197815Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 777.4 2026-02-21T08:35:25.4199609Z configs/s 2026-02-21T08:35:25.5192874Z [113s] Generation 5 complete: 2026-02-21T08:35:25.5196740Z ok=20 2026-02-21T08:35:25.5199869Z min=0.0266 2026-02-21T08:35:25.5203721Z mid=0.0307 2026-02-21T08:35:25.5208163Z max=0.0492 2026-02-21T08:35:25.5212554Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:25.5214148Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:25.5214455Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:25.5214633Z 'num_stages': 4, 2026-02-21T08:35:25.5214786Z 'num_warps': 1, 2026-02-21T08:35:25.5214928Z 'pid_type': 'flat', 2026-02-21T08:35:25.5215171Z 'range_flattens': [None, None], 2026-02-21T08:35:25.5217353Z 'range_multi_buffers': [None, False], 2026-02-21T08:35:25.5217587Z 'range_num_stages': [0, 4], 2026-02-21T08:35:25.5217762Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:25.5217954Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:25.5218575Z [113s] Fitting surrogate: 386 points, 386 targets 2026-02-21T08:35:25.9678413Z [113s] Generation 6 starting: 21 neighbors, 2 active search path(s) 2026-02-21T08:35:27.8818751Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 15.7 configs/s 2026-02-21T08:35:29.1580428Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.0 configs/s 2026-02-21T08:35:30.7485850Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 635.7 2026-02-21T08:35:30.7489942Z configs/s 2026-02-21T08:35:30.8688654Z [118s] Generation 6 complete: 2026-02-21T08:35:30.8693053Z ok=24 2026-02-21T08:35:30.8695128Z min=0.0265 2026-02-21T08:35:30.8695305Z mid=0.0266 2026-02-21T08:35:30.8695439Z max=0.0471 2026-02-21T08:35:30.8695580Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:30.8695868Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:30.8696161Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:30.8696381Z 'num_stages': 5, 2026-02-21T08:35:30.8696539Z 'num_warps': 4, 2026-02-21T08:35:30.8696693Z 'pid_type': 'flat', 2026-02-21T08:35:30.8696860Z 'range_flattens': [None, True], 2026-02-21T08:35:30.8697038Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:30.8697234Z 'range_num_stages': [0, 2], 2026-02-21T08:35:30.8697404Z 'range_unroll_factors': [0, 2], 2026-02-21T08:35:30.8697592Z 'range_warp_specializes': [None, None]} 2026-02-21T08:35:30.8706167Z [118s] Fitting surrogate: 410 points, 410 targets 2026-02-21T08:35:31.2966701Z [118s] Generation 7 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:35:33.4670839Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 7.9 configs/s 2026-02-21T08:35:34.8633508Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 15.5 configs/s 2026-02-21T08:35:36.8602077Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 657.9 2026-02-21T08:35:36.8603612Z configs/s 2026-02-21T08:35:36.9871804Z [124s] Generation 7 complete: 2026-02-21T08:35:36.9872954Z ok=22 2026-02-21T08:35:36.9873114Z min=0.0266 2026-02-21T08:35:36.9873256Z mid=0.0266 2026-02-21T08:35:36.9873376Z max=0.0369 2026-02-21T08:35:36.9873522Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:36.9873769Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:35:36.9874027Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:36.9874204Z 'num_stages': 3, 2026-02-21T08:35:36.9874353Z 'num_warps': 1, 2026-02-21T08:35:36.9874492Z 'pid_type': 'flat', 2026-02-21T08:35:36.9874656Z 'range_flattens': [None, True], 2026-02-21T08:35:36.9874839Z 'range_multi_buffers': [None, False], 2026-02-21T08:35:36.9875019Z 'range_num_stages': [0, 3], 2026-02-21T08:35:36.9875187Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:36.9875361Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:36.9887752Z [124s] Fitting surrogate: 432 points, 432 targets 2026-02-21T08:35:37.2673091Z [124s] Generation 8 starting: 10 neighbors, 1 active search path(s) 2026-02-21T08:35:38.6410634Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 10.7 configs/s 2026-02-21T08:35:39.2549530Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.6 configs/s 2026-02-21T08:35:40.0787204Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1216.1 2026-02-21T08:35:40.0788597Z configs/s 2026-02-21T08:35:40.1514645Z [127s] Generation 8 complete: 2026-02-21T08:35:40.1519010Z ok=12 2026-02-21T08:35:40.1520413Z min=0.0266 2026-02-21T08:35:40.1520579Z mid=0.0266 2026-02-21T08:35:40.1520705Z max=0.0307 2026-02-21T08:35:40.1520852Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:40.1521121Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:40.1521406Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:40.1521818Z 'num_stages': 5, 2026-02-21T08:35:40.1522328Z 'num_warps': 4, 2026-02-21T08:35:40.1522483Z 'pid_type': 'flat', 2026-02-21T08:35:40.1522656Z 'range_flattens': [None, None], 2026-02-21T08:35:40.1522839Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:40.1523038Z 'range_num_stages': [0, 2], 2026-02-21T08:35:40.1523218Z 'range_unroll_factors': [0, 1], 2026-02-21T08:35:40.1523400Z 'range_warp_specializes': [None, None]} 2026-02-21T08:35:40.1540557Z [127s] Fitting surrogate: 444 points, 444 targets 2026-02-21T08:35:40.4575853Z [128s] Generation 9 starting: 10 neighbors, 1 active search path(s) 2026-02-21T08:35:41.7570980Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 13.6 configs/s 2026-02-21T08:35:42.3582072Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.0 configs/s 2026-02-21T08:35:42.8393018Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2057.0 2026-02-21T08:35:42.8397039Z configs/s 2026-02-21T08:35:42.8898361Z [130s] Generation 9 complete: 2026-02-21T08:35:42.8902931Z ok=11 2026-02-21T08:35:42.8905867Z min=0.0266 2026-02-21T08:35:42.8906044Z mid=0.0327 2026-02-21T08:35:42.8906175Z max=0.1455 2026-02-21T08:35:42.8906331Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:42.8906616Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:42.8906917Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:42.8907105Z 'num_stages': 5, 2026-02-21T08:35:42.8907263Z 'num_warps': 4, 2026-02-21T08:35:42.8907423Z 'pid_type': 'flat', 2026-02-21T08:35:42.8907587Z 'range_flattens': [None, None], 2026-02-21T08:35:42.8907782Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:42.8907976Z 'range_num_stages': [0, 2], 2026-02-21T08:35:42.8908157Z 'range_unroll_factors': [0, 1], 2026-02-21T08:35:42.8908346Z 'range_warp_specializes': [None, None]} 2026-02-21T08:35:42.8924287Z [130s] Fitting surrogate: 455 points, 455 targets 2026-02-21T08:35:43.1810496Z [130s] Generation 10 starting: 7 neighbors, 1 active search path(s) 2026-02-21T08:35:44.2867005Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 10.8 configs/s 2026-02-21T08:35:44.7139581Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 7/7 18.5 configs/s 2026-02-21T08:35:45.2619458Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1809.8 2026-02-21T08:35:45.2621206Z configs/s 2026-02-21T08:35:45.3135660Z [132s] Generation 10 complete: 2026-02-21T08:35:45.3139583Z ok=8 2026-02-21T08:35:45.3144119Z min=0.0266 2026-02-21T08:35:45.3145852Z mid=0.0266 2026-02-21T08:35:45.3146058Z max=0.0328 2026-02-21T08:35:45.3151493Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:45.3155583Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:45.3159706Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:45.3163566Z 'num_stages': 5, 2026-02-21T08:35:45.3175717Z 'num_warps': 1, 2026-02-21T08:35:45.3176065Z 'pid_type': 'flat', 2026-02-21T08:35:45.3176239Z 'range_flattens': [None, None], 2026-02-21T08:35:45.3176442Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:45.3176645Z 'range_num_stages': [0, 2], 2026-02-21T08:35:45.3176819Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:45.3177018Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:45.3177252Z [132s] Fitting surrogate: 463 points, 463 targets 2026-02-21T08:35:45.5785388Z [133s] Generation 11 starting: 9 neighbors, 1 active search path(s) 2026-02-21T08:35:47.2425671Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.6 configs/s 2026-02-21T08:35:47.8486348Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.9 configs/s 2026-02-21T08:35:48.5234409Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1480.7 2026-02-21T08:35:48.5235268Z configs/s 2026-02-21T08:35:48.5866202Z [136s] Generation 11 complete: 2026-02-21T08:35:48.5870559Z ok=11 2026-02-21T08:35:48.5872130Z min=0.0266 2026-02-21T08:35:48.5872296Z mid=0.0266 2026-02-21T08:35:48.5872418Z max=0.0389 2026-02-21T08:35:48.5872562Z best={'block_sizes': [1, 8192], 2026-02-21T08:35:48.5872832Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:48.5873107Z 'load_eviction_policies': ['', ''], 2026-02-21T08:35:48.5873280Z 'num_stages': 5, 2026-02-21T08:35:48.5873425Z 'num_warps': 1, 2026-02-21T08:35:48.5873565Z 'pid_type': 'flat', 2026-02-21T08:35:48.5873723Z 'range_flattens': [None, None], 2026-02-21T08:35:48.5873896Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:48.5874084Z 'range_num_stages': [0, 2], 2026-02-21T08:35:48.5874250Z 'range_unroll_factors': [0, 0], 2026-02-21T08:35:48.5874423Z 'range_warp_specializes': [None, True]} 2026-02-21T08:35:48.5884017Z [136s] Fitting surrogate: 474 points, 474 targets 2026-02-21T08:35:48.7678437Z [136s] Autotuning complete in 136.4s after searching 451 configs. 2026-02-21T08:35:48.7678943Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:35:48.7679992Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:35:48.7680882Z 2026-02-21T08:35:48.7681154Z [136s] Code of selected kernel: /tmp/torchinductor_root/34/c34nendb26c5gje5v6qbm6aunt2taswx5hcv7ajfpjlnjbo4gcdo.py 2026-02-21T08:35:49.7088697Z WARNING:tritonbench.utils.triton_op:Completed input ID 46: 2026-02-21T08:35:49.7090509Z (M, N) 2026-02-21T08:35:49.7090684Z ------------ 2026-02-21T08:35:49.7090827Z (4096, 6144) 2026-02-21T08:35:49.7090987Z 2026-02-21T08:35:49.7104613Z 50%|█████ | 10/20 [26:47<29:39, 177.90s/it]WARNING:tritonbench.utils.triton_op:Running input ID 51: 2026-02-21T08:35:49.7106154Z (M, N) 2026-02-21T08:35:49.7106329Z ------------ 2026-02-21T08:35:49.7106477Z (4096, 6784) 2026-02-21T08:35:49.7106851Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax 2026-02-21T08:35:50.9126277Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:35:52.4112509Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for torch_compile_softmax 2026-02-21T08:35:55.9221188Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:35:55.9225492Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:35:55.9228718Z 'dtype': 'torch.float16', 2026-02-21T08:35:55.9231962Z 'shape': (4096, 6784), 2026-02-21T08:35:55.9236488Z 'stride': (6784, 1)},), 2026-02-21T08:35:55.9238348Z 'kwargs': {}} 2026-02-21T08:35:55.9243588Z INFO:tritonbench.utils.triton_op:Took 2.38ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:35:56.1011820Z [0s] Autotune random seed: 2134816249 2026-02-21T08:35:56.2402290Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:36:31.0616532Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:36:31.4522547Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:36:31.7120601Z [35s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:36:31.7137060Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:36:38.7049031Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.4 configs/s 2026-02-21T08:36:38.7058740Z [42s] Adaptive compile timeout: 30s (90% percentile=7.3s, bounds=[30.0s, 30s]) 2026-02-21T08:36:39.4884928Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1261.2 configs/s 2026-02-21T08:36:39.5599454Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T08:36:39.5601107Z error=6 2026-02-21T08:36:39.5601269Z timeout=3 2026-02-21T08:36:39.5601403Z ok=91 2026-02-21T08:36:39.5601525Z min=0.0492 2026-02-21T08:36:39.5601952Z mid=0.6514 2026-02-21T08:36:39.5602076Z max=45.4574 2026-02-21T08:36:39.5602230Z best={'block_sizes': [2, 1024], 2026-02-21T08:36:39.5602502Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:36:39.5602799Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:36:39.5602998Z 'num_sm_multiplier': 64, 2026-02-21T08:36:39.5603157Z 'num_stages': 5, 2026-02-21T08:36:39.5603303Z 'num_warps': 1, 2026-02-21T08:36:39.5603460Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:36:39.5603660Z 'range_flattens': [True, True], 2026-02-21T08:36:39.5603837Z 'range_multi_buffers': [False, None], 2026-02-21T08:36:39.5604032Z 'range_num_stages': [3, 1], 2026-02-21T08:36:39.5604199Z 'range_unroll_factors': [0, 2], 2026-02-21T08:36:39.5604403Z 'range_warp_specializes': [True, None]} 2026-02-21T08:36:39.5613414Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:36:40.6639704Z [44s] Generation 1 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:37:06.7963521Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.6 configs/s 2026-02-21T08:37:11.9294633Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.7 configs/s 2026-02-21T08:37:18.4073461Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 173.5 2026-02-21T08:37:18.4074365Z configs/s 2026-02-21T08:37:18.7708940Z [82s] Generation 1 complete: 2026-02-21T08:37:18.7713019Z ok=86 2026-02-21T08:37:18.7714419Z min=0.0410 2026-02-21T08:37:18.7714594Z mid=0.0512 2026-02-21T08:37:18.7714732Z max=2.4760 2026-02-21T08:37:18.7714877Z best={'block_sizes': [2, 4096], 2026-02-21T08:37:18.7715695Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:37:18.7716024Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:37:18.7716230Z 'num_sm_multiplier': 64, 2026-02-21T08:37:18.7716399Z 'num_stages': 5, 2026-02-21T08:37:18.7716551Z 'num_warps': 4, 2026-02-21T08:37:18.7716715Z 'pid_type': 'persistent_blocked', 2026-02-21T08:37:18.7716916Z 'range_flattens': [True, True], 2026-02-21T08:37:18.7717100Z 'range_multi_buffers': [None, None], 2026-02-21T08:37:18.7717298Z 'range_num_stages': [3, 1], 2026-02-21T08:37:18.7717478Z 'range_unroll_factors': [0, 2], 2026-02-21T08:37:18.7717663Z 'range_warp_specializes': [True, None]} 2026-02-21T08:37:18.7723677Z [82s] Fitting surrogate: 186 points, 186 targets 2026-02-21T08:37:19.7710214Z [83s] Generation 2 starting: 68 neighbors, 5 active search path(s) 2026-02-21T08:37:39.8361353Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.8 configs/s 2026-02-21T08:37:44.0770892Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.9 configs/s 2026-02-21T08:37:48.4698164Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.6 2026-02-21T08:37:48.4702317Z configs/s 2026-02-21T08:37:48.7441860Z [112s] Generation 2 complete: 2026-02-21T08:37:48.7446028Z ok=74 2026-02-21T08:37:48.7450427Z min=0.0328 2026-02-21T08:37:48.7455492Z mid=0.0450 2026-02-21T08:37:48.7457497Z max=0.4588 2026-02-21T08:37:48.7457677Z best={'block_sizes': [1, 8192], 2026-02-21T08:37:48.7457957Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:37:48.7458254Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:37:48.7458465Z 'num_stages': 4, 2026-02-21T08:37:48.7458616Z 'num_warps': 16, 2026-02-21T08:37:48.7458757Z 'pid_type': 'flat', 2026-02-21T08:37:48.7458922Z 'range_flattens': [None, False], 2026-02-21T08:37:48.7459105Z 'range_multi_buffers': [None, False], 2026-02-21T08:37:48.7459296Z 'range_num_stages': [0, 3], 2026-02-21T08:37:48.7459482Z 'range_unroll_factors': [0, 0], 2026-02-21T08:37:48.7459682Z 'range_warp_specializes': [None, True]} 2026-02-21T08:37:48.7461403Z [112s] Fitting surrogate: 260 points, 260 targets 2026-02-21T08:37:49.5764443Z [113s] Generation 3 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:37:55.2811717Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 8.8 configs/s 2026-02-21T08:37:58.9934724Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.9 configs/s 2026-02-21T08:38:01.6359132Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.6 2026-02-21T08:38:01.6360539Z configs/s 2026-02-21T08:38:01.7915558Z [125s] Generation 3 complete: 2026-02-21T08:38:01.7919841Z ok=66 2026-02-21T08:38:01.7923597Z min=0.0266 2026-02-21T08:38:01.7928226Z mid=0.0410 2026-02-21T08:38:01.7930222Z max=0.0655 2026-02-21T08:38:01.7930425Z best={'block_sizes': [1, 8192], 2026-02-21T08:38:01.7930731Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:38:01.7931379Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:38:01.7931782Z 'num_stages': 4, 2026-02-21T08:38:01.7931951Z 'num_warps': 4, 2026-02-21T08:38:01.7932097Z 'pid_type': 'flat', 2026-02-21T08:38:01.7932268Z 'range_flattens': [None, False], 2026-02-21T08:38:01.7932448Z 'range_multi_buffers': [None, False], 2026-02-21T08:38:01.7932641Z 'range_num_stages': [0, 3], 2026-02-21T08:38:01.7932814Z 'range_unroll_factors': [0, 0], 2026-02-21T08:38:01.7933089Z 'range_warp_specializes': [None, True]} 2026-02-21T08:38:01.7933304Z [125s] Fitting surrogate: 326 points, 326 targets 2026-02-21T08:38:02.6800582Z [126s] Generation 4 starting: 58 neighbors, 5 active search path(s) 2026-02-21T08:38:08.6009109Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 35.5 configs/s 2026-02-21T08:38:12.2119006Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 16.8 configs/s 2026-02-21T08:38:15.9497348Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.3 2026-02-21T08:38:15.9501469Z configs/s 2026-02-21T08:38:16.1915016Z [139s] Generation 4 complete: 2026-02-21T08:38:16.1915285Z ok=63 2026-02-21T08:38:16.1915463Z min=0.0266 2026-02-21T08:38:16.1915604Z mid=0.0369 2026-02-21T08:38:16.1915770Z max=0.0746 2026-02-21T08:38:16.1915918Z best={'block_sizes': [1, 8192], 2026-02-21T08:38:16.1916222Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:38:16.1916566Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:38:16.1916770Z 'num_stages': 4, 2026-02-21T08:38:16.1916916Z 'num_warps': 1, 2026-02-21T08:38:16.1917056Z 'pid_type': 'flat', 2026-02-21T08:38:16.1917218Z 'range_flattens': [None, False], 2026-02-21T08:38:16.1917401Z 'range_multi_buffers': [None, False], 2026-02-21T08:38:16.1917584Z 'range_num_stages': [0, 3], 2026-02-21T08:38:16.1917773Z 'range_unroll_factors': [0, 0], 2026-02-21T08:38:16.1917963Z 'range_warp_specializes': [None, True]} 2026-02-21T08:38:16.1934882Z [139s] Fitting surrogate: 389 points, 389 targets 2026-02-21T08:38:16.7837837Z [140s] Generation 5 starting: 32 neighbors, 3 active search path(s) 2026-02-21T08:38:20.8386120Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 3.0 configs/s 2026-02-21T08:38:22.8952922Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 16.9 configs/s 2026-02-21T08:38:25.2256988Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 434.6 2026-02-21T08:38:25.2258690Z configs/s 2026-02-21T08:38:25.3995512Z [149s] Generation 5 complete: 2026-02-21T08:38:25.3997581Z ok=35 2026-02-21T08:38:25.3997799Z min=0.0266 2026-02-21T08:38:25.3998029Z mid=0.0285 2026-02-21T08:38:25.3998199Z max=0.0553 2026-02-21T08:38:25.3998372Z best={'block_sizes': [1, 8192], 2026-02-21T08:38:25.4002079Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:38:25.4006232Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:38:25.4007630Z 'num_stages': 4, 2026-02-21T08:38:25.4007805Z 'num_warps': 1, 2026-02-21T08:38:25.4007965Z 'pid_type': 'flat', 2026-02-21T08:38:25.4008127Z 'range_flattens': [None, False], 2026-02-21T08:38:25.4008318Z 'range_multi_buffers': [None, False], 2026-02-21T08:38:25.4008510Z 'range_num_stages': [0, 3], 2026-02-21T08:38:25.4008678Z 'range_unroll_factors': [0, 0], 2026-02-21T08:38:25.4008865Z 'range_warp_specializes': [None, True]} 2026-02-21T08:38:25.4011440Z [149s] Fitting surrogate: 424 points, 424 targets 2026-02-21T08:38:25.7799645Z [149s] Generation 6 starting: 16 neighbors, 2 active search path(s) 2026-02-21T08:38:28.0391429Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.9 configs/s 2026-02-21T08:38:29.0022873Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 17.4 configs/s 2026-02-21T08:38:30.1717952Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 861.9 2026-02-21T08:38:30.1722404Z configs/s 2026-02-21T08:38:30.2616661Z [154s] Generation 6 complete: 2026-02-21T08:38:30.2620919Z ok=18 2026-02-21T08:38:30.2624915Z min=0.0266 2026-02-21T08:38:30.2629339Z mid=0.0266 2026-02-21T08:38:30.2633817Z max=0.0471 2026-02-21T08:38:30.2637844Z best={'block_sizes': [1, 8192], 2026-02-21T08:38:30.2642490Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:38:30.2642849Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:38:30.2647325Z 'num_stages': 5, 2026-02-21T08:38:30.2648883Z 'num_warps': 4, 2026-02-21T08:38:30.2649077Z 'pid_type': 'flat', 2026-02-21T08:38:30.2649262Z 'range_flattens': [None, False], 2026-02-21T08:38:30.2649466Z 'range_multi_buffers': [None, None], 2026-02-21T08:38:30.2649661Z 'range_num_stages': [0, 0], 2026-02-21T08:38:30.2649864Z 'range_unroll_factors': [0, 0], 2026-02-21T08:38:30.2650067Z 'range_warp_specializes': [None, True]} 2026-02-21T08:38:30.2650376Z [154s] Fitting surrogate: 442 points, 442 targets 2026-02-21T08:38:30.4986275Z [154s] Generation 7 starting: 6 neighbors, 1 active search path(s) 2026-02-21T08:38:31.5266962Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 3.9 configs/s 2026-02-21T08:38:31.8846988Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 6/6 19.5 configs/s 2026-02-21T08:38:32.3648171Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2056.5 2026-02-21T08:38:32.3650049Z configs/s 2026-02-21T08:38:32.4121317Z [156s] Generation 7 complete: 2026-02-21T08:38:32.4122934Z ok=7 2026-02-21T08:38:32.4123095Z min=0.0266 2026-02-21T08:38:32.4123287Z mid=0.0266 2026-02-21T08:38:32.4123421Z max=0.0328 2026-02-21T08:38:32.4128435Z best={'block_sizes': [1, 8192], 2026-02-21T08:38:32.4130120Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:38:32.4130462Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:38:32.4130656Z 'num_stages': 5, 2026-02-21T08:38:32.4130806Z 'num_warps': 4, 2026-02-21T08:38:32.4130951Z 'pid_type': 'flat', 2026-02-21T08:38:32.4131116Z 'range_flattens': [None, False], 2026-02-21T08:38:32.4131294Z 'range_multi_buffers': [None, None], 2026-02-21T08:38:32.4131485Z 'range_num_stages': [0, 0], 2026-02-21T08:38:32.4131738Z 'range_unroll_factors': [0, 0], 2026-02-21T08:38:32.4131919Z 'range_warp_specializes': [None, True]} 2026-02-21T08:38:32.4140810Z [156s] Fitting surrogate: 449 points, 449 targets 2026-02-21T08:38:32.5831782Z [156s] Autotuning complete in 156.3s after searching 431 configs. 2026-02-21T08:38:32.5832189Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:38:32.5833185Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:38:32.5836879Z 2026-02-21T08:38:32.5837287Z [156s] Code of selected kernel: /tmp/torchinductor_root/ia/cia5lcrlnyyqtlkltxwguhwcq43qd6izyhmsy7zouruem2pnjllu.py 2026-02-21T08:38:33.5729832Z WARNING:tritonbench.utils.triton_op:Completed input ID 51: 2026-02-21T08:38:33.5734322Z (M, N) 2026-02-21T08:38:33.5738812Z ------------ 2026-02-21T08:38:33.5743302Z (4096, 6784) 2026-02-21T08:38:33.5746975Z 2026-02-21T08:38:33.5751298Z 55%|█████▌ | 11/20 [29:31<26:02, 173.61s/it]WARNING:tritonbench.utils.triton_op:Running input ID 56: 2026-02-21T08:38:33.5755326Z (M, N) 2026-02-21T08:38:33.5755565Z ------------ 2026-02-21T08:38:33.5755768Z (4096, 7424) 2026-02-21T08:38:33.5756388Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:38:34.7448771Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:38:36.2672571Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for torch_compile_softmax 2026-02-21T08:38:39.7742364Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:38:39.7744304Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:38:39.7744585Z 'dtype': 'torch.float16', 2026-02-21T08:38:39.7744794Z 'shape': (4096, 7424), 2026-02-21T08:38:39.7749107Z 'stride': (7424, 1)},), 2026-02-21T08:38:39.7752298Z 'kwargs': {}} 2026-02-21T08:38:39.7762672Z INFO:tritonbench.utils.triton_op:Took 2.17ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:38:39.9508107Z [0s] Autotune random seed: 2134816249 2026-02-21T08:38:40.0884635Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:39:15.3859240Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:39:15.8440229Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:39:16.0989151Z [36s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:39:16.1007610Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:39:19.2154771Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:39:19.2157566Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:39:19.2158038Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:39:19.2158237Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:39:19.2158416Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:39:19.2158595Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:39:19.2158822Z %cst = arith.constant dense<7424> : tensor<32x1xi32> 2026-02-21T08:39:19.2159464Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:39:19.2159729Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:39:19.2159944Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:39:19.2160135Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:39:19.2160319Z %c7424_i32 = arith.constant 7424 : i32 2026-02-21T08:39:19.2160504Z %c7424_i64 = arith.constant 7424 : i64 2026-02-21T08:39:19.2160682Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:39:19.2161003Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : , > 2026-02-21T08:39:19.2161447Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : , > 2026-02-21T08:39:19.2161934Z %2 = tt.get_program_id x : i32 2026-02-21T08:39:19.2162124Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:39:19.2162456Z %4 = arith.minsi %3, %c128_i32 : i32 2026-02-21T08:39:19.2162679Z scf.for %arg2 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:39:19.2162889Z %5 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:39:19.2163184Z %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:39:19.2163438Z %7 = tt.splat %5 : i32 -> tensor<32xi32> 2026-02-21T08:39:19.2163646Z %8 = arith.addi %7, %6 : tensor<32xi32> 2026-02-21T08:39:19.2163834Z %c7416_i32 = arith.constant 7416 : i32 2026-02-21T08:39:19.2164024Z %c24_i32 = arith.constant 24 : i32 2026-02-21T08:39:19.2164382Z %9:2 = scf.for %arg3 = %c0_i32 to %c7416_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:39:19.2164790Z %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:39:19.2165044Z %50 = tt.splat %arg3 : i32 -> tensor<8xi32> 2026-02-21T08:39:19.2165244Z %51 = arith.addi %50, %49 : tensor<8xi32> 2026-02-21T08:39:19.2165510Z %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:19.2165766Z %53 = arith.muli %52, %cst : tensor<32x1xi32> 2026-02-21T08:39:19.2166019Z %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:39:19.2166300Z %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2166561Z %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2166797Z %57 = arith.addi %55, %56 : tensor<32x8xi32> 2026-02-21T08:39:19.2167030Z %58 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2167307Z %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:39:19.2167594Z %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2167878Z %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2168109Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2168301Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2168497Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:19.2168689Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2168877Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2169094Z %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:19.2169334Z %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:19.2169560Z %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32> 2026-02-21T08:39:19.2169788Z %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:39:19.2170005Z %67 = arith.ori %65, %66 : tensor<32xi1> 2026-02-21T08:39:19.2170231Z %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:19.2170474Z %69 = arith.subf %arg4, %68 : tensor<32xf32> 2026-02-21T08:39:19.2170831Z %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2171281Z %71 = arith.mulf %arg5, %70 : tensor<32xf32> 2026-02-21T08:39:19.2171570Z %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2171865Z %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2172113Z %74 = arith.subf %61, %73 : tensor<32x8xf32> 2026-02-21T08:39:19.2172464Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2172827Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2173017Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2173208Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:19.2173403Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2173649Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2173855Z %77 = arith.addf %71, %76 : tensor<32xf32> 2026-02-21T08:39:19.2174046Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:39:19.2174243Z %78 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:39:19.2174432Z %79 = arith.addi %arg3, %78 : i32 2026-02-21T08:39:19.2174666Z %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:39:19.2174909Z %81 = tt.splat %79 : i32 -> tensor<8xi32> 2026-02-21T08:39:19.2175108Z %82 = arith.addi %81, %80 : tensor<8xi32> 2026-02-21T08:39:19.2175387Z %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:19.2175646Z %84 = arith.muli %83, %cst : tensor<32x1xi32> 2026-02-21T08:39:19.2175896Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:39:19.2176170Z %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2176431Z %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2176665Z %88 = arith.addi %86, %87 : tensor<32x8xi32> 2026-02-21T08:39:19.2176896Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2177167Z %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:39:19.2177455Z %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2177741Z %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2177963Z %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2178157Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2178342Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:19.2178530Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2178719Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2178937Z %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:19.2179185Z %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:19.2179411Z %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32> 2026-02-21T08:39:19.2179631Z %97 = arith.cmpf une, %68, %68 : tensor<32xf32> 2026-02-21T08:39:19.2179830Z %98 = arith.ori %96, %97 : tensor<32xi1> 2026-02-21T08:39:19.2180063Z %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:19.2180302Z %100 = arith.subf %68, %99 : tensor<32xf32> 2026-02-21T08:39:19.2180663Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2181027Z %102 = arith.mulf %77, %101 : tensor<32xf32> 2026-02-21T08:39:19.2181277Z %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2181603Z %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2181909Z %105 = arith.subf %92, %104 : tensor<32x8xf32> 2026-02-21T08:39:19.2182263Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2182627Z %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2182817Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2183006Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:19.2183192Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2183380Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2183585Z %108 = arith.addf %102, %107 : tensor<32xf32> 2026-02-21T08:39:19.2183780Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:19.2183972Z %109 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:39:19.2184160Z %110 = arith.addi %arg3, %109 : i32 2026-02-21T08:39:19.2184463Z %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:39:19.2184714Z %112 = tt.splat %110 : i32 -> tensor<8xi32> 2026-02-21T08:39:19.2184922Z %113 = arith.addi %112, %111 : tensor<8xi32> 2026-02-21T08:39:19.2185179Z %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:19.2185446Z %115 = arith.muli %114, %cst : tensor<32x1xi32> 2026-02-21T08:39:19.2185703Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:39:19.2185990Z %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2186258Z %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2186490Z %119 = arith.addi %117, %118 : tensor<32x8xi32> 2026-02-21T08:39:19.2186734Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2187024Z %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:39:19.2187331Z %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2187638Z %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2187866Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2188064Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2188244Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:19.2188440Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2188628Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2188849Z %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:19.2189100Z %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:19.2189333Z %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32> 2026-02-21T08:39:19.2189558Z %128 = arith.cmpf une, %99, %99 : tensor<32xf32> 2026-02-21T08:39:19.2189761Z %129 = arith.ori %127, %128 : tensor<32xi1> 2026-02-21T08:39:19.2190004Z %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:19.2190250Z %131 = arith.subf %99, %130 : tensor<32xf32> 2026-02-21T08:39:19.2190603Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2190970Z %133 = arith.mulf %108, %132 : tensor<32xf32> 2026-02-21T08:39:19.2191224Z %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2191521Z %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2191792Z %136 = arith.subf %123, %135 : tensor<32x8xf32> 2026-02-21T08:39:19.2192171Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2192561Z %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2192762Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:19.2193025Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:19.2193220Z tt.reduce.return %140 : f32 2026-02-21T08:39:19.2193422Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2193630Z %139 = arith.addf %133, %138 : tensor<32xf32> 2026-02-21T08:39:19.2193864Z scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:39:19.2194128Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:39:19.2194400Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:39:19.2194661Z %11 = tt.splat %c7416_i32 : i32 -> tensor<8xi32> 2026-02-21T08:39:19.2194871Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:39:19.2195133Z %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:19.2195406Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:39:19.2195765Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:39:19.2196067Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2196329Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:39:19.2196570Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:39:19.2196815Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2197099Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:39:19.2197407Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:39:19.2197699Z %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2197965Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2198168Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:39:19.2198355Z %49 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:39:19.2198566Z tt.reduce.return %49 : f32 2026-02-21T08:39:19.2198759Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2198994Z %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:19.2199244Z %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:19.2199485Z %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32> 2026-02-21T08:39:19.2199714Z %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32> 2026-02-21T08:39:19.2199943Z %28 = arith.ori %26, %27 : tensor<32xi1> 2026-02-21T08:39:19.2200195Z %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:19.2200427Z %30 = arith.subf %9#0, %29 : tensor<32xf32> 2026-02-21T08:39:19.2200789Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2201144Z %32 = arith.mulf %9#1, %31 : tensor<32xf32> 2026-02-21T08:39:19.2201401Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2201718Z %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2201944Z %35 = arith.subf %22, %34 : tensor<32x8xf32> 2026-02-21T08:39:19.2202301Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2202654Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:39:19.2202848Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:39:19.2203023Z %49 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:39:19.2203217Z tt.reduce.return %49 : f32 2026-02-21T08:39:19.2203407Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:39:19.2203600Z %38 = arith.addf %32, %37 : tensor<32xf32> 2026-02-21T08:39:19.2203803Z %c7416_i32_2 = arith.constant 7416 : i32 2026-02-21T08:39:19.2203992Z %c24_i32_3 = arith.constant 24 : i32 2026-02-21T08:39:19.2204227Z scf.for %arg3 = %c0_i32 to %c7416_i32_2 step %c24_i32_3 : i32 { 2026-02-21T08:39:19.2204615Z %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:39:19.2204966Z %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2205259Z %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2205513Z %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2205751Z %53 = arith.subf %51, %52 : tensor<32x8xf32> 2026-02-21T08:39:19.2206114Z %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2206524Z %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2206808Z %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2207093Z %57 = arith.divf %54, %56 : tensor<32x8xf32> 2026-02-21T08:39:19.2207328Z %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:39:19.2207633Z tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:39:19.2207919Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:39:19.2208109Z %59 = arith.muli %c8_i32, %c1_i32_4 : i32 2026-02-21T08:39:19.2208300Z %60 = arith.addi %arg3, %59 : i32 2026-02-21T08:39:19.2208560Z %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:39:19.2208890Z %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2209174Z %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2209418Z %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2209649Z %65 = arith.subf %63, %64 : tensor<32x8xf32> 2026-02-21T08:39:19.2210006Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2210419Z %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2210709Z %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2210931Z %69 = arith.divf %66, %68 : tensor<32x8xf32> 2026-02-21T08:39:19.2211160Z %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:39:19.2211456Z tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:39:19.2211770Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:19.2211957Z %71 = arith.muli %c8_i32, %c2_i32 : i32 2026-02-21T08:39:19.2212149Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:39:19.2212416Z %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:39:19.2212744Z %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2213027Z %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2213272Z %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2213505Z %77 = arith.subf %75, %76 : tensor<32x8xf32> 2026-02-21T08:39:19.2213863Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2214263Z %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2214550Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2214777Z %81 = arith.divf %78, %80 : tensor<32x8xf32> 2026-02-21T08:39:19.2215016Z %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:39:19.2215315Z tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:39:19.2215682Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:39:19.2216006Z %39 = tt.descriptor_load %0[%5, %c7416_i32_2] : !tt.tensordesc> -> tensor<32x8xf16> 2026-02-21T08:39:19.2216346Z %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2216633Z %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32> 2026-02-21T08:39:19.2216875Z %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2217105Z %43 = arith.subf %41, %42 : tensor<32x8xf32> 2026-02-21T08:39:19.2217461Z %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:39:19.2217861Z %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:19.2218201Z %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32> 2026-02-21T08:39:19.2218426Z %47 = arith.divf %44, %46 : tensor<32x8xf32> 2026-02-21T08:39:19.2218652Z %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16> 2026-02-21T08:39:19.2218963Z tt.descriptor_store %1[%5, %c7416_i32_2], %48 : !tt.tensordesc>, tensor<32x8xf16> 2026-02-21T08:39:19.2219271Z } {tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:39:19.2219468Z tt.return 2026-02-21T08:39:19.2219595Z } 2026-02-21T08:39:19.2219723Z } 2026-02-21T08:39:19.2219791Z 2026-02-21T08:39:19.2219840Z {-# 2026-02-21T08:39:19.2219974Z external_resources: { 2026-02-21T08:39:19.2220130Z mlir_reproducer: { 2026-02-21T08:39:19.2224513Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:39:19.2228939Z disable_threading: false, 2026-02-21T08:39:19.2229102Z verify_each: true 2026-02-21T08:39:19.2229250Z } 2026-02-21T08:39:19.2229370Z } 2026-02-21T08:39:19.2229480Z #-} 2026-02-21T08:39:19.2229900Z /tmp/torchinductor_root/2w/c2wcir4gm3rdqiowhzt2k5g2mukrxiva6udznsaoahozanh2bh7g.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:39:19.2231083Z /tmp/torchinductor_root/2w/c2wcir4gm3rdqiowhzt2k5g2mukrxiva6udznsaoahozanh2bh7g.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:39:19.2232134Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:39:19.2233222Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:39:19.2234201Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:39:19.2234460Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:39:20.7489287Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:39:20.7493852Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:39:20.7498883Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:39:20.7500930Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:39:20.7501189Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:39:20.7501423Z %cst = arith.constant dense<7424> : tensor<32x1xi32> 2026-02-21T08:39:20.7509285Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32> 2026-02-21T08:39:20.7509547Z %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32> 2026-02-21T08:39:20.7509775Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:39:20.7509970Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:39:20.7510151Z %c7424_i32 = arith.constant 7424 : i32 2026-02-21T08:39:20.7510336Z %c7424_i64 = arith.constant 7424 : i64 2026-02-21T08:39:20.7510528Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:39:20.7510859Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : , > 2026-02-21T08:39:20.7511178Z %1 = tt.get_program_id x : i32 2026-02-21T08:39:20.7511395Z scf.for %arg2 = %1 to %c128_i32 step %c9472_i32 : i32 { 2026-02-21T08:39:20.7511684Z %2 = arith.muli %arg2, %c32_i32 : i32 2026-02-21T08:39:20.7511931Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:39:20.7512189Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:39:20.7512382Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:39:20.7512583Z %c7392_i32 = arith.constant 7392 : i32 2026-02-21T08:39:20.7512770Z %c96_i32 = arith.constant 96 : i32 2026-02-21T08:39:20.7513150Z %6:2 = scf.for %arg3 = %c0_i32 to %c7392_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>) : i32 { 2026-02-21T08:39:20.7513616Z %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:39:20.7513952Z %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7514195Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7514389Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7514580Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:20.7514771Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7514962Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7515185Z %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:20.7515433Z %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:20.7515673Z %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32> 2026-02-21T08:39:20.7515900Z %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32> 2026-02-21T08:39:20.7516164Z %54 = arith.ori %52, %53 : tensor<32xi1> 2026-02-21T08:39:20.7516711Z %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:20.7516951Z %56 = arith.subf %arg4, %55 : tensor<32xf32> 2026-02-21T08:39:20.7517318Z %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7517678Z %58 = arith.mulf %arg5, %57 : tensor<32xf32> 2026-02-21T08:39:20.7517930Z %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7518226Z %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7518459Z %61 = arith.subf %48, %60 : tensor<32x32xf32> 2026-02-21T08:39:20.7518823Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7519187Z %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7519447Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7519642Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:20.7519830Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7520026Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7520226Z %64 = arith.addf %58, %63 : tensor<32xf32> 2026-02-21T08:39:20.7520431Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:39:20.7520626Z %65 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:39:20.7520833Z %66 = arith.addi %arg3, %65 : i32 2026-02-21T08:39:20.7521121Z %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:39:20.7521440Z %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7521715Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7521898Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7522087Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:20.7522279Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7522467Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7522693Z %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:20.7522930Z %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:20.7523162Z %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32> 2026-02-21T08:39:20.7523373Z %73 = arith.cmpf une, %55, %55 : tensor<32xf32> 2026-02-21T08:39:20.7523581Z %74 = arith.ori %72, %73 : tensor<32xi1> 2026-02-21T08:39:20.7523804Z %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:20.7524041Z %76 = arith.subf %55, %75 : tensor<32xf32> 2026-02-21T08:39:20.7524391Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7524738Z %78 = arith.mulf %64, %77 : tensor<32xf32> 2026-02-21T08:39:20.7524998Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7525282Z %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7525519Z %81 = arith.subf %68, %80 : tensor<32x32xf32> 2026-02-21T08:39:20.7525883Z %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7526238Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7526433Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7526610Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:20.7526797Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7526976Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7527172Z %84 = arith.addf %78, %83 : tensor<32xf32> 2026-02-21T08:39:20.7527356Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:20.7527551Z %85 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:39:20.7527809Z %86 = arith.addi %arg3, %85 : i32 2026-02-21T08:39:20.7528073Z %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:39:20.7528385Z %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7528608Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7528795Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7528973Z %105 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:39:20.7529167Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7529351Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7529563Z %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:20.7529804Z %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:20.7530027Z %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32> 2026-02-21T08:39:20.7530288Z %93 = arith.cmpf une, %75, %75 : tensor<32xf32> 2026-02-21T08:39:20.7530493Z %94 = arith.ori %92, %93 : tensor<32xi1> 2026-02-21T08:39:20.7530727Z %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:20.7530962Z %96 = arith.subf %75, %95 : tensor<32xf32> 2026-02-21T08:39:20.7531306Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7531699Z %98 = arith.mulf %84, %97 : tensor<32xf32> 2026-02-21T08:39:20.7531950Z %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7532252Z %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7532494Z %101 = arith.subf %88, %100 : tensor<32x32xf32> 2026-02-21T08:39:20.7532873Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7533255Z %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7533447Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:39:20.7533631Z %105 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:39:20.7533816Z tt.reduce.return %105 : f32 2026-02-21T08:39:20.7534003Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7534199Z %104 = arith.addf %98, %103 : tensor<32xf32> 2026-02-21T08:39:20.7534421Z scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32> 2026-02-21T08:39:20.7534639Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:39:20.7534929Z %7 = tt.descriptor_load %0[%2, %c7392_i32] : !tt.tensordesc> -> tensor<32x32xf16> 2026-02-21T08:39:20.7535255Z %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7535475Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7535668Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:39:20.7535849Z %47 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:39:20.7536045Z tt.reduce.return %47 : f32 2026-02-21T08:39:20.7536234Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7536449Z %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16> 2026-02-21T08:39:20.7536691Z %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32> 2026-02-21T08:39:20.7536913Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32> 2026-02-21T08:39:20.7537130Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32> 2026-02-21T08:39:20.7537329Z %14 = arith.ori %12, %13 : tensor<32xi1> 2026-02-21T08:39:20.7537558Z %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32> 2026-02-21T08:39:20.7537794Z %16 = arith.subf %6#0, %15 : tensor<32xf32> 2026-02-21T08:39:20.7538140Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7538496Z %18 = arith.mulf %6#1, %17 : tensor<32xf32> 2026-02-21T08:39:20.7538804Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7539090Z %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7539325Z %21 = arith.subf %8, %20 : tensor<32x32xf32> 2026-02-21T08:39:20.7539702Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7540079Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:39:20.7540272Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:39:20.7540463Z %47 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:39:20.7540653Z tt.reduce.return %47 : f32 2026-02-21T08:39:20.7540849Z }) : (tensor<32x32xf32>) -> tensor<32xf32> 2026-02-21T08:39:20.7541048Z %24 = arith.addf %18, %23 : tensor<32xf32> 2026-02-21T08:39:20.7541257Z %c7392_i32_2 = arith.constant 7392 : i32 2026-02-21T08:39:20.7541520Z %c96_i32_3 = arith.constant 96 : i32 2026-02-21T08:39:20.7541795Z scf.for %arg3 = %c0_i32 to %c7392_i32_2 step %c96_i32_3 : i32 { 2026-02-21T08:39:20.7542057Z %47 = tt.splat %arg3 : i32 -> tensor<32xi32> 2026-02-21T08:39:20.7542267Z %48 = arith.addi %47, %3 : tensor<32xi32> 2026-02-21T08:39:20.7542530Z %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:20.7542805Z %50 = arith.muli %49, %cst : tensor<32x1xi32> 2026-02-21T08:39:20.7543074Z %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:39:20.7543377Z %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7543643Z %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7543890Z %54 = arith.addi %52, %53 : tensor<32x32xi32> 2026-02-21T08:39:20.7544137Z %55 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7544440Z %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7544756Z %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7545086Z %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7545388Z %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7545650Z %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7545895Z %61 = arith.subf %59, %60 : tensor<32x32xf32> 2026-02-21T08:39:20.7546272Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7546704Z %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7547002Z %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7547247Z %65 = arith.divf %62, %64 : tensor<32x32xf32> 2026-02-21T08:39:20.7547479Z %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:39:20.7547739Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7548016Z %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7548267Z tt.store %68, %66 : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7548474Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:39:20.7548665Z %69 = arith.muli %c32_i32, %c1_i32 : i32 2026-02-21T08:39:20.7548851Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:39:20.7549046Z %71 = tt.splat %70 : i32 -> tensor<32xi32> 2026-02-21T08:39:20.7549240Z %72 = arith.addi %71, %3 : tensor<32xi32> 2026-02-21T08:39:20.7549485Z %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:20.7549744Z %74 = arith.muli %73, %cst : tensor<32x1xi32> 2026-02-21T08:39:20.7550045Z %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:39:20.7550328Z %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7550577Z %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7550811Z %78 = arith.addi %76, %77 : tensor<32x32xi32> 2026-02-21T08:39:20.7551038Z %79 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7551310Z %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7551628Z %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7551935Z %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7552218Z %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7552537Z %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7552771Z %85 = arith.subf %83, %84 : tensor<32x32xf32> 2026-02-21T08:39:20.7553123Z %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7553535Z %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7553819Z %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7554044Z %89 = arith.divf %86, %88 : tensor<32x32xf32> 2026-02-21T08:39:20.7554279Z %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:39:20.7554535Z %91 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7554807Z %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7555049Z tt.store %92, %90 : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7555259Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:39:20.7555456Z %93 = arith.muli %c32_i32, %c2_i32 : i32 2026-02-21T08:39:20.7555637Z %94 = arith.addi %arg3, %93 : i32 2026-02-21T08:39:20.7555828Z %95 = tt.splat %94 : i32 -> tensor<32xi32> 2026-02-21T08:39:20.7556021Z %96 = arith.addi %95, %3 : tensor<32xi32> 2026-02-21T08:39:20.7556264Z %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:20.7556514Z %98 = arith.muli %97, %cst : tensor<32x1xi32> 2026-02-21T08:39:20.7556762Z %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:39:20.7557050Z %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7557309Z %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7557560Z %102 = arith.addi %100, %101 : tensor<32x32xi32> 2026-02-21T08:39:20.7557816Z %103 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7558110Z %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7558424Z %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7558734Z %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7559029Z %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7559284Z %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7559530Z %109 = arith.subf %107, %108 : tensor<32x32xf32> 2026-02-21T08:39:20.7559898Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7560319Z %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7560613Z %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7560903Z %113 = arith.divf %110, %112 : tensor<32x32xf32> 2026-02-21T08:39:20.7561150Z %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:39:20.7561419Z %115 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7561738Z %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7562003Z tt.store %116, %114 : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7562213Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:39:20.7562431Z %25 = tt.splat %c7392_i32_2 : i32 -> tensor<32xi32> 2026-02-21T08:39:20.7562637Z %26 = arith.addi %25, %3 : tensor<32xi32> 2026-02-21T08:39:20.7562885Z %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:39:20.7563138Z %28 = arith.muli %27, %cst : tensor<32x1xi32> 2026-02-21T08:39:20.7563440Z %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:39:20.7563732Z %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7563982Z %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32> 2026-02-21T08:39:20.7564211Z %32 = arith.addi %30, %31 : tensor<32x32xi32> 2026-02-21T08:39:20.7564438Z %33 = tt.splat %arg0 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7564714Z %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7565003Z %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7565311Z %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7565595Z %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32> 2026-02-21T08:39:20.7565842Z %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7566074Z %39 = arith.subf %37, %38 : tensor<32x32xf32> 2026-02-21T08:39:20.7566429Z %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32> 2026-02-21T08:39:20.7566839Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32> 2026-02-21T08:39:20.7567117Z %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32> 2026-02-21T08:39:20.7567337Z %43 = arith.divf %40, %42 : tensor<32x32xf32> 2026-02-21T08:39:20.7567565Z %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16> 2026-02-21T08:39:20.7567820Z %45 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7568090Z %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr>, tensor<32x32xi32> 2026-02-21T08:39:20.7568338Z tt.store %46, %44 : tensor<32x32x!tt.ptr> 2026-02-21T08:39:20.7568611Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:39:20.7568865Z tt.return 2026-02-21T08:39:20.7568992Z } 2026-02-21T08:39:20.7569115Z } 2026-02-21T08:39:20.7569185Z 2026-02-21T08:39:20.7569235Z {-# 2026-02-21T08:39:20.7569372Z external_resources: { 2026-02-21T08:39:20.7569527Z mlir_reproducer: { 2026-02-21T08:39:20.7573851Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:39:20.7578378Z disable_threading: false, 2026-02-21T08:39:20.7578562Z verify_each: true 2026-02-21T08:39:20.7578708Z } 2026-02-21T08:39:20.7578842Z } 2026-02-21T08:39:20.7578963Z #-} 2026-02-21T08:39:20.7579414Z /tmp/torchinductor_root/3o/c3ouqpfz652r7o4j7juljqku7fknds53vpazrynxeu6dtcsako3r.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:39:20.7580644Z /tmp/torchinductor_root/3o/c3ouqpfz652r7o4j7juljqku7fknds53vpazrynxeu6dtcsako3r.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:39:20.7581700Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:39:20.7582795Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:39:20.7583800Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:39:20.7584066Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:39:23.2835722Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.0 configs/s 2026-02-21T08:39:23.2848638Z [43s] Adaptive compile timeout: 30s (90% percentile=7.7s, bounds=[30.0s, 30s]) 2026-02-21T08:39:23.9814689Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1409.8 configs/s 2026-02-21T08:39:24.0386171Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T08:39:24.0390036Z error=8 2026-02-21T08:39:24.0395694Z timeout=3 2026-02-21T08:39:24.0400068Z ok=89 2026-02-21T08:39:24.0404611Z min=0.0532 2026-02-21T08:39:24.0406761Z mid=0.7076 2026-02-21T08:39:24.0406926Z max=49.5391 2026-02-21T08:39:24.0407089Z best={'block_sizes': [2, 1024], 2026-02-21T08:39:24.0407368Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:39:24.0407643Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:39:24.0407839Z 'num_sm_multiplier': 64, 2026-02-21T08:39:24.0407996Z 'num_stages': 5, 2026-02-21T08:39:24.0408138Z 'num_warps': 1, 2026-02-21T08:39:24.0408290Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:39:24.0408489Z 'range_flattens': [True, True], 2026-02-21T08:39:24.0408659Z 'range_multi_buffers': [False, None], 2026-02-21T08:39:24.0408845Z 'range_num_stages': [3, 1], 2026-02-21T08:39:24.0409016Z 'range_unroll_factors': [0, 2], 2026-02-21T08:39:24.0409191Z 'range_warp_specializes': [True, None]} 2026-02-21T08:39:24.0409409Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:39:25.1250847Z [45s] Generation 1 starting: 79 neighbors, 5 active search path(s) 2026-02-21T08:39:34.1079827Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 5.4 configs/s 2026-02-21T08:39:38.9799634Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.0 configs/s 2026-02-21T08:39:43.3214923Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 232.8 2026-02-21T08:39:43.3215456Z configs/s 2026-02-21T08:39:43.5702801Z [63s] Generation 1 complete: 2026-02-21T08:39:43.5703128Z ok=85 2026-02-21T08:39:43.5703322Z min=0.0389 2026-02-21T08:39:43.5703582Z mid=0.0553 2026-02-21T08:39:43.5703765Z max=2.1699 2026-02-21T08:39:43.5703959Z best={'block_sizes': [1, 8192], 2026-02-21T08:39:43.5704338Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:39:43.5706666Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:39:43.5706923Z 'num_stages': 1, 2026-02-21T08:39:43.5707442Z 'num_warps': 4, 2026-02-21T08:39:43.5707626Z 'pid_type': 'flat', 2026-02-21T08:39:43.5707794Z 'range_flattens': [None, False], 2026-02-21T08:39:43.5707976Z 'range_multi_buffers': [None, None], 2026-02-21T08:39:43.5708164Z 'range_num_stages': [0, 4], 2026-02-21T08:39:43.5708330Z 'range_unroll_factors': [0, 1], 2026-02-21T08:39:43.5708517Z 'range_warp_specializes': [None, False]} 2026-02-21T08:39:43.5715378Z [63s] Fitting surrogate: 185 points, 185 targets 2026-02-21T08:39:44.5422992Z [64s] Generation 2 starting: 67 neighbors, 5 active search path(s) 2026-02-21T08:39:55.2078246Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.7 configs/s 2026-02-21T08:39:59.3226729Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.2 configs/s 2026-02-21T08:40:02.1747084Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 429.1 2026-02-21T08:40:02.1750945Z configs/s 2026-02-21T08:40:02.3254177Z [82s] Generation 2 complete: 2026-02-21T08:40:02.3256162Z error=1 2026-02-21T08:40:02.3256343Z ok=72 2026-02-21T08:40:02.3256509Z min=0.0307 2026-02-21T08:40:02.3256668Z mid=0.0471 2026-02-21T08:40:02.3256809Z max=0.2642 2026-02-21T08:40:02.3256991Z best={'block_sizes': [1, 8192], 2026-02-21T08:40:02.3257268Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:40:02.3257585Z 'load_eviction_policies': ['', ''], 2026-02-21T08:40:02.3257806Z 'num_stages': 1, 2026-02-21T08:40:02.3257981Z 'num_warps': 4, 2026-02-21T08:40:02.3258144Z 'pid_type': 'flat', 2026-02-21T08:40:02.3258319Z 'range_flattens': [None, False], 2026-02-21T08:40:02.3258522Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:02.3258724Z 'range_num_stages': [0, 4], 2026-02-21T08:40:02.3258907Z 'range_unroll_factors': [0, 1], 2026-02-21T08:40:02.3259092Z 'range_warp_specializes': [None, False]} 2026-02-21T08:40:02.3269788Z [82s] Fitting surrogate: 258 points, 258 targets 2026-02-21T08:40:03.3199280Z [83s] Generation 3 starting: 68 neighbors, 5 active search path(s) 2026-02-21T08:40:12.0047170Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 2.1 configs/s 2026-02-21T08:40:16.1532070Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.8 configs/s 2026-02-21T08:40:21.1595885Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 202.5 2026-02-21T08:40:21.1599824Z configs/s 2026-02-21T08:40:21.4876988Z [101s] Generation 3 complete: 2026-02-21T08:40:21.4882130Z ok=73 2026-02-21T08:40:21.4884154Z min=0.0307 2026-02-21T08:40:21.4884361Z mid=0.0430 2026-02-21T08:40:21.4889280Z max=1.1116 2026-02-21T08:40:21.4893840Z best={'block_sizes': [1, 8192], 2026-02-21T08:40:21.4898598Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:40:21.4902961Z 'load_eviction_policies': ['', ''], 2026-02-21T08:40:21.4904632Z 'num_stages': 1, 2026-02-21T08:40:21.4904905Z 'num_warps': 4, 2026-02-21T08:40:21.4909599Z 'pid_type': 'flat', 2026-02-21T08:40:21.4913983Z 'range_flattens': [None, False], 2026-02-21T08:40:21.4917975Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:21.4923027Z 'range_num_stages': [0, 4], 2026-02-21T08:40:21.4924662Z 'range_unroll_factors': [0, 1], 2026-02-21T08:40:21.4924912Z 'range_warp_specializes': [None, False]} 2026-02-21T08:40:21.4925138Z [101s] Fitting surrogate: 331 points, 331 targets 2026-02-21T08:40:22.3171875Z [102s] Generation 4 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:40:28.2963549Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 9.1 configs/s 2026-02-21T08:40:32.0635518Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s 2026-02-21T08:40:36.0445698Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 254.7 2026-02-21T08:40:36.0449526Z configs/s 2026-02-21T08:40:36.3158270Z [116s] Generation 4 complete: 2026-02-21T08:40:36.3159824Z ok=67 2026-02-21T08:40:36.3160039Z min=0.0306 2026-02-21T08:40:36.3160201Z mid=0.0389 2026-02-21T08:40:36.3160372Z max=0.1238 2026-02-21T08:40:36.3160552Z best={'block_sizes': [1, 8192], 2026-02-21T08:40:36.3160807Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:40:36.3161069Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:40:36.3161285Z 'num_stages': 6, 2026-02-21T08:40:36.3161447Z 'num_warps': 1, 2026-02-21T08:40:36.3161808Z 'pid_type': 'flat', 2026-02-21T08:40:36.3162000Z 'range_flattens': [None, False], 2026-02-21T08:40:36.3162185Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:36.3162377Z 'range_num_stages': [0, 2], 2026-02-21T08:40:36.3162548Z 'range_unroll_factors': [0, 0], 2026-02-21T08:40:36.3162736Z 'range_warp_specializes': [None, True]} 2026-02-21T08:40:36.3174792Z [116s] Fitting surrogate: 398 points, 398 targets 2026-02-21T08:40:37.0302909Z [116s] Generation 5 starting: 50 neighbors, 4 active search path(s) 2026-02-21T08:40:42.1162602Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 25.4 configs/s 2026-02-21T08:40:45.1005891Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.0 configs/s 2026-02-21T08:40:48.5252271Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.1 2026-02-21T08:40:48.5252633Z configs/s 2026-02-21T08:40:48.7650216Z [128s] Generation 5 complete: 2026-02-21T08:40:48.7651733Z ok=54 2026-02-21T08:40:48.7651912Z min=0.0288 2026-02-21T08:40:48.7652056Z mid=0.0327 2026-02-21T08:40:48.7652182Z max=0.0881 2026-02-21T08:40:48.7652336Z best={'block_sizes': [1, 8192], 2026-02-21T08:40:48.7652576Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:40:48.7652835Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:40:48.7653032Z 'num_stages': 6, 2026-02-21T08:40:48.7653187Z 'num_warps': 1, 2026-02-21T08:40:48.7653370Z 'pid_type': 'flat', 2026-02-21T08:40:48.7653562Z 'range_flattens': [None, False], 2026-02-21T08:40:48.7653757Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:48.7653950Z 'range_num_stages': [0, 2], 2026-02-21T08:40:48.7654130Z 'range_unroll_factors': [0, 0], 2026-02-21T08:40:48.7654314Z 'range_warp_specializes': [None, True]} 2026-02-21T08:40:48.7672488Z [128s] Fitting surrogate: 452 points, 452 targets 2026-02-21T08:40:49.4179097Z [129s] Generation 6 starting: 44 neighbors, 4 active search path(s) 2026-02-21T08:40:53.6628547Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 10.0 configs/s 2026-02-21T08:40:56.3453402Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 17.1 configs/s 2026-02-21T08:40:59.4493624Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.7 2026-02-21T08:40:59.4494879Z configs/s 2026-02-21T08:40:59.6803399Z [139s] Generation 6 complete: 2026-02-21T08:40:59.6804259Z ok=48 2026-02-21T08:40:59.6804391Z min=0.0287 2026-02-21T08:40:59.6804523Z mid=0.0307 2026-02-21T08:40:59.6804641Z max=0.0779 2026-02-21T08:40:59.6804784Z best={'block_sizes': [1, 8192], 2026-02-21T08:40:59.6805021Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:40:59.6805280Z 'load_eviction_policies': ['', ''], 2026-02-21T08:40:59.6805467Z 'num_sm_multiplier': 32, 2026-02-21T08:40:59.6805624Z 'num_stages': 6, 2026-02-21T08:40:59.6805769Z 'num_warps': 2, 2026-02-21T08:40:59.6805919Z 'pid_type': 'persistent_blocked', 2026-02-21T08:40:59.6806104Z 'range_flattens': [True, True], 2026-02-21T08:40:59.6806277Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:59.6806461Z 'range_num_stages': [3, 1], 2026-02-21T08:40:59.6806621Z 'range_unroll_factors': [0, 2], 2026-02-21T08:40:59.6806800Z 'range_warp_specializes': [True, None]} 2026-02-21T08:40:59.6820328Z [139s] Fitting surrogate: 500 points, 500 targets 2026-02-21T08:41:00.1652597Z [140s] Generation 7 starting: 27 neighbors, 2 active search path(s) 2026-02-21T08:41:03.8458607Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 9.3 configs/s 2026-02-21T08:41:05.4652929Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.1 configs/s 2026-02-21T08:41:07.5385657Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.8 2026-02-21T08:41:07.5389194Z configs/s 2026-02-21T08:41:07.6958347Z [147s] Generation 7 complete: 2026-02-21T08:41:07.6963676Z ok=29 2026-02-21T08:41:07.6965558Z min=0.0307 2026-02-21T08:41:07.6965716Z mid=0.0307 2026-02-21T08:41:07.6965849Z max=0.0471 2026-02-21T08:41:07.6965988Z best={'block_sizes': [1, 8192], 2026-02-21T08:41:07.6966249Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:41:07.6966505Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:07.6966702Z 'num_sm_multiplier': 32, 2026-02-21T08:41:07.6966899Z 'num_stages': 6, 2026-02-21T08:41:07.6967069Z 'num_warps': 2, 2026-02-21T08:41:07.6967234Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:07.6967417Z 'range_flattens': [True, True], 2026-02-21T08:41:07.6967603Z 'range_multi_buffers': [None, None], 2026-02-21T08:41:07.6967781Z 'range_num_stages': [3, 1], 2026-02-21T08:41:07.6967950Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:07.6968126Z 'range_warp_specializes': [True, None]} 2026-02-21T08:41:07.6976468Z [147s] Fitting surrogate: 529 points, 529 targets 2026-02-21T08:41:08.0579554Z [147s] Generation 8 starting: 8 neighbors, 1 active search path(s) 2026-02-21T08:41:09.6672771Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 4.4 configs/s 2026-02-21T08:41:10.2008011Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.6 configs/s 2026-02-21T08:41:10.7120956Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1929.6 2026-02-21T08:41:10.7125232Z configs/s 2026-02-21T08:41:10.7618642Z [150s] Generation 8 complete: 2026-02-21T08:41:10.7618906Z ok=10 2026-02-21T08:41:10.7619091Z min=0.0287 2026-02-21T08:41:10.7619258Z mid=0.0389 2026-02-21T08:41:10.7619389Z max=0.0553 2026-02-21T08:41:10.7619557Z best={'block_sizes': [1, 8192], 2026-02-21T08:41:10.7619817Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:41:10.7620083Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:10.7620269Z 'num_sm_multiplier': 32, 2026-02-21T08:41:10.7620424Z 'num_stages': 6, 2026-02-21T08:41:10.7620565Z 'num_warps': 2, 2026-02-21T08:41:10.7620711Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:10.7620897Z 'range_flattens': [True, True], 2026-02-21T08:41:10.7621070Z 'range_multi_buffers': [None, None], 2026-02-21T08:41:10.7621251Z 'range_num_stages': [3, 1], 2026-02-21T08:41:10.7621411Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:10.7621650Z 'range_warp_specializes': [True, None]} 2026-02-21T08:41:10.7643445Z [150s] Fitting surrogate: 539 points, 539 targets 2026-02-21T08:41:11.1121894Z [151s] Generation 9 starting: 10 neighbors, 1 active search path(s) 2026-02-21T08:41:12.2671301Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.6 configs/s 2026-02-21T08:41:12.8534527Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.6 configs/s 2026-02-21T08:41:14.2401942Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1151.9 2026-02-21T08:41:14.2402532Z configs/s 2026-02-21T08:41:14.3133492Z [154s] Generation 9 complete: 2026-02-21T08:41:14.3135047Z ok=12 2026-02-21T08:41:14.3135250Z min=0.0307 2026-02-21T08:41:14.3139880Z mid=0.0307 2026-02-21T08:41:14.3144405Z max=0.0389 2026-02-21T08:41:14.3148811Z best={'block_sizes': [1, 8192], 2026-02-21T08:41:14.3150362Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:41:14.3150726Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:14.3150951Z 'num_sm_multiplier': 32, 2026-02-21T08:41:14.3156177Z 'num_stages': 6, 2026-02-21T08:41:14.3159352Z 'num_warps': 2, 2026-02-21T08:41:14.3161488Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:14.3161791Z 'range_flattens': [True, True], 2026-02-21T08:41:14.3161995Z 'range_multi_buffers': [None, None], 2026-02-21T08:41:14.3162179Z 'range_num_stages': [3, 1], 2026-02-21T08:41:14.3162347Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:14.3162524Z 'range_warp_specializes': [True, None]} 2026-02-21T08:41:14.3162813Z [154s] Fitting surrogate: 551 points, 551 targets 2026-02-21T08:41:14.7353679Z [154s] Generation 10 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:41:16.3736876Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 12.7 configs/s 2026-02-21T08:41:17.0934380Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.8 configs/s 2026-02-21T08:41:18.1373291Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 959.9 2026-02-21T08:41:18.1377235Z configs/s 2026-02-21T08:41:18.2190296Z [158s] Generation 10 complete: 2026-02-21T08:41:18.2194678Z ok=14 2026-02-21T08:41:18.2199050Z min=0.0307 2026-02-21T08:41:18.2200906Z mid=0.0307 2026-02-21T08:41:18.2201096Z max=0.0431 2026-02-21T08:41:18.2201275Z best={'block_sizes': [1, 8192], 2026-02-21T08:41:18.2201774Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:41:18.2202090Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:18.2202287Z 'num_sm_multiplier': 32, 2026-02-21T08:41:18.2202460Z 'num_stages': 6, 2026-02-21T08:41:18.2202623Z 'num_warps': 2, 2026-02-21T08:41:18.2202819Z 'pid_type': 'persistent_blocked', 2026-02-21T08:41:18.2203023Z 'range_flattens': [True, True], 2026-02-21T08:41:18.2203219Z 'range_multi_buffers': [None, None], 2026-02-21T08:41:18.2203422Z 'range_num_stages': [3, 1], 2026-02-21T08:41:18.2203647Z 'range_unroll_factors': [0, 2], 2026-02-21T08:41:18.2204184Z 'range_warp_specializes': [True, None]} 2026-02-21T08:41:18.2218744Z [158s] Fitting surrogate: 565 points, 565 targets 2026-02-21T08:41:18.4975428Z [158s] Autotuning complete in 158.4s after searching 534 configs. 2026-02-21T08:41:18.4977450Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:41:18.4978454Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=6, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:41:18.4979311Z 2026-02-21T08:41:18.4979562Z [158s] Code of selected kernel: /tmp/torchinductor_root/gk/cgkn344xjwop3j7ywcryqion4i3hhvpupzxrejegoxquuyzp5mdx.py 2026-02-21T08:41:19.5940000Z WARNING:tritonbench.utils.triton_op:Completed input ID 56: 2026-02-21T08:41:19.5943842Z (M, N) 2026-02-21T08:41:19.5948159Z ------------ 2026-02-21T08:41:19.5949790Z (4096, 7424) 2026-02-21T08:41:19.5949983Z 2026-02-21T08:41:19.5955845Z 60%|██████ | 12/20 [32:17<22:50, 171.30s/it]WARNING:tritonbench.utils.triton_op:Running input ID 61: 2026-02-21T08:41:19.5959709Z (M, N) 2026-02-21T08:41:19.5962882Z ------------ 2026-02-21T08:41:19.5963135Z (4096, 8064) 2026-02-21T08:41:19.5966269Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:41:20.7793105Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:41:22.2562985Z INFO:tritonbench.utils.triton_op:Took 2.24ms to get benchmark function for torch_compile_softmax 2026-02-21T08:41:25.7738979Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:41:25.7743113Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:41:25.7744455Z 'dtype': 'torch.float16', 2026-02-21T08:41:25.7744720Z 'shape': (4096, 8064), 2026-02-21T08:41:25.7744925Z 'stride': (8064, 1)},), 2026-02-21T08:41:25.7745095Z 'kwargs': {}} 2026-02-21T08:41:25.7761918Z INFO:tritonbench.utils.triton_op:Took 2.55ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:41:25.9583340Z [0s] Autotune random seed: 2134816249 2026-02-21T08:41:26.0970599Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:42:01.2694309Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:42:01.7723093Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False]) 2026-02-21T08:42:02.0769492Z [35s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True]) 2026-02-21T08:42:02.0781119Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T08:42:09.4254296Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.7 configs/s 2026-02-21T08:42:09.4264542Z [43s] Adaptive compile timeout: 30s (90% percentile=8.3s, bounds=[30.0s, 30s]) 2026-02-21T08:42:10.0163388Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1651.3 configs/s 2026-02-21T08:42:10.0670994Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T08:42:10.0672457Z error=6 2026-02-21T08:42:10.0672624Z timeout=3 2026-02-21T08:42:10.0672760Z ok=91 2026-02-21T08:42:10.0672883Z min=0.0532 2026-02-21T08:42:10.0673024Z mid=0.8255 2026-02-21T08:42:10.0673152Z max=53.8153 2026-02-21T08:42:10.0673301Z best={'block_sizes': [2, 1024], 2026-02-21T08:42:10.0673570Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:42:10.0673849Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:42:10.0674044Z 'num_sm_multiplier': 64, 2026-02-21T08:42:10.0674203Z 'num_stages': 5, 2026-02-21T08:42:10.0674346Z 'num_warps': 1, 2026-02-21T08:42:10.0674501Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:42:10.0675023Z 'range_flattens': [True, True], 2026-02-21T08:42:10.0675225Z 'range_multi_buffers': [False, None], 2026-02-21T08:42:10.0675414Z 'range_num_stages': [3, 1], 2026-02-21T08:42:10.0675578Z 'range_unroll_factors': [0, 2], 2026-02-21T08:42:10.0675763Z 'range_warp_specializes': [True, None]} 2026-02-21T08:42:10.0687104Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:42:11.0331435Z [44s] Generation 1 starting: 79 neighbors, 5 active search path(s) 2026-02-21T08:42:23.5399350Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 1.8 configs/s 2026-02-21T08:42:28.5581883Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 16.9 configs/s 2026-02-21T08:42:33.6492212Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 198.7 2026-02-21T08:42:33.6496200Z configs/s 2026-02-21T08:42:33.9479726Z [67s] Generation 1 complete: 2026-02-21T08:42:33.9481718Z ok=85 2026-02-21T08:42:33.9481985Z min=0.0389 2026-02-21T08:42:33.9482223Z mid=0.0552 2026-02-21T08:42:33.9482384Z max=2.5283 2026-02-21T08:42:33.9482594Z best={'block_sizes': [1, 8192], 2026-02-21T08:42:33.9482881Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:42:33.9483202Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:42:33.9483455Z 'num_stages': 1, 2026-02-21T08:42:33.9483636Z 'num_warps': 4, 2026-02-21T08:42:33.9483857Z 'pid_type': 'flat', 2026-02-21T08:42:33.9484053Z 'range_flattens': [None, None], 2026-02-21T08:42:33.9484303Z 'range_multi_buffers': [None, None], 2026-02-21T08:42:33.9484532Z 'range_num_stages': [0, 4], 2026-02-21T08:42:33.9484770Z 'range_unroll_factors': [0, 1], 2026-02-21T08:42:33.9484990Z 'range_warp_specializes': [None, False]} 2026-02-21T08:42:33.9491457Z [67s] Fitting surrogate: 185 points, 185 targets 2026-02-21T08:42:34.8167932Z [68s] Generation 2 starting: 60 neighbors, 5 active search path(s) 2026-02-21T08:43:00.0769308Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.7 configs/s 2026-02-21T08:43:03.7909316Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.2 configs/s 2026-02-21T08:43:05.7981780Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 502.4 2026-02-21T08:43:05.7982256Z configs/s 2026-02-21T08:43:05.9228747Z [99s] Generation 2 complete: 2026-02-21T08:43:05.9233124Z error=1 2026-02-21T08:43:05.9234567Z ok=65 2026-02-21T08:43:05.9234782Z min=0.0307 2026-02-21T08:43:05.9234995Z mid=0.0491 2026-02-21T08:43:05.9235163Z max=0.8058 2026-02-21T08:43:05.9235374Z best={'block_sizes': [1, 8192], 2026-02-21T08:43:05.9235674Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:43:05.9236013Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:43:05.9236249Z 'num_stages': 1, 2026-02-21T08:43:05.9236466Z 'num_warps': 4, 2026-02-21T08:43:05.9236690Z 'pid_type': 'flat', 2026-02-21T08:43:05.9236915Z 'range_flattens': [None, None], 2026-02-21T08:43:05.9237481Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:05.9237710Z 'range_num_stages': [0, 4], 2026-02-21T08:43:05.9237950Z 'range_unroll_factors': [0, 1], 2026-02-21T08:43:05.9238172Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:05.9241120Z [99s] Fitting surrogate: 251 points, 251 targets 2026-02-21T08:43:06.8374267Z [100s] Generation 3 starting: 60 neighbors, 5 active search path(s) 2026-02-21T08:43:13.9979690Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 5.5 configs/s 2026-02-21T08:43:17.6435285Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.9 configs/s 2026-02-21T08:43:22.0600434Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.5 2026-02-21T08:43:22.0605179Z configs/s 2026-02-21T08:43:22.3604167Z [116s] Generation 3 complete: 2026-02-21T08:43:22.3609131Z ok=65 2026-02-21T08:43:22.3614196Z min=0.0307 2026-02-21T08:43:22.3618454Z mid=0.0410 2026-02-21T08:43:22.3620464Z max=1.4491 2026-02-21T08:43:22.3620693Z best={'block_sizes': [1, 8192], 2026-02-21T08:43:22.3621071Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:43:22.3626114Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:43:22.3628212Z 'num_stages': 1, 2026-02-21T08:43:22.3628454Z 'num_warps': 4, 2026-02-21T08:43:22.3628691Z 'pid_type': 'flat', 2026-02-21T08:43:22.3628905Z 'range_flattens': [None, None], 2026-02-21T08:43:22.3629176Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:22.3629414Z 'range_num_stages': [0, 4], 2026-02-21T08:43:22.3629662Z 'range_unroll_factors': [0, 1], 2026-02-21T08:43:22.3629893Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:22.3630198Z [116s] Fitting surrogate: 316 points, 316 targets 2026-02-21T08:43:23.1902284Z [117s] Generation 4 starting: 51 neighbors, 5 active search path(s) 2026-02-21T08:43:31.5280175Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 10.7 configs/s 2026-02-21T08:43:34.8953008Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 15.9 configs/s 2026-02-21T08:43:38.3426946Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 293.8 2026-02-21T08:43:38.3430147Z configs/s 2026-02-21T08:43:38.5797673Z [132s] Generation 4 complete: 2026-02-21T08:43:38.5799281Z ok=56 2026-02-21T08:43:38.5799579Z min=0.0307 2026-02-21T08:43:38.5804207Z mid=0.0389 2026-02-21T08:43:38.5809180Z max=0.1351 2026-02-21T08:43:38.5810629Z best={'block_sizes': [1, 8192], 2026-02-21T08:43:38.5810995Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:43:38.5811313Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:43:38.5811818Z 'num_stages': 1, 2026-02-21T08:43:38.5812015Z 'num_warps': 4, 2026-02-21T08:43:38.5812231Z 'pid_type': 'flat', 2026-02-21T08:43:38.5812472Z 'range_flattens': [None, None], 2026-02-21T08:43:38.5813093Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:38.5813352Z 'range_num_stages': [0, 4], 2026-02-21T08:43:38.5813562Z 'range_unroll_factors': [0, 1], 2026-02-21T08:43:38.5813830Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:38.5815216Z [132s] Fitting surrogate: 372 points, 372 targets 2026-02-21T08:43:39.8043397Z [133s] Generation 5 starting: 35 neighbors, 4 active search path(s) 2026-02-21T08:43:45.1032452Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 29.0 configs/s 2026-02-21T08:43:47.2861005Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 16.3 configs/s 2026-02-21T08:43:50.1993547Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 348.0 2026-02-21T08:43:50.1994106Z configs/s 2026-02-21T08:43:50.4089743Z [144s] Generation 5 complete: 2026-02-21T08:43:50.4094147Z ok=40 2026-02-21T08:43:50.4097529Z min=0.0307 2026-02-21T08:43:50.4102586Z mid=0.0327 2026-02-21T08:43:50.4104484Z max=0.0820 2026-02-21T08:43:50.4104711Z best={'block_sizes': [1, 8192], 2026-02-21T08:43:50.4105051Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:43:50.4105423Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:43:50.4105700Z 'num_stages': 1, 2026-02-21T08:43:50.4110456Z 'num_warps': 4, 2026-02-21T08:43:50.4114476Z 'pid_type': 'flat', 2026-02-21T08:43:50.4119195Z 'range_flattens': [None, None], 2026-02-21T08:43:50.4119480Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:50.4119735Z 'range_num_stages': [0, 4], 2026-02-21T08:43:50.4119989Z 'range_unroll_factors': [0, 1], 2026-02-21T08:43:50.4126220Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:50.4126582Z [144s] Fitting surrogate: 412 points, 412 targets 2026-02-21T08:43:50.7726550Z [144s] Generation 6 starting: 18 neighbors, 2 active search path(s) 2026-02-21T08:43:54.9089339Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 7.5 configs/s 2026-02-21T08:43:56.1478111Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 15.8 configs/s 2026-02-21T08:43:57.3760913Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 817.3 2026-02-21T08:43:57.3762787Z configs/s 2026-02-21T08:43:57.4665602Z [151s] Generation 6 complete: 2026-02-21T08:43:57.4667486Z ok=21 2026-02-21T08:43:57.4667713Z min=0.0307 2026-02-21T08:43:57.4667928Z mid=0.0329 2026-02-21T08:43:57.4668103Z max=0.1801 2026-02-21T08:43:57.4668326Z best={'block_sizes': [1, 8192], 2026-02-21T08:43:57.4668639Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:43:57.4668929Z 'load_eviction_policies': ['', ''], 2026-02-21T08:43:57.4669191Z 'num_stages': 2, 2026-02-21T08:43:57.4669382Z 'num_warps': 4, 2026-02-21T08:43:57.4669604Z 'pid_type': 'flat', 2026-02-21T08:43:57.4669814Z 'range_flattens': [None, False], 2026-02-21T08:43:57.4670119Z 'range_multi_buffers': [None, None], 2026-02-21T08:43:57.4670367Z 'range_num_stages': [0, 1], 2026-02-21T08:43:57.4670619Z 'range_unroll_factors': [0, 1], 2026-02-21T08:43:57.4670851Z 'range_warp_specializes': [None, True]} 2026-02-21T08:43:57.4688784Z [151s] Fitting surrogate: 433 points, 433 targets 2026-02-21T08:43:57.9013689Z [151s] Generation 7 starting: 24 neighbors, 2 active search path(s) 2026-02-21T08:44:01.2284580Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 18.2 configs/s 2026-02-21T08:44:02.6956845Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 16.9 configs/s 2026-02-21T08:44:04.3523021Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 610.1 2026-02-21T08:44:04.3523875Z configs/s 2026-02-21T08:44:04.4727861Z [158s] Generation 7 complete: 2026-02-21T08:44:04.4728168Z ok=26 2026-02-21T08:44:04.4728374Z min=0.0307 2026-02-21T08:44:04.4728582Z mid=0.0308 2026-02-21T08:44:04.4729091Z max=0.1004 2026-02-21T08:44:04.4729272Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:04.4729564Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:04.4729839Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:04.4730088Z 'num_stages': 2, 2026-02-21T08:44:04.4730277Z 'num_warps': 4, 2026-02-21T08:44:04.4730494Z 'pid_type': 'flat', 2026-02-21T08:44:04.4730701Z 'range_flattens': [None, False], 2026-02-21T08:44:04.4730959Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:04.4731214Z 'range_num_stages': [0, 0], 2026-02-21T08:44:04.4731424Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:04.4731976Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:04.4749591Z [158s] Fitting surrogate: 459 points, 459 targets 2026-02-21T08:44:04.7711095Z [158s] Generation 8 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:44:07.6667982Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 2.7 configs/s 2026-02-21T08:44:09.0947983Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 11/11 7.5 configs/s 2026-02-21T08:44:09.9985265Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1107.3 2026-02-21T08:44:09.9989413Z configs/s 2026-02-21T08:44:10.0709044Z [163s] Generation 8 complete: 2026-02-21T08:44:10.0710546Z ok=13 2026-02-21T08:44:10.0710785Z min=0.0307 2026-02-21T08:44:10.0710964Z mid=0.0327 2026-02-21T08:44:10.0711162Z max=0.0696 2026-02-21T08:44:10.0711345Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:10.0712478Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:10.0712759Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:10.0713006Z 'num_stages': 2, 2026-02-21T08:44:10.0713199Z 'num_warps': 4, 2026-02-21T08:44:10.0713409Z 'pid_type': 'flat', 2026-02-21T08:44:10.0715569Z 'range_flattens': [None, False], 2026-02-21T08:44:10.0716019Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:10.0720861Z 'range_num_stages': [0, 0], 2026-02-21T08:44:10.0726545Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:10.0730654Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:10.0735304Z [163s] Fitting surrogate: 472 points, 472 targets 2026-02-21T08:44:10.3331217Z [164s] Generation 9 starting: 7 neighbors, 1 active search path(s) 2026-02-21T08:44:12.8013152Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 90.8 configs/s 2026-02-21T08:44:13.2280646Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 18.5 configs/s 2026-02-21T08:44:13.8256385Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1658.0 2026-02-21T08:44:13.8257055Z configs/s 2026-02-21T08:44:13.8802025Z [167s] Generation 9 complete: 2026-02-21T08:44:13.8807229Z ok=9 2026-02-21T08:44:13.8809446Z min=0.0307 2026-02-21T08:44:13.8809694Z mid=0.0326 2026-02-21T08:44:13.8812193Z max=0.0553 2026-02-21T08:44:13.8812461Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:13.8812789Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:13.8813064Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:13.8813312Z 'num_stages': 2, 2026-02-21T08:44:13.8813494Z 'num_warps': 4, 2026-02-21T08:44:13.8813708Z 'pid_type': 'flat', 2026-02-21T08:44:13.8813909Z 'range_flattens': [None, False], 2026-02-21T08:44:13.8814163Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:13.8814417Z 'range_num_stages': [0, 0], 2026-02-21T08:44:13.8814629Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:13.8814882Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:13.8818854Z [167s] Fitting surrogate: 481 points, 481 targets 2026-02-21T08:44:14.1955028Z [168s] Generation 10 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:44:16.7643263Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 7.8 configs/s 2026-02-21T08:44:17.4969302Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.5 configs/s 2026-02-21T08:44:18.5478596Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 954.6 2026-02-21T08:44:18.5479619Z configs/s 2026-02-21T08:44:18.6353147Z [172s] Generation 10 complete: 2026-02-21T08:44:18.6353568Z ok=14 2026-02-21T08:44:18.6353794Z min=0.0307 2026-02-21T08:44:18.6354037Z mid=0.0308 2026-02-21T08:44:18.6354223Z max=0.0451 2026-02-21T08:44:18.6354445Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:18.6354735Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:18.6355003Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:18.6355253Z 'num_stages': 2, 2026-02-21T08:44:18.6355445Z 'num_warps': 4, 2026-02-21T08:44:18.6355658Z 'pid_type': 'flat', 2026-02-21T08:44:18.6355855Z 'range_flattens': [None, False], 2026-02-21T08:44:18.6356112Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:18.6358722Z 'range_num_stages': [0, 0], 2026-02-21T08:44:18.6359453Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:18.6359758Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:18.6367305Z [172s] Fitting surrogate: 495 points, 495 targets 2026-02-21T08:44:18.9433002Z [172s] Generation 11 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:44:21.2717907Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.1 configs/s 2026-02-21T08:44:22.0769133Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 15.7 configs/s 2026-02-21T08:44:22.9985179Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1088.0 2026-02-21T08:44:22.9985825Z configs/s 2026-02-21T08:44:23.0720138Z [176s] Generation 11 complete: 2026-02-21T08:44:23.0722091Z ok=13 2026-02-21T08:44:23.0722357Z min=0.0307 2026-02-21T08:44:23.0722572Z mid=0.0369 2026-02-21T08:44:23.0722740Z max=0.0593 2026-02-21T08:44:23.0725449Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:23.0725840Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:23.0726172Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:23.0726412Z 'num_stages': 2, 2026-02-21T08:44:23.0726644Z 'num_warps': 4, 2026-02-21T08:44:23.0726841Z 'pid_type': 'flat', 2026-02-21T08:44:23.0727084Z 'range_flattens': [None, False], 2026-02-21T08:44:23.0727354Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:23.0727592Z 'range_num_stages': [0, 0], 2026-02-21T08:44:23.0727850Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:23.0728084Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:23.0738566Z [176s] Fitting surrogate: 508 points, 508 targets 2026-02-21T08:44:23.4468665Z [177s] Generation 12 starting: 8 neighbors, 1 active search path(s) 2026-02-21T08:44:25.8739518Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 23.3 configs/s 2026-02-21T08:44:26.3623572Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 18.1 configs/s 2026-02-21T08:44:27.1250192Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1308.0 2026-02-21T08:44:27.1251284Z configs/s 2026-02-21T08:44:27.1898478Z [181s] Generation 12 complete: 2026-02-21T08:44:27.1898876Z ok=10 2026-02-21T08:44:27.1899077Z min=0.0307 2026-02-21T08:44:27.1899305Z mid=0.0308 2026-02-21T08:44:27.1899474Z max=0.0451 2026-02-21T08:44:27.1899718Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:27.1900064Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:27.1900378Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:27.1900654Z 'num_stages': 2, 2026-02-21T08:44:27.1900852Z 'num_warps': 4, 2026-02-21T08:44:27.1901078Z 'pid_type': 'flat', 2026-02-21T08:44:27.1901289Z 'range_flattens': [None, False], 2026-02-21T08:44:27.1901591Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:27.1901851Z 'range_num_stages': [0, 0], 2026-02-21T08:44:27.1902389Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:27.1902671Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:27.1920065Z [181s] Fitting surrogate: 518 points, 518 targets 2026-02-21T08:44:27.5485601Z [181s] Generation 13 starting: 9 neighbors, 1 active search path(s) 2026-02-21T08:44:31.0976568Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.7 configs/s 2026-02-21T08:44:31.6976207Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s 2026-02-21T08:44:32.3774941Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1462.1 2026-02-21T08:44:32.3776560Z configs/s 2026-02-21T08:44:32.4382112Z [186s] Generation 13 complete: 2026-02-21T08:44:32.4386653Z ok=11 2026-02-21T08:44:32.4388822Z min=0.0307 2026-02-21T08:44:32.4389064Z mid=0.0308 2026-02-21T08:44:32.4389238Z max=0.0593 2026-02-21T08:44:32.4389451Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:32.4389788Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:32.4390079Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:32.4390332Z 'num_stages': 2, 2026-02-21T08:44:32.4390515Z 'num_warps': 4, 2026-02-21T08:44:32.4390742Z 'pid_type': 'flat', 2026-02-21T08:44:32.4390942Z 'range_flattens': [None, False], 2026-02-21T08:44:32.4391186Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:32.4391407Z 'range_num_stages': [0, 0], 2026-02-21T08:44:32.4391699Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:32.4391953Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:32.4397835Z [186s] Fitting surrogate: 529 points, 529 targets 2026-02-21T08:44:32.8353591Z [186s] Generation 14 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:44:35.3046960Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.6 configs/s 2026-02-21T08:44:35.9618259Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.0 configs/s 2026-02-21T08:44:36.9341362Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1030.6 2026-02-21T08:44:36.9342200Z configs/s 2026-02-21T08:44:37.0136965Z [190s] Generation 14 complete: 2026-02-21T08:44:37.0138694Z ok=13 2026-02-21T08:44:37.0138916Z min=0.0307 2026-02-21T08:44:37.0139126Z mid=0.0308 2026-02-21T08:44:37.0139294Z max=0.0451 2026-02-21T08:44:37.0139504Z best={'block_sizes': [1, 8192], 2026-02-21T08:44:37.0139771Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:44:37.0140068Z 'load_eviction_policies': ['', ''], 2026-02-21T08:44:37.0140284Z 'num_stages': 2, 2026-02-21T08:44:37.0140495Z 'num_warps': 4, 2026-02-21T08:44:37.0140675Z 'pid_type': 'flat', 2026-02-21T08:44:37.0140911Z 'range_flattens': [None, False], 2026-02-21T08:44:37.0146103Z 'range_multi_buffers': [None, None], 2026-02-21T08:44:37.0147634Z 'range_num_stages': [0, 0], 2026-02-21T08:44:37.0147905Z 'range_unroll_factors': [0, 1], 2026-02-21T08:44:37.0148192Z 'range_warp_specializes': [None, True]} 2026-02-21T08:44:37.0153714Z [190s] Fitting surrogate: 542 points, 542 targets 2026-02-21T08:44:37.3120178Z [191s] Autotuning complete in 191.2s after searching 507 configs. 2026-02-21T08:44:37.3120785Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:44:37.3122114Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:44:37.3122999Z 2026-02-21T08:44:37.3123366Z [191s] Code of selected kernel: /tmp/torchinductor_root/hq/chqj2t3kr6zihvxuaucr5c7hpwcwekhixqtqvxw23ih2hwb5gstv.py 2026-02-21T08:44:37.9169457Z WARNING:tritonbench.utils.triton_op:Completed input ID 61: 2026-02-21T08:44:37.9170025Z (M, N) 2026-02-21T08:44:37.9170257Z ------------ 2026-02-21T08:44:37.9171176Z (4096, 8064) 2026-02-21T08:44:37.9171388Z 2026-02-21T08:44:37.9179023Z 65%|██████▌ | 13/20 [35:35<20:56, 179.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 66: 2026-02-21T08:44:37.9179544Z (M, N) 2026-02-21T08:44:37.9182464Z ------------ 2026-02-21T08:44:37.9182693Z (4096, 8704) 2026-02-21T08:44:37.9183149Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:44:39.0623312Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:44:40.3452842Z INFO:tritonbench.utils.triton_op:Took 2.39ms to get benchmark function for torch_compile_softmax 2026-02-21T08:44:44.9857176Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:44:44.9862258Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:44:44.9865795Z 'dtype': 'torch.float16', 2026-02-21T08:44:44.9867953Z 'shape': (4096, 8704), 2026-02-21T08:44:44.9868318Z 'stride': (8704, 1)},), 2026-02-21T08:44:44.9873882Z 'kwargs': {}} 2026-02-21T08:44:44.9890955Z INFO:tritonbench.utils.triton_op:Took 3.59ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:44:45.1783753Z [0s] Autotune random seed: 2134816249 2026-02-21T08:44:45.3255016Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:45:24.9641265Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:45:24.9660846Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:45:25.8215922Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:45:25.8216631Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:45:25.8218851Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:45:25.8219091Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:45:25.8219378Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:45:25.8224108Z %cst = arith.constant dense<8704> : tensor<8x1xi32> 2026-02-21T08:45:25.8229226Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:45:25.8234645Z %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:45:25.8236728Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:45:25.8237016Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:45:25.8237295Z %c8704_i32 = arith.constant 8704 : i32 2026-02-21T08:45:25.8237553Z %c8704_i64 = arith.constant 8704 : i64 2026-02-21T08:45:25.8237774Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:45:25.8238502Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : , > 2026-02-21T08:45:25.8238869Z %1 = tt.get_program_id x : i32 2026-02-21T08:45:25.8239153Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:45:25.8239411Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:45:25.8239708Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:45:25.8240049Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:45:25.8240292Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:45:25.8240561Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:45:25.8240791Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:45:25.8241229Z %6:2 = scf.for %arg3 = %c0_i32 to %c8192_i32 step %c2048_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:45:25.8242251Z %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:45:25.8242657Z %49 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8242965Z %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8243206Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8243464Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:25.8243702Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8243960Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8244222Z %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:45:25.8244531Z %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:45:25.8244827Z %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32> 2026-02-21T08:45:25.8245089Z %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:45:25.8245367Z %55 = arith.ori %53, %54 : tensor<8xi1> 2026-02-21T08:45:25.8245645Z %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:45:25.8245948Z %57 = arith.subf %arg4, %56 : tensor<8xf32> 2026-02-21T08:45:25.8246347Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8246769Z %59 = arith.mulf %arg5, %58 : tensor<8xf32> 2026-02-21T08:45:25.8247082Z %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8247405Z %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8247704Z %62 = arith.subf %49, %61 : tensor<8x512xf32> 2026-02-21T08:45:25.8248106Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8248550Z %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8248805Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8249028Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:25.8249288Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8249519Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8249784Z %65 = arith.addf %59, %64 : tensor<8xf32> 2026-02-21T08:45:25.8250014Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:45:25.8250278Z %66 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:45:25.8250509Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T08:45:25.8250853Z %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:45:25.8251234Z %69 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8251506Z %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8251798Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8252024Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:25.8252284Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8252596Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8252887Z %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:45:25.8253200Z %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:45:25.8253544Z %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32> 2026-02-21T08:45:25.8253798Z %74 = arith.cmpf une, %56, %56 : tensor<8xf32> 2026-02-21T08:45:25.8254067Z %75 = arith.ori %73, %74 : tensor<8xi1> 2026-02-21T08:45:25.8254335Z %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:45:25.8254633Z %77 = arith.subf %56, %76 : tensor<8xf32> 2026-02-21T08:45:25.8255041Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8255429Z %79 = arith.mulf %65, %78 : tensor<8xf32> 2026-02-21T08:45:25.8255807Z %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8256131Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8256428Z %82 = arith.subf %69, %81 : tensor<8x512xf32> 2026-02-21T08:45:25.8256830Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8257255Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8257514Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8257737Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:25.8257993Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8258214Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8258475Z %85 = arith.addf %79, %84 : tensor<8xf32> 2026-02-21T08:45:25.8258709Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:45:25.8258964Z %86 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:45:25.8259221Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:45:25.8259533Z %88 = tt.descriptor_load %0[%2, %87] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:45:25.8259910Z %89 = arith.extf %88 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8260174Z %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8260473Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8260691Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:25.8260948Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8261194Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8261452Z %91 = arith.truncf %90 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:45:25.8261801Z %92 = arith.extf %91 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:45:25.8262071Z %93 = arith.cmpf ogt, %76, %92 : tensor<8xf32> 2026-02-21T08:45:25.8262345Z %94 = arith.cmpf une, %76, %76 : tensor<8xf32> 2026-02-21T08:45:25.8262586Z %95 = arith.ori %93, %94 : tensor<8xi1> 2026-02-21T08:45:25.8262878Z %96 = arith.select %95, %76, %92 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:45:25.8263174Z %97 = arith.subf %76, %96 : tensor<8xf32> 2026-02-21T08:45:25.8263563Z %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8263976Z %99 = arith.mulf %85, %98 : tensor<8xf32> 2026-02-21T08:45:25.8264264Z %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8264623Z %101 = tt.broadcast %100 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8264905Z %102 = arith.subf %89, %101 : tensor<8x512xf32> 2026-02-21T08:45:25.8265343Z %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8265782Z %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8266087Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8266337Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:25.8266566Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8266820Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8267058Z %105 = arith.addf %99, %104 : tensor<8xf32> 2026-02-21T08:45:25.8267327Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:45:25.8267585Z %106 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:45:25.8267823Z %107 = arith.addi %arg3, %106 : i32 2026-02-21T08:45:25.8268172Z %108 = tt.descriptor_load %0[%2, %107] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:45:25.8268532Z %109 = arith.extf %108 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8268831Z %110 = "tt.reduce"(%109) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8269058Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8269361Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:25.8269618Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8269843Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8270132Z %111 = arith.truncf %110 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:45:25.8270416Z %112 = arith.extf %111 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:45:25.8270709Z %113 = arith.cmpf ogt, %96, %112 : tensor<8xf32> 2026-02-21T08:45:25.8270964Z %114 = arith.cmpf une, %96, %96 : tensor<8xf32> 2026-02-21T08:45:25.8271234Z %115 = arith.ori %113, %114 : tensor<8xi1> 2026-02-21T08:45:25.8271575Z %116 = arith.select %115, %96, %112 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:45:25.8271865Z %117 = arith.subf %96, %116 : tensor<8xf32> 2026-02-21T08:45:25.8272314Z %118 = tt.extern_elementwise %117 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8272737Z %119 = arith.mulf %105, %118 : tensor<8xf32> 2026-02-21T08:45:25.8273079Z %120 = tt.expand_dims %116 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8273433Z %121 = tt.broadcast %120 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8273765Z %122 = arith.subf %109, %121 : tensor<8x512xf32> 2026-02-21T08:45:25.8274228Z %123 = tt.extern_elementwise %122 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8274659Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8274929Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:25.8275160Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:25.8275432Z tt.reduce.return %126 : f32 2026-02-21T08:45:25.8275663Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8275941Z %125 = arith.addf %119, %124 : tensor<8xf32> 2026-02-21T08:45:25.8276239Z scf.yield %116, %125 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:45:25.8276537Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:45:25.8276938Z %7 = tt.descriptor_load %0[%2, %c8192_i32] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:45:25.8277317Z %8 = arith.extf %7 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8277616Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8277851Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:45:25.8278111Z %48 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:45:25.8278375Z tt.reduce.return %48 : f32 2026-02-21T08:45:25.8278610Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8278905Z %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:45:25.8279197Z %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:45:25.8279499Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32> 2026-02-21T08:45:25.8279765Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T08:45:25.8280134Z %14 = arith.ori %12, %13 : tensor<8xi1> 2026-02-21T08:45:25.8280433Z %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:45:25.8280755Z %16 = arith.subf %6#0, %15 : tensor<8xf32> 2026-02-21T08:45:25.8281341Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8281986Z %18 = arith.mulf %6#1, %17 : tensor<8xf32> 2026-02-21T08:45:25.8282334Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8282655Z %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8282954Z %21 = arith.subf %8, %20 : tensor<8x512xf32> 2026-02-21T08:45:25.8283382Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8283872Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:45:25.8284151Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:45:25.8284373Z %48 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:45:25.8284639Z tt.reduce.return %48 : f32 2026-02-21T08:45:25.8284865Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:45:25.8285129Z %24 = arith.addf %18, %23 : tensor<8xf32> 2026-02-21T08:45:25.8285390Z %c8192_i32_2 = arith.constant 8192 : i32 2026-02-21T08:45:25.8285663Z %c2048_i32_3 = arith.constant 2048 : i32 2026-02-21T08:45:25.8286089Z scf.for %arg3 = %c0_i32 to %c8192_i32_2 step %c2048_i32_3 : i32 { 2026-02-21T08:45:25.8286590Z %48 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:45:25.8287064Z %49 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:45:25.8287313Z %50 = arith.addi %49, %48 : tensor<512xi32> 2026-02-21T08:45:25.8287636Z %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:45:25.8287971Z %52 = arith.muli %51, %cst : tensor<8x1xi32> 2026-02-21T08:45:25.8288270Z %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:45:25.8288631Z %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8288929Z %55 = tt.broadcast %53 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8289232Z %56 = arith.addi %54, %55 : tensor<8x512xi32> 2026-02-21T08:45:25.8289510Z %57 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8289857Z %58 = tt.addptr %57, %56 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8290226Z %59 = tt.load %58 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8290563Z %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8290912Z %61 = arith.extf %59 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8291351Z %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8291854Z %63 = arith.subf %61, %62 : tensor<8x512xf32> 2026-02-21T08:45:25.8292623Z %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8293316Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8293819Z %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8294209Z %67 = arith.divf %64, %66 : tensor<8x512xf32> 2026-02-21T08:45:25.8294636Z %68 = arith.truncf %67 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:45:25.8295084Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8295583Z %70 = tt.addptr %69, %56 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8296053Z tt.store %70, %68 : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8296591Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:45:25.8296939Z %71 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:45:25.8297276Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:45:25.8297703Z %73 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:45:25.8298113Z %74 = tt.splat %72 : i32 -> tensor<512xi32> 2026-02-21T08:45:25.8298496Z %75 = arith.addi %74, %73 : tensor<512xi32> 2026-02-21T08:45:25.8298976Z %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:45:25.8299443Z %77 = arith.muli %76, %cst : tensor<8x1xi32> 2026-02-21T08:45:25.8299952Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:45:25.8300452Z %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8301098Z %80 = tt.broadcast %78 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8301583Z %81 = arith.addi %79, %80 : tensor<8x512xi32> 2026-02-21T08:45:25.8302038Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8302480Z %83 = tt.addptr %82, %81 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8302991Z %84 = tt.load %83 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8303542Z %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8304013Z %86 = arith.extf %84 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8304487Z %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8304911Z %88 = arith.subf %86, %87 : tensor<8x512xf32> 2026-02-21T08:45:25.8305518Z %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8306250Z %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8306732Z %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8307155Z %92 = arith.divf %89, %91 : tensor<8x512xf32> 2026-02-21T08:45:25.8307548Z %93 = arith.truncf %92 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:45:25.8308033Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8308522Z %95 = tt.addptr %94, %81 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8308954Z tt.store %95, %93 : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8309331Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:45:25.8309639Z %96 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:45:25.8309969Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T08:45:25.8310239Z %98 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:45:25.8310554Z %99 = tt.splat %97 : i32 -> tensor<512xi32> 2026-02-21T08:45:25.8310833Z %100 = arith.addi %99, %98 : tensor<512xi32> 2026-02-21T08:45:25.8311119Z %101 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:45:25.8311454Z %102 = arith.muli %101, %cst : tensor<8x1xi32> 2026-02-21T08:45:25.8311801Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:45:25.8312176Z %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8312486Z %105 = tt.broadcast %103 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8312802Z %106 = arith.addi %104, %105 : tensor<8x512xi32> 2026-02-21T08:45:25.8313109Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8313429Z %108 = tt.addptr %107, %106 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8313806Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8314270Z %110 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8314627Z %111 = arith.extf %109 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8314958Z %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8315236Z %113 = arith.subf %111, %112 : tensor<8x512xf32> 2026-02-21T08:45:25.8315677Z %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8316150Z %115 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8316526Z %116 = tt.broadcast %115 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8316821Z %117 = arith.divf %114, %116 : tensor<8x512xf32> 2026-02-21T08:45:25.8317231Z %118 = arith.truncf %117 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:45:25.8317599Z %119 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8317943Z %120 = tt.addptr %119, %106 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8318287Z tt.store %120, %118 : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8318546Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:45:25.8318822Z %121 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:45:25.8319066Z %122 = arith.addi %arg3, %121 : i32 2026-02-21T08:45:25.8319388Z %123 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:45:25.8319729Z %124 = tt.splat %122 : i32 -> tensor<512xi32> 2026-02-21T08:45:25.8319990Z %125 = arith.addi %124, %123 : tensor<512xi32> 2026-02-21T08:45:25.8320328Z %126 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:45:25.8320647Z %127 = arith.muli %126, %cst : tensor<8x1xi32> 2026-02-21T08:45:25.8321002Z %128 = tt.expand_dims %125 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:45:25.8321380Z %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8321737Z %130 = tt.broadcast %128 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8322064Z %131 = arith.addi %129, %130 : tensor<8x512xi32> 2026-02-21T08:45:25.8322362Z %132 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8322731Z %133 = tt.addptr %132, %131 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8323099Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8323501Z %135 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8323869Z %136 = arith.extf %134 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8324188Z %137 = tt.broadcast %135 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8324520Z %138 = arith.subf %136, %137 : tensor<8x512xf32> 2026-02-21T08:45:25.8324933Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8325411Z %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8325762Z %141 = tt.broadcast %140 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8326040Z %142 = arith.divf %139, %141 : tensor<8x512xf32> 2026-02-21T08:45:25.8326341Z %143 = arith.truncf %142 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:45:25.8326654Z %144 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8327002Z %145 = tt.addptr %144, %131 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8327300Z tt.store %145, %143 : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8327604Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:45:25.8327996Z %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:45:25.8328299Z %26 = tt.splat %c8192_i32_2 : i32 -> tensor<512xi32> 2026-02-21T08:45:25.8328577Z %27 = arith.addi %26, %25 : tensor<512xi32> 2026-02-21T08:45:25.8328871Z %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:45:25.8329190Z %29 = arith.muli %28, %cst : tensor<8x1xi32> 2026-02-21T08:45:25.8329479Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:45:25.8329833Z %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8330154Z %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:45:25.8330419Z %33 = arith.addi %31, %32 : tensor<8x512xi32> 2026-02-21T08:45:25.8330790Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8331102Z %35 = tt.addptr %34, %33 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8331460Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8331813Z %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8332159Z %38 = arith.extf %36 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:45:25.8332476Z %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8332803Z %40 = arith.subf %38, %39 : tensor<8x512xf32> 2026-02-21T08:45:25.8333232Z %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:45:25.8333675Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:45:25.8334013Z %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:45:25.8334311Z %44 = arith.divf %41, %43 : tensor<8x512xf32> 2026-02-21T08:45:25.8334577Z %45 = arith.truncf %44 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:45:25.8334903Z %46 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8335205Z %47 = tt.addptr %46, %33 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:45:25.8335523Z tt.store %47, %45 : tensor<8x512x!tt.ptr> 2026-02-21T08:45:25.8335832Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:45:25.8336152Z tt.return 2026-02-21T08:45:25.8336355Z } 2026-02-21T08:45:25.8336517Z } 2026-02-21T08:45:25.8336609Z 2026-02-21T08:45:25.8336709Z {-# 2026-02-21T08:45:25.8336878Z external_resources: { 2026-02-21T08:45:25.8337110Z mlir_reproducer: { 2026-02-21T08:45:25.8341491Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:45:25.8346103Z disable_threading: false, 2026-02-21T08:45:25.8346338Z verify_each: true 2026-02-21T08:45:25.8346521Z } 2026-02-21T08:45:25.8346708Z } 2026-02-21T08:45:25.8346862Z #-} 2026-02-21T08:45:25.8347360Z /tmp/torchinductor_root/ho/cho633uoi2rsxfmseduqhol3myx3aexbxpany4zvj3jhnwb2qnfz.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:45:25.8348691Z /tmp/torchinductor_root/ho/cho633uoi2rsxfmseduqhol3myx3aexbxpany4zvj3jhnwb2qnfz.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:45:25.8349773Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:45:25.8350948Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:45:25.8352031Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:45:25.8352338Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:45:30.5708996Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:45:30.5713823Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:45:30.5718438Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:45:30.5722155Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:45:30.5726229Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:45:30.5731031Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:45:30.5733234Z %cst = arith.constant dense<8704> : tensor<128x1xi32> 2026-02-21T08:45:30.5733587Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T08:45:30.5733941Z %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T08:45:30.5734220Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:45:30.5734526Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:45:30.5734776Z %c8704_i32 = arith.constant 8704 : i32 2026-02-21T08:45:30.5735031Z %c8704_i64 = arith.constant 8704 : i64 2026-02-21T08:45:30.5735281Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:45:30.5735646Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : , > 2026-02-21T08:45:30.5736041Z %1 = tt.get_program_id x : i32 2026-02-21T08:45:30.5736259Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:45:30.5736510Z %3 = arith.minsi %2, %c32_i32 : i32 2026-02-21T08:45:30.5736772Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:45:30.5737102Z %4 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T08:45:30.5737378Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:45:30.5737709Z %6 = tt.splat %4 : i32 -> tensor<128xi32> 2026-02-21T08:45:30.5737971Z %7 = arith.addi %6, %5 : tensor<128xi32> 2026-02-21T08:45:30.5738209Z %c8448_i32 = arith.constant 8448 : i32 2026-02-21T08:45:30.5738817Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:45:30.5739238Z %8:2 = scf.for %arg3 = %c0_i32 to %c8448_i32 step %c768_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T08:45:30.5739789Z %48 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5740215Z %49 = arith.extf %48 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5740514Z %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5740795Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5741039Z %106 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:30.5741322Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5741634Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5741954Z %51 = arith.truncf %50 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:45:30.5742454Z %52 = arith.extf %51 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:45:30.5742758Z %53 = arith.cmpf ogt, %arg4, %52 : tensor<128xf32> 2026-02-21T08:45:30.5743085Z %54 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T08:45:30.5743353Z %55 = arith.ori %53, %54 : tensor<128xi1> 2026-02-21T08:45:30.5743682Z %56 = arith.select %55, %arg4, %52 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:45:30.5744078Z %57 = arith.subf %arg4, %56 : tensor<128xf32> 2026-02-21T08:45:30.5744537Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5744964Z %59 = arith.mulf %arg5, %58 : tensor<128xf32> 2026-02-21T08:45:30.5745307Z %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5745668Z %61 = tt.broadcast %60 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5746005Z %62 = arith.subf %49, %61 : tensor<128x256xf32> 2026-02-21T08:45:30.5746443Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5746914Z %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5747192Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5747430Z %106 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:30.5747706Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5747946Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5748230Z %65 = arith.addf %59, %64 : tensor<128xf32> 2026-02-21T08:45:30.5748476Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:45:30.5748758Z %66 = arith.muli %c256_i32, %c1_i32_4 : i32 2026-02-21T08:45:30.5749030Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T08:45:30.5749376Z %68 = tt.descriptor_load %0[%4, %67] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5749799Z %69 = arith.extf %68 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5750093Z %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5750359Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5750596Z %106 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:30.5750876Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5751126Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5751395Z %71 = arith.truncf %70 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:45:30.5751753Z %72 = arith.extf %71 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:45:30.5752024Z %73 = arith.cmpf ogt, %56, %72 : tensor<128xf32> 2026-02-21T08:45:30.5752305Z %74 = arith.cmpf une, %56, %56 : tensor<128xf32> 2026-02-21T08:45:30.5752546Z %75 = arith.ori %73, %74 : tensor<128xi1> 2026-02-21T08:45:30.5752848Z %76 = arith.select %75, %56, %72 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:45:30.5753231Z %77 = arith.subf %56, %76 : tensor<128xf32> 2026-02-21T08:45:30.5753629Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5754053Z %79 = arith.mulf %65, %78 : tensor<128xf32> 2026-02-21T08:45:30.5754347Z %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5754710Z %81 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5754995Z %82 = arith.subf %69, %81 : tensor<128x256xf32> 2026-02-21T08:45:30.5755430Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5755872Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5756104Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5756436Z %106 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:30.5756664Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5756927Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5757165Z %85 = arith.addf %79, %84 : tensor<128xf32> 2026-02-21T08:45:30.5757429Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:45:30.5757694Z %86 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:45:30.5757929Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:45:30.5758271Z %88 = tt.descriptor_load %0[%4, %87] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5758630Z %89 = arith.extf %88 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5758935Z %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5759161Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5759406Z %106 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:45:30.5759664Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5759893Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5760187Z %91 = arith.truncf %90 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:45:30.5760472Z %92 = arith.extf %91 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:45:30.5760771Z %93 = arith.cmpf ogt, %76, %92 : tensor<128xf32> 2026-02-21T08:45:30.5761024Z %94 = arith.cmpf une, %76, %76 : tensor<128xf32> 2026-02-21T08:45:30.5761296Z %95 = arith.ori %93, %94 : tensor<128xi1> 2026-02-21T08:45:30.5761632Z %96 = arith.select %95, %76, %92 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:45:30.5761905Z %97 = arith.subf %76, %96 : tensor<128xf32> 2026-02-21T08:45:30.5762329Z %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5762723Z %99 = arith.mulf %85, %98 : tensor<128xf32> 2026-02-21T08:45:30.5763053Z %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5763403Z %101 = tt.broadcast %100 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5763731Z %102 = arith.subf %89, %101 : tensor<128x256xf32> 2026-02-21T08:45:30.5764179Z %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5764599Z %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5764868Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:45:30.5765096Z %106 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:45:30.5765353Z tt.reduce.return %106 : f32 2026-02-21T08:45:30.5765582Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5765853Z %105 = arith.addf %99, %104 : tensor<128xf32> 2026-02-21T08:45:30.5766144Z scf.yield %96, %105 : tensor<128xf32>, tensor<128xf32> 2026-02-21T08:45:30.5766465Z } {tt.num_stages = 1 : i32} 2026-02-21T08:45:30.5766823Z %9 = tt.descriptor_load %0[%4, %c8448_i32] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5767200Z %10 = arith.extf %9 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5767502Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5767736Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:45:30.5767993Z %48 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:45:30.5768250Z tt.reduce.return %48 : f32 2026-02-21T08:45:30.5768475Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5768768Z %12 = arith.truncf %11 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:45:30.5769052Z %13 = arith.extf %12 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:45:30.5769359Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<128xf32> 2026-02-21T08:45:30.5769616Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32> 2026-02-21T08:45:30.5769961Z %16 = arith.ori %14, %15 : tensor<128xi1> 2026-02-21T08:45:30.5770258Z %17 = arith.select %16, %8#0, %13 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:45:30.5770535Z %18 = arith.subf %8#0, %17 : tensor<128xf32> 2026-02-21T08:45:30.5770956Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5771343Z %20 = arith.mulf %8#1, %19 : tensor<128xf32> 2026-02-21T08:45:30.5771712Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5772046Z %22 = tt.broadcast %21 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5772350Z %23 = arith.subf %10, %22 : tensor<128x256xf32> 2026-02-21T08:45:30.5772782Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5773192Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T08:45:30.5773468Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:45:30.5773692Z %48 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:45:30.5773943Z tt.reduce.return %48 : f32 2026-02-21T08:45:30.5774164Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:45:30.5774430Z %26 = arith.addf %20, %25 : tensor<128xf32> 2026-02-21T08:45:30.5774693Z %c8448_i32_2 = arith.constant 8448 : i32 2026-02-21T08:45:30.5774923Z %c768_i32_3 = arith.constant 768 : i32 2026-02-21T08:45:30.5775218Z scf.for %arg3 = %c0_i32 to %c8448_i32_2 step %c768_i32_3 : i32 { 2026-02-21T08:45:30.5775542Z %48 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:45:30.5775868Z %49 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:45:30.5776109Z %50 = arith.addi %49, %48 : tensor<256xi32> 2026-02-21T08:45:30.5776477Z %51 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5776896Z %52 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5777230Z %53 = arith.extf %51 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5777564Z %54 = tt.broadcast %52 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5777894Z %55 = arith.subf %53, %54 : tensor<128x256xf32> 2026-02-21T08:45:30.5778330Z %56 = tt.extern_elementwise %55 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5778791Z %57 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5779151Z %58 = tt.broadcast %57 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5779437Z %59 = arith.divf %56, %58 : tensor<128x256xf32> 2026-02-21T08:45:30.5779755Z %60 = arith.truncf %59 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:45:30.5780244Z %61 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:45:30.5780552Z %62 = arith.muli %61, %cst : tensor<128x1xi32> 2026-02-21T08:45:30.5780877Z %63 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:45:30.5781208Z %64 = tt.broadcast %62 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5781583Z %65 = tt.broadcast %63 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5781889Z %66 = arith.addi %64, %65 : tensor<128x256xi32> 2026-02-21T08:45:30.5782171Z %67 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5782527Z %68 = tt.addptr %67, %66 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:45:30.5782834Z tt.store %68, %60 : tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5783115Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T08:45:30.5783414Z %69 = arith.muli %c256_i32, %c1_i32_4 : i32 2026-02-21T08:45:30.5783681Z %70 = arith.addi %arg3, %69 : i32 2026-02-21T08:45:30.5783955Z %71 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:45:30.5784294Z %72 = tt.splat %70 : i32 -> tensor<256xi32> 2026-02-21T08:45:30.5784581Z %73 = arith.addi %72, %71 : tensor<256xi32> 2026-02-21T08:45:30.5784929Z %74 = tt.descriptor_load %0[%4, %70] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5785363Z %75 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5785710Z %76 = arith.extf %74 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5786059Z %77 = tt.broadcast %75 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5786384Z %78 = arith.subf %76, %77 : tensor<128x256xf32> 2026-02-21T08:45:30.5786817Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5787339Z %80 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5787686Z %81 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5788013Z %82 = arith.divf %79, %81 : tensor<128x256xf32> 2026-02-21T08:45:30.5788304Z %83 = arith.truncf %82 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:45:30.5788679Z %84 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:45:30.5789026Z %85 = arith.muli %84, %cst : tensor<128x1xi32> 2026-02-21T08:45:30.5789333Z %86 = tt.expand_dims %73 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:45:30.5789702Z %87 = tt.broadcast %85 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5790017Z %88 = tt.broadcast %86 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5790341Z %89 = arith.addi %87, %88 : tensor<128x256xi32> 2026-02-21T08:45:30.5790662Z %90 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5791000Z %91 = tt.addptr %90, %89 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:45:30.5791341Z tt.store %91, %83 : tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5791619Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:45:30.5791891Z %92 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:45:30.5792134Z %93 = arith.addi %arg3, %92 : i32 2026-02-21T08:45:30.5792444Z %94 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:45:30.5792772Z %95 = tt.splat %93 : i32 -> tensor<256xi32> 2026-02-21T08:45:30.5793032Z %96 = arith.addi %95, %94 : tensor<256xi32> 2026-02-21T08:45:30.5793387Z %97 = tt.descriptor_load %0[%4, %93] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5793825Z %98 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5794177Z %99 = arith.extf %97 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5794496Z %100 = tt.broadcast %98 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5794817Z %101 = arith.subf %99, %100 : tensor<128x256xf32> 2026-02-21T08:45:30.5795281Z %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5795756Z %103 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5796125Z %104 = tt.broadcast %103 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5796421Z %105 = arith.divf %102, %104 : tensor<128x256xf32> 2026-02-21T08:45:30.5796748Z %106 = arith.truncf %105 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:45:30.5797178Z %107 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:45:30.5797496Z %108 = arith.muli %107, %cst : tensor<128x1xi32> 2026-02-21T08:45:30.5797830Z %109 = tt.expand_dims %96 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:45:30.5798174Z %110 = tt.broadcast %108 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5798524Z %111 = tt.broadcast %109 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5798817Z %112 = arith.addi %110, %111 : tensor<128x256xi32> 2026-02-21T08:45:30.5799139Z %113 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5799511Z %114 = tt.addptr %113, %112 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:45:30.5799835Z tt.store %114, %106 : tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5800115Z } {tt.num_stages = 1 : i32} 2026-02-21T08:45:30.5800391Z %27 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:45:30.5800731Z %28 = tt.splat %c8448_i32_2 : i32 -> tensor<256xi32> 2026-02-21T08:45:30.5800990Z %29 = arith.addi %28, %27 : tensor<256xi32> 2026-02-21T08:45:30.5801374Z %30 = tt.descriptor_load %0[%4, %c8448_i32_2] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:45:30.5801848Z %31 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5802182Z %32 = arith.extf %30 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:45:30.5802523Z %33 = tt.broadcast %31 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5802813Z %34 = arith.subf %32, %33 : tensor<128x256xf32> 2026-02-21T08:45:30.5803270Z %35 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:45:30.5803767Z %36 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:45:30.5804109Z %37 = tt.broadcast %36 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:45:30.5804425Z %38 = arith.divf %35, %37 : tensor<128x256xf32> 2026-02-21T08:45:30.5804712Z %39 = arith.truncf %38 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:45:30.5805068Z %40 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:45:30.5805406Z %41 = arith.muli %40, %cst : tensor<128x1xi32> 2026-02-21T08:45:30.5805704Z %42 = tt.expand_dims %29 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:45:30.5806062Z %43 = tt.broadcast %41 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5806369Z %44 = tt.broadcast %42 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:45:30.5806673Z %45 = arith.addi %43, %44 : tensor<128x256xi32> 2026-02-21T08:45:30.5806985Z %46 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5807399Z %47 = tt.addptr %46, %45 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:45:30.5807724Z tt.store %47, %39 : tensor<128x256x!tt.ptr> 2026-02-21T08:45:30.5807987Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:45:30.5808246Z tt.return 2026-02-21T08:45:30.5808417Z } 2026-02-21T08:45:30.5808607Z } 2026-02-21T08:45:30.5808700Z 2026-02-21T08:45:30.5808771Z {-# 2026-02-21T08:45:30.5808969Z external_resources: { 2026-02-21T08:45:30.5809162Z mlir_reproducer: { 2026-02-21T08:45:30.5813612Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:45:30.5818195Z disable_threading: false, 2026-02-21T08:45:30.5818442Z verify_each: true 2026-02-21T08:45:30.5818632Z } 2026-02-21T08:45:30.5818822Z } 2026-02-21T08:45:30.5818977Z #-} 2026-02-21T08:45:30.5819470Z /tmp/torchinductor_root/xo/cxoakznh6e5gx4h7szyr7cv3z4x2fdmhship7mdqmxtbzwj433yn.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:45:30.5820707Z /tmp/torchinductor_root/xo/cxoakznh6e5gx4h7szyr7cv3z4x2fdmhship7mdqmxtbzwj433yn.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:45:30.5821787Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:45:30.5822957Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:45:30.5824007Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:45:30.5824304Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:45:34.2292598Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.8 configs/s 2026-02-21T08:45:34.2303439Z [48s] Adaptive compile timeout: 30s (90% percentile=9.9s, bounds=[30.0s, 30s]) 2026-02-21T08:45:35.0098310Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1267.7 configs/s 2026-02-21T08:45:35.0735799Z [49s] Initial random population of 100, 5 starting points: 2026-02-21T08:45:35.0740031Z error=6 2026-02-21T08:45:35.0744232Z timeout=1 2026-02-21T08:45:35.0748733Z ok=93 2026-02-21T08:45:35.0754023Z min=0.0451 2026-02-21T08:45:35.0758856Z mid=0.7834 2026-02-21T08:45:35.0764253Z max=44.0392 2026-02-21T08:45:35.0764593Z best={'block_sizes': [1, 16384], 2026-02-21T08:45:35.0764918Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:45:35.0770746Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:45:35.0775283Z 'num_sm_multiplier': 8, 2026-02-21T08:45:35.0779315Z 'num_stages': 3, 2026-02-21T08:45:35.0784487Z 'num_warps': 1, 2026-02-21T08:45:35.0789068Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:45:35.0793087Z 'range_flattens': [False, None], 2026-02-21T08:45:35.0799254Z 'range_multi_buffers': [True, True], 2026-02-21T08:45:35.0804190Z 'range_num_stages': [1, 2], 2026-02-21T08:45:35.0804525Z 'range_unroll_factors': [0, 1], 2026-02-21T08:45:35.0804818Z 'range_warp_specializes': [True, None]} 2026-02-21T08:45:35.0810074Z [49s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:45:36.1323570Z [50s] Generation 1 starting: 77 neighbors, 5 active search path(s) 2026-02-21T08:45:51.1329191Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 3.8 configs/s 2026-02-21T08:45:56.6581288Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 14.9 configs/s 2026-02-21T08:46:01.4863499Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 209.4 2026-02-21T08:46:01.4867350Z configs/s 2026-02-21T08:46:01.7538173Z [76s] Generation 1 complete: 2026-02-21T08:46:01.7540564Z ok=83 2026-02-21T08:46:01.7540787Z min=0.0430 2026-02-21T08:46:01.7540993Z mid=0.0594 2026-02-21T08:46:01.7541199Z max=0.2048 2026-02-21T08:46:01.7541425Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:01.7541961Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:01.7542283Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:46:01.7542539Z 'num_stages': 3, 2026-02-21T08:46:01.7542724Z 'num_warps': 1, 2026-02-21T08:46:01.7542938Z 'pid_type': 'flat', 2026-02-21T08:46:01.7543136Z 'range_flattens': [None, None], 2026-02-21T08:46:01.7543388Z 'range_multi_buffers': [None, None], 2026-02-21T08:46:01.7543612Z 'range_num_stages': [0, 2], 2026-02-21T08:46:01.7543845Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:01.7544064Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:01.7552681Z [76s] Fitting surrogate: 183 points, 183 targets 2026-02-21T08:46:02.5674232Z [77s] Generation 2 starting: 55 neighbors, 5 active search path(s) 2026-02-21T08:46:12.6312966Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 3.3 configs/s 2026-02-21T08:46:16.6946706Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.1 configs/s 2026-02-21T08:46:19.6157860Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 443.3 2026-02-21T08:46:19.6159935Z configs/s 2026-02-21T08:46:19.7583692Z [94s] Generation 2 complete: 2026-02-21T08:46:19.7587749Z ok=60 2026-02-21T08:46:19.7592170Z min=0.0348 2026-02-21T08:46:19.7593946Z mid=0.0553 2026-02-21T08:46:19.7594202Z max=0.1639 2026-02-21T08:46:19.7594395Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:19.7594707Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:19.7594992Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:46:19.7595255Z 'num_stages': 3, 2026-02-21T08:46:19.7595437Z 'num_warps': 1, 2026-02-21T08:46:19.7595638Z 'pid_type': 'flat', 2026-02-21T08:46:19.7604909Z 'range_flattens': [None, None], 2026-02-21T08:46:19.7605151Z 'range_multi_buffers': [None, None], 2026-02-21T08:46:19.7605410Z 'range_num_stages': [0, 2], 2026-02-21T08:46:19.7605969Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:19.7606189Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:19.7606862Z [94s] Fitting surrogate: 243 points, 243 targets 2026-02-21T08:46:20.4056457Z [95s] Generation 3 starting: 43 neighbors, 4 active search path(s) 2026-02-21T08:46:27.4099548Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 3.7 configs/s 2026-02-21T08:46:30.6686903Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 13.9 configs/s 2026-02-21T08:46:32.2426505Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 638.3 2026-02-21T08:46:32.2427408Z configs/s 2026-02-21T08:46:32.3579741Z [107s] Generation 3 complete: 2026-02-21T08:46:32.3581124Z ok=47 2026-02-21T08:46:32.3581374Z min=0.0347 2026-02-21T08:46:32.3581835Z mid=0.0553 2026-02-21T08:46:32.3582039Z max=1.2422 2026-02-21T08:46:32.3582630Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:32.3582965Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:32.3583249Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:46:32.3583504Z 'num_stages': 3, 2026-02-21T08:46:32.3583678Z 'num_warps': 1, 2026-02-21T08:46:32.3583885Z 'pid_type': 'flat', 2026-02-21T08:46:32.3584105Z 'range_flattens': [None, None], 2026-02-21T08:46:32.3584315Z 'range_multi_buffers': [None, None], 2026-02-21T08:46:32.3584557Z 'range_num_stages': [0, 2], 2026-02-21T08:46:32.3584763Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:32.3585006Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:32.3595841Z [107s] Fitting surrogate: 290 points, 290 targets 2026-02-21T08:46:32.8943245Z [107s] Generation 4 starting: 32 neighbors, 3 active search path(s) 2026-02-21T08:46:37.7933535Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 7.2 configs/s 2026-02-21T08:46:39.7433973Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 17.3 configs/s 2026-02-21T08:46:41.0894253Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 745.2 2026-02-21T08:46:41.0898833Z configs/s 2026-02-21T08:46:41.1841615Z [115s] Generation 4 complete: 2026-02-21T08:46:41.1845502Z ok=35 2026-02-21T08:46:41.1849804Z min=0.0348 2026-02-21T08:46:41.1851972Z mid=0.0532 2026-02-21T08:46:41.1852216Z max=0.1537 2026-02-21T08:46:41.1852478Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:41.1852791Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:41.1853104Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:46:41.1853378Z 'num_stages': 3, 2026-02-21T08:46:41.1853602Z 'num_warps': 2, 2026-02-21T08:46:41.1853854Z 'pid_type': 'flat', 2026-02-21T08:46:41.1854052Z 'range_flattens': [None, None], 2026-02-21T08:46:41.1854303Z 'range_multi_buffers': [None, False], 2026-02-21T08:46:41.1854508Z 'range_num_stages': [0, 1], 2026-02-21T08:46:41.1854794Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:41.1855086Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:41.1855391Z [115s] Fitting surrogate: 325 points, 325 targets 2026-02-21T08:46:41.6091043Z [116s] Generation 5 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:46:45.3620077Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 6.7 configs/s 2026-02-21T08:46:46.6725485Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.4 configs/s 2026-02-21T08:46:47.7327004Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 940.7 2026-02-21T08:46:47.7328776Z configs/s 2026-02-21T08:46:47.8141277Z [122s] Generation 5 complete: 2026-02-21T08:46:47.8145142Z ok=23 2026-02-21T08:46:47.8148941Z min=0.0348 2026-02-21T08:46:47.8152854Z mid=0.0513 2026-02-21T08:46:47.8154425Z max=0.1331 2026-02-21T08:46:47.8154638Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:47.8155280Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:47.8155708Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:46:47.8155963Z 'num_stages': 3, 2026-02-21T08:46:47.8156143Z 'num_warps': 2, 2026-02-21T08:46:47.8156345Z 'pid_type': 'flat', 2026-02-21T08:46:47.8156565Z 'range_flattens': [None, None], 2026-02-21T08:46:47.8156781Z 'range_multi_buffers': [None, False], 2026-02-21T08:46:47.8157028Z 'range_num_stages': [0, 1], 2026-02-21T08:46:47.8157231Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:47.8157478Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:47.8157805Z [122s] Fitting surrogate: 348 points, 348 targets 2026-02-21T08:46:48.1229737Z [122s] Generation 6 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:46:51.1131458Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 4.2 configs/s 2026-02-21T08:46:51.8949192Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.7 configs/s 2026-02-21T08:46:52.8434473Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1050.5 2026-02-21T08:46:52.8437876Z configs/s 2026-02-21T08:46:52.9182969Z [127s] Generation 6 complete: 2026-02-21T08:46:52.9186903Z ok=15 2026-02-21T08:46:52.9191484Z min=0.0348 2026-02-21T08:46:52.9196438Z mid=0.0471 2026-02-21T08:46:52.9198354Z max=0.0798 2026-02-21T08:46:52.9198654Z best={'block_sizes': [1, 16384], 2026-02-21T08:46:52.9204439Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:46:52.9206493Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:46:52.9206795Z 'num_stages': 3, 2026-02-21T08:46:52.9206987Z 'num_warps': 2, 2026-02-21T08:46:52.9207202Z 'pid_type': 'flat', 2026-02-21T08:46:52.9207406Z 'range_flattens': [None, None], 2026-02-21T08:46:52.9207655Z 'range_multi_buffers': [None, False], 2026-02-21T08:46:52.9207909Z 'range_num_stages': [0, 1], 2026-02-21T08:46:52.9208124Z 'range_unroll_factors': [0, 1], 2026-02-21T08:46:52.9208382Z 'range_warp_specializes': [None, True]} 2026-02-21T08:46:52.9208641Z [127s] Fitting surrogate: 363 points, 363 targets 2026-02-21T08:46:53.0962996Z [127s] Autotuning complete in 127.8s after searching 351 configs. 2026-02-21T08:46:53.0967058Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:46:53.0972277Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:46:53.0973369Z 2026-02-21T08:46:53.0977922Z [127s] Code of selected kernel: /tmp/torchinductor_root/gr/cgrf4vzaiybqq2vtl4f4ipkrbsrfxtqdt5p3nddu75heazgfy3hn.py 2026-02-21T08:46:53.7931333Z WARNING:tritonbench.utils.triton_op:Completed input ID 66: 2026-02-21T08:46:53.7933789Z (M, N) 2026-02-21T08:46:53.7934024Z ------------ 2026-02-21T08:46:53.7934207Z (4096, 8704) 2026-02-21T08:46:53.7934398Z 2026-02-21T08:46:53.7946577Z 70%|███████ | 14/20 [37:51<16:37, 166.31s/it]WARNING:tritonbench.utils.triton_op:Running input ID 71: 2026-02-21T08:46:53.7950687Z (M, N) 2026-02-21T08:46:53.7952189Z ------------ 2026-02-21T08:46:53.7952435Z (4096, 9344) 2026-02-21T08:46:53.7952804Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:46:55.0184371Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:46:56.3716248Z INFO:tritonbench.utils.triton_op:Took 2.60ms to get benchmark function for torch_compile_softmax 2026-02-21T08:47:00.0684098Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:47:00.0689597Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:47:00.0693675Z 'dtype': 'torch.float16', 2026-02-21T08:47:00.0696960Z 'shape': (4096, 9344), 2026-02-21T08:47:00.0699356Z 'stride': (9344, 1)},), 2026-02-21T08:47:00.0699680Z 'kwargs': {}} 2026-02-21T08:47:00.0704539Z INFO:tritonbench.utils.triton_op:Took 2.40ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:47:00.2490805Z [0s] Autotune random seed: 2134816249 2026-02-21T08:47:00.3925338Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:47:39.6882572Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:47:39.6897224Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T08:47:42.0299828Z module { 2026-02-21T08:47:42.0304570Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:47:42.0309600Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:47:42.0310962Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:47:42.0311242Z %c148_i32 = arith.constant 148 : i32 2026-02-21T08:47:42.0311905Z %cst = arith.constant dense<9344> : tensor<128x1xi32> 2026-02-21T08:47:42.0312227Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T08:47:42.0312566Z %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T08:47:42.0312856Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:47:42.0313087Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:47:42.0313320Z %c9344_i32 = arith.constant 9344 : i32 2026-02-21T08:47:42.0313539Z %c9344_i64 = arith.constant 9344 : i64 2026-02-21T08:47:42.0313804Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:47:42.0314185Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : , > 2026-02-21T08:47:42.0314574Z %1 = tt.get_program_id x : i32 2026-02-21T08:47:42.0314855Z scf.for %arg2 = %1 to %c32_i32 step %c148_i32 : i32 { 2026-02-21T08:47:42.0315117Z %2 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T08:47:42.0315445Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:47:42.0315758Z %4 = tt.splat %2 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0316037Z %5 = arith.addi %4, %3 : tensor<128xi32> 2026-02-21T08:47:42.0316276Z %c9216_i32 = arith.constant 9216 : i32 2026-02-21T08:47:42.0316562Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:47:42.0317040Z %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T08:47:42.0317497Z %55 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0318146Z %56 = arith.addi %55, %3 : tensor<128xi32> 2026-02-21T08:47:42.0318475Z %57 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0318848Z %58 = arith.muli %57, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0319166Z %59 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0319544Z %60 = tt.broadcast %58 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0319897Z %61 = tt.broadcast %59 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0320193Z %62 = arith.addi %60, %61 : tensor<128x128xi32> 2026-02-21T08:47:42.0320510Z %63 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0320831Z %64 = tt.addptr %63, %62 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0321318Z %65 = tt.load %64 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0321694Z %66 = arith.extf %65 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0321951Z %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0322161Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0322359Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:42.0322570Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0322855Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0323093Z %68 = arith.truncf %67 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:42.0323363Z %69 = arith.extf %68 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:42.0323611Z %70 = arith.cmpf ogt, %arg4, %69 : tensor<128xf32> 2026-02-21T08:47:42.0323855Z %71 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T08:47:42.0324084Z %72 = arith.ori %70, %71 : tensor<128xi1> 2026-02-21T08:47:42.0324339Z %73 = arith.select %72, %arg4, %69 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:42.0324606Z %74 = arith.subf %arg4, %73 : tensor<128xf32> 2026-02-21T08:47:42.0324987Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0325373Z %76 = arith.mulf %arg5, %75 : tensor<128xf32> 2026-02-21T08:47:42.0325638Z %77 = tt.expand_dims %73 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0325952Z %78 = tt.broadcast %77 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0326191Z %79 = arith.subf %66, %78 : tensor<128x128xf32> 2026-02-21T08:47:42.0326567Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0326942Z %81 = "tt.reduce"(%80) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0327133Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0327322Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:42.0327506Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0327701Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0327895Z %82 = arith.addf %76, %81 : tensor<128xf32> 2026-02-21T08:47:42.0328093Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:47:42.0328284Z %83 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T08:47:42.0328468Z %84 = arith.addi %arg3, %83 : i32 2026-02-21T08:47:42.0328659Z %85 = tt.splat %84 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0328855Z %86 = arith.addi %85, %3 : tensor<128xi32> 2026-02-21T08:47:42.0329106Z %87 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0329367Z %88 = arith.muli %87, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0329634Z %89 = tt.expand_dims %86 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0330016Z %90 = tt.broadcast %88 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0330280Z %91 = tt.broadcast %89 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0330522Z %92 = arith.addi %90, %91 : tensor<128x128xi32> 2026-02-21T08:47:42.0330757Z %93 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0331045Z %94 = tt.addptr %93, %92 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0331347Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0331683Z %96 = arith.extf %95 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0331924Z %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0332114Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0332307Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:42.0332561Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0332768Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0332996Z %98 = arith.truncf %97 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:42.0333257Z %99 = arith.extf %98 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:42.0333495Z %100 = arith.cmpf ogt, %73, %99 : tensor<128xf32> 2026-02-21T08:47:42.0333709Z %101 = arith.cmpf une, %73, %73 : tensor<128xf32> 2026-02-21T08:47:42.0333925Z %102 = arith.ori %100, %101 : tensor<128xi1> 2026-02-21T08:47:42.0334165Z %103 = arith.select %102, %73, %99 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:42.0334411Z %104 = arith.subf %73, %103 : tensor<128xf32> 2026-02-21T08:47:42.0334773Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0335141Z %106 = arith.mulf %82, %105 : tensor<128xf32> 2026-02-21T08:47:42.0335410Z %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0335714Z %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0335967Z %109 = arith.subf %96, %108 : tensor<128x128xf32> 2026-02-21T08:47:42.0336341Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0336724Z %111 = "tt.reduce"(%110) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0336924Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0337103Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:42.0337295Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0337484Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0337691Z %112 = arith.addf %106, %111 : tensor<128xf32> 2026-02-21T08:47:42.0337884Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:42.0338079Z %113 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:47:42.0338271Z %114 = arith.addi %arg3, %113 : i32 2026-02-21T08:47:42.0338469Z %115 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0338676Z %116 = arith.addi %115, %3 : tensor<128xi32> 2026-02-21T08:47:42.0338922Z %117 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0339194Z %118 = arith.muli %117, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0339452Z %119 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0339760Z %120 = tt.broadcast %118 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0340036Z %121 = tt.broadcast %119 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0340280Z %122 = arith.addi %120, %121 : tensor<128x128xi32> 2026-02-21T08:47:42.0340526Z %123 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0340816Z %124 = tt.addptr %123, %122 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0341199Z %125 = tt.load %124 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0341497Z %126 = arith.extf %125 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0341794Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0341988Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0342169Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:42.0342367Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0342554Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0342786Z %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:42.0343040Z %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:42.0343286Z %130 = arith.cmpf ogt, %103, %129 : tensor<128xf32> 2026-02-21T08:47:42.0343581Z %131 = arith.cmpf une, %103, %103 : tensor<128xf32> 2026-02-21T08:47:42.0343801Z %132 = arith.ori %130, %131 : tensor<128xi1> 2026-02-21T08:47:42.0344053Z %133 = arith.select %132, %103, %129 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:42.0344309Z %134 = arith.subf %103, %133 : tensor<128xf32> 2026-02-21T08:47:42.0344691Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0345069Z %136 = arith.mulf %112, %135 : tensor<128xf32> 2026-02-21T08:47:42.0345347Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0345669Z %138 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0345926Z %139 = arith.subf %126, %138 : tensor<128x128xf32> 2026-02-21T08:47:42.0346325Z %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0346717Z %141 = "tt.reduce"(%140) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0346936Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0347140Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:42.0347335Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0347538Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0347745Z %142 = arith.addf %136, %141 : tensor<128xf32> 2026-02-21T08:47:42.0347951Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:47:42.0348146Z %143 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:47:42.0348353Z %144 = arith.addi %arg3, %143 : i32 2026-02-21T08:47:42.0348553Z %145 = tt.splat %144 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0348769Z %146 = arith.addi %145, %3 : tensor<128xi32> 2026-02-21T08:47:42.0349035Z %147 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0349319Z %148 = arith.muli %147, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0349595Z %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0349906Z %150 = tt.broadcast %148 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0350195Z %151 = tt.broadcast %149 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0350448Z %152 = arith.addi %150, %151 : tensor<128x128xi32> 2026-02-21T08:47:42.0350711Z %153 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0351018Z %154 = tt.addptr %153, %152 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0351343Z %155 = tt.load %154 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0351685Z %156 = arith.extf %155 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0351922Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0352176Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0352364Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:42.0352553Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0352746Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0352970Z %158 = arith.truncf %157 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:42.0353227Z %159 = arith.extf %158 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:42.0353459Z %160 = arith.cmpf ogt, %133, %159 : tensor<128xf32> 2026-02-21T08:47:42.0353684Z %161 = arith.cmpf une, %133, %133 : tensor<128xf32> 2026-02-21T08:47:42.0353894Z %162 = arith.ori %160, %161 : tensor<128xi1> 2026-02-21T08:47:42.0354134Z %163 = arith.select %162, %133, %159 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:42.0354382Z %164 = arith.subf %133, %163 : tensor<128xf32> 2026-02-21T08:47:42.0354828Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0355200Z %166 = arith.mulf %142, %165 : tensor<128xf32> 2026-02-21T08:47:42.0355457Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0355762Z %168 = tt.broadcast %167 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0356018Z %169 = arith.subf %156, %168 : tensor<128x128xf32> 2026-02-21T08:47:42.0356388Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0356764Z %171 = "tt.reduce"(%170) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0356952Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:42.0357141Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:42.0357327Z tt.reduce.return %173 : f32 2026-02-21T08:47:42.0357522Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0357732Z %172 = arith.addf %166, %171 : tensor<128xf32> 2026-02-21T08:47:42.0357954Z scf.yield %163, %172 : tensor<128xf32>, tensor<128xf32> 2026-02-21T08:47:42.0358234Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:47:42.0358498Z %7 = tt.splat %c9216_i32 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0358726Z %8 = arith.addi %7, %3 : tensor<128xi32> 2026-02-21T08:47:42.0358966Z %9 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0359268Z %10 = arith.muli %9, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0359522Z %11 = tt.expand_dims %8 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0359809Z %12 = tt.broadcast %10 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0360091Z %13 = tt.broadcast %11 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0360335Z %14 = arith.addi %12, %13 : tensor<128x128xi32> 2026-02-21T08:47:42.0360589Z %15 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0360881Z %16 = tt.addptr %15, %14 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0361208Z %17 = tt.load %16 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0361516Z %18 = arith.extf %17 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0361786Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0361992Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:47:42.0362182Z %55 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:47:42.0362386Z tt.reduce.return %55 : f32 2026-02-21T08:47:42.0362578Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0362818Z %20 = arith.truncf %19 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:42.0363085Z %21 = arith.extf %20 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:42.0363409Z %22 = arith.cmpf ogt, %6#0, %21 : tensor<128xf32> 2026-02-21T08:47:42.0363640Z %23 = arith.cmpf une, %6#0, %6#0 : tensor<128xf32> 2026-02-21T08:47:42.0363854Z %24 = arith.ori %22, %23 : tensor<128xi1> 2026-02-21T08:47:42.0364099Z %25 = arith.select %24, %6#0, %21 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:42.0364343Z %26 = arith.subf %6#0, %25 : tensor<128xf32> 2026-02-21T08:47:42.0364720Z %27 = tt.extern_elementwise %26 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0365093Z %28 = arith.mulf %6#1, %27 : tensor<128xf32> 2026-02-21T08:47:42.0365348Z %29 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0365658Z %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0365904Z %31 = arith.subf %18, %30 : tensor<128x128xf32> 2026-02-21T08:47:42.0366344Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0366732Z %33 = "tt.reduce"(%32) <{axis = 1 : i32}> ({ 2026-02-21T08:47:42.0366927Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:47:42.0367117Z %55 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:47:42.0367310Z tt.reduce.return %55 : f32 2026-02-21T08:47:42.0367507Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:47:42.0367711Z %34 = arith.addf %28, %33 : tensor<128xf32> 2026-02-21T08:47:42.0367921Z %c9216_i32_2 = arith.constant 9216 : i32 2026-02-21T08:47:42.0368122Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T08:47:42.0368365Z scf.for %arg3 = %c0_i32 to %c9216_i32_2 step %c512_i32_3 : i32 { 2026-02-21T08:47:42.0368627Z %55 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0368838Z %56 = arith.addi %55, %3 : tensor<128xi32> 2026-02-21T08:47:42.0369161Z %57 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:47:42.0369512Z %58 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0369810Z %59 = arith.extf %57 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0370078Z %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0370319Z %61 = arith.subf %59, %60 : tensor<128x128xf32> 2026-02-21T08:47:42.0370696Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0371117Z %63 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0371415Z %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0371687Z %65 = arith.divf %62, %64 : tensor<128x128xf32> 2026-02-21T08:47:42.0371942Z %66 = arith.truncf %65 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:47:42.0372237Z %67 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0372499Z %68 = arith.muli %67, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0372755Z %69 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0373038Z %70 = tt.broadcast %68 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0373307Z %71 = tt.broadcast %69 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0373546Z %72 = arith.addi %70, %71 : tensor<128x128xi32> 2026-02-21T08:47:42.0373780Z %73 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0374066Z %74 = tt.addptr %73, %72 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0374326Z tt.store %74, %66 : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0374538Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:47:42.0374789Z %75 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T08:47:42.0374984Z %76 = arith.addi %arg3, %75 : i32 2026-02-21T08:47:42.0375177Z %77 = tt.splat %76 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0375373Z %78 = arith.addi %77, %3 : tensor<128xi32> 2026-02-21T08:47:42.0375662Z %79 = tt.descriptor_load %0[%2, %76] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:47:42.0376001Z %80 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0376288Z %81 = arith.extf %79 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0376545Z %82 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0376789Z %83 = arith.subf %81, %82 : tensor<128x128xf32> 2026-02-21T08:47:42.0377215Z %84 = tt.extern_elementwise %83 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0377633Z %85 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0377929Z %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0378166Z %87 = arith.divf %84, %86 : tensor<128x128xf32> 2026-02-21T08:47:42.0378407Z %88 = arith.truncf %87 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:47:42.0378700Z %89 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0378965Z %90 = arith.muli %89, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0379223Z %91 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0379510Z %92 = tt.broadcast %90 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0379779Z %93 = tt.broadcast %91 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0380015Z %94 = arith.addi %92, %93 : tensor<128x128xi32> 2026-02-21T08:47:42.0380259Z %95 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0380542Z %96 = tt.addptr %95, %94 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0380802Z tt.store %96, %88 : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0381010Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:42.0381195Z %97 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:47:42.0381390Z %98 = arith.addi %arg3, %97 : i32 2026-02-21T08:47:42.0381603Z %99 = tt.splat %98 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0381811Z %100 = arith.addi %99, %3 : tensor<128xi32> 2026-02-21T08:47:42.0382105Z %101 = tt.descriptor_load %0[%2, %98] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:47:42.0382445Z %102 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0382745Z %103 = arith.extf %101 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0383018Z %104 = tt.broadcast %102 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0383271Z %105 = arith.subf %103, %104 : tensor<128x128xf32> 2026-02-21T08:47:42.0383654Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0384073Z %107 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0384375Z %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0384626Z %109 = arith.divf %106, %108 : tensor<128x128xf32> 2026-02-21T08:47:42.0384884Z %110 = arith.truncf %109 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:47:42.0385180Z %111 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0385466Z %112 = arith.muli %111, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0385799Z %113 = tt.expand_dims %100 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0386090Z %114 = tt.broadcast %112 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0386363Z %115 = tt.broadcast %113 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0386606Z %116 = arith.addi %114, %115 : tensor<128x128xi32> 2026-02-21T08:47:42.0386853Z %117 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0387145Z %118 = tt.addptr %117, %116 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0387411Z tt.store %118, %110 : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0387623Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:47:42.0387809Z %119 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:47:42.0388008Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T08:47:42.0388254Z %121 = tt.splat %120 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0388471Z %122 = arith.addi %121, %3 : tensor<128xi32> 2026-02-21T08:47:42.0388763Z %123 = tt.descriptor_load %0[%2, %120] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:47:42.0389119Z %124 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0389416Z %125 = arith.extf %123 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0389682Z %126 = tt.broadcast %124 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0389930Z %127 = arith.subf %125, %126 : tensor<128x128xf32> 2026-02-21T08:47:42.0390307Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0390739Z %129 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0391040Z %130 = tt.broadcast %129 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0391282Z %131 = arith.divf %128, %130 : tensor<128x128xf32> 2026-02-21T08:47:42.0391575Z %132 = arith.truncf %131 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:47:42.0391867Z %133 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0392135Z %134 = arith.muli %133, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0392397Z %135 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0392688Z %136 = tt.broadcast %134 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0392959Z %137 = tt.broadcast %135 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0393200Z %138 = arith.addi %136, %137 : tensor<128x128xi32> 2026-02-21T08:47:42.0393444Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0393733Z %140 = tt.addptr %139, %138 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0394000Z tt.store %140, %132 : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0394261Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:47:42.0394519Z %35 = tt.splat %c9216_i32_2 : i32 -> tensor<128xi32> 2026-02-21T08:47:42.0394736Z %36 = arith.addi %35, %3 : tensor<128xi32> 2026-02-21T08:47:42.0395034Z %37 = tt.descriptor_load %0[%2, %c9216_i32_2] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:47:42.0395390Z %38 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0395688Z %39 = arith.extf %37 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:47:42.0395950Z %40 = tt.broadcast %38 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0396192Z %41 = arith.subf %39, %40 : tensor<128x128xf32> 2026-02-21T08:47:42.0396575Z %42 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:47:42.0397083Z %43 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:42.0397379Z %44 = tt.broadcast %43 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:47:42.0397615Z %45 = arith.divf %42, %44 : tensor<128x128xf32> 2026-02-21T08:47:42.0397857Z %46 = arith.truncf %45 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:47:42.0398143Z %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:42.0398413Z %48 = arith.muli %47, %cst : tensor<128x1xi32> 2026-02-21T08:47:42.0398674Z %49 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:47:42.0398963Z %50 = tt.broadcast %48 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0399287Z %51 = tt.broadcast %49 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:47:42.0399523Z %52 = arith.addi %50, %51 : tensor<128x128xi32> 2026-02-21T08:47:42.0399763Z %53 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0400036Z %54 = tt.addptr %53, %52 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:47:42.0400296Z tt.store %54, %46 : tensor<128x128x!tt.ptr> 2026-02-21T08:47:42.0400571Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:47:42.0400819Z tt.return 2026-02-21T08:47:42.0400953Z } 2026-02-21T08:47:42.0401076Z } 2026-02-21T08:47:42.0401144Z 2026-02-21T08:47:42.0401204Z {-# 2026-02-21T08:47:42.0401330Z external_resources: { 2026-02-21T08:47:42.0401493Z mlir_reproducer: { 2026-02-21T08:47:42.0405999Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:47:42.0410704Z disable_threading: false, 2026-02-21T08:47:42.0410925Z verify_each: true 2026-02-21T08:47:42.0411098Z } 2026-02-21T08:47:42.0411242Z } 2026-02-21T08:47:42.0411392Z #-} 2026-02-21T08:47:42.0411923Z /tmp/torchinductor_root/a2/ca2sar7nma3f3dfsksjyc2hlyfuz4jhavdzoiwo4fl6jczqwnmxf.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:47:42.0413311Z /tmp/torchinductor_root/a2/ca2sar7nma3f3dfsksjyc2hlyfuz4jhavdzoiwo4fl6jczqwnmxf.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:47:42.0414443Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:47:42.0415551Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_sm_multiplier=1, num_stages=7, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:47:42.0416512Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:47:42.0416825Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:47:43.9958214Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:47:43.9958736Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:47:43.9959237Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:47:43.9959444Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:47:43.9959644Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:47:43.9959822Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:47:43.9960035Z %cst = arith.constant dense<9344> : tensor<128x1xi32> 2026-02-21T08:47:43.9960298Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x256xf32> 2026-02-21T08:47:43.9960567Z %cst_1 = arith.constant dense<0xFC00> : tensor<128x256xf16> 2026-02-21T08:47:43.9960812Z %cst_2 = arith.constant dense<9344> : tensor<256xi32> 2026-02-21T08:47:43.9961086Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T08:47:43.9961389Z %cst_4 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T08:47:43.9961818Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:47:43.9962052Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:47:43.9962323Z %c9344_i32 = arith.constant 9344 : i32 2026-02-21T08:47:43.9962598Z %c9344_i64 = arith.constant 9344 : i64 2026-02-21T08:47:43.9962860Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:47:43.9963347Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : , > 2026-02-21T08:47:43.9963851Z %1 = tt.get_program_id x : i32 2026-02-21T08:47:43.9964117Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:47:43.9964394Z %3 = arith.minsi %2, %c32_i32 : i32 2026-02-21T08:47:43.9964690Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:47:43.9965016Z %4 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T08:47:43.9965398Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:47:43.9965817Z %6 = tt.splat %4 : i32 -> tensor<128xi32> 2026-02-21T08:47:43.9966116Z %7 = arith.addi %6, %5 : tensor<128xi32> 2026-02-21T08:47:43.9966378Z %c9216_i32 = arith.constant 9216 : i32 2026-02-21T08:47:43.9966576Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:47:43.9966942Z %8:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c768_i32 iter_args(%arg4 = %cst_4, %arg5 = %cst_3) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T08:47:43.9967405Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:43.9967732Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:47:43.9967941Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:47:43.9968162Z %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32> 2026-02-21T08:47:43.9968482Z %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:43.9969156Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:43.9969476Z %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:43.9969763Z %67 = arith.select %66, %64, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:47:43.9977051Z %68 = arith.extf %67 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:43.9977547Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:47:43.9977918Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:43.9978272Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:43.9978634Z tt.reduce.return %145 : f32 2026-02-21T08:47:43.9978982Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9979347Z %70 = arith.truncf %69 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:43.9979978Z %71 = arith.extf %70 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:43.9980240Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<128xf32> 2026-02-21T08:47:43.9980495Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T08:47:43.9980742Z %74 = arith.ori %72, %73 : tensor<128xi1> 2026-02-21T08:47:43.9981032Z %75 = arith.select %74, %arg4, %71 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:43.9981274Z %76 = arith.subf %arg4, %75 : tensor<128xf32> 2026-02-21T08:47:43.9981707Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9982134Z %78 = arith.mulf %arg5, %77 : tensor<128xf32> 2026-02-21T08:47:43.9982400Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:43.9982712Z %80 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:43.9982982Z %81 = tt.broadcast %79 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:43.9983237Z %82 = arith.subf %80, %81 : tensor<128x256xf32> 2026-02-21T08:47:43.9983607Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:43.9984029Z %84 = arith.select %66, %83, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:47:43.9984281Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:47:43.9984507Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:43.9984740Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:43.9984926Z tt.reduce.return %145 : f32 2026-02-21T08:47:43.9985117Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9985318Z %86 = arith.addf %78, %85 : tensor<128xf32> 2026-02-21T08:47:43.9985604Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:47:43.9985882Z %87 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:47:43.9986167Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:47:43.9986536Z %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:43.9986924Z %90 = tt.splat %88 : i32 -> tensor<256xi32> 2026-02-21T08:47:43.9987229Z %91 = arith.addi %90, %89 : tensor<256xi32> 2026-02-21T08:47:43.9987560Z %92 = arith.cmpi slt, %91, %cst_2 : tensor<256xi32> 2026-02-21T08:47:43.9988069Z %93 = tt.descriptor_load %0[%4, %88] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:43.9988656Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:43.9989108Z %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:43.9989452Z %96 = arith.select %95, %93, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:47:43.9989737Z %97 = arith.extf %96 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:43.9989984Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:47:43.9990299Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:43.9990494Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:43.9990698Z tt.reduce.return %145 : f32 2026-02-21T08:47:43.9990886Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9991121Z %99 = arith.truncf %98 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:43.9991376Z %100 = arith.extf %99 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:43.9991680Z %101 = arith.cmpf ogt, %75, %100 : tensor<128xf32> 2026-02-21T08:47:43.9991937Z %102 = arith.cmpf une, %75, %75 : tensor<128xf32> 2026-02-21T08:47:43.9992164Z %103 = arith.ori %101, %102 : tensor<128xi1> 2026-02-21T08:47:43.9992415Z %104 = arith.select %103, %75, %100 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:43.9992666Z %105 = arith.subf %75, %104 : tensor<128xf32> 2026-02-21T08:47:43.9993136Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9993516Z %107 = arith.mulf %86, %106 : tensor<128xf32> 2026-02-21T08:47:43.9993786Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:43.9994094Z %109 = arith.extf %93 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:43.9994385Z %110 = tt.broadcast %108 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:43.9994687Z %111 = arith.subf %109, %110 : tensor<128x256xf32> 2026-02-21T08:47:43.9995071Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:43.9995566Z %113 = arith.select %95, %112, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:47:43.9995936Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:47:43.9996191Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:43.9996386Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:43.9996575Z tt.reduce.return %145 : f32 2026-02-21T08:47:43.9996774Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:43.9996978Z %115 = arith.addf %107, %114 : tensor<128xf32> 2026-02-21T08:47:43.9997183Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:43.9997386Z %116 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:47:43.9997597Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:47:43.9997841Z %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:43.9998097Z %119 = tt.splat %117 : i32 -> tensor<256xi32> 2026-02-21T08:47:43.9998301Z %120 = arith.addi %119, %118 : tensor<256xi32> 2026-02-21T08:47:43.9998547Z %121 = arith.cmpi slt, %120, %cst_2 : tensor<256xi32> 2026-02-21T08:47:43.9999019Z %122 = tt.descriptor_load %0[%4, %117] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:43.9999577Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0000032Z %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0000494Z %125 = arith.select %124, %122, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:47:44.0000961Z %126 = arith.extf %125 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0001350Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:47:44.0001695Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:44.0001979Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:47:44.0002270Z tt.reduce.return %145 : f32 2026-02-21T08:47:44.0002555Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0002887Z %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:44.0003255Z %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:44.0003770Z %130 = arith.cmpf ogt, %104, %129 : tensor<128xf32> 2026-02-21T08:47:44.0004104Z %131 = arith.cmpf une, %104, %104 : tensor<128xf32> 2026-02-21T08:47:44.0004422Z %132 = arith.ori %130, %131 : tensor<128xi1> 2026-02-21T08:47:44.0004781Z %133 = arith.select %132, %104, %129 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:44.0005166Z %134 = arith.subf %104, %133 : tensor<128xf32> 2026-02-21T08:47:44.0005729Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0006266Z %136 = arith.mulf %115, %135 : tensor<128xf32> 2026-02-21T08:47:44.0006648Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0007073Z %138 = arith.extf %122 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0007587Z %139 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0007942Z %140 = arith.subf %138, %139 : tensor<128x256xf32> 2026-02-21T08:47:44.0008392Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0008826Z %142 = arith.select %124, %141, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:47:44.0009084Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:47:44.0009301Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:47:44.0009551Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:47:44.0009814Z tt.reduce.return %145 : f32 2026-02-21T08:47:44.0010077Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0010355Z %144 = arith.addf %136, %143 : tensor<128xf32> 2026-02-21T08:47:44.0010654Z scf.yield %133, %144 : tensor<128xf32>, tensor<128xf32> 2026-02-21T08:47:44.0010935Z } {tt.num_stages = 1 : i32} 2026-02-21T08:47:44.0011168Z %9 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:44.0011422Z %10 = tt.splat %c9216_i32 : i32 -> tensor<256xi32> 2026-02-21T08:47:44.0011683Z %11 = arith.addi %10, %9 : tensor<256xi32> 2026-02-21T08:47:44.0011893Z %12 = arith.cmpi slt, %11, %cst_2 : tensor<256xi32> 2026-02-21T08:47:44.0012210Z %13 = tt.descriptor_load %0[%4, %c9216_i32] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:44.0012576Z %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0012858Z %15 = tt.broadcast %14 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0013140Z %16 = arith.select %15, %13, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:47:44.0013415Z %17 = arith.extf %16 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0013657Z %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({ 2026-02-21T08:47:44.0013861Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:47:44.0014043Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:47:44.0014239Z tt.reduce.return %60 : f32 2026-02-21T08:47:44.0014418Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0014645Z %19 = arith.truncf %18 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:47:44.0014884Z %20 = arith.extf %19 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:47:44.0015121Z %21 = arith.cmpf ogt, %8#0, %20 : tensor<128xf32> 2026-02-21T08:47:44.0015340Z %22 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32> 2026-02-21T08:47:44.0015542Z %23 = arith.ori %21, %22 : tensor<128xi1> 2026-02-21T08:47:44.0015773Z %24 = arith.select %23, %8#0, %20 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:47:44.0016003Z %25 = arith.subf %8#0, %24 : tensor<128xf32> 2026-02-21T08:47:44.0016367Z %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0016789Z %27 = arith.mulf %8#1, %26 : tensor<128xf32> 2026-02-21T08:47:44.0017057Z %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0017365Z %29 = arith.extf %13 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0017635Z %30 = tt.broadcast %28 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0017891Z %31 = arith.subf %29, %30 : tensor<128x256xf32> 2026-02-21T08:47:44.0018269Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0018703Z %33 = arith.select %15, %32, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:47:44.0018969Z %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({ 2026-02-21T08:47:44.0019164Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:47:44.0019439Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:47:44.0019640Z tt.reduce.return %60 : f32 2026-02-21T08:47:44.0019839Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:47:44.0020042Z %35 = arith.addf %27, %34 : tensor<128xf32> 2026-02-21T08:47:44.0020254Z %c9216_i32_5 = arith.constant 9216 : i32 2026-02-21T08:47:44.0020453Z %c768_i32_6 = arith.constant 768 : i32 2026-02-21T08:47:44.0020699Z scf.for %arg3 = %c0_i32 to %c9216_i32_5 step %c768_i32_6 : i32 { 2026-02-21T08:47:44.0020999Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:44.0021266Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:47:44.0021485Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:47:44.0021749Z %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32> 2026-02-21T08:47:44.0022075Z %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:44.0022437Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0022746Z %66 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0023028Z %67 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0023276Z %68 = arith.subf %66, %67 : tensor<128x256xf32> 2026-02-21T08:47:44.0023669Z %69 = tt.extern_elementwise %68 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0024102Z %70 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0024410Z %71 = tt.broadcast %70 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0024668Z %72 = arith.divf %69, %71 : tensor<128x256xf32> 2026-02-21T08:47:44.0024915Z %73 = arith.truncf %72 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:47:44.0025221Z %74 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:44.0025498Z %75 = arith.muli %74, %cst : tensor<128x1xi32> 2026-02-21T08:47:44.0025764Z %76 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:47:44.0026063Z %77 = tt.broadcast %75 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0026342Z %78 = tt.broadcast %76 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0026593Z %79 = arith.addi %77, %78 : tensor<128x256xi32> 2026-02-21T08:47:44.0026842Z %80 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0027126Z %81 = tt.addptr %80, %79 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:47:44.0027420Z %82 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0027711Z %83 = tt.broadcast %82 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0028017Z tt.store %81, %73, %83 : tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0028231Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:47:44.0028428Z %84 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:47:44.0028620Z %85 = arith.addi %arg3, %84 : i32 2026-02-21T08:47:44.0028854Z %86 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:44.0029098Z %87 = tt.splat %85 : i32 -> tensor<256xi32> 2026-02-21T08:47:44.0029296Z %88 = arith.addi %87, %86 : tensor<256xi32> 2026-02-21T08:47:44.0029501Z %89 = arith.cmpi slt, %88, %cst_2 : tensor<256xi32> 2026-02-21T08:47:44.0029804Z %90 = tt.descriptor_load %0[%4, %85] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:44.0030153Z %91 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0030485Z %92 = arith.extf %90 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0030752Z %93 = tt.broadcast %91 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0030984Z %94 = arith.subf %92, %93 : tensor<128x256xf32> 2026-02-21T08:47:44.0031359Z %95 = tt.extern_elementwise %94 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0031816Z %96 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0032104Z %97 = tt.broadcast %96 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0032348Z %98 = arith.divf %95, %97 : tensor<128x256xf32> 2026-02-21T08:47:44.0032582Z %99 = arith.truncf %98 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:47:44.0032882Z %100 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:44.0033160Z %101 = arith.muli %100, %cst : tensor<128x1xi32> 2026-02-21T08:47:44.0033423Z %102 = tt.expand_dims %88 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:47:44.0033733Z %103 = tt.broadcast %101 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0034006Z %104 = tt.broadcast %102 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0034263Z %105 = arith.addi %103, %104 : tensor<128x256xi32> 2026-02-21T08:47:44.0034510Z %106 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0034810Z %107 = tt.addptr %106, %105 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:47:44.0035122Z %108 = tt.expand_dims %89 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0035408Z %109 = tt.broadcast %108 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0035669Z tt.store %107, %99, %109 : tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0035884Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:47:44.0036081Z %110 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:47:44.0036274Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T08:47:44.0036512Z %112 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:44.0036772Z %113 = tt.splat %111 : i32 -> tensor<256xi32> 2026-02-21T08:47:44.0036977Z %114 = arith.addi %113, %112 : tensor<256xi32> 2026-02-21T08:47:44.0037201Z %115 = arith.cmpi slt, %114, %cst_2 : tensor<256xi32> 2026-02-21T08:47:44.0037505Z %116 = tt.descriptor_load %0[%4, %111] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:44.0037865Z %117 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0038165Z %118 = arith.extf %116 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0038436Z %119 = tt.broadcast %117 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0038693Z %120 = arith.subf %118, %119 : tensor<128x256xf32> 2026-02-21T08:47:44.0039123Z %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0039557Z %122 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0039847Z %123 = tt.broadcast %122 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0040101Z %124 = arith.divf %121, %123 : tensor<128x256xf32> 2026-02-21T08:47:44.0040352Z %125 = arith.truncf %124 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:47:44.0040643Z %126 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:44.0040914Z %127 = arith.muli %126, %cst : tensor<128x1xi32> 2026-02-21T08:47:44.0041173Z %128 = tt.expand_dims %114 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:47:44.0041532Z %129 = tt.broadcast %127 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0041845Z %130 = tt.broadcast %128 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0042084Z %131 = arith.addi %129, %130 : tensor<128x256xi32> 2026-02-21T08:47:44.0042331Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0042636Z %133 = tt.addptr %132, %131 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:47:44.0042956Z %134 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0043251Z %135 = tt.broadcast %134 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0043518Z tt.store %133, %125, %135 : tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0043747Z } {tt.num_stages = 1 : i32} 2026-02-21T08:47:44.0043968Z %36 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:47:44.0044229Z %37 = tt.splat %c9216_i32_5 : i32 -> tensor<256xi32> 2026-02-21T08:47:44.0044441Z %38 = arith.addi %37, %36 : tensor<256xi32> 2026-02-21T08:47:44.0044658Z %39 = arith.cmpi slt, %38, %cst_2 : tensor<256xi32> 2026-02-21T08:47:44.0044969Z %40 = tt.descriptor_load %0[%4, %c9216_i32_5] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:47:44.0045333Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0045631Z %42 = arith.extf %40 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:47:44.0045894Z %43 = tt.broadcast %41 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0046137Z %44 = arith.subf %42, %43 : tensor<128x256xf32> 2026-02-21T08:47:44.0046504Z %45 = tt.extern_elementwise %44 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:47:44.0046922Z %46 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:47:44.0047238Z %47 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:47:44.0047474Z %48 = arith.divf %45, %47 : tensor<128x256xf32> 2026-02-21T08:47:44.0047713Z %49 = arith.truncf %48 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:47:44.0047993Z %50 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:47:44.0048258Z %51 = arith.muli %50, %cst : tensor<128x1xi32> 2026-02-21T08:47:44.0048509Z %52 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:47:44.0048788Z %53 = tt.broadcast %51 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0049049Z %54 = tt.broadcast %52 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:47:44.0049279Z %55 = arith.addi %53, %54 : tensor<128x256xi32> 2026-02-21T08:47:44.0049517Z %56 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0049801Z %57 = tt.addptr %56, %55 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:47:44.0050140Z %58 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:47:44.0050427Z %59 = tt.broadcast %58 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:47:44.0050671Z tt.store %57, %49, %59 : tensor<128x256x!tt.ptr> 2026-02-21T08:47:44.0050899Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:47:44.0051088Z tt.return 2026-02-21T08:47:44.0051220Z } 2026-02-21T08:47:44.0051337Z } 2026-02-21T08:47:44.0051410Z 2026-02-21T08:47:44.0051460Z {-# 2026-02-21T08:47:44.0051642Z external_resources: { 2026-02-21T08:47:44.0051795Z mlir_reproducer: { 2026-02-21T08:47:44.0056165Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:47:44.0060599Z disable_threading: false, 2026-02-21T08:47:44.0060782Z verify_each: true 2026-02-21T08:47:44.0060931Z } 2026-02-21T08:47:44.0061063Z } 2026-02-21T08:47:44.0061181Z #-} 2026-02-21T08:47:44.0061676Z /tmp/torchinductor_root/yd/cydwhqvmvwim7d7qkg4nbhz4g2n7u3a42tsph3mw7voszwmou3lu.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:47:44.0062935Z /tmp/torchinductor_root/yd/cydwhqvmvwim7d7qkg4nbhz4g2n7u3a42tsph3mw7voszwmou3lu.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:47:44.0063961Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:47:44.0065118Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:47:44.0066165Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:47:44.0066433Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:47:47.3807386Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.0 configs/s 2026-02-21T08:47:47.3820374Z [46s] Adaptive compile timeout: 30s (90% percentile=9.6s, bounds=[30.0s, 30s]) 2026-02-21T08:47:48.1964516Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1209.5 configs/s 2026-02-21T08:47:48.2603460Z [47s] Initial random population of 100, 5 starting points: 2026-02-21T08:47:48.2605561Z error=6 2026-02-21T08:47:48.2605756Z timeout=1 2026-02-21T08:47:48.2611129Z ok=93 2026-02-21T08:47:48.2615883Z min=0.0497 2026-02-21T08:47:48.2617370Z mid=0.8457 2026-02-21T08:47:48.2617531Z max=46.4589 2026-02-21T08:47:48.2617688Z best={'block_sizes': [1, 16384], 2026-02-21T08:47:48.2617917Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:47:48.2618153Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:47:48.2618339Z 'num_sm_multiplier': 8, 2026-02-21T08:47:48.2618502Z 'num_stages': 3, 2026-02-21T08:47:48.2618645Z 'num_warps': 1, 2026-02-21T08:47:48.2619152Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:47:48.2619386Z 'range_flattens': [False, None], 2026-02-21T08:47:48.2619564Z 'range_multi_buffers': [True, True], 2026-02-21T08:47:48.2619751Z 'range_num_stages': [1, 2], 2026-02-21T08:47:48.2619913Z 'range_unroll_factors': [0, 1], 2026-02-21T08:47:48.2620096Z 'range_warp_specializes': [True, None]} 2026-02-21T08:47:48.2624289Z [47s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:47:49.3126373Z [48s] Generation 1 starting: 81 neighbors, 5 active search path(s) 2026-02-21T08:48:02.4108844Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 6.7 configs/s 2026-02-21T08:48:07.5485381Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 16.9 configs/s 2026-02-21T08:48:10.9633889Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.5 2026-02-21T08:48:10.9635292Z configs/s 2026-02-21T08:48:11.1696831Z [70s] Generation 1 complete: 2026-02-21T08:48:11.1701307Z ok=87 2026-02-21T08:48:11.1706852Z min=0.0390 2026-02-21T08:48:11.1711024Z mid=0.0615 2026-02-21T08:48:11.1712552Z max=0.2046 2026-02-21T08:48:11.1712765Z best={'block_sizes': [1, 16384], 2026-02-21T08:48:11.1713090Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:48:11.1715259Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:48:11.1715502Z 'num_sm_multiplier': 8, 2026-02-21T08:48:11.1715669Z 'num_stages': 3, 2026-02-21T08:48:11.1715821Z 'num_warps': 1, 2026-02-21T08:48:11.1715982Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:48:11.1716192Z 'range_flattens': [False, None], 2026-02-21T08:48:11.1716378Z 'range_multi_buffers': [True, True], 2026-02-21T08:48:11.1716573Z 'range_num_stages': [1, 2], 2026-02-21T08:48:11.1716742Z 'range_unroll_factors': [1, 1], 2026-02-21T08:48:11.1716930Z 'range_warp_specializes': [True, None]} 2026-02-21T08:48:11.1717225Z [70s] Fitting surrogate: 187 points, 187 targets 2026-02-21T08:48:12.1804020Z [71s] Generation 2 starting: 74 neighbors, 5 active search path(s) 2026-02-21T08:48:25.2385678Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 7.5 configs/s 2026-02-21T08:48:31.0180013Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 13.6 configs/s 2026-02-21T08:48:34.4705359Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 292.6 2026-02-21T08:48:34.4709212Z configs/s 2026-02-21T08:48:34.6738638Z [94s] Generation 2 complete: 2026-02-21T08:48:34.6738912Z ok=79 2026-02-21T08:48:34.6739079Z min=0.0389 2026-02-21T08:48:34.6739209Z mid=0.0594 2026-02-21T08:48:34.6739338Z max=0.3748 2026-02-21T08:48:34.6739476Z best={'block_sizes': [1, 16384], 2026-02-21T08:48:34.6739745Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:48:34.6739981Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:48:34.6740204Z 'num_stages': 3, 2026-02-21T08:48:34.6740786Z 'num_warps': 1, 2026-02-21T08:48:34.7121070Z 'pid_type': 'flat', 2026-02-21T08:48:34.7121249Z 'range_flattens': [None, None], 2026-02-21T08:48:34.7121430Z 'range_multi_buffers': [None, True], 2026-02-21T08:48:34.7121858Z 'range_num_stages': [0, 2], 2026-02-21T08:48:34.7122025Z 'range_unroll_factors': [0, 1], 2026-02-21T08:48:34.7122213Z 'range_warp_specializes': [None, True]} 2026-02-21T08:48:34.7122441Z [94s] Fitting surrogate: 266 points, 266 targets 2026-02-21T08:48:35.4615497Z [95s] Generation 3 starting: 56 neighbors, 4 active search path(s) 2026-02-21T08:48:45.7695229Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 4.3 configs/s 2026-02-21T08:48:49.3014684Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.9 configs/s 2026-02-21T08:48:53.1186971Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 264.6 2026-02-21T08:48:53.1190444Z configs/s 2026-02-21T08:48:53.3430627Z [112s] Generation 3 complete: 2026-02-21T08:48:53.3434631Z ok=61 2026-02-21T08:48:53.3436741Z min=0.0369 2026-02-21T08:48:53.3436907Z mid=0.0532 2026-02-21T08:48:53.3437044Z max=0.6912 2026-02-21T08:48:53.3437199Z best={'block_sizes': [1, 16384], 2026-02-21T08:48:53.3437465Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:48:53.3437755Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:48:53.3437950Z 'num_sm_multiplier': 32, 2026-02-21T08:48:53.3438115Z 'num_stages': 5, 2026-02-21T08:48:53.3438252Z 'num_warps': 2, 2026-02-21T08:48:53.3438415Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:48:53.3438604Z 'range_flattens': [True, True], 2026-02-21T08:48:53.3438785Z 'range_multi_buffers': [False, None], 2026-02-21T08:48:53.3438971Z 'range_num_stages': [3, 2], 2026-02-21T08:48:53.3439135Z 'range_unroll_factors': [0, 2], 2026-02-21T08:48:53.3439317Z 'range_warp_specializes': [True, None]} 2026-02-21T08:48:53.3448108Z [112s] Fitting surrogate: 327 points, 327 targets 2026-02-21T08:48:53.9445761Z [113s] Generation 4 starting: 37 neighbors, 2 active search path(s) 2026-02-21T08:49:03.3421183Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 2.8 configs/s 2026-02-21T08:49:05.7228050Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.7 configs/s 2026-02-21T08:49:07.8250692Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 479.3 2026-02-21T08:49:07.8254750Z configs/s 2026-02-21T08:49:07.9782881Z [127s] Generation 4 complete: 2026-02-21T08:49:07.9783237Z ok=40 2026-02-21T08:49:07.9783458Z min=0.0368 2026-02-21T08:49:07.9783625Z mid=0.0513 2026-02-21T08:49:07.9783808Z max=0.6871 2026-02-21T08:49:07.9783994Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:07.9784293Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:49:07.9784594Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:49:07.9784845Z 'num_sm_multiplier': 32, 2026-02-21T08:49:07.9785033Z 'num_stages': 5, 2026-02-21T08:49:07.9785176Z 'num_warps': 4, 2026-02-21T08:49:07.9785344Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:49:07.9785600Z 'range_flattens': [True, True], 2026-02-21T08:49:07.9785872Z 'range_multi_buffers': [False, None], 2026-02-21T08:49:07.9786146Z 'range_num_stages': [3, 2], 2026-02-21T08:49:07.9786383Z 'range_unroll_factors': [0, 2], 2026-02-21T08:49:07.9786658Z 'range_warp_specializes': [True, None]} 2026-02-21T08:49:07.9806097Z [127s] Fitting surrogate: 367 points, 367 targets 2026-02-21T08:49:08.6287486Z [128s] Generation 5 starting: 38 neighbors, 2 active search path(s) 2026-02-21T08:49:15.3934103Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 7.0 configs/s 2026-02-21T08:49:17.8500119Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 16.6 configs/s 2026-02-21T08:49:20.9142818Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 330.5 2026-02-21T08:49:20.9144175Z configs/s 2026-02-21T08:49:21.1283363Z [140s] Generation 5 complete: 2026-02-21T08:49:21.1287610Z ok=41 2026-02-21T08:49:21.1289298Z min=0.0368 2026-02-21T08:49:21.1289514Z mid=0.0471 2026-02-21T08:49:21.1295328Z max=0.2550 2026-02-21T08:49:21.1300209Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:21.1302313Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:49:21.1302601Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:49:21.1302799Z 'num_sm_multiplier': 32, 2026-02-21T08:49:21.1302969Z 'num_stages': 5, 2026-02-21T08:49:21.1303108Z 'num_warps': 1, 2026-02-21T08:49:21.1303273Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:49:21.1303466Z 'range_flattens': [True, True], 2026-02-21T08:49:21.1303648Z 'range_multi_buffers': [False, True], 2026-02-21T08:49:21.1303841Z 'range_num_stages': [3, 2], 2026-02-21T08:49:21.1304006Z 'range_unroll_factors': [0, 3], 2026-02-21T08:49:21.1304655Z 'range_warp_specializes': [True, None]} 2026-02-21T08:49:21.1304886Z [140s] Fitting surrogate: 408 points, 408 targets 2026-02-21T08:49:21.4886343Z [141s] Generation 6 starting: 16 neighbors, 1 active search path(s) 2026-02-21T08:49:24.5440345Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s 2026-02-21T08:49:25.5205522Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s 2026-02-21T08:49:26.9769639Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 690.0 2026-02-21T08:49:26.9770423Z configs/s 2026-02-21T08:49:27.0896231Z [146s] Generation 6 complete: 2026-02-21T08:49:27.0901278Z ok=18 2026-02-21T08:49:27.0905970Z min=0.0368 2026-02-21T08:49:27.0910685Z mid=0.0389 2026-02-21T08:49:27.0914878Z max=0.0532 2026-02-21T08:49:27.0917120Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:27.0917436Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:49:27.0917727Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:49:27.0917926Z 'num_sm_multiplier': 32, 2026-02-21T08:49:27.0918092Z 'num_stages': 5, 2026-02-21T08:49:27.0918234Z 'num_warps': 1, 2026-02-21T08:49:27.0918390Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:49:27.0918587Z 'range_flattens': [True, True], 2026-02-21T08:49:27.0918760Z 'range_multi_buffers': [False, True], 2026-02-21T08:49:27.0918946Z 'range_num_stages': [3, 2], 2026-02-21T08:49:27.0919105Z 'range_unroll_factors': [0, 3], 2026-02-21T08:49:27.0919285Z 'range_warp_specializes': [True, None]} 2026-02-21T08:49:27.0919495Z [146s] Fitting surrogate: 426 points, 426 targets 2026-02-21T08:49:27.4177499Z [147s] Generation 7 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:49:29.9281495Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 9.9 configs/s 2026-02-21T08:49:30.7225701Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.4 configs/s 2026-02-21T08:49:31.6850194Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1040.2 2026-02-21T08:49:31.6851933Z configs/s 2026-02-21T08:49:31.7596539Z [151s] Generation 7 complete: 2026-02-21T08:49:31.7598647Z ok=14 2026-02-21T08:49:31.7598861Z min=0.0368 2026-02-21T08:49:31.7603756Z mid=0.0389 2026-02-21T08:49:31.7605742Z max=0.0584 2026-02-21T08:49:31.7605923Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:31.7606195Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:49:31.7606474Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:49:31.7606669Z 'num_stages': 4, 2026-02-21T08:49:31.7606825Z 'num_warps': 8, 2026-02-21T08:49:31.7606971Z 'pid_type': 'flat', 2026-02-21T08:49:31.7607139Z 'range_flattens': [None, True], 2026-02-21T08:49:31.7607330Z 'range_multi_buffers': [None, True], 2026-02-21T08:49:31.7607534Z 'range_num_stages': [0, 2], 2026-02-21T08:49:31.7608109Z 'range_unroll_factors': [0, 1], 2026-02-21T08:49:31.7608321Z 'range_warp_specializes': [None, True]} 2026-02-21T08:49:31.7622913Z [151s] Fitting surrogate: 440 points, 440 targets 2026-02-21T08:49:32.1029558Z [151s] Generation 8 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:49:34.6972166Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 7.7 configs/s 2026-02-21T08:49:35.4999706Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.1 configs/s 2026-02-21T08:49:36.6383299Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 879.3 2026-02-21T08:49:36.6387371Z configs/s 2026-02-21T08:49:36.7310791Z [156s] Generation 8 complete: 2026-02-21T08:49:36.7312019Z ok=14 2026-02-21T08:49:36.7312158Z min=0.0369 2026-02-21T08:49:36.7312311Z mid=0.0389 2026-02-21T08:49:36.7312434Z max=0.0532 2026-02-21T08:49:36.7312599Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:36.7312910Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:49:36.7313736Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:49:36.7313937Z 'num_sm_multiplier': 32, 2026-02-21T08:49:36.7314103Z 'num_stages': 4, 2026-02-21T08:49:36.7314248Z 'num_warps': 8, 2026-02-21T08:49:36.7314405Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:49:36.7314602Z 'range_flattens': [False, True], 2026-02-21T08:49:36.7314790Z 'range_multi_buffers': [False, True], 2026-02-21T08:49:36.7314972Z 'range_num_stages': [0, 2], 2026-02-21T08:49:36.7315145Z 'range_unroll_factors': [1, 1], 2026-02-21T08:49:36.7315319Z 'range_warp_specializes': [True, None]} 2026-02-21T08:49:36.7321165Z [156s] Fitting surrogate: 454 points, 454 targets 2026-02-21T08:49:37.1104946Z [156s] Generation 9 starting: 19 neighbors, 1 active search path(s) 2026-02-21T08:49:45.8761107Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s 2026-02-21T08:49:47.2701470Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 13.9 configs/s 2026-02-21T08:49:48.3116886Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 962.8 2026-02-21T08:49:48.3117334Z configs/s 2026-02-21T08:49:48.3946366Z [168s] Generation 9 complete: 2026-02-21T08:49:48.3946605Z ok=20 2026-02-21T08:49:48.3946743Z min=0.0369 2026-02-21T08:49:48.3946869Z mid=0.0410 2026-02-21T08:49:48.3946997Z max=0.1352 2026-02-21T08:49:48.3947131Z best={'block_sizes': [1, 16384], 2026-02-21T08:49:48.3947389Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:49:48.3947647Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:49:48.3947849Z 'num_sm_multiplier': 32, 2026-02-21T08:49:48.3948009Z 'num_stages': 4, 2026-02-21T08:49:48.3948160Z 'num_warps': 8, 2026-02-21T08:49:48.3948322Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:49:48.3948513Z 'range_flattens': [False, True], 2026-02-21T08:49:48.3949184Z 'range_multi_buffers': [False, True], 2026-02-21T08:49:48.3962488Z 'range_num_stages': [0, 2], 2026-02-21T08:49:48.3962720Z 'range_unroll_factors': [1, 1], 2026-02-21T08:49:48.3962913Z 'range_warp_specializes': [True, None]} 2026-02-21T08:49:48.3963138Z [168s] Fitting surrogate: 474 points, 474 targets 2026-02-21T08:49:48.5966868Z [168s] Autotuning complete in 168.2s after searching 462 configs. 2026-02-21T08:49:48.5969294Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:49:48.5970353Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[1, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:49:48.5971292Z 2026-02-21T08:49:48.5972280Z [168s] Code of selected kernel: /tmp/torchinductor_root/im/cimneque2jlxruvi7ivem6r5lzucrbfserm5ratl2hy54aqmngsj.py 2026-02-21T08:49:49.6363822Z WARNING:tritonbench.utils.triton_op:Completed input ID 71: 2026-02-21T08:49:49.6365487Z (M, N) 2026-02-21T08:49:49.6365650Z ------------ 2026-02-21T08:49:49.6365805Z (4096, 9344) 2026-02-21T08:49:49.6365885Z 2026-02-21T08:49:49.6376196Z 75%|███████▌ | 15/20 [40:47<14:05, 169.19s/it]WARNING:tritonbench.utils.triton_op:Running input ID 77: 2026-02-21T08:49:49.6378502Z (M, N) 2026-02-21T08:49:49.6378721Z ------------- 2026-02-21T08:49:49.6382188Z (4096, 10112) 2026-02-21T08:49:49.6382545Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:49:50.7998756Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:49:52.1452584Z INFO:tritonbench.utils.triton_op:Took 2.32ms to get benchmark function for torch_compile_softmax 2026-02-21T08:49:55.9499367Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:49:55.9501075Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:49:55.9501299Z 'dtype': 'torch.float16', 2026-02-21T08:49:55.9501496Z 'shape': (4096, 10112), 2026-02-21T08:49:55.9501965Z 'stride': (10112, 1)},), 2026-02-21T08:49:55.9502142Z 'kwargs': {}} 2026-02-21T08:49:55.9518570Z INFO:tritonbench.utils.triton_op:Took 2.63ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:49:56.1330568Z [0s] Autotune random seed: 2134816249 2026-02-21T08:49:56.2818441Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:50:36.3632328Z [40s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:50:36.3648897Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T08:50:40.7471245Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:50:40.7475458Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:50:40.7476710Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:50:40.7477103Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:50:40.7477538Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:50:40.7477914Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:50:40.7478299Z %cst = arith.constant dense<10112> : tensor<128x1xi32> 2026-02-21T08:50:40.7478806Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x256xf32> 2026-02-21T08:50:40.7479334Z %cst_1 = arith.constant dense<0xFC00> : tensor<128x256xf16> 2026-02-21T08:50:40.7480119Z %cst_2 = arith.constant dense<10112> : tensor<256xi32> 2026-02-21T08:50:40.7480610Z %cst_3 = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T08:50:40.7481083Z %cst_4 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T08:50:40.7481523Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:50:40.7481912Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:50:40.7482303Z %c10112_i32 = arith.constant 10112 : i32 2026-02-21T08:50:40.7482695Z %c10112_i64 = arith.constant 10112 : i64 2026-02-21T08:50:40.7483035Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:50:40.7483639Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10112_i32], [%c10112_i64, %c1_i64] : , > 2026-02-21T08:50:40.7484234Z %1 = tt.get_program_id x : i32 2026-02-21T08:50:40.7484598Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:50:40.7484901Z %3 = arith.minsi %2, %c32_i32 : i32 2026-02-21T08:50:40.7485467Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:50:40.7485844Z %4 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T08:50:40.7486299Z %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:50:40.7486789Z %6 = tt.splat %4 : i32 -> tensor<128xi32> 2026-02-21T08:50:40.7487151Z %7 = arith.addi %6, %5 : tensor<128xi32> 2026-02-21T08:50:40.7487535Z %c9984_i32 = arith.constant 9984 : i32 2026-02-21T08:50:40.7487851Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:50:40.7488549Z %8:2 = scf.for %arg3 = %c0_i32 to %c9984_i32 step %c768_i32 iter_args(%arg4 = %cst_4, %arg5 = %cst_3) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T08:50:40.7489330Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7489797Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7490203Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:50:40.7490586Z %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7491163Z %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7491843Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7492419Z %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7492988Z %67 = arith.select %66, %64, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:50:40.7493489Z %68 = arith.extf %67 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7494003Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7494375Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7494770Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:50:40.7495149Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7495553Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7496034Z %70 = arith.truncf %69 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:50:40.7496513Z %71 = arith.extf %70 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:50:40.7497000Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<128xf32> 2026-02-21T08:50:40.7497436Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T08:50:40.7497875Z %74 = arith.ori %72, %73 : tensor<128xi1> 2026-02-21T08:50:40.7498323Z %75 = arith.select %74, %arg4, %71 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:50:40.7498835Z %76 = arith.subf %arg4, %75 : tensor<128xf32> 2026-02-21T08:50:40.7499553Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7500253Z %78 = arith.mulf %arg5, %77 : tensor<128xf32> 2026-02-21T08:50:40.7500772Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7501323Z %80 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7502054Z %81 = tt.broadcast %79 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7502463Z %82 = arith.subf %80, %81 : tensor<128x256xf32> 2026-02-21T08:50:40.7503149Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7503943Z %84 = arith.select %66, %83, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:50:40.7504358Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7504670Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7504955Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:50:40.7505258Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7505563Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7505878Z %86 = arith.addf %78, %85 : tensor<128xf32> 2026-02-21T08:50:40.7506312Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:50:40.7506611Z %87 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:50:40.7506921Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T08:50:40.7507281Z %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7507692Z %90 = tt.splat %88 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7508007Z %91 = arith.addi %90, %89 : tensor<256xi32> 2026-02-21T08:50:40.7508354Z %92 = arith.cmpi slt, %91, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7508858Z %93 = tt.descriptor_load %0[%4, %88] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7509420Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7509898Z %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7510350Z %96 = arith.select %95, %93, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:50:40.7510826Z %97 = arith.extf %96 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7511210Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7511504Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7511832Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:50:40.7512142Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7512447Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7512804Z %99 = arith.truncf %98 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:50:40.7513219Z %100 = arith.extf %99 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:50:40.7513605Z %101 = arith.cmpf ogt, %75, %100 : tensor<128xf32> 2026-02-21T08:50:40.7513959Z %102 = arith.cmpf une, %75, %75 : tensor<128xf32> 2026-02-21T08:50:40.7514299Z %103 = arith.ori %101, %102 : tensor<128xi1> 2026-02-21T08:50:40.7514687Z %104 = arith.select %103, %75, %100 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:50:40.7515098Z %105 = arith.subf %75, %104 : tensor<128xf32> 2026-02-21T08:50:40.7515697Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7516280Z %107 = arith.mulf %86, %106 : tensor<128xf32> 2026-02-21T08:50:40.7516701Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7517186Z %109 = arith.extf %93 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7517638Z %110 = tt.broadcast %108 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7518042Z %111 = arith.subf %109, %110 : tensor<128x256xf32> 2026-02-21T08:50:40.7518686Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7519413Z %113 = arith.select %95, %112, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:50:40.7519957Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7520261Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7520541Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:50:40.7520840Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7521134Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7521462Z %115 = arith.addf %107, %114 : tensor<128xf32> 2026-02-21T08:50:40.7521809Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:50:40.7522117Z %116 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:50:40.7522430Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T08:50:40.7522804Z %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7523219Z %119 = tt.splat %117 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7523662Z %120 = arith.addi %119, %118 : tensor<256xi32> 2026-02-21T08:50:40.7524026Z %121 = arith.cmpi slt, %120, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7524532Z %122 = tt.descriptor_load %0[%4, %117] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7525106Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7525598Z %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7526073Z %125 = arith.select %124, %122, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:50:40.7526560Z %126 = arith.extf %125 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7526950Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7527259Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7527555Z %145 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:50:40.7527858Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7528164Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7528530Z %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:50:40.7528949Z %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:50:40.7529321Z %130 = arith.cmpf ogt, %104, %129 : tensor<128xf32> 2026-02-21T08:50:40.7529682Z %131 = arith.cmpf une, %104, %104 : tensor<128xf32> 2026-02-21T08:50:40.7530021Z %132 = arith.ori %130, %131 : tensor<128xi1> 2026-02-21T08:50:40.7530411Z %133 = arith.select %132, %104, %129 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:50:40.7530820Z %134 = arith.subf %104, %133 : tensor<128xf32> 2026-02-21T08:50:40.7531420Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7532080Z %136 = arith.mulf %115, %135 : tensor<128xf32> 2026-02-21T08:50:40.7532500Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7533005Z %138 = arith.extf %122 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7533458Z %139 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7533845Z %140 = arith.subf %138, %139 : tensor<128x256xf32> 2026-02-21T08:50:40.7534489Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7535213Z %142 = arith.select %124, %141, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:50:40.7535672Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7535996Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:50:40.7536297Z %145 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:50:40.7536620Z tt.reduce.return %145 : f32 2026-02-21T08:50:40.7536934Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7537291Z %144 = arith.addf %136, %143 : tensor<128xf32> 2026-02-21T08:50:40.7541161Z scf.yield %133, %144 : tensor<128xf32>, tensor<128xf32> 2026-02-21T08:50:40.7541527Z } {tt.num_stages = 1 : i32} 2026-02-21T08:50:40.7541925Z %9 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7542361Z %10 = tt.splat %c9984_i32 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7542728Z %11 = arith.addi %10, %9 : tensor<256xi32> 2026-02-21T08:50:40.7543077Z %12 = arith.cmpi slt, %11, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7543624Z %13 = tt.descriptor_load %0[%4, %c9984_i32] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7544246Z %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7544722Z %15 = tt.broadcast %14 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7545299Z %16 = arith.select %15, %13, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16> 2026-02-21T08:50:40.7545760Z %17 = arith.extf %16 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7546142Z %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7546440Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:50:40.7546734Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:50:40.7547032Z tt.reduce.return %60 : f32 2026-02-21T08:50:40.7547330Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7547688Z %19 = arith.truncf %18 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:50:40.7548088Z %20 = arith.extf %19 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:50:40.7548507Z %21 = arith.cmpf ogt, %8#0, %20 : tensor<128xf32> 2026-02-21T08:50:40.7548861Z %22 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32> 2026-02-21T08:50:40.7549186Z %23 = arith.ori %21, %22 : tensor<128xi1> 2026-02-21T08:50:40.7549562Z %24 = arith.select %23, %8#0, %20 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:50:40.7549929Z %25 = arith.subf %8#0, %24 : tensor<128xf32> 2026-02-21T08:50:40.7550516Z %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7551115Z %27 = arith.mulf %8#1, %26 : tensor<128xf32> 2026-02-21T08:50:40.7551514Z %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7552039Z %29 = arith.extf %13 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7552467Z %30 = tt.broadcast %28 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7552862Z %31 = arith.subf %29, %30 : tensor<128x256xf32> 2026-02-21T08:50:40.7553491Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7554195Z %33 = arith.select %15, %32, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32> 2026-02-21T08:50:40.7554599Z %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({ 2026-02-21T08:50:40.7554898Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:50:40.7555183Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:50:40.7555474Z tt.reduce.return %60 : f32 2026-02-21T08:50:40.7555770Z }) : (tensor<128x256xf32>) -> tensor<128xf32> 2026-02-21T08:50:40.7556093Z %35 = arith.addf %27, %34 : tensor<128xf32> 2026-02-21T08:50:40.7556402Z %c9984_i32_5 = arith.constant 9984 : i32 2026-02-21T08:50:40.7556712Z %c768_i32_6 = arith.constant 768 : i32 2026-02-21T08:50:40.7557070Z scf.for %arg3 = %c0_i32 to %c9984_i32_5 step %c768_i32_6 : i32 { 2026-02-21T08:50:40.7557529Z %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7557937Z %61 = tt.splat %arg3 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7558269Z %62 = arith.addi %61, %60 : tensor<256xi32> 2026-02-21T08:50:40.7558610Z %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7559224Z %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7559819Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7560364Z %66 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7560871Z %67 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7561316Z %68 = arith.subf %66, %67 : tensor<128x256xf32> 2026-02-21T08:50:40.7562064Z %69 = tt.extern_elementwise %68 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7562858Z %70 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7563375Z %71 = tt.broadcast %70 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7563910Z %72 = arith.divf %69, %71 : tensor<128x256xf32> 2026-02-21T08:50:40.7564349Z %73 = arith.truncf %72 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:50:40.7564916Z %74 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:50:40.7565442Z %75 = arith.muli %74, %cst : tensor<128x1xi32> 2026-02-21T08:50:40.7565903Z %76 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:50:40.7566459Z %77 = tt.broadcast %75 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7566944Z %78 = tt.broadcast %76 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7567413Z %79 = arith.addi %77, %78 : tensor<128x256xi32> 2026-02-21T08:50:40.7567849Z %80 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7568395Z %81 = tt.addptr %80, %79 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:50:40.7568972Z %82 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7569497Z %83 = tt.broadcast %82 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7569986Z tt.store %81, %73, %83 : tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7570378Z %c1_i32_7 = arith.constant 1 : i32 2026-02-21T08:50:40.7570772Z %84 = arith.muli %c256_i32, %c1_i32_7 : i32 2026-02-21T08:50:40.7571129Z %85 = arith.addi %arg3, %84 : i32 2026-02-21T08:50:40.7571608Z %86 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7572093Z %87 = tt.splat %85 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7572463Z %88 = arith.addi %87, %86 : tensor<256xi32> 2026-02-21T08:50:40.7572888Z %89 = arith.cmpi slt, %88, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7573429Z %90 = tt.descriptor_load %0[%4, %85] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7574090Z %91 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7574625Z %92 = arith.extf %90 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7575136Z %93 = tt.broadcast %91 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7575591Z %94 = arith.subf %92, %93 : tensor<128x256xf32> 2026-02-21T08:50:40.7576256Z %95 = tt.extern_elementwise %94 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7577041Z %96 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7577565Z %97 = tt.broadcast %96 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7578038Z %98 = arith.divf %95, %97 : tensor<128x256xf32> 2026-02-21T08:50:40.7578508Z %99 = arith.truncf %98 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:50:40.7579042Z %100 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:50:40.7579708Z %101 = arith.muli %100, %cst : tensor<128x1xi32> 2026-02-21T08:50:40.7580210Z %102 = tt.expand_dims %88 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:50:40.7580815Z %103 = tt.broadcast %101 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7581320Z %104 = tt.broadcast %102 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7581871Z %105 = arith.addi %103, %104 : tensor<128x256xi32> 2026-02-21T08:50:40.7582380Z %106 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7582934Z %107 = tt.addptr %106, %105 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:50:40.7583560Z %108 = tt.expand_dims %89 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7584118Z %109 = tt.broadcast %108 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7584747Z tt.store %107, %99, %109 : tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7585201Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:50:40.7585572Z %110 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:50:40.7585987Z %111 = arith.addi %arg3, %110 : i32 2026-02-21T08:50:40.7586430Z %112 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7586954Z %113 = tt.splat %111 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7587353Z %114 = arith.addi %113, %112 : tensor<256xi32> 2026-02-21T08:50:40.7587787Z %115 = arith.cmpi slt, %114, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7588375Z %116 = tt.descriptor_load %0[%4, %111] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7588999Z %117 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7589538Z %118 = arith.extf %116 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7590034Z %119 = tt.broadcast %117 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7590485Z %120 = arith.subf %118, %119 : tensor<128x256xf32> 2026-02-21T08:50:40.7591149Z %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7591987Z %122 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7592572Z %123 = tt.broadcast %122 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7593027Z %124 = arith.divf %121, %123 : tensor<128x256xf32> 2026-02-21T08:50:40.7593514Z %125 = arith.truncf %124 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:50:40.7594050Z %126 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:50:40.7594569Z %127 = arith.muli %126, %cst : tensor<128x1xi32> 2026-02-21T08:50:40.7595088Z %128 = tt.expand_dims %114 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:50:40.7595626Z %129 = tt.broadcast %127 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7596154Z %130 = tt.broadcast %128 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7596610Z %131 = arith.addi %129, %130 : tensor<128x256xi32> 2026-02-21T08:50:40.7597094Z %132 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7597628Z %133 = tt.addptr %132, %131 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:50:40.7598232Z %134 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7598790Z %135 = tt.broadcast %134 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7599245Z tt.store %133, %125, %135 : tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7599645Z } {tt.num_stages = 1 : i32} 2026-02-21T08:50:40.7600049Z %36 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:50:40.7600667Z %37 = tt.splat %c9984_i32_5 : i32 -> tensor<256xi32> 2026-02-21T08:50:40.7601096Z %38 = arith.addi %37, %36 : tensor<256xi32> 2026-02-21T08:50:40.7601486Z %39 = arith.cmpi slt, %38, %cst_2 : tensor<256xi32> 2026-02-21T08:50:40.7602151Z %40 = tt.descriptor_load %0[%4, %c9984_i32_5] : !tt.tensordesc> -> tensor<128x256xf16> 2026-02-21T08:50:40.7602806Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7603362Z %42 = arith.extf %40 : tensor<128x256xf16> to tensor<128x256xf32> 2026-02-21T08:50:40.7603829Z %43 = tt.broadcast %41 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7604295Z %44 = arith.subf %42, %43 : tensor<128x256xf32> 2026-02-21T08:50:40.7605121Z %45 = tt.extern_elementwise %44 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32> 2026-02-21T08:50:40.7605883Z %46 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:50:40.7606439Z %47 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x256xf32> 2026-02-21T08:50:40.7606873Z %48 = arith.divf %45, %47 : tensor<128x256xf32> 2026-02-21T08:50:40.7607338Z %49 = arith.truncf %48 : tensor<128x256xf32> to tensor<128x256xf16> 2026-02-21T08:50:40.7607882Z %50 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:50:40.7608365Z %51 = arith.muli %50, %cst : tensor<128x1xi32> 2026-02-21T08:50:40.7608860Z %52 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32> 2026-02-21T08:50:40.7609376Z %53 = tt.broadcast %51 : tensor<128x1xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7609891Z %54 = tt.broadcast %52 : tensor<1x256xi32> -> tensor<128x256xi32> 2026-02-21T08:50:40.7610325Z %55 = arith.addi %53, %54 : tensor<128x256xi32> 2026-02-21T08:50:40.7610795Z %56 = tt.splat %arg1 : !tt.ptr -> tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7611332Z %57 = tt.addptr %56, %55 : tensor<128x256x!tt.ptr>, tensor<128x256xi32> 2026-02-21T08:50:40.7611904Z %58 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1> 2026-02-21T08:50:40.7612312Z %59 = tt.broadcast %58 : tensor<1x256xi1> -> tensor<128x256xi1> 2026-02-21T08:50:40.7612602Z tt.store %57, %49, %59 : tensor<128x256x!tt.ptr> 2026-02-21T08:50:40.7612901Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:50:40.7613134Z tt.return 2026-02-21T08:50:40.7613328Z } 2026-02-21T08:50:40.7613514Z } 2026-02-21T08:50:40.7613604Z 2026-02-21T08:50:40.7613676Z {-# 2026-02-21T08:50:40.7613852Z external_resources: { 2026-02-21T08:50:40.7614049Z mlir_reproducer: { 2026-02-21T08:50:40.7621764Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:50:40.7630250Z disable_threading: false, 2026-02-21T08:50:40.7630587Z verify_each: true 2026-02-21T08:50:40.7630908Z } 2026-02-21T08:50:40.7631133Z } 2026-02-21T08:50:40.7631386Z #-} 2026-02-21T08:50:40.7632240Z /tmp/torchinductor_root/ev/cevvux4li7rnsh5wz7iwurivhxq66lvzlqufnv4e3fb3pym4xoo2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:50:40.7634457Z /tmp/torchinductor_root/ev/cevvux4li7rnsh5wz7iwurivhxq66lvzlqufnv4e3fb3pym4xoo2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:50:40.7636291Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:50:40.7638258Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:50:40.7640038Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:50:40.7640543Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:50:44.2124910Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.8 configs/s 2026-02-21T08:50:44.2135573Z [47s] Adaptive compile timeout: 30s (90% percentile=10.3s, bounds=[30.0s, 30s]) 2026-02-21T08:50:45.0400154Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1191.8 configs/s 2026-02-21T08:50:45.1036833Z [48s] Initial random population of 100, 5 starting points: 2026-02-21T08:50:45.1042155Z error=5 2026-02-21T08:50:45.1047135Z timeout=1 2026-02-21T08:50:45.1051479Z ok=94 2026-02-21T08:50:45.1053120Z min=0.0511 2026-02-21T08:50:45.1053367Z mid=0.9227 2026-02-21T08:50:45.1053539Z max=49.1755 2026-02-21T08:50:45.1053758Z best={'block_sizes': [1, 16384], 2026-02-21T08:50:45.1054031Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:50:45.1054336Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:50:45.1054611Z 'num_sm_multiplier': 8, 2026-02-21T08:50:45.1054853Z 'num_stages': 3, 2026-02-21T08:50:45.1055029Z 'num_warps': 1, 2026-02-21T08:50:45.1055249Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:50:45.1055479Z 'range_flattens': [False, None], 2026-02-21T08:50:45.1055729Z 'range_multi_buffers': [True, True], 2026-02-21T08:50:45.1055978Z 'range_num_stages': [1, 2], 2026-02-21T08:50:45.1056187Z 'range_unroll_factors': [0, 1], 2026-02-21T08:50:45.1056435Z 'range_warp_specializes': [True, None]} 2026-02-21T08:50:45.1056688Z [48s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:50:46.1488572Z [49s] Generation 1 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:51:00.3485347Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 1.2 configs/s 2026-02-21T08:51:09.2137669Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 84/84 9.5 configs/s 2026-02-21T08:51:14.7880826Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 181.3 2026-02-21T08:51:14.7882522Z configs/s 2026-02-21T08:51:15.0754172Z [78s] Generation 1 complete: 2026-02-21T08:51:15.0756013Z ok=86 2026-02-21T08:51:15.0756274Z min=0.0471 2026-02-21T08:51:15.0756460Z mid=0.0655 2026-02-21T08:51:15.0756662Z max=0.3850 2026-02-21T08:51:15.0756854Z best={'block_sizes': [1, 16384], 2026-02-21T08:51:15.0757179Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:51:15.0757466Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:51:15.0757743Z 'num_sm_multiplier': 16, 2026-02-21T08:51:15.0757955Z 'num_stages': 3, 2026-02-21T08:51:15.0758177Z 'num_warps': 1, 2026-02-21T08:51:15.0758393Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:51:15.0758671Z 'range_flattens': [False, True], 2026-02-21T08:51:15.0758922Z 'range_multi_buffers': [True, True], 2026-02-21T08:51:15.0759150Z 'range_num_stages': [1, 2], 2026-02-21T08:51:15.0759356Z 'range_unroll_factors': [0, 1], 2026-02-21T08:51:15.0759903Z 'range_warp_specializes': [True, None]} 2026-02-21T08:51:15.0770986Z [78s] Fitting surrogate: 186 points, 186 targets 2026-02-21T08:51:16.1222051Z [79s] Generation 2 starting: 73 neighbors, 5 active search path(s) 2026-02-21T08:51:27.5429367Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 2.1 configs/s 2026-02-21T08:51:31.9270827Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.1 configs/s 2026-02-21T08:51:38.4827259Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 172.4 2026-02-21T08:51:38.4829573Z configs/s 2026-02-21T08:51:38.8119817Z [102s] Generation 2 complete: 2026-02-21T08:51:38.8123933Z ok=78 2026-02-21T08:51:38.8129031Z min=0.0451 2026-02-21T08:51:38.8130542Z mid=0.0573 2026-02-21T08:51:38.8130837Z max=0.2806 2026-02-21T08:51:38.8135696Z best={'block_sizes': [1, 16384], 2026-02-21T08:51:38.8140296Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:51:38.8144819Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:51:38.8149321Z 'num_sm_multiplier': 32, 2026-02-21T08:51:38.8153313Z 'num_stages': 4, 2026-02-21T08:51:38.8158598Z 'num_warps': 1, 2026-02-21T08:51:38.8163183Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:51:38.8165125Z 'range_flattens': [False, True], 2026-02-21T08:51:38.8165511Z 'range_multi_buffers': [True, True], 2026-02-21T08:51:38.8165795Z 'range_num_stages': [1, 2], 2026-02-21T08:51:38.8166085Z 'range_unroll_factors': [0, 1], 2026-02-21T08:51:38.8166349Z 'range_warp_specializes': [True, None]} 2026-02-21T08:51:38.8166785Z [102s] Fitting surrogate: 264 points, 264 targets 2026-02-21T08:51:39.7419093Z [103s] Generation 3 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:51:53.1800953Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 8.8 configs/s 2026-02-21T08:51:58.9119454Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 13.7 configs/s 2026-02-21T08:52:03.2253310Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 234.2 2026-02-21T08:52:03.2254774Z configs/s 2026-02-21T08:52:03.4755945Z [127s] Generation 3 complete: 2026-02-21T08:52:03.4760004Z ok=78 2026-02-21T08:52:03.4760305Z min=0.0470 2026-02-21T08:52:03.4760529Z mid=0.0616 2026-02-21T08:52:03.4760759Z max=0.2089 2026-02-21T08:52:03.4761019Z best={'block_sizes': [1, 16384], 2026-02-21T08:52:03.4761328Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:52:03.4762308Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:52:03.4762635Z 'num_sm_multiplier': 32, 2026-02-21T08:52:03.4762918Z 'num_stages': 4, 2026-02-21T08:52:03.4763121Z 'num_warps': 1, 2026-02-21T08:52:03.4763350Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:52:03.4763589Z 'range_flattens': [False, None], 2026-02-21T08:52:03.4763849Z 'range_multi_buffers': [True, True], 2026-02-21T08:52:03.4764113Z 'range_num_stages': [1, 2], 2026-02-21T08:52:03.4764761Z 'range_unroll_factors': [0, 1], 2026-02-21T08:52:03.4765075Z 'range_warp_specializes': [True, None]} 2026-02-21T08:52:03.4771054Z [127s] Fitting surrogate: 342 points, 342 targets 2026-02-21T08:52:04.5167703Z [128s] Generation 4 starting: 73 neighbors, 5 active search path(s) 2026-02-21T08:52:23.2237078Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.3 configs/s 2026-02-21T08:52:27.8813897Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.9 configs/s 2026-02-21T08:52:31.5934521Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.1 2026-02-21T08:52:31.5936676Z configs/s 2026-02-21T08:52:31.8340663Z [155s] Generation 4 complete: 2026-02-21T08:52:31.8345074Z ok=79 2026-02-21T08:52:31.8350800Z min=0.0389 2026-02-21T08:52:31.8352678Z mid=0.0554 2026-02-21T08:52:31.8353012Z max=0.2560 2026-02-21T08:52:31.8353680Z best={'block_sizes': [1, 16384], 2026-02-21T08:52:31.8354100Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:52:31.8360378Z 'load_eviction_policies': ['', ''], 2026-02-21T08:52:31.8363260Z 'num_sm_multiplier': 64, 2026-02-21T08:52:31.8363622Z 'num_stages': 5, 2026-02-21T08:52:31.8368228Z 'num_warps': 2, 2026-02-21T08:52:31.8370597Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:52:31.8370907Z 'range_flattens': [False, True], 2026-02-21T08:52:31.8371191Z 'range_multi_buffers': [False, False], 2026-02-21T08:52:31.8371443Z 'range_num_stages': [3, 1], 2026-02-21T08:52:31.8371801Z 'range_unroll_factors': [0, 2], 2026-02-21T08:52:31.8372068Z 'range_warp_specializes': [True, None]} 2026-02-21T08:52:31.8372411Z [155s] Fitting surrogate: 421 points, 421 targets 2026-02-21T08:52:33.0531055Z [156s] Generation 5 starting: 76 neighbors, 5 active search path(s) 2026-02-21T08:52:51.8480958Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 2.8 configs/s 2026-02-21T08:52:56.7467809Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 16.7 configs/s 2026-02-21T08:53:00.9418410Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 304.5 2026-02-21T08:53:00.9418852Z configs/s 2026-02-21T08:53:01.1560866Z [184s] Generation 5 complete: 2026-02-21T08:53:01.1565051Z ok=82 2026-02-21T08:53:01.1569611Z min=0.0369 2026-02-21T08:53:01.1574267Z mid=0.0615 2026-02-21T08:53:01.1578928Z max=0.3031 2026-02-21T08:53:01.1582947Z best={'block_sizes': [1, 16384], 2026-02-21T08:53:01.1586991Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:53:01.1588302Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:01.1588578Z 'num_sm_multiplier': 32, 2026-02-21T08:53:01.1588764Z 'num_stages': 5, 2026-02-21T08:53:01.1588969Z 'num_warps': 8, 2026-02-21T08:53:01.1589172Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:53:01.1589445Z 'range_flattens': [False, True], 2026-02-21T08:53:01.1589711Z 'range_multi_buffers': [False, False], 2026-02-21T08:53:01.1590387Z 'range_num_stages': [3, 1], 2026-02-21T08:53:01.1590600Z 'range_unroll_factors': [0, 1], 2026-02-21T08:53:01.1590861Z 'range_warp_specializes': [True, None]} 2026-02-21T08:53:01.1591144Z [184s] Fitting surrogate: 503 points, 503 targets 2026-02-21T08:53:02.0529588Z [185s] Generation 6 starting: 56 neighbors, 4 active search path(s) 2026-02-21T08:53:14.7028309Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 3.9 configs/s 2026-02-21T08:53:18.6869741Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 15.2 configs/s 2026-02-21T08:53:20.8043378Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 475.3 2026-02-21T08:53:20.8046554Z configs/s 2026-02-21T08:53:20.9367445Z [204s] Generation 6 complete: 2026-02-21T08:53:20.9373416Z error=1 2026-02-21T08:53:20.9374968Z ok=60 2026-02-21T08:53:20.9375621Z min=0.0369 2026-02-21T08:53:20.9378366Z mid=0.0614 2026-02-21T08:53:20.9378608Z max=0.2376 2026-02-21T08:53:20.9378796Z best={'block_sizes': [1, 16384], 2026-02-21T08:53:20.9379095Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:53:20.9379365Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:20.9379618Z 'num_sm_multiplier': 32, 2026-02-21T08:53:20.9379844Z 'num_stages': 5, 2026-02-21T08:53:20.9380025Z 'num_warps': 8, 2026-02-21T08:53:20.9380250Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:53:20.9380481Z 'range_flattens': [False, True], 2026-02-21T08:53:20.9380732Z 'range_multi_buffers': [True, False], 2026-02-21T08:53:20.9380952Z 'range_num_stages': [3, 1], 2026-02-21T08:53:20.9381184Z 'range_unroll_factors': [0, 1], 2026-02-21T08:53:20.9381400Z 'range_warp_specializes': [True, None]} 2026-02-21T08:53:20.9392712Z [204s] Fitting surrogate: 564 points, 564 targets 2026-02-21T08:53:21.7409622Z [205s] Generation 7 starting: 44 neighbors, 3 active search path(s) 2026-02-21T08:53:32.8663061Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 1.4 configs/s 2026-02-21T08:53:35.6734080Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.0 configs/s 2026-02-21T08:53:37.6774860Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 503.4 2026-02-21T08:53:37.6777386Z configs/s 2026-02-21T08:53:37.8091223Z [221s] Generation 7 complete: 2026-02-21T08:53:37.8092845Z ok=48 2026-02-21T08:53:37.8093063Z min=0.0369 2026-02-21T08:53:37.8093278Z mid=0.0554 2026-02-21T08:53:37.8093445Z max=0.2560 2026-02-21T08:53:37.8093659Z best={'block_sizes': [1, 16384], 2026-02-21T08:53:37.8093934Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:53:37.8094238Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:37.8094473Z 'num_sm_multiplier': 32, 2026-02-21T08:53:37.8094704Z 'num_stages': 5, 2026-02-21T08:53:37.8094911Z 'num_warps': 8, 2026-02-21T08:53:37.8095155Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:53:37.8095441Z 'range_flattens': [False, True], 2026-02-21T08:53:37.8095662Z 'range_multi_buffers': [True, False], 2026-02-21T08:53:37.8095912Z 'range_num_stages': [3, 1], 2026-02-21T08:53:37.8096121Z 'range_unroll_factors': [0, 1], 2026-02-21T08:53:37.8096373Z 'range_warp_specializes': [True, None]} 2026-02-21T08:53:37.8114561Z [221s] Fitting surrogate: 612 points, 612 targets 2026-02-21T08:53:38.2034254Z [221s] Generation 8 starting: 10 neighbors, 1 active search path(s) 2026-02-21T08:53:42.5589477Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.3 configs/s 2026-02-21T08:53:43.2287872Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 15.9 configs/s 2026-02-21T08:53:44.0741063Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1176.6 2026-02-21T08:53:44.0741529Z configs/s 2026-02-21T08:53:44.1392341Z [227s] Generation 8 complete: 2026-02-21T08:53:44.1394426Z ok=12 2026-02-21T08:53:44.1394740Z min=0.0389 2026-02-21T08:53:44.1400728Z mid=0.0409 2026-02-21T08:53:44.1402582Z max=0.0798 2026-02-21T08:53:44.1402839Z best={'block_sizes': [1, 16384], 2026-02-21T08:53:44.1403125Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:53:44.1403440Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:44.1403672Z 'num_sm_multiplier': 32, 2026-02-21T08:53:44.1403909Z 'num_stages': 5, 2026-02-21T08:53:44.1404087Z 'num_warps': 8, 2026-02-21T08:53:44.1404321Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:53:44.1404586Z 'range_flattens': [False, True], 2026-02-21T08:53:44.1404810Z 'range_multi_buffers': [True, False], 2026-02-21T08:53:44.1405045Z 'range_num_stages': [3, 1], 2026-02-21T08:53:44.1405254Z 'range_unroll_factors': [0, 1], 2026-02-21T08:53:44.1405508Z 'range_warp_specializes': [True, None]} 2026-02-21T08:53:44.1417577Z [227s] Fitting surrogate: 624 points, 624 targets 2026-02-21T08:53:44.4843439Z [228s] Generation 9 starting: 6 neighbors, 1 active search path(s) 2026-02-21T08:53:46.4425825Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 3.0 configs/s 2026-02-21T08:53:46.8484579Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 6/6 16.5 configs/s 2026-02-21T08:53:47.4898630Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1544.1 2026-02-21T08:53:47.4902887Z configs/s 2026-02-21T08:53:47.5447848Z [231s] Generation 9 complete: 2026-02-21T08:53:47.5451814Z ok=8 2026-02-21T08:53:47.5453127Z min=0.0389 2026-02-21T08:53:47.5453417Z mid=0.0408 2026-02-21T08:53:47.5458910Z max=0.0430 2026-02-21T08:53:47.5460332Z best={'block_sizes': [1, 16384], 2026-02-21T08:53:47.5460674Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:53:47.5460981Z 'load_eviction_policies': ['', ''], 2026-02-21T08:53:47.5461216Z 'num_sm_multiplier': 32, 2026-02-21T08:53:47.5461471Z 'num_stages': 5, 2026-02-21T08:53:47.5461732Z 'num_warps': 8, 2026-02-21T08:53:47.5461961Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:53:47.5462202Z 'range_flattens': [False, True], 2026-02-21T08:53:47.5462452Z 'range_multi_buffers': [True, False], 2026-02-21T08:53:47.5462677Z 'range_num_stages': [3, 1], 2026-02-21T08:53:47.5462911Z 'range_unroll_factors': [0, 1], 2026-02-21T08:53:47.5463130Z 'range_warp_specializes': [True, None]} 2026-02-21T08:53:47.5473310Z [231s] Fitting surrogate: 632 points, 632 targets 2026-02-21T08:53:47.8378046Z [231s] Autotuning complete in 231.6s after searching 616 configs. 2026-02-21T08:53:47.8383369Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:53:47.8389289Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 1], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:53:47.8390385Z 2026-02-21T08:53:47.8390907Z [231s] Code of selected kernel: /tmp/torchinductor_root/dq/cdqq7zp6bqkjpqkfmt7mzu4ydfou4qpeunr74tt4laftwcjdwbex.py 2026-02-21T08:53:48.9456211Z WARNING:tritonbench.utils.triton_op:Completed input ID 77: 2026-02-21T08:53:48.9458197Z (M, N) 2026-02-21T08:53:48.9458563Z ------------- 2026-02-21T08:53:48.9458829Z (4096, 10112) 2026-02-21T08:53:48.9463801Z 2026-02-21T08:53:48.9468591Z 80%|████████ | 16/20 [44:46<12:41, 190.29s/it]WARNING:tritonbench.utils.triton_op:Running input ID 82: 2026-02-21T08:53:48.9469338Z (M, N) 2026-02-21T08:53:48.9469535Z ------------- 2026-02-21T08:53:48.9469792Z (4096, 10752) 2026-02-21T08:53:48.9470153Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax 2026-02-21T08:53:50.1476764Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:53:51.4878186Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax 2026-02-21T08:53:56.2149995Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:53:56.2151858Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:53:56.2152210Z 'dtype': 'torch.float16', 2026-02-21T08:53:56.2152523Z 'shape': (4096, 10752), 2026-02-21T08:53:56.2152783Z 'stride': (10752, 1)},), 2026-02-21T08:53:56.2153068Z 'kwargs': {}} 2026-02-21T08:53:56.2176817Z INFO:tritonbench.utils.triton_op:Took 3.06ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:53:56.4000372Z [0s] Autotune random seed: 2134816249 2026-02-21T08:53:59.0300278Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:54:40.3355677Z [41s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:54:40.3371997Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T08:54:41.4620093Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:54:41.4622361Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:54:41.4622998Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:54:41.4628559Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:54:41.4629948Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:54:41.4630303Z %cst = arith.constant dense<10752> : tensor<8x1xi32> 2026-02-21T08:54:41.4630643Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T08:54:41.4630988Z %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T08:54:41.4631271Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:54:41.4631503Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:54:41.4631843Z %c10752_i32 = arith.constant 10752 : i32 2026-02-21T08:54:41.4632076Z %c10752_i64 = arith.constant 10752 : i64 2026-02-21T08:54:41.4632338Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:54:41.4632705Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10752_i32], [%c10752_i64, %c1_i64] : , > 2026-02-21T08:54:41.4633108Z %1 = tt.get_program_id x : i32 2026-02-21T08:54:41.4633394Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T08:54:41.4633657Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T08:54:41.4633957Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:54:41.4634248Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T08:54:41.4634559Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T08:54:41.4634786Z %c10240_i32 = arith.constant 10240 : i32 2026-02-21T08:54:41.4635054Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:54:41.4635502Z %6:2 = scf.for %arg3 = %c0_i32 to %c10240_i32 step %c2048_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T08:54:41.4636016Z %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:54:41.4636416Z %49 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4636695Z %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4636968Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4637207Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:54:41.4637474Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4637732Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4638326Z %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:54:41.4638679Z %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:54:41.4638977Z %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32> 2026-02-21T08:54:41.4639255Z %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T08:54:41.4639540Z %55 = arith.ori %53, %54 : tensor<8xi1> 2026-02-21T08:54:41.4639838Z %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:54:41.4640113Z %57 = arith.subf %arg4, %56 : tensor<8xf32> 2026-02-21T08:54:41.4640540Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4640938Z %59 = arith.mulf %arg5, %58 : tensor<8xf32> 2026-02-21T08:54:41.4641259Z %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4641768Z %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4642076Z %62 = arith.subf %49, %61 : tensor<8x512xf32> 2026-02-21T08:54:41.4642511Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4642921Z %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4643178Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4643403Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:54:41.4643667Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4643892Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4644158Z %65 = arith.addf %59, %64 : tensor<8xf32> 2026-02-21T08:54:41.4644418Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:54:41.4644642Z %66 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:54:41.4644916Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T08:54:41.4645231Z %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:54:41.4645609Z %69 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4645874Z %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4646128Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4646370Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:54:41.4646602Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4646849Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4647106Z %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:54:41.4647411Z %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:54:41.4647677Z %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32> 2026-02-21T08:54:41.4647955Z %74 = arith.cmpf une, %56, %56 : tensor<8xf32> 2026-02-21T08:54:41.4648222Z %75 = arith.ori %73, %74 : tensor<8xi1> 2026-02-21T08:54:41.4648492Z %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:54:41.4648792Z %77 = arith.subf %56, %76 : tensor<8xf32> 2026-02-21T08:54:41.4649179Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4649597Z %79 = arith.mulf %65, %78 : tensor<8xf32> 2026-02-21T08:54:41.4649883Z %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4650254Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4650556Z %82 = arith.subf %69, %81 : tensor<8x512xf32> 2026-02-21T08:54:41.4650959Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4651393Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4651662Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4652003Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:54:41.4652232Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4652484Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4652750Z %85 = arith.addf %79, %84 : tensor<8xf32> 2026-02-21T08:54:41.4652981Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:54:41.4653235Z %86 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:54:41.4653462Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T08:54:41.4653797Z %88 = tt.descriptor_load %0[%2, %87] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:54:41.4654149Z %89 = arith.extf %88 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4654443Z %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4654697Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4654917Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:54:41.4655246Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4655474Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4655759Z %91 = arith.truncf %90 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:54:41.4656034Z %92 = arith.extf %91 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:54:41.4656326Z %93 = arith.cmpf ogt, %76, %92 : tensor<8xf32> 2026-02-21T08:54:41.4656607Z %94 = arith.cmpf une, %76, %76 : tensor<8xf32> 2026-02-21T08:54:41.4656845Z %95 = arith.ori %93, %94 : tensor<8xi1> 2026-02-21T08:54:41.4657132Z %96 = arith.select %95, %76, %92 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:54:41.4657395Z %97 = arith.subf %76, %96 : tensor<8xf32> 2026-02-21T08:54:41.4657836Z %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4658226Z %99 = arith.mulf %85, %98 : tensor<8xf32> 2026-02-21T08:54:41.4658543Z %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4658901Z %101 = tt.broadcast %100 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4659182Z %102 = arith.subf %89, %101 : tensor<8x512xf32> 2026-02-21T08:54:41.4659621Z %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4660028Z %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4660288Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4660510Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:54:41.4660767Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4661009Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4661243Z %105 = arith.addf %99, %104 : tensor<8xf32> 2026-02-21T08:54:41.4661506Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:54:41.4661780Z %106 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:54:41.4662053Z %107 = arith.addi %arg3, %106 : i32 2026-02-21T08:54:41.4662371Z %108 = tt.descriptor_load %0[%2, %107] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:54:41.4662762Z %109 = arith.extf %108 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4663069Z %110 = "tt.reduce"(%109) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4663304Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4663559Z %126 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:54:41.4663806Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4664048Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4664315Z %111 = arith.truncf %110 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:54:41.4664649Z %112 = arith.extf %111 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:54:41.4664965Z %113 = arith.cmpf ogt, %96, %112 : tensor<8xf32> 2026-02-21T08:54:41.4665239Z %114 = arith.cmpf une, %96, %96 : tensor<8xf32> 2026-02-21T08:54:41.4665601Z %115 = arith.ori %113, %114 : tensor<8xi1> 2026-02-21T08:54:41.4665888Z %116 = arith.select %115, %96, %112 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:54:41.4666209Z %117 = arith.subf %96, %116 : tensor<8xf32> 2026-02-21T08:54:41.4666624Z %118 = tt.extern_elementwise %117 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4667076Z %119 = arith.mulf %105, %118 : tensor<8xf32> 2026-02-21T08:54:41.4667417Z %120 = tt.expand_dims %116 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4667767Z %121 = tt.broadcast %120 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4668087Z %122 = arith.subf %109, %121 : tensor<8x512xf32> 2026-02-21T08:54:41.4668515Z %123 = tt.extern_elementwise %122 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4669035Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4669277Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:54:41.4669534Z %126 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:54:41.4669792Z tt.reduce.return %126 : f32 2026-02-21T08:54:41.4670025Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4670279Z %125 = arith.addf %119, %124 : tensor<8xf32> 2026-02-21T08:54:41.4670546Z scf.yield %116, %125 : tensor<8xf32>, tensor<8xf32> 2026-02-21T08:54:41.4670869Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:54:41.4671247Z %7 = tt.descriptor_load %0[%2, %c10240_i32] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T08:54:41.4671685Z %8 = arith.extf %7 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4671994Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4672234Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:54:41.4672496Z %48 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:54:41.4672738Z tt.reduce.return %48 : f32 2026-02-21T08:54:41.4673012Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4673284Z %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16> 2026-02-21T08:54:41.4673612Z %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32> 2026-02-21T08:54:41.4673900Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32> 2026-02-21T08:54:41.4674155Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T08:54:41.4674422Z %14 = arith.ori %12, %13 : tensor<8xi1> 2026-02-21T08:54:41.4674683Z %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32> 2026-02-21T08:54:41.4674975Z %16 = arith.subf %6#0, %15 : tensor<8xf32> 2026-02-21T08:54:41.4675362Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4675781Z %18 = arith.mulf %6#1, %17 : tensor<8xf32> 2026-02-21T08:54:41.4676099Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4676418Z %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4676719Z %21 = arith.subf %8, %20 : tensor<8x512xf32> 2026-02-21T08:54:41.4677112Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4677534Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T08:54:41.4677762Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:54:41.4678010Z %48 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:54:41.4678265Z tt.reduce.return %48 : f32 2026-02-21T08:54:41.4678489Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:54:41.4678748Z %24 = arith.addf %18, %23 : tensor<8xf32> 2026-02-21T08:54:41.4678985Z %c10240_i32_2 = arith.constant 10240 : i32 2026-02-21T08:54:41.4679259Z %c2048_i32_3 = arith.constant 2048 : i32 2026-02-21T08:54:41.4679607Z scf.for %arg3 = %c0_i32 to %c10240_i32_2 step %c2048_i32_3 : i32 { 2026-02-21T08:54:41.4679969Z %48 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:41.4680301Z %49 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T08:54:41.4680551Z %50 = arith.addi %49, %48 : tensor<512xi32> 2026-02-21T08:54:41.4680872Z %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:54:41.4681176Z %52 = arith.muli %51, %cst : tensor<8x1xi32> 2026-02-21T08:54:41.4681500Z %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:41.4681857Z %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4682187Z %55 = tt.broadcast %53 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4682484Z %56 = arith.addi %54, %55 : tensor<8x512xi32> 2026-02-21T08:54:41.4682836Z %57 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4683176Z %58 = tt.addptr %57, %56 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4683511Z %59 = tt.load %58 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4683875Z %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4684191Z %61 = arith.extf %59 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4684511Z %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4684810Z %63 = arith.subf %61, %62 : tensor<8x512xf32> 2026-02-21T08:54:41.4685213Z %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4685684Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4686003Z %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4686302Z %67 = arith.divf %64, %66 : tensor<8x512xf32> 2026-02-21T08:54:41.4686595Z %68 = arith.truncf %67 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:54:41.4686897Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4687237Z %70 = tt.addptr %69, %56 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4687531Z tt.store %70, %68 : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4687801Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:54:41.4688033Z %71 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:54:41.4688291Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T08:54:41.4688595Z %73 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:41.4688883Z %74 = tt.splat %72 : i32 -> tensor<512xi32> 2026-02-21T08:54:41.4689152Z %75 = arith.addi %74, %73 : tensor<512xi32> 2026-02-21T08:54:41.4689438Z %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:54:41.4689760Z %77 = arith.muli %76, %cst : tensor<8x1xi32> 2026-02-21T08:54:41.4690051Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:41.4690401Z %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4690728Z %80 = tt.broadcast %78 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4690999Z %81 = arith.addi %79, %80 : tensor<8x512xi32> 2026-02-21T08:54:41.4691298Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4691635Z %83 = tt.addptr %82, %81 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4691999Z %84 = tt.load %83 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4692337Z %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4692742Z %86 = arith.extf %84 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4693066Z %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4693334Z %88 = arith.subf %86, %87 : tensor<8x512xf32> 2026-02-21T08:54:41.4693770Z %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4694215Z %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4694571Z %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4694866Z %92 = arith.divf %89, %91 : tensor<8x512xf32> 2026-02-21T08:54:41.4695141Z %93 = arith.truncf %92 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:54:41.4695476Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4695841Z %95 = tt.addptr %94, %81 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4696165Z tt.store %95, %93 : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4696409Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:54:41.4696667Z %96 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:54:41.4696933Z %97 = arith.addi %arg3, %96 : i32 2026-02-21T08:54:41.4697200Z %98 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:41.4697518Z %99 = tt.splat %97 : i32 -> tensor<512xi32> 2026-02-21T08:54:41.4697764Z %100 = arith.addi %99, %98 : tensor<512xi32> 2026-02-21T08:54:41.4698076Z %101 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:54:41.4698375Z %102 = arith.muli %101, %cst : tensor<8x1xi32> 2026-02-21T08:54:41.4698708Z %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:41.4699074Z %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4699377Z %105 = tt.broadcast %103 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4699689Z %106 = arith.addi %104, %105 : tensor<8x512xi32> 2026-02-21T08:54:41.4699969Z %107 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4700310Z %108 = tt.addptr %107, %106 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4700656Z %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4701027Z %110 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4701384Z %111 = arith.extf %109 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4701721Z %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4702030Z %113 = arith.subf %111, %112 : tensor<8x512xf32> 2026-02-21T08:54:41.4702444Z %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4702943Z %115 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4703296Z %116 = tt.broadcast %115 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4703576Z %117 = arith.divf %114, %116 : tensor<8x512xf32> 2026-02-21T08:54:41.4703878Z %118 = arith.truncf %117 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:54:41.4704189Z %119 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4704545Z %120 = tt.addptr %119, %106 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4704869Z tt.store %120, %118 : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4705119Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:54:41.4705376Z %121 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T08:54:41.4705610Z %122 = arith.addi %arg3, %121 : i32 2026-02-21T08:54:41.4705980Z %123 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:41.4706270Z %124 = tt.splat %122 : i32 -> tensor<512xi32> 2026-02-21T08:54:41.4706547Z %125 = arith.addi %124, %123 : tensor<512xi32> 2026-02-21T08:54:41.4706837Z %126 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:54:41.4707163Z %127 = arith.muli %126, %cst : tensor<8x1xi32> 2026-02-21T08:54:41.4707501Z %128 = tt.expand_dims %125 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:41.4707835Z %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4708169Z %130 = tt.broadcast %128 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4708449Z %131 = arith.addi %129, %130 : tensor<8x512xi32> 2026-02-21T08:54:41.4708759Z %132 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4709184Z %133 = tt.addptr %132, %131 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4709553Z %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4709953Z %135 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4710298Z %136 = arith.extf %134 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4710645Z %137 = tt.broadcast %135 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4710969Z %138 = arith.subf %136, %137 : tensor<8x512xf32> 2026-02-21T08:54:41.4711404Z %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4711947Z %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4712289Z %141 = tt.broadcast %140 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4712616Z %142 = arith.divf %139, %141 : tensor<8x512xf32> 2026-02-21T08:54:41.4712941Z %143 = arith.truncf %142 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:54:41.4713272Z %144 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4713645Z %145 = tt.addptr %144, %131 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4713964Z tt.store %145, %143 : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4714290Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:54:41.4714621Z %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:54:41.4714976Z %26 = tt.splat %c10240_i32_2 : i32 -> tensor<512xi32> 2026-02-21T08:54:41.4715267Z %27 = arith.addi %26, %25 : tensor<512xi32> 2026-02-21T08:54:41.4715564Z %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T08:54:41.4715905Z %29 = arith.muli %28, %cst : tensor<8x1xi32> 2026-02-21T08:54:41.4716214Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:54:41.4716587Z %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4716899Z %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T08:54:41.4717211Z %33 = arith.addi %31, %32 : tensor<8x512xi32> 2026-02-21T08:54:41.4717523Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4717853Z %35 = tt.addptr %34, %33 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4718228Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4718579Z %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4718926Z %38 = arith.extf %36 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T08:54:41.4719240Z %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4763796Z %40 = arith.subf %38, %39 : tensor<8x512xf32> 2026-02-21T08:54:41.4764220Z %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:54:41.4764670Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T08:54:41.4765009Z %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T08:54:41.4765275Z %44 = arith.divf %41, %43 : tensor<8x512xf32> 2026-02-21T08:54:41.4765596Z %45 = arith.truncf %44 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T08:54:41.4765925Z %46 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4766231Z %47 = tt.addptr %46, %33 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T08:54:41.4766544Z tt.store %47, %45 : tensor<8x512x!tt.ptr> 2026-02-21T08:54:41.4767002Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:54:41.4767330Z tt.return 2026-02-21T08:54:41.4767503Z } 2026-02-21T08:54:41.4767697Z } 2026-02-21T08:54:41.4767791Z 2026-02-21T08:54:41.4767894Z {-# 2026-02-21T08:54:41.4768068Z external_resources: { 2026-02-21T08:54:41.4768292Z mlir_reproducer: { 2026-02-21T08:54:41.4772699Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:54:41.4777209Z disable_threading: false, 2026-02-21T08:54:41.4777451Z verify_each: true 2026-02-21T08:54:41.4777635Z } 2026-02-21T08:54:41.4777827Z } 2026-02-21T08:54:41.4777984Z #-} 2026-02-21T08:54:41.4778480Z /tmp/torchinductor_root/ng/cng5e4yetnouvd2hbraksannbxwtzpiyha37lrnoy6ervzlsattq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:54:41.4779743Z /tmp/torchinductor_root/ng/cng5e4yetnouvd2hbraksannbxwtzpiyha37lrnoy6ervzlsattq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:54:41.4780766Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:54:41.4781944Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:54:41.4783057Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:54:41.4783352Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:54:49.5574703Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.9 configs/s 2026-02-21T08:54:49.5585870Z [50s] Adaptive compile timeout: 30s (90% percentile=11.0s, bounds=[30.0s, 30s]) 2026-02-21T08:54:50.4239809Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1136.9 configs/s 2026-02-21T08:54:50.4892210Z [51s] Initial random population of 100, 5 starting points: 2026-02-21T08:54:50.4894399Z error=6 2026-02-21T08:54:50.4894698Z timeout=1 2026-02-21T08:54:50.4894915Z ok=93 2026-02-21T08:54:50.4895118Z min=0.0553 2026-02-21T08:54:50.4895288Z mid=1.0046 2026-02-21T08:54:50.4895483Z max=53.8255 2026-02-21T08:54:50.4895703Z best={'block_sizes': [1, 16384], 2026-02-21T08:54:50.4896010Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:54:50.4896281Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:54:50.4896540Z 'num_sm_multiplier': 8, 2026-02-21T08:54:50.4896738Z 'num_stages': 3, 2026-02-21T08:54:50.4896951Z 'num_warps': 1, 2026-02-21T08:54:50.4897150Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:54:50.4897413Z 'range_flattens': [False, None], 2026-02-21T08:54:50.4897706Z 'range_multi_buffers': [True, True], 2026-02-21T08:54:50.4897974Z 'range_num_stages': [1, 2], 2026-02-21T08:54:50.4898216Z 'range_unroll_factors': [0, 1], 2026-02-21T08:54:50.4898438Z 'range_warp_specializes': [True, None]} 2026-02-21T08:54:50.4907217Z [51s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:54:51.5351664Z [52s] Generation 1 starting: 79 neighbors, 5 active search path(s) 2026-02-21T08:55:03.6071947Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 11.0 configs/s 2026-02-21T08:55:08.5832776Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.0 configs/s 2026-02-21T08:55:15.0529043Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 175.1 2026-02-21T08:55:15.0530479Z configs/s 2026-02-21T08:55:15.3387370Z [76s] Generation 1 complete: 2026-02-21T08:55:15.3389314Z ok=85 2026-02-21T08:55:15.3389555Z min=0.0512 2026-02-21T08:55:15.3389805Z mid=0.0696 2026-02-21T08:55:15.3394597Z max=0.3931 2026-02-21T08:55:15.3398147Z best={'block_sizes': [1, 16384], 2026-02-21T08:55:15.3400565Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:55:15.3400912Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:55:15.3401204Z 'num_stages': 3, 2026-02-21T08:55:15.3401463Z 'num_warps': 1, 2026-02-21T08:55:15.3401920Z 'pid_type': 'flat', 2026-02-21T08:55:15.3402120Z 'range_flattens': [None, None], 2026-02-21T08:55:15.3402369Z 'range_multi_buffers': [None, True], 2026-02-21T08:55:15.3406071Z 'range_num_stages': [0, 2], 2026-02-21T08:55:15.3410762Z 'range_unroll_factors': [0, 1], 2026-02-21T08:55:15.3412893Z 'range_warp_specializes': [None, True]} 2026-02-21T08:55:15.3413267Z [76s] Fitting surrogate: 185 points, 185 targets 2026-02-21T08:55:16.1277215Z [77s] Generation 2 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:55:27.5110597Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 17.1 configs/s 2026-02-21T08:55:31.2267729Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.2 configs/s 2026-02-21T08:55:33.4591357Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 449.9 2026-02-21T08:55:33.4595295Z configs/s 2026-02-21T08:55:33.5918105Z [94s] Generation 2 complete: 2026-02-21T08:55:33.5920079Z ok=66 2026-02-21T08:55:33.5920330Z min=0.0410 2026-02-21T08:55:33.5920560Z mid=0.0675 2026-02-21T08:55:33.5920756Z max=0.7946 2026-02-21T08:55:33.5920992Z best={'block_sizes': [1, 16384], 2026-02-21T08:55:33.5921320Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:55:33.5921731Z 'load_eviction_policies': ['', ''], 2026-02-21T08:55:33.5921957Z 'num_stages': 3, 2026-02-21T08:55:33.5922211Z 'num_warps': 2, 2026-02-21T08:55:33.5922454Z 'pid_type': 'flat', 2026-02-21T08:55:33.5922654Z 'range_flattens': [None, None], 2026-02-21T08:55:33.5922919Z 'range_multi_buffers': [None, True], 2026-02-21T08:55:33.5923148Z 'range_num_stages': [0, 2], 2026-02-21T08:55:33.5923377Z 'range_unroll_factors': [0, 1], 2026-02-21T08:55:33.5923600Z 'range_warp_specializes': [None, True]} 2026-02-21T08:55:33.5935663Z [94s] Fitting surrogate: 251 points, 251 targets 2026-02-21T08:55:34.4207816Z [95s] Generation 3 starting: 56 neighbors, 4 active search path(s) 2026-02-21T08:55:45.1062402Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 4.1 configs/s 2026-02-21T08:55:49.3252470Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 14.1 configs/s 2026-02-21T08:55:51.0009540Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 599.6 2026-02-21T08:55:51.0010978Z configs/s 2026-02-21T08:55:51.1128437Z [112s] Generation 3 complete: 2026-02-21T08:55:51.1130134Z ok=61 2026-02-21T08:55:51.1130407Z min=0.0409 2026-02-21T08:55:51.1130666Z mid=0.0655 2026-02-21T08:55:51.1130872Z max=0.4711 2026-02-21T08:55:51.1131128Z best={'block_sizes': [1, 16384], 2026-02-21T08:55:51.1131423Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:55:51.1132134Z 'load_eviction_policies': ['', ''], 2026-02-21T08:55:51.1133936Z 'num_stages': 3, 2026-02-21T08:55:51.1134213Z 'num_warps': 4, 2026-02-21T08:55:51.1139814Z 'pid_type': 'flat', 2026-02-21T08:55:51.1144538Z 'range_flattens': [None, None], 2026-02-21T08:55:51.1144906Z 'range_multi_buffers': [None, True], 2026-02-21T08:55:51.1149299Z 'range_num_stages': [0, 2], 2026-02-21T08:55:51.1149663Z 'range_unroll_factors': [0, 1], 2026-02-21T08:55:51.1149928Z 'range_warp_specializes': [None, True]} 2026-02-21T08:55:51.1155372Z [112s] Fitting surrogate: 312 points, 312 targets 2026-02-21T08:55:51.8470920Z [112s] Generation 4 starting: 48 neighbors, 3 active search path(s) 2026-02-21T08:56:01.9910726Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 11.2 configs/s 2026-02-21T08:56:05.0315831Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.0 configs/s 2026-02-21T08:56:07.6408852Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 386.2 2026-02-21T08:56:07.6415418Z configs/s 2026-02-21T08:56:07.7917549Z [128s] Generation 4 complete: 2026-02-21T08:56:07.7918575Z ok=52 2026-02-21T08:56:07.7918774Z min=0.0409 2026-02-21T08:56:07.7918946Z mid=0.0614 2026-02-21T08:56:07.7919134Z max=0.2724 2026-02-21T08:56:07.7919315Z best={'block_sizes': [1, 16384], 2026-02-21T08:56:07.7919584Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:56:07.7919837Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:07.7920092Z 'num_stages': 3, 2026-02-21T08:56:07.7920271Z 'num_warps': 4, 2026-02-21T08:56:07.7920479Z 'pid_type': 'flat', 2026-02-21T08:56:07.7920706Z 'range_flattens': [None, None], 2026-02-21T08:56:07.7920922Z 'range_multi_buffers': [None, True], 2026-02-21T08:56:07.7921174Z 'range_num_stages': [0, 1], 2026-02-21T08:56:07.7921383Z 'range_unroll_factors': [0, 1], 2026-02-21T08:56:07.7921893Z 'range_warp_specializes': [None, True]} 2026-02-21T08:56:07.7935451Z [128s] Fitting surrogate: 364 points, 364 targets 2026-02-21T08:56:08.2964802Z [129s] Generation 5 starting: 31 neighbors, 2 active search path(s) 2026-02-21T08:56:16.1704884Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 6.5 configs/s 2026-02-21T08:56:18.1100513Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.9 configs/s 2026-02-21T08:56:20.8652835Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 497.5 2026-02-21T08:56:20.8653333Z configs/s 2026-02-21T08:56:20.9839707Z [142s] Generation 5 complete: 2026-02-21T08:56:20.9840097Z ok=34 2026-02-21T08:56:20.9840348Z min=0.0410 2026-02-21T08:56:20.9840560Z mid=0.0594 2026-02-21T08:56:20.9840782Z max=0.5181 2026-02-21T08:56:20.9840994Z best={'block_sizes': [1, 16384], 2026-02-21T08:56:20.9841306Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:56:20.9841682Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:20.9841907Z 'num_stages': 3, 2026-02-21T08:56:20.9842126Z 'num_warps': 4, 2026-02-21T08:56:20.9842315Z 'pid_type': 'flat', 2026-02-21T08:56:20.9842581Z 'range_flattens': [None, None], 2026-02-21T08:56:20.9842819Z 'range_multi_buffers': [None, True], 2026-02-21T08:56:20.9843076Z 'range_num_stages': [0, 1], 2026-02-21T08:56:20.9843277Z 'range_unroll_factors': [0, 1], 2026-02-21T08:56:20.9843534Z 'range_warp_specializes': [None, True]} 2026-02-21T08:56:20.9853015Z [142s] Fitting surrogate: 398 points, 398 targets 2026-02-21T08:56:21.4074037Z [142s] Generation 6 starting: 24 neighbors, 2 active search path(s) 2026-02-21T08:56:27.2540588Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 1.5 configs/s 2026-02-21T08:56:28.8999835Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 16.2 configs/s 2026-02-21T08:56:30.3933190Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 670.4 2026-02-21T08:56:30.3933783Z configs/s 2026-02-21T08:56:30.4922587Z [151s] Generation 6 complete: 2026-02-21T08:56:30.4926294Z ok=27 2026-02-21T08:56:30.4929513Z min=0.0409 2026-02-21T08:56:30.4933438Z mid=0.0594 2026-02-21T08:56:30.4937880Z max=0.1885 2026-02-21T08:56:30.4939470Z best={'block_sizes': [1, 16384], 2026-02-21T08:56:30.4939851Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:56:30.4943888Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:30.4944263Z 'num_stages': 4, 2026-02-21T08:56:30.4944483Z 'num_warps': 1, 2026-02-21T08:56:30.4949483Z 'pid_type': 'flat', 2026-02-21T08:56:30.4952825Z 'range_flattens': [None, None], 2026-02-21T08:56:30.4956034Z 'range_multi_buffers': [None, True], 2026-02-21T08:56:30.4957973Z 'range_num_stages': [0, 1], 2026-02-21T08:56:30.4958257Z 'range_unroll_factors': [0, 0], 2026-02-21T08:56:30.4958495Z 'range_warp_specializes': [None, True]} 2026-02-21T08:56:30.4958862Z [151s] Fitting surrogate: 425 points, 425 targets 2026-02-21T08:56:30.7775049Z [151s] Generation 7 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:56:36.5980817Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 0.8 configs/s 2026-02-21T08:56:37.3105771Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.0 configs/s 2026-02-21T08:56:38.3217529Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 986.5 2026-02-21T08:56:38.3219192Z configs/s 2026-02-21T08:56:38.3956749Z [159s] Generation 7 complete: 2026-02-21T08:56:38.3960554Z ok=13 2026-02-21T08:56:38.3964479Z min=0.0409 2026-02-21T08:56:38.3966473Z mid=0.0409 2026-02-21T08:56:38.3966701Z max=0.3767 2026-02-21T08:56:38.3966885Z best={'block_sizes': [1, 16384], 2026-02-21T08:56:38.3967197Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:56:38.3967461Z 'load_eviction_policies': ['', ''], 2026-02-21T08:56:38.3967699Z 'num_stages': 4, 2026-02-21T08:56:38.3967963Z 'num_warps': 1, 2026-02-21T08:56:38.3972239Z 'pid_type': 'flat', 2026-02-21T08:56:38.3976795Z 'range_flattens': [None, None], 2026-02-21T08:56:38.3978578Z 'range_multi_buffers': [None, True], 2026-02-21T08:56:38.3978898Z 'range_num_stages': [0, 1], 2026-02-21T08:56:38.3979169Z 'range_unroll_factors': [0, 0], 2026-02-21T08:56:38.3979408Z 'range_warp_specializes': [None, True]} 2026-02-21T08:56:38.3979829Z [159s] Fitting surrogate: 438 points, 438 targets 2026-02-21T08:56:38.5687845Z [159s] Autotuning complete in 159.7s after searching 426 configs. 2026-02-21T08:56:38.5688303Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:56:38.5693132Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:56:38.5694018Z 2026-02-21T08:56:38.5694322Z [159s] Code of selected kernel: /tmp/torchinductor_root/v4/cv4c7vlts4ihrxwoqka3tpcvarlgtl5v757t2iold6n2j3k6uap6.py 2026-02-21T08:56:39.6588638Z WARNING:tritonbench.utils.triton_op:Completed input ID 82: 2026-02-21T08:56:39.6594747Z (M, N) 2026-02-21T08:56:39.6596881Z ------------- 2026-02-21T08:56:39.6597124Z (4096, 10752) 2026-02-21T08:56:39.6597264Z 2026-02-21T08:56:39.6603002Z 85%|████████▌ | 17/20 [47:37<09:13, 184.41s/it]WARNING:tritonbench.utils.triton_op:Running input ID 87: 2026-02-21T08:56:39.6607247Z (M, N) 2026-02-21T08:56:39.6612236Z ------------- 2026-02-21T08:56:39.6614213Z (4096, 11392) 2026-02-21T08:56:39.6614722Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T08:56:40.8571319Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T08:56:42.2133646Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax 2026-02-21T08:56:46.0290711Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:56:46.0295078Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:56:46.0296610Z 'dtype': 'torch.float16', 2026-02-21T08:56:46.0296926Z 'shape': (4096, 11392), 2026-02-21T08:56:46.0297156Z 'stride': (11392, 1)},), 2026-02-21T08:56:46.0297479Z 'kwargs': {}} 2026-02-21T08:56:46.0310362Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T08:56:46.2069041Z [0s] Autotune random seed: 2134816249 2026-02-21T08:56:46.3509883Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:57:26.1987076Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T08:57:26.2005421Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s 2026-02-21T08:57:28.7032390Z module { 2026-02-21T08:57:28.7034597Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T08:57:28.7035158Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:57:28.7035401Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:57:28.7035665Z %c148_i32 = arith.constant 148 : i32 2026-02-21T08:57:28.7035927Z %cst = arith.constant dense<11392> : tensor<128x1xi32> 2026-02-21T08:57:28.7036260Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T08:57:28.7036563Z %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T08:57:28.7036854Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:57:28.7037110Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:57:28.7037718Z %c11392_i32 = arith.constant 11392 : i32 2026-02-21T08:57:28.7038017Z %c11392_i64 = arith.constant 11392 : i64 2026-02-21T08:57:28.7038244Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:57:28.7038639Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c11392_i32], [%c11392_i64, %c1_i64] : , > 2026-02-21T08:57:28.7039017Z %1 = tt.get_program_id x : i32 2026-02-21T08:57:28.7039298Z scf.for %arg2 = %1 to %c32_i32 step %c148_i32 : i32 { 2026-02-21T08:57:28.7039596Z %2 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T08:57:28.7039887Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:57:28.7040232Z %4 = tt.splat %2 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7040478Z %5 = arith.addi %4, %3 : tensor<128xi32> 2026-02-21T08:57:28.7040750Z %c11264_i32 = arith.constant 11264 : i32 2026-02-21T08:57:28.7041021Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:57:28.7041513Z %6:2 = scf.for %arg3 = %c0_i32 to %c11264_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T08:57:28.7042058Z %55 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7042321Z %56 = arith.addi %55, %3 : tensor<128xi32> 2026-02-21T08:57:28.7042724Z %57 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7043070Z %58 = arith.muli %57, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7043381Z %59 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7043751Z %60 = tt.broadcast %58 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7044076Z %61 = tt.broadcast %59 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7044404Z %62 = arith.addi %60, %61 : tensor<128x128xi32> 2026-02-21T08:57:28.7044737Z %63 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7045094Z %64 = tt.addptr %63, %62 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7045489Z %65 = tt.load %64 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7045841Z %66 = arith.extf %65 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7046159Z %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7046403Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7046674Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:57:28.7046945Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7047188Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7047501Z %68 = arith.truncf %67 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:57:28.7047808Z %69 = arith.extf %68 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:57:28.7048130Z %70 = arith.cmpf ogt, %arg4, %69 : tensor<128xf32> 2026-02-21T08:57:28.7048417Z %71 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T08:57:28.7048870Z %72 = arith.ori %70, %71 : tensor<128xi1> 2026-02-21T08:57:28.7049157Z %73 = arith.select %72, %arg4, %69 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:57:28.7049460Z %74 = arith.subf %arg4, %73 : tensor<128xf32> 2026-02-21T08:57:28.7049926Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7050337Z %76 = arith.mulf %arg5, %75 : tensor<128xf32> 2026-02-21T08:57:28.7050660Z %77 = tt.expand_dims %73 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7050995Z %78 = tt.broadcast %77 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7051311Z %79 = arith.subf %66, %78 : tensor<128x128xf32> 2026-02-21T08:57:28.7051904Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7052326Z %81 = "tt.reduce"(%80) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7052595Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7052821Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:57:28.7053088Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7053320Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7053590Z %82 = arith.addf %76, %81 : tensor<128xf32> 2026-02-21T08:57:28.7053854Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:57:28.7054101Z %83 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T08:57:28.7054363Z %84 = arith.addi %arg3, %83 : i32 2026-02-21T08:57:28.7054590Z %85 = tt.splat %84 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7054856Z %86 = arith.addi %85, %3 : tensor<128xi32> 2026-02-21T08:57:28.7055139Z %87 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7055476Z %88 = arith.muli %87, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7055793Z %89 = tt.expand_dims %86 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7056118Z %90 = tt.broadcast %88 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7056447Z %91 = tt.broadcast %89 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7056729Z %92 = arith.addi %90, %91 : tensor<128x128xi32> 2026-02-21T08:57:28.7057036Z %93 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7057359Z %94 = tt.addptr %93, %92 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7057729Z %95 = tt.load %94 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7058088Z %96 = arith.extf %95 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7058361Z %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7058653Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7058883Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:57:28.7059175Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7059374Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7059670Z %98 = arith.truncf %97 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:57:28.7059985Z %99 = arith.extf %98 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:57:28.7060257Z %100 = arith.cmpf ogt, %73, %99 : tensor<128xf32> 2026-02-21T08:57:28.7060543Z %101 = arith.cmpf une, %73, %73 : tensor<128xf32> 2026-02-21T08:57:28.7060791Z %102 = arith.ori %100, %101 : tensor<128xi1> 2026-02-21T08:57:28.7061100Z %103 = arith.select %102, %73, %99 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:57:28.7061387Z %104 = arith.subf %73, %103 : tensor<128xf32> 2026-02-21T08:57:28.7061852Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7062368Z %106 = arith.mulf %82, %105 : tensor<128xf32> 2026-02-21T08:57:28.7062641Z %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7063016Z %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7063313Z %109 = arith.subf %96, %108 : tensor<128x128xf32> 2026-02-21T08:57:28.7063765Z %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7064216Z %111 = "tt.reduce"(%110) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7064459Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7064708Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:57:28.7064938Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7065196Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7065625Z %112 = arith.addf %106, %111 : tensor<128xf32> 2026-02-21T08:57:28.7065896Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:57:28.7066127Z %113 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:57:28.7066390Z %114 = arith.addi %arg3, %113 : i32 2026-02-21T08:57:28.7066627Z %115 = tt.splat %114 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7066904Z %116 = arith.addi %115, %3 : tensor<128xi32> 2026-02-21T08:57:28.7067199Z %117 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7067539Z %118 = arith.muli %117, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7067842Z %119 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7068211Z %120 = tt.broadcast %118 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7068557Z %121 = tt.broadcast %119 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7068849Z %122 = arith.addi %120, %121 : tensor<128x128xi32> 2026-02-21T08:57:28.7069161Z %123 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7069495Z %124 = tt.addptr %123, %122 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7069883Z %125 = tt.load %124 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7070249Z %126 = arith.extf %125 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7070528Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7070789Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7071013Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:57:28.7071271Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7071498Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7071829Z %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:57:28.7072158Z %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:57:28.7072443Z %130 = arith.cmpf ogt, %103, %129 : tensor<128xf32> 2026-02-21T08:57:28.7072735Z %131 = arith.cmpf une, %103, %103 : tensor<128xf32> 2026-02-21T08:57:28.7072988Z %132 = arith.ori %130, %131 : tensor<128xi1> 2026-02-21T08:57:28.7073306Z %133 = arith.select %132, %103, %129 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:57:28.7073602Z %134 = arith.subf %103, %133 : tensor<128xf32> 2026-02-21T08:57:28.7074043Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7074472Z %136 = arith.mulf %112, %135 : tensor<128xf32> 2026-02-21T08:57:28.7074774Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7075139Z %138 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7075431Z %139 = arith.subf %126, %138 : tensor<128x128xf32> 2026-02-21T08:57:28.7075946Z %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7076385Z %141 = "tt.reduce"(%140) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7076619Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7076874Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:57:28.7077105Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7077363Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7077605Z %142 = arith.addf %136, %141 : tensor<128xf32> 2026-02-21T08:57:28.7077870Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:57:28.7078100Z %143 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:57:28.7078389Z %144 = arith.addi %arg3, %143 : i32 2026-02-21T08:57:28.7078657Z %145 = tt.splat %144 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7078994Z %146 = arith.addi %145, %3 : tensor<128xi32> 2026-02-21T08:57:28.7079327Z %147 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7079645Z %148 = arith.muli %147, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7079980Z %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7080320Z %150 = tt.broadcast %148 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7080665Z %151 = tt.broadcast %149 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7080984Z %152 = arith.addi %150, %151 : tensor<128x128xi32> 2026-02-21T08:57:28.7081271Z %153 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7081664Z %154 = tt.addptr %153, %152 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7082027Z %155 = tt.load %154 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7082400Z %156 = arith.extf %155 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7082707Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7082942Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7083167Z %173 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T08:57:28.7083401Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7083656Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7083921Z %158 = arith.truncf %157 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:57:28.7084262Z %159 = arith.extf %158 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:57:28.7084551Z %160 = arith.cmpf ogt, %133, %159 : tensor<128xf32> 2026-02-21T08:57:28.7084853Z %161 = arith.cmpf une, %133, %133 : tensor<128xf32> 2026-02-21T08:57:28.7085151Z %162 = arith.ori %160, %161 : tensor<128xi1> 2026-02-21T08:57:28.7085448Z %163 = arith.select %162, %133, %159 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:57:28.7085780Z %164 = arith.subf %133, %163 : tensor<128xf32> 2026-02-21T08:57:28.7086209Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7086662Z %166 = arith.mulf %142, %165 : tensor<128xf32> 2026-02-21T08:57:28.7087010Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7087371Z %168 = tt.broadcast %167 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7087709Z %169 = arith.subf %156, %168 : tensor<128x128xf32> 2026-02-21T08:57:28.7088156Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7088619Z %171 = "tt.reduce"(%170) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7088863Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:57:28.7089143Z %173 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:57:28.7089482Z tt.reduce.return %173 : f32 2026-02-21T08:57:28.7089718Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7090000Z %172 = arith.addf %166, %171 : tensor<128xf32> 2026-02-21T08:57:28.7090276Z scf.yield %163, %172 : tensor<128xf32>, tensor<128xf32> 2026-02-21T08:57:28.7090642Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:57:28.7090974Z %7 = tt.splat %c11264_i32 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7091267Z %8 = arith.addi %7, %3 : tensor<128xi32> 2026-02-21T08:57:28.7091637Z %9 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7091954Z %10 = arith.muli %9, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7092297Z %11 = tt.expand_dims %8 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7092695Z %12 = tt.broadcast %10 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7093056Z %13 = tt.broadcast %11 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7093366Z %14 = arith.addi %12, %13 : tensor<128x128xi32> 2026-02-21T08:57:28.7093674Z %15 = tt.splat %arg0 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7094029Z %16 = tt.addptr %15, %14 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7094372Z %17 = tt.load %16 evictionPolicy = evict_first : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7094741Z %18 = arith.extf %17 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7095016Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7095277Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:57:28.7095504Z %55 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T08:57:28.7095771Z tt.reduce.return %55 : f32 2026-02-21T08:57:28.7096033Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7096302Z %20 = arith.truncf %19 : tensor<128xf32> to tensor<128xf16> 2026-02-21T08:57:28.7096628Z %21 = arith.extf %20 : tensor<128xf16> to tensor<128xf32> 2026-02-21T08:57:28.7096898Z %22 = arith.cmpf ogt, %6#0, %21 : tensor<128xf32> 2026-02-21T08:57:28.7097182Z %23 = arith.cmpf une, %6#0, %6#0 : tensor<128xf32> 2026-02-21T08:57:28.7097424Z %24 = arith.ori %22, %23 : tensor<128xi1> 2026-02-21T08:57:28.7097725Z %25 = arith.select %24, %6#0, %21 : tensor<128xi1>, tensor<128xf32> 2026-02-21T08:57:28.7098031Z %26 = arith.subf %6#0, %25 : tensor<128xf32> 2026-02-21T08:57:28.7098435Z %27 = tt.extern_elementwise %26 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7098862Z %28 = arith.mulf %6#1, %27 : tensor<128xf32> 2026-02-21T08:57:28.7099151Z %29 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7099520Z %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7099823Z %31 = arith.subf %18, %30 : tensor<128x128xf32> 2026-02-21T08:57:28.7100232Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7100660Z %33 = "tt.reduce"(%32) <{axis = 1 : i32}> ({ 2026-02-21T08:57:28.7100878Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T08:57:28.7101106Z %55 = arith.addf %arg3, %arg4 : f32 2026-02-21T08:57:28.7101332Z tt.reduce.return %55 : f32 2026-02-21T08:57:28.7101615Z }) : (tensor<128x128xf32>) -> tensor<128xf32> 2026-02-21T08:57:28.7101880Z %34 = arith.addf %28, %33 : tensor<128xf32> 2026-02-21T08:57:28.7102124Z %c11264_i32_2 = arith.constant 11264 : i32 2026-02-21T08:57:28.7102360Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T08:57:28.7102630Z scf.for %arg3 = %c0_i32 to %c11264_i32_2 step %c512_i32_3 : i32 { 2026-02-21T08:57:28.7102947Z %55 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7103255Z %56 = arith.addi %55, %3 : tensor<128xi32> 2026-02-21T08:57:28.7103619Z %57 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:57:28.7104039Z %58 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7104370Z %59 = arith.extf %57 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7104713Z %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7105003Z %61 = arith.subf %59, %60 : tensor<128x128xf32> 2026-02-21T08:57:28.7105442Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7105899Z %63 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7106323Z %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7106632Z %65 = arith.divf %62, %64 : tensor<128x128xf32> 2026-02-21T08:57:28.7106905Z %66 = arith.truncf %65 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:57:28.7107255Z %67 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7107556Z %68 = arith.muli %67, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7107879Z %69 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7108232Z %70 = tt.broadcast %68 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7108533Z %71 = tt.broadcast %69 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7108838Z %72 = arith.addi %70, %71 : tensor<128x128xi32> 2026-02-21T08:57:28.7109119Z %73 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7109473Z %74 = tt.addptr %73, %72 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7109777Z tt.store %74, %66 : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7110054Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:57:28.7110313Z %75 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T08:57:28.7110548Z %76 = arith.addi %arg3, %75 : i32 2026-02-21T08:57:28.7110807Z %77 = tt.splat %76 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7111044Z %78 = arith.addi %77, %3 : tensor<128xi32> 2026-02-21T08:57:28.7111403Z %79 = tt.descriptor_load %0[%2, %76] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:57:28.7111818Z %80 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7112177Z %81 = arith.extf %79 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7112508Z %82 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7112791Z %83 = arith.subf %81, %82 : tensor<128x128xf32> 2026-02-21T08:57:28.7113236Z %84 = tt.extern_elementwise %83 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7113697Z %85 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7114058Z %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7114372Z %87 = arith.divf %84, %86 : tensor<128x128xf32> 2026-02-21T08:57:28.7114649Z %88 = arith.truncf %87 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:57:28.7115001Z %89 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7115302Z %90 = arith.muli %89, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7115619Z %91 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7115946Z %92 = tt.broadcast %90 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7116327Z %93 = tt.broadcast %91 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7116628Z %94 = arith.addi %92, %93 : tensor<128x128xi32> 2026-02-21T08:57:28.7116897Z %95 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7117222Z %96 = tt.addptr %95, %94 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7117523Z tt.store %96, %88 : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7117799Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:57:28.7118028Z %97 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T08:57:28.7118286Z %98 = arith.addi %arg3, %97 : i32 2026-02-21T08:57:28.7118543Z %99 = tt.splat %98 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7118792Z %100 = arith.addi %99, %3 : tensor<128xi32> 2026-02-21T08:57:28.7119211Z %101 = tt.descriptor_load %0[%2, %98] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:57:28.7119608Z %102 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7119971Z %103 = arith.extf %101 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7120286Z %104 = tt.broadcast %102 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7120605Z %105 = arith.subf %103, %104 : tensor<128x128xf32> 2026-02-21T08:57:28.7121059Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7121562Z %107 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7121924Z %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7122213Z %109 = arith.divf %106, %108 : tensor<128x128xf32> 2026-02-21T08:57:28.7122533Z %110 = arith.truncf %109 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:57:28.7122899Z %111 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7123210Z %112 = arith.muli %111, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7123543Z %113 = tt.expand_dims %100 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7123879Z %114 = tt.broadcast %112 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7124214Z %115 = tt.broadcast %113 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7124497Z %116 = arith.addi %114, %115 : tensor<128x128xi32> 2026-02-21T08:57:28.7124812Z %117 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7125174Z %118 = tt.addptr %117, %116 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7125488Z tt.store %118, %110 : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7125764Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:57:28.7125996Z %119 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T08:57:28.7126262Z %120 = arith.addi %arg3, %119 : i32 2026-02-21T08:57:28.7126502Z %121 = tt.splat %120 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7126773Z %122 = arith.addi %121, %3 : tensor<128xi32> 2026-02-21T08:57:28.7127139Z %123 = tt.descriptor_load %0[%2, %120] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:57:28.7127546Z %124 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7127926Z %125 = arith.extf %123 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7128253Z %126 = tt.broadcast %124 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7128582Z %127 = arith.subf %125, %126 : tensor<128x128xf32> 2026-02-21T08:57:28.7129059Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7129625Z %129 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7130012Z %130 = tt.broadcast %129 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7130315Z %131 = arith.divf %128, %130 : tensor<128x128xf32> 2026-02-21T08:57:28.7130650Z %132 = arith.truncf %131 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:57:28.7131031Z %133 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7131320Z %134 = arith.muli %133, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7131692Z %135 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7132048Z %136 = tt.broadcast %134 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7132397Z %137 = tt.broadcast %135 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7132752Z %138 = arith.addi %136, %137 : tensor<128x128xi32> 2026-02-21T08:57:28.7133081Z %139 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7133452Z %140 = tt.addptr %139, %138 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7133778Z tt.store %140, %132 : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7134122Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:57:28.7134445Z %35 = tt.splat %c11264_i32_2 : i32 -> tensor<128xi32> 2026-02-21T08:57:28.7134749Z %36 = arith.addi %35, %3 : tensor<128xi32> 2026-02-21T08:57:28.7135094Z %37 = tt.descriptor_load %0[%2, %c11264_i32_2] : !tt.tensordesc> -> tensor<128x128xf16> 2026-02-21T08:57:28.7135521Z %38 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7135883Z %39 = arith.extf %37 : tensor<128x128xf16> to tensor<128x128xf32> 2026-02-21T08:57:28.7136191Z %40 = tt.broadcast %38 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7136499Z %41 = arith.subf %39, %40 : tensor<128x128xf32> 2026-02-21T08:57:28.7136904Z %42 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32> 2026-02-21T08:57:28.7137365Z %43 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T08:57:28.7137726Z %44 = tt.broadcast %43 : tensor<128x1xf32> -> tensor<128x128xf32> 2026-02-21T08:57:28.7138000Z %45 = arith.divf %42, %44 : tensor<128x128xf32> 2026-02-21T08:57:28.7138302Z %46 = arith.truncf %45 : tensor<128x128xf32> to tensor<128x128xf16> 2026-02-21T08:57:28.7138620Z %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:57:28.7138939Z %48 = arith.muli %47, %cst : tensor<128x1xi32> 2026-02-21T08:57:28.7139229Z %49 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T08:57:28.7139570Z %50 = tt.broadcast %48 : tensor<128x1xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7139895Z %51 = tt.broadcast %49 : tensor<1x128xi32> -> tensor<128x128xi32> 2026-02-21T08:57:28.7140166Z %52 = arith.addi %50, %51 : tensor<128x128xi32> 2026-02-21T08:57:28.7140466Z %53 = tt.splat %arg1 : !tt.ptr -> tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7140785Z %54 = tt.addptr %53, %52 : tensor<128x128x!tt.ptr>, tensor<128x128xi32> 2026-02-21T08:57:28.7141108Z tt.store %54, %46 : tensor<128x128x!tt.ptr> 2026-02-21T08:57:28.7141445Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:57:28.7141765Z tt.return 2026-02-21T08:57:28.7141967Z } 2026-02-21T08:57:28.7142132Z } 2026-02-21T08:57:28.7142222Z 2026-02-21T08:57:28.7142324Z {-# 2026-02-21T08:57:28.7142496Z external_resources: { 2026-02-21T08:57:28.7142725Z mlir_reproducer: { 2026-02-21T08:57:28.7147112Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:57:28.7151745Z disable_threading: false, 2026-02-21T08:57:28.7152038Z verify_each: true 2026-02-21T08:57:28.7152247Z } 2026-02-21T08:57:28.7152468Z } 2026-02-21T08:57:28.7152662Z #-} 2026-02-21T08:57:28.7153271Z /tmp/torchinductor_root/yi/cyilusxyo4s4jj75drhhgdbymmwkq5c346veo2nlrpr3gu3qs3ra.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:57:28.7154675Z /tmp/torchinductor_root/yi/cyilusxyo4s4jj75drhhgdbymmwkq5c346veo2nlrpr3gu3qs3ra.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:57:28.7155784Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:57:28.7157035Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_sm_multiplier=1, num_stages=7, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:57:28.7158130Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:57:28.7158521Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:57:34.3836938Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.3 configs/s 2026-02-21T08:57:34.3847264Z [48s] Adaptive compile timeout: 30s (90% percentile=11.0s, bounds=[30.0s, 30s]) 2026-02-21T08:57:35.2905637Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1086.0 configs/s 2026-02-21T08:57:35.3583346Z [49s] Initial random population of 100, 5 starting points: 2026-02-21T08:57:35.3587809Z error=6 2026-02-21T08:57:35.3592145Z timeout=1 2026-02-21T08:57:35.3593642Z ok=93 2026-02-21T08:57:35.3593954Z min=0.0615 2026-02-21T08:57:35.3594201Z mid=1.0433 2026-02-21T08:57:35.3594425Z max=58.3782 2026-02-21T08:57:35.3594683Z best={'block_sizes': [1, 16384], 2026-02-21T08:57:35.3595003Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:57:35.3595747Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:57:35.3595984Z 'num_sm_multiplier': 8, 2026-02-21T08:57:35.3596220Z 'num_stages': 3, 2026-02-21T08:57:35.3596401Z 'num_warps': 1, 2026-02-21T08:57:35.3596632Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:57:35.3596883Z 'range_flattens': [False, None], 2026-02-21T08:57:35.3597139Z 'range_multi_buffers': [True, True], 2026-02-21T08:57:35.3597411Z 'range_num_stages': [1, 2], 2026-02-21T08:57:35.3597623Z 'range_unroll_factors': [0, 1], 2026-02-21T08:57:35.3597875Z 'range_warp_specializes': [True, None]} 2026-02-21T08:57:35.3603302Z [49s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:57:36.4409307Z [50s] Generation 1 starting: 81 neighbors, 5 active search path(s) 2026-02-21T08:57:49.5170075Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 19.1 configs/s 2026-02-21T08:57:54.5569740Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 17.0 configs/s 2026-02-21T08:58:02.3473442Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 129.6 2026-02-21T08:58:02.3474079Z configs/s 2026-02-21T08:58:02.7252096Z [76s] Generation 1 complete: 2026-02-21T08:58:02.7254108Z ok=87 2026-02-21T08:58:02.7254377Z min=0.0574 2026-02-21T08:58:02.7254602Z mid=0.0737 2026-02-21T08:58:02.7254780Z max=0.4055 2026-02-21T08:58:02.7254998Z best={'block_sizes': [1, 16384], 2026-02-21T08:58:02.7255290Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:58:02.7255610Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:58:02.7255836Z 'num_stages': 3, 2026-02-21T08:58:02.7256050Z 'num_warps': 1, 2026-02-21T08:58:02.7256247Z 'pid_type': 'flat', 2026-02-21T08:58:02.7256496Z 'range_flattens': [None, None], 2026-02-21T08:58:02.7256727Z 'range_multi_buffers': [None, True], 2026-02-21T08:58:02.7256996Z 'range_num_stages': [0, 2], 2026-02-21T08:58:02.7257256Z 'range_unroll_factors': [0, 1], 2026-02-21T08:58:02.7257534Z 'range_warp_specializes': [None, True]} 2026-02-21T08:58:02.7268793Z [76s] Fitting surrogate: 187 points, 187 targets 2026-02-21T08:58:03.6574084Z [77s] Generation 2 starting: 64 neighbors, 5 active search path(s) 2026-02-21T08:58:18.5251230Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.5 configs/s 2026-02-21T08:58:22.5372963Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.2 configs/s 2026-02-21T08:58:28.3390764Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 174.1 2026-02-21T08:58:28.3395012Z configs/s 2026-02-21T08:58:28.6258539Z [102s] Generation 2 complete: 2026-02-21T08:58:28.6262638Z ok=70 2026-02-21T08:58:28.6267845Z min=0.0573 2026-02-21T08:58:28.6269315Z mid=0.0737 2026-02-21T08:58:28.6275115Z max=1.2350 2026-02-21T08:58:28.6279638Z best={'block_sizes': [1, 16384], 2026-02-21T08:58:28.6283615Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:58:28.6284054Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:58:28.6284321Z 'num_stages': 3, 2026-02-21T08:58:28.6288729Z 'num_warps': 1, 2026-02-21T08:58:28.6293337Z 'pid_type': 'flat', 2026-02-21T08:58:28.6298378Z 'range_flattens': [None, None], 2026-02-21T08:58:28.6303255Z 'range_multi_buffers': [None, True], 2026-02-21T08:58:28.6307914Z 'range_num_stages': [0, 2], 2026-02-21T08:58:28.6309449Z 'range_unroll_factors': [0, 1], 2026-02-21T08:58:28.6309721Z 'range_warp_specializes': [None, True]} 2026-02-21T08:58:28.6310009Z [102s] Fitting surrogate: 257 points, 257 targets 2026-02-21T08:58:29.5177074Z [103s] Generation 3 starting: 65 neighbors, 5 active search path(s) 2026-02-21T08:59:08.0859624Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 0.1 configs/s 2026-02-21T08:59:11.9543532Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 17.3 configs/s 2026-02-21T08:59:16.7408095Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 210.8 2026-02-21T08:59:16.9991295Z [150s] Generation 3 complete: 2026-02-21T08:59:16.9991942Z configs/s 2026-02-21T08:59:16.9992770Z ok=70 2026-02-21T08:59:16.9992945Z min=0.0450 2026-02-21T08:59:16.9993148Z mid=0.0614 2026-02-21T08:59:16.9993313Z max=0.2437 2026-02-21T08:59:16.9993526Z best={'block_sizes': [1, 16384], 2026-02-21T08:59:16.9993785Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:59:16.9994081Z 'load_eviction_policies': ['', ''], 2026-02-21T08:59:16.9994326Z 'num_stages': 4, 2026-02-21T08:59:16.9994509Z 'num_warps': 8, 2026-02-21T08:59:16.9994716Z 'pid_type': 'flat', 2026-02-21T08:59:16.9994916Z 'range_flattens': [None, None], 2026-02-21T08:59:16.9995155Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:16.9995374Z 'range_num_stages': [0, 0], 2026-02-21T08:59:16.9995612Z 'range_unroll_factors': [0, 0], 2026-02-21T08:59:17.0009065Z 'range_warp_specializes': [None, True]} 2026-02-21T08:59:17.0009509Z [150s] Fitting surrogate: 327 points, 327 targets 2026-02-21T08:59:17.7726304Z [151s] Generation 4 starting: 51 neighbors, 4 active search path(s) 2026-02-21T08:59:27.3479497Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 3.3 configs/s 2026-02-21T08:59:30.4705034Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.2 configs/s 2026-02-21T08:59:33.8948245Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 294.3 2026-02-21T08:59:33.8948830Z configs/s 2026-02-21T08:59:34.0952968Z [167s] Generation 4 complete: 2026-02-21T08:59:34.0954691Z ok=56 2026-02-21T08:59:34.0954951Z min=0.0410 2026-02-21T08:59:34.0955144Z mid=0.0594 2026-02-21T08:59:34.0955354Z max=0.1946 2026-02-21T08:59:34.0955550Z best={'block_sizes': [1, 16384], 2026-02-21T08:59:34.0955875Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:59:34.0956208Z 'load_eviction_policies': ['', ''], 2026-02-21T08:59:34.0956490Z 'num_stages': 3, 2026-02-21T08:59:34.0956772Z 'num_warps': 2, 2026-02-21T08:59:34.0961851Z 'pid_type': 'flat', 2026-02-21T08:59:34.0963829Z 'range_flattens': [None, False], 2026-02-21T08:59:34.0964213Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:34.0969254Z 'range_num_stages': [0, 0], 2026-02-21T08:59:34.0971856Z 'range_unroll_factors': [0, 0], 2026-02-21T08:59:34.0972186Z 'range_warp_specializes': [None, True]} 2026-02-21T08:59:34.0972535Z [167s] Fitting surrogate: 383 points, 383 targets 2026-02-21T08:59:34.8218339Z [168s] Generation 5 starting: 42 neighbors, 3 active search path(s) 2026-02-21T08:59:45.3503791Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 1.6 configs/s 2026-02-21T08:59:47.9690580Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 16.7 configs/s 2026-02-21T08:59:51.6857883Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 357.5 2026-02-21T08:59:51.6858505Z configs/s 2026-02-21T08:59:51.8691887Z [185s] Generation 5 complete: 2026-02-21T08:59:51.8692209Z ok=46 2026-02-21T08:59:51.8692421Z min=0.0419 2026-02-21T08:59:51.8692757Z mid=0.0574 2026-02-21T08:59:51.8692952Z max=0.1004 2026-02-21T08:59:51.8693136Z best={'block_sizes': [1, 16384], 2026-02-21T08:59:51.8693454Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T08:59:51.8693753Z 'load_eviction_policies': ['', ''], 2026-02-21T08:59:51.8698415Z 'num_stages': 3, 2026-02-21T08:59:51.8702572Z 'num_warps': 4, 2026-02-21T08:59:51.8706660Z 'pid_type': 'flat', 2026-02-21T08:59:51.8710876Z 'range_flattens': [None, False], 2026-02-21T08:59:51.8714963Z 'range_multi_buffers': [None, True], 2026-02-21T08:59:51.8719561Z 'range_num_stages': [0, 0], 2026-02-21T08:59:51.8720924Z 'range_unroll_factors': [0, 0], 2026-02-21T08:59:51.8721197Z 'range_warp_specializes': [None, True]} 2026-02-21T08:59:51.8722032Z [185s] Fitting surrogate: 429 points, 429 targets 2026-02-21T08:59:52.5780798Z [186s] Generation 6 starting: 42 neighbors, 3 active search path(s) 2026-02-21T09:00:02.0660308Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 1.7 configs/s 2026-02-21T09:00:04.5942267Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.3 configs/s 2026-02-21T09:00:07.0213027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 414.5 2026-02-21T09:00:07.0213460Z configs/s 2026-02-21T09:00:07.1743816Z [200s] Generation 6 complete: 2026-02-21T09:00:07.1748217Z error=1 2026-02-21T09:00:07.1749837Z ok=44 2026-02-21T09:00:07.1750122Z min=0.0409 2026-02-21T09:00:07.1755398Z mid=0.0574 2026-02-21T09:00:07.1757152Z max=0.1290 2026-02-21T09:00:07.1757416Z best={'block_sizes': [1, 16384], 2026-02-21T09:00:07.1757802Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:00:07.1761846Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:00:07.1763261Z 'num_stages': 5, 2026-02-21T09:00:07.1763523Z 'num_warps': 4, 2026-02-21T09:00:07.1763720Z 'pid_type': 'flat', 2026-02-21T09:00:07.1763952Z 'range_flattens': [None, True], 2026-02-21T09:00:07.1764175Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:07.1764431Z 'range_num_stages': [0, 2], 2026-02-21T09:00:07.1764642Z 'range_unroll_factors': [0, 0], 2026-02-21T09:00:07.1764893Z 'range_warp_specializes': [None, True]} 2026-02-21T09:00:07.1765221Z [200s] Fitting surrogate: 474 points, 474 targets 2026-02-21T09:00:07.4627304Z [201s] Generation 7 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:00:09.6617639Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 6.1 configs/s 2026-02-21T09:00:10.2611464Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s 2026-02-21T09:00:11.0408651Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1269.7 2026-02-21T09:00:11.0412900Z configs/s 2026-02-21T09:00:11.1026430Z [204s] Generation 7 complete: 2026-02-21T09:00:11.1030777Z ok=11 2026-02-21T09:00:11.1034865Z min=0.0428 2026-02-21T09:00:11.1036914Z mid=0.0451 2026-02-21T09:00:11.1037125Z max=0.0655 2026-02-21T09:00:11.1037350Z best={'block_sizes': [1, 16384], 2026-02-21T09:00:11.1037648Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:00:11.1037976Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:00:11.1038207Z 'num_stages': 5, 2026-02-21T09:00:11.1038415Z 'num_warps': 1, 2026-02-21T09:00:11.1038622Z 'pid_type': 'flat', 2026-02-21T09:00:11.1038823Z 'range_flattens': [None, True], 2026-02-21T09:00:11.1039064Z 'range_multi_buffers': [None, None], 2026-02-21T09:00:11.1039284Z 'range_num_stages': [0, 2], 2026-02-21T09:00:11.1039513Z 'range_unroll_factors': [0, 0], 2026-02-21T09:00:11.1039726Z 'range_warp_specializes': [None, True]} 2026-02-21T09:00:11.1050292Z [204s] Fitting surrogate: 485 points, 485 targets 2026-02-21T09:00:11.2722746Z [204s] Autotuning complete in 204.9s after searching 468 configs. 2026-02-21T09:00:11.2724878Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:00:11.2725914Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:00:11.2726825Z 2026-02-21T09:00:11.2727129Z [204s] Code of selected kernel: /tmp/torchinductor_root/vy/cvyqpaemgsb3xa23z4dnhvblqyfc45ml3zbow5nujzu3wnhoqarl.py 2026-02-21T09:00:12.4054699Z WARNING:tritonbench.utils.triton_op:Completed input ID 87: 2026-02-21T09:00:12.4059176Z (M, N) 2026-02-21T09:00:12.4064730Z ------------- 2026-02-21T09:00:12.4069287Z (4096, 11392) 2026-02-21T09:00:12.4069537Z 2026-02-21T09:00:12.4070108Z 90%|█████████ | 18/20 [51:09<06:25, 192.92s/it]WARNING:tritonbench.utils.triton_op:Running input ID 92: 2026-02-21T09:00:12.4070491Z (M, N) 2026-02-21T09:00:12.4070668Z ------------- 2026-02-21T09:00:12.4070866Z (4096, 12032) 2026-02-21T09:00:12.4071176Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:00:13.5777690Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:00:14.9553898Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax 2026-02-21T09:00:18.9631315Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:00:18.9635592Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:00:18.9640219Z 'dtype': 'torch.float16', 2026-02-21T09:00:18.9644179Z 'shape': (4096, 12032), 2026-02-21T09:00:18.9646235Z 'stride': (12032, 1)},), 2026-02-21T09:00:18.9646573Z 'kwargs': {}} 2026-02-21T09:00:18.9659892Z INFO:tritonbench.utils.triton_op:Took 3.14ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:00:19.1442127Z [0s] Autotune random seed: 2134816249 2026-02-21T09:00:19.2909662Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:01:01.0775325Z [41s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T09:01:01.0779124Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T09:01:09.3442610Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.1 configs/s 2026-02-21T09:01:09.3451232Z [50s] Adaptive compile timeout: 30s (90% percentile=13.2s, bounds=[30.0s, 30s]) 2026-02-21T09:01:10.2630089Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1072.3 configs/s 2026-02-21T09:01:10.3304352Z [51s] Initial random population of 100, 5 starting points: 2026-02-21T09:01:10.3305102Z error=5 2026-02-21T09:01:10.3305333Z timeout=1 2026-02-21T09:01:10.3305561Z ok=94 2026-02-21T09:01:10.3305766Z min=0.0634 2026-02-21T09:01:10.3305952Z mid=1.1162 2026-02-21T09:01:10.3306170Z max=61.4842 2026-02-21T09:01:10.3306385Z best={'block_sizes': [1, 16384], 2026-02-21T09:01:10.3306707Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:01:10.3307017Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:01:10.3307322Z 'num_sm_multiplier': 8, 2026-02-21T09:01:10.3307531Z 'num_stages': 3, 2026-02-21T09:01:10.3307758Z 'num_warps': 1, 2026-02-21T09:01:10.3307995Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:01:10.3308280Z 'range_flattens': [False, None], 2026-02-21T09:01:10.3308955Z 'range_multi_buffers': [True, True], 2026-02-21T09:01:10.3309214Z 'range_num_stages': [1, 2], 2026-02-21T09:01:10.3309491Z 'range_unroll_factors': [0, 1], 2026-02-21T09:01:10.3309715Z 'range_warp_specializes': [True, None]} 2026-02-21T09:01:10.3318336Z [51s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:01:11.5251778Z [52s] Generation 1 starting: 80 neighbors, 5 active search path(s) 2026-02-21T09:01:28.0558651Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 1.1 configs/s 2026-02-21T09:01:33.1090337Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 17.0 configs/s 2026-02-21T09:01:35.5074182Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 418.8 2026-02-21T09:01:35.5079201Z configs/s 2026-02-21T09:01:35.6416453Z [76s] Generation 1 complete: 2026-02-21T09:01:35.6418077Z ok=86 2026-02-21T09:01:35.6418861Z min=0.0450 2026-02-21T09:01:35.6419110Z mid=0.0758 2026-02-21T09:01:35.6419347Z max=0.3871 2026-02-21T09:01:35.6419561Z best={'block_sizes': [1, 16384], 2026-02-21T09:01:35.6419912Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:01:35.6420206Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:01:35.6420461Z 'num_sm_multiplier': 8, 2026-02-21T09:01:35.6420670Z 'num_stages': 3, 2026-02-21T09:01:35.6420885Z 'num_warps': 1, 2026-02-21T09:01:35.6421117Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:01:35.6421353Z 'range_flattens': [False, None], 2026-02-21T09:01:35.6421879Z 'range_multi_buffers': [True, True], 2026-02-21T09:01:35.6422115Z 'range_num_stages': [1, 2], 2026-02-21T09:01:35.6422364Z 'range_unroll_factors': [1, 1], 2026-02-21T09:01:35.6422585Z 'range_warp_specializes': [True, None]} 2026-02-21T09:01:35.6431325Z [76s] Fitting surrogate: 186 points, 186 targets 2026-02-21T09:01:36.6461214Z [77s] Generation 2 starting: 74 neighbors, 5 active search path(s) 2026-02-21T09:01:48.2962642Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 10.8 configs/s 2026-02-21T09:01:52.7255019Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.1 configs/s 2026-02-21T09:01:56.6919658Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 313.7 2026-02-21T09:01:56.6921106Z configs/s 2026-02-21T09:01:56.8817605Z [97s] Generation 2 complete: 2026-02-21T09:01:56.8819710Z ok=79 2026-02-21T09:01:56.8819924Z min=0.0430 2026-02-21T09:01:56.8820130Z mid=0.0655 2026-02-21T09:01:56.8820294Z max=0.2899 2026-02-21T09:01:56.8820510Z best={'block_sizes': [1, 16384], 2026-02-21T09:01:56.8820785Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:01:56.8821089Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:01:56.8821320Z 'num_stages': 3, 2026-02-21T09:01:56.8821523Z 'num_warps': 1, 2026-02-21T09:01:56.8821911Z 'pid_type': 'flat', 2026-02-21T09:01:56.8822193Z 'range_flattens': [None, None], 2026-02-21T09:01:56.8822430Z 'range_multi_buffers': [None, True], 2026-02-21T09:01:56.8822720Z 'range_num_stages': [0, 2], 2026-02-21T09:01:56.8825778Z 'range_unroll_factors': [0, 1], 2026-02-21T09:01:56.8827892Z 'range_warp_specializes': [None, True]} 2026-02-21T09:01:56.8830362Z [97s] Fitting surrogate: 265 points, 265 targets 2026-02-21T09:01:57.8227965Z [98s] Generation 3 starting: 68 neighbors, 5 active search path(s) 2026-02-21T09:02:13.8172242Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.1 configs/s 2026-02-21T09:02:17.8195188Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.2 configs/s 2026-02-21T09:02:22.9667128Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 196.2 2026-02-21T09:02:22.9667557Z configs/s 2026-02-21T09:02:23.2406936Z [123s] Generation 3 complete: 2026-02-21T09:02:23.2410714Z error=1 2026-02-21T09:02:23.2415760Z ok=72 2026-02-21T09:02:23.2419867Z min=0.0430 2026-02-21T09:02:23.2421787Z mid=0.0613 2026-02-21T09:02:23.2422031Z max=0.5242 2026-02-21T09:02:23.2422219Z best={'block_sizes': [1, 16384], 2026-02-21T09:02:23.2422536Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:02:23.2422815Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:02:23.2423078Z 'num_stages': 3, 2026-02-21T09:02:23.2423292Z 'num_warps': 1, 2026-02-21T09:02:23.2423476Z 'pid_type': 'flat', 2026-02-21T09:02:23.2423700Z 'range_flattens': [None, None], 2026-02-21T09:02:23.2423923Z 'range_multi_buffers': [None, True], 2026-02-21T09:02:23.2424176Z 'range_num_stages': [0, 2], 2026-02-21T09:02:23.2424377Z 'range_unroll_factors': [0, 1], 2026-02-21T09:02:23.2424629Z 'range_warp_specializes': [None, True]} 2026-02-21T09:02:23.2426395Z [123s] Fitting surrogate: 338 points, 338 targets 2026-02-21T09:02:24.0184706Z [124s] Generation 4 starting: 49 neighbors, 5 active search path(s) 2026-02-21T09:02:33.7161081Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 1.5 configs/s 2026-02-21T09:02:36.6228703Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 17.1 configs/s 2026-02-21T09:02:40.9405815Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 234.1 2026-02-21T09:02:40.9409925Z configs/s 2026-02-21T09:02:41.1901469Z [141s] Generation 4 complete: 2026-02-21T09:02:41.1905810Z ok=54 2026-02-21T09:02:41.1910263Z min=0.0429 2026-02-21T09:02:41.1910602Z mid=0.0594 2026-02-21T09:02:41.1910813Z max=0.1105 2026-02-21T09:02:41.1911051Z best={'block_sizes': [1, 16384], 2026-02-21T09:02:41.1911371Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:02:41.1916002Z 'load_eviction_policies': ['', ''], 2026-02-21T09:02:41.1920515Z 'num_stages': 5, 2026-02-21T09:02:41.1924910Z 'num_warps': 2, 2026-02-21T09:02:41.1926420Z 'pid_type': 'flat', 2026-02-21T09:02:41.1926715Z 'range_flattens': [None, True], 2026-02-21T09:02:41.1927018Z 'range_multi_buffers': [None, None], 2026-02-21T09:02:41.1927251Z 'range_num_stages': [0, 1], 2026-02-21T09:02:41.1927568Z 'range_unroll_factors': [0, 1], 2026-02-21T09:02:41.1932411Z 'range_warp_specializes': [None, True]} 2026-02-21T09:02:41.1937344Z [141s] Fitting surrogate: 392 points, 392 targets 2026-02-21T09:02:41.6580651Z [142s] Generation 5 starting: 24 neighbors, 3 active search path(s) 2026-02-21T09:02:48.9620598Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 0.7 configs/s 2026-02-21T09:02:50.4957879Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.5 configs/s 2026-02-21T09:02:52.7753190Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 442.0 2026-02-21T09:02:52.7754615Z configs/s 2026-02-21T09:02:52.9228353Z [153s] Generation 5 complete: 2026-02-21T09:02:52.9228724Z ok=28 2026-02-21T09:02:52.9229727Z min=0.0430 2026-02-21T09:02:52.9234155Z mid=0.0430 2026-02-21T09:02:52.9236125Z max=0.7054 2026-02-21T09:02:52.9236390Z best={'block_sizes': [1, 16384], 2026-02-21T09:02:52.9236704Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:02:52.9237044Z 'load_eviction_policies': ['', ''], 2026-02-21T09:02:52.9237278Z 'num_stages': 5, 2026-02-21T09:02:52.9237494Z 'num_warps': 2, 2026-02-21T09:02:52.9237687Z 'pid_type': 'flat', 2026-02-21T09:02:52.9237923Z 'range_flattens': [None, True], 2026-02-21T09:02:52.9238148Z 'range_multi_buffers': [None, None], 2026-02-21T09:02:52.9238414Z 'range_num_stages': [0, 1], 2026-02-21T09:02:52.9238660Z 'range_unroll_factors': [0, 0], 2026-02-21T09:02:52.9238881Z 'range_warp_specializes': [None, True]} 2026-02-21T09:02:52.9245479Z [153s] Fitting surrogate: 420 points, 420 targets 2026-02-21T09:02:53.0748127Z [153s] Autotuning complete in 153.8s after searching 402 configs. 2026-02-21T09:02:53.0748937Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:02:53.0749981Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:02:53.0750804Z 2026-02-21T09:02:53.0751113Z [153s] Code of selected kernel: /tmp/torchinductor_root/nv/cnvxfp2hrfiggj3jsxrsbinlsyuo22orr5g7hz2573inp7sojikm.py 2026-02-21T09:02:54.1278763Z WARNING:tritonbench.utils.triton_op:Completed input ID 92: 2026-02-21T09:02:54.1283050Z (M, N) 2026-02-21T09:02:54.1287005Z ------------- 2026-02-21T09:02:54.1288385Z (4096, 12032) 2026-02-21T09:02:54.1288544Z 2026-02-21T09:02:54.1289099Z 95%|█████████▌| 19/20 [53:51<03:03, 183.55s/it]WARNING:tritonbench.utils.triton_op:Running input ID 97: 2026-02-21T09:02:54.1294220Z (M, N) 2026-02-21T09:02:54.1295626Z ------------- 2026-02-21T09:02:54.1295862Z (4096, 12672) 2026-02-21T09:02:54.1296272Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:02:55.3133179Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:02:56.6876721Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax 2026-02-21T09:03:00.3480996Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:03:00.3485057Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:03:00.3490073Z 'dtype': 'torch.float16', 2026-02-21T09:03:00.3493318Z 'shape': (4096, 12672), 2026-02-21T09:03:00.3494727Z 'stride': (12672, 1)},), 2026-02-21T09:03:00.3495026Z 'kwargs': {}} 2026-02-21T09:03:00.3503258Z INFO:tritonbench.utils.triton_op:Took 2.43ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:03:00.5296026Z [0s] Autotune random seed: 2134816249 2026-02-21T09:03:00.6725034Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:03:43.8873704Z [43s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None]) 2026-02-21T09:03:43.8891976Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T09:03:44.0846396Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T09:03:44.0848073Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:03:44.0848697Z %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16> 2026-02-21T09:03:44.0849307Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:03:44.0849574Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:03:44.0849805Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:03:44.0850096Z %cst_0 = arith.constant dense<12672> : tensor<8x1xi32> 2026-02-21T09:03:44.0850391Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32> 2026-02-21T09:03:44.0850720Z %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16> 2026-02-21T09:03:44.0851029Z %cst_3 = arith.constant dense<12672> : tensor<512xi32> 2026-02-21T09:03:44.0851313Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:03:44.0851839Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:03:44.0852097Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:03:44.0852381Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:03:44.0852763Z %c12672_i32 = arith.constant 12672 : i32 2026-02-21T09:03:44.0853033Z %c12672_i64 = arith.constant 12672 : i64 2026-02-21T09:03:44.0853257Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:03:44.0853647Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c12672_i32], [%c12672_i64, %c1_i64] : , > 2026-02-21T09:03:44.0854043Z %1 = tt.get_program_id x : i32 2026-02-21T09:03:44.0854292Z scf.for %arg2 = %1 to %c512_i32 step %c9472_i32 : i32 { 2026-02-21T09:03:44.0854577Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:03:44.0854843Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:03:44.0855157Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:03:44.0855395Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:03:44.0855651Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T09:03:44.0855919Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:03:44.0856328Z %6:2 = scf.for %arg3 = %c0_i32 to %c12288_i32 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:03:44.0856816Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0857116Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0857388Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T09:03:44.0857671Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0858014Z %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T09:03:44.0858426Z %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0858756Z %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0859096Z %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T09:03:44.0859412Z %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0859715Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0859989Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0860218Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:03:44.0860484Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0860709Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0860992Z %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:03:44.0861269Z %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:03:44.0861605Z %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32> 2026-02-21T09:03:44.0861900Z %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:03:44.0862158Z %74 = arith.ori %72, %73 : tensor<8xi1> 2026-02-21T09:03:44.0862455Z %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:03:44.0862737Z %76 = arith.subf %arg4, %75 : tensor<8xf32> 2026-02-21T09:03:44.0863217Z %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0863727Z %78 = arith.mulf %arg5, %77 : tensor<8xf32> 2026-02-21T09:03:44.0864045Z %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0864372Z %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0864697Z %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0864973Z %82 = arith.subf %80, %81 : tensor<8x512xf32> 2026-02-21T09:03:44.0865405Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0865849Z %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T09:03:44.0866168Z %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0866496Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0866720Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:03:44.0866977Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0867216Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0867493Z %86 = arith.addf %78, %85 : tensor<8xf32> 2026-02-21T09:03:44.0867724Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:44.0867978Z %87 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T09:03:44.0868233Z %88 = arith.addi %arg3, %87 : i32 2026-02-21T09:03:44.0868511Z %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0868832Z %90 = tt.splat %88 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0869072Z %91 = arith.addi %90, %89 : tensor<512xi32> 2026-02-21T09:03:44.0869359Z %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0869691Z %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T09:03:44.0870100Z %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0870464Z %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0870773Z %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T09:03:44.0871116Z %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0871381Z %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0871687Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0871906Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:03:44.0872169Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0872418Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0872677Z %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:03:44.0872991Z %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:03:44.0873272Z %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32> 2026-02-21T09:03:44.0873562Z %102 = arith.cmpf une, %75, %75 : tensor<8xf32> 2026-02-21T09:03:44.0873828Z %103 = arith.ori %101, %102 : tensor<8xi1> 2026-02-21T09:03:44.0874137Z %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:03:44.0874466Z %105 = arith.subf %75, %104 : tensor<8xf32> 2026-02-21T09:03:44.0874892Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0875342Z %107 = arith.mulf %86, %106 : tensor<8xf32> 2026-02-21T09:03:44.0875654Z %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0876033Z %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0876355Z %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0876684Z %111 = arith.subf %109, %110 : tensor<8x512xf32> 2026-02-21T09:03:44.0877267Z %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0877742Z %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T09:03:44.0878085Z %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0878356Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0878623Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:03:44.0878888Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0879125Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0879409Z %115 = arith.addf %107, %114 : tensor<8xf32> 2026-02-21T09:03:44.0879655Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:44.0879929Z %116 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T09:03:44.0880173Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:03:44.0880542Z %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0880851Z %119 = tt.splat %117 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0881135Z %120 = arith.addi %119, %118 : tensor<512xi32> 2026-02-21T09:03:44.0881428Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0881819Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T09:03:44.0882253Z %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0882607Z %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0882971Z %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T09:03:44.0883340Z %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0883631Z %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0883908Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0884143Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:03:44.0884412Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0884635Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0884925Z %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:03:44.0885211Z %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:03:44.0885507Z %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32> 2026-02-21T09:03:44.0885798Z %131 = arith.cmpf une, %104, %104 : tensor<8xf32> 2026-02-21T09:03:44.0886043Z %132 = arith.ori %130, %131 : tensor<8xi1> 2026-02-21T09:03:44.0886345Z %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:03:44.0886625Z %134 = arith.subf %104, %133 : tensor<8xf32> 2026-02-21T09:03:44.0887048Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0887474Z %136 = arith.mulf %115, %135 : tensor<8xf32> 2026-02-21T09:03:44.0887766Z %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0888122Z %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0888424Z %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0888727Z %140 = arith.subf %138, %139 : tensor<8x512xf32> 2026-02-21T09:03:44.0889139Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0889619Z %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T09:03:44.0889945Z %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0890175Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0890425Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:03:44.0890721Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0890972Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0891210Z %144 = arith.addf %136, %143 : tensor<8xf32> 2026-02-21T09:03:44.0891474Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:44.0891764Z %145 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T09:03:44.0891997Z %146 = arith.addi %arg3, %145 : i32 2026-02-21T09:03:44.0892296Z %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0892590Z %148 = tt.splat %146 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0892862Z %149 = arith.addi %148, %147 : tensor<512xi32> 2026-02-21T09:03:44.0893122Z %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0893487Z %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T09:03:44.0893958Z %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0894298Z %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0894650Z %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T09:03:44.0894973Z %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0895271Z %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0895502Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0895753Z %174 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:03:44.0896013Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0896236Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0896524Z %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:03:44.0896805Z %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:03:44.0897102Z %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32> 2026-02-21T09:03:44.0897364Z %160 = arith.cmpf une, %133, %133 : tensor<8xf32> 2026-02-21T09:03:44.0897637Z %161 = arith.ori %159, %160 : tensor<8xi1> 2026-02-21T09:03:44.0897940Z %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:03:44.0898218Z %163 = arith.subf %133, %162 : tensor<8xf32> 2026-02-21T09:03:44.0898639Z %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0899036Z %165 = arith.mulf %144, %164 : tensor<8xf32> 2026-02-21T09:03:44.0899353Z %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0899681Z %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0900012Z %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0900316Z %169 = arith.subf %167, %168 : tensor<8x512xf32> 2026-02-21T09:03:44.0900731Z %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0901234Z %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T09:03:44.0901529Z %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0901818Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:03:44.0902067Z %174 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:03:44.0902292Z tt.reduce.return %174 : f32 2026-02-21T09:03:44.0902543Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0902776Z %173 = arith.addf %165, %172 : tensor<8xf32> 2026-02-21T09:03:44.0903065Z scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:03:44.0903354Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:03:44.0903690Z %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0904079Z %8 = tt.splat %c12288_i32 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0904323Z %9 = arith.addi %8, %7 : tensor<512xi32> 2026-02-21T09:03:44.0904598Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0904951Z %11 = tt.descriptor_load %0[%2, %c12288_i32] : !tt.tensordesc> -> tensor<8x512xf16> 2026-02-21T09:03:44.0905371Z %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0905698Z %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0906041Z %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16> 2026-02-21T09:03:44.0906380Z %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0906645Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0906906Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:03:44.0907129Z %60 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:03:44.0907443Z tt.reduce.return %60 : f32 2026-02-21T09:03:44.0907672Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0907955Z %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:03:44.0908230Z %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:03:44.0908520Z %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32> 2026-02-21T09:03:44.0908803Z %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T09:03:44.0909043Z %21 = arith.ori %19, %20 : tensor<8xi1> 2026-02-21T09:03:44.0909331Z %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:03:44.0909599Z %23 = arith.subf %6#0, %22 : tensor<8xf32> 2026-02-21T09:03:44.0910016Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0910434Z %25 = arith.mulf %6#1, %24 : tensor<8xf32> 2026-02-21T09:03:44.0910725Z %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0911074Z %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0911369Z %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0911692Z %29 = arith.subf %27, %28 : tensor<8x512xf32> 2026-02-21T09:03:44.0912091Z %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0912564Z %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T09:03:44.0912872Z %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({ 2026-02-21T09:03:44.0913101Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:03:44.0913348Z %60 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:03:44.0913577Z tt.reduce.return %60 : f32 2026-02-21T09:03:44.0913824Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T09:03:44.0914057Z %33 = arith.addf %25, %32 : tensor<8xf32> 2026-02-21T09:03:44.0914323Z %c12288_i32_6 = arith.constant 12288 : i32 2026-02-21T09:03:44.0914589Z %c2048_i32_7 = arith.constant 2048 : i32 2026-02-21T09:03:44.0914866Z scf.for %arg3 = %c0_i32 to %c12288_i32_6 step %c2048_i32_7 : i32 { 2026-02-21T09:03:44.0915219Z %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0915513Z %61 = tt.splat %arg3 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0915784Z %62 = arith.addi %61, %60 : tensor<512xi32> 2026-02-21T09:03:44.0916042Z %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0916404Z %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:44.0916728Z %65 = arith.muli %64, %cst_0 : tensor<8x1xi32> 2026-02-21T09:03:44.0917053Z %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T09:03:44.0917388Z %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0917821Z %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0918104Z %69 = arith.addi %67, %68 : tensor<8x512xi32> 2026-02-21T09:03:44.0918429Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0918796Z %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0919149Z %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0919529Z %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0919876Z %74 = tt.load %71, %73, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0920282Z %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0920619Z %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0921036Z %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0921355Z %78 = arith.subf %76, %77 : tensor<8x512xf32> 2026-02-21T09:03:44.0921846Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0922353Z %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0922689Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0923008Z %82 = arith.divf %79, %81 : tensor<8x512xf32> 2026-02-21T09:03:44.0923322Z %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T09:03:44.0923645Z %84 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0924002Z %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0924321Z tt.store %85, %83, %73 : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0924620Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:03:44.0924864Z %86 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T09:03:44.0925139Z %87 = arith.addi %arg3, %86 : i32 2026-02-21T09:03:44.0925447Z %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0925753Z %89 = tt.splat %87 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0926021Z %90 = arith.addi %89, %88 : tensor<512xi32> 2026-02-21T09:03:44.0926275Z %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0926602Z %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:44.0926901Z %93 = arith.muli %92, %cst_0 : tensor<8x1xi32> 2026-02-21T09:03:44.0927221Z %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T09:03:44.0927575Z %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0927872Z %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0928169Z %97 = arith.addi %95, %96 : tensor<8x512xi32> 2026-02-21T09:03:44.0928441Z %98 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0928771Z %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0929103Z %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0929465Z %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0929836Z %102 = tt.load %99, %101, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0930199Z %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0930553Z %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0930855Z %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0931223Z %106 = arith.subf %104, %105 : tensor<8x512xf32> 2026-02-21T09:03:44.0931714Z %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0932167Z %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0932519Z %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0932810Z %110 = arith.divf %107, %109 : tensor<8x512xf32> 2026-02-21T09:03:44.0933115Z %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T09:03:44.0933423Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0933772Z %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0934107Z tt.store %113, %111, %101 : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0934412Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:03:44.0934671Z %114 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T09:03:44.0934907Z %115 = arith.addi %arg3, %114 : i32 2026-02-21T09:03:44.0935203Z %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0935491Z %117 = tt.splat %115 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0935759Z %118 = arith.addi %117, %116 : tensor<512xi32> 2026-02-21T09:03:44.0936045Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0936353Z %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:44.0936688Z %121 = arith.muli %120, %cst_0 : tensor<8x1xi32> 2026-02-21T09:03:44.0936990Z %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T09:03:44.0937355Z %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0937682Z %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0937966Z %125 = arith.addi %123, %124 : tensor<8x512xi32> 2026-02-21T09:03:44.0938277Z %126 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0938601Z %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0938971Z %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0939306Z %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0939683Z %130 = tt.load %127, %129, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0940073Z %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0940396Z %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0940726Z %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0941009Z %134 = arith.subf %132, %133 : tensor<8x512xf32> 2026-02-21T09:03:44.0941445Z %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0941962Z %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0942285Z %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0942588Z %138 = arith.divf %135, %137 : tensor<8x512xf32> 2026-02-21T09:03:44.0942865Z %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T09:03:44.0943200Z %140 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0943518Z %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0943844Z tt.store %141, %139, %129 : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0944129Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:03:44.0944444Z %142 = arith.muli %c512_i32, %c3_i32 : i32 2026-02-21T09:03:44.0944712Z %143 = arith.addi %arg3, %142 : i32 2026-02-21T09:03:44.0944987Z %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0945308Z %145 = tt.splat %143 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0945562Z %146 = arith.addi %145, %144 : tensor<512xi32> 2026-02-21T09:03:44.0945848Z %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0946177Z %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:44.0946478Z %149 = arith.muli %148, %cst_0 : tensor<8x1xi32> 2026-02-21T09:03:44.0946809Z %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T09:03:44.0947144Z %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0947542Z %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0947825Z %153 = arith.addi %151, %152 : tensor<8x512xi32> 2026-02-21T09:03:44.0948132Z %154 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0948475Z %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0948828Z %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0949186Z %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0949534Z %158 = tt.load %155, %157, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0949926Z %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0950269Z %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0950566Z %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0950872Z %162 = arith.subf %160, %161 : tensor<8x512xf32> 2026-02-21T09:03:44.0951282Z %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0951776Z %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0952121Z %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0952401Z %166 = arith.divf %163, %165 : tensor<8x512xf32> 2026-02-21T09:03:44.0952705Z %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T09:03:44.0953014Z %168 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0953360Z %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0953670Z tt.store %169, %167, %157 : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0953990Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T09:03:44.0954330Z %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T09:03:44.0954631Z %35 = tt.splat %c12288_i32_6 : i32 -> tensor<512xi32> 2026-02-21T09:03:44.0954915Z %36 = arith.addi %35, %34 : tensor<512xi32> 2026-02-21T09:03:44.0955168Z %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32> 2026-02-21T09:03:44.0955490Z %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:03:44.0955792Z %39 = arith.muli %38, %cst_0 : tensor<8x1xi32> 2026-02-21T09:03:44.0956113Z %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T09:03:44.0956464Z %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0956758Z %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32> 2026-02-21T09:03:44.0957061Z %43 = arith.addi %41, %42 : tensor<8x512xi32> 2026-02-21T09:03:44.0957335Z %44 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0957733Z %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0958063Z %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1> 2026-02-21T09:03:44.0958415Z %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1> 2026-02-21T09:03:44.0958760Z %48 = tt.load %45, %47, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0959112Z %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0959455Z %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32> 2026-02-21T09:03:44.0959740Z %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0960034Z %52 = arith.subf %50, %51 : tensor<8x512xf32> 2026-02-21T09:03:44.0960533Z %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T09:03:44.0961003Z %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:03:44.0961357Z %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32> 2026-02-21T09:03:44.0961668Z %56 = arith.divf %53, %55 : tensor<8x512xf32> 2026-02-21T09:03:44.0961973Z %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16> 2026-02-21T09:03:44.0962286Z %58 = tt.splat %arg1 : !tt.ptr -> tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0962647Z %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr>, tensor<8x512xi32> 2026-02-21T09:03:44.0962983Z tt.store %59, %57, %47 : tensor<8x512x!tt.ptr> 2026-02-21T09:03:44.0963319Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:03:44.0963656Z tt.return 2026-02-21T09:03:44.0963836Z } 2026-02-21T09:03:44.0964035Z } 2026-02-21T09:03:44.0964129Z 2026-02-21T09:03:44.0964203Z {-# 2026-02-21T09:03:44.0964413Z external_resources: { 2026-02-21T09:03:44.0964638Z mlir_reproducer: { 2026-02-21T09:03:44.0969187Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:03:44.0973749Z disable_threading: false, 2026-02-21T09:03:44.0973986Z verify_each: true 2026-02-21T09:03:44.0974178Z } 2026-02-21T09:03:44.0974377Z } 2026-02-21T09:03:44.0974587Z #-} 2026-02-21T09:03:44.0975080Z /tmp/torchinductor_root/cb/ccb5nm5euemrmlemnx3l2mj5qkox75vbvwlyuk34mtkf6mmguhb5.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:03:44.0976329Z /tmp/torchinductor_root/cb/ccb5nm5euemrmlemnx3l2mj5qkox75vbvwlyuk34mtkf6mmguhb5.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:03:44.0977365Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:03:44.0978591Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:03:44.0979647Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:03:44.0979949Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:03:52.3143519Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.9 configs/s 2026-02-21T09:03:52.3158020Z [51s] Adaptive compile timeout: 30s (90% percentile=13.8s, bounds=[30.0s, 30s]) 2026-02-21T09:03:53.2758177Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1023.1 configs/s 2026-02-21T09:03:53.3444072Z [52s] Initial random population of 100, 5 starting points: 2026-02-21T09:03:53.3448387Z error=7 2026-02-21T09:03:53.3452093Z timeout=1 2026-02-21T09:03:53.3453483Z ok=92 2026-02-21T09:03:53.3453751Z min=0.0674 2026-02-21T09:03:53.3453943Z mid=1.1777 2026-02-21T09:03:53.3454169Z max=62.4026 2026-02-21T09:03:53.3454413Z best={'block_sizes': [1, 16384], 2026-02-21T09:03:53.3454740Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:03:53.3455033Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:03:53.3455295Z 'num_sm_multiplier': 8, 2026-02-21T09:03:53.3455494Z 'num_stages': 3, 2026-02-21T09:03:53.3455715Z 'num_warps': 1, 2026-02-21T09:03:53.3455963Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:03:53.3456198Z 'range_flattens': [False, None], 2026-02-21T09:03:53.3456445Z 'range_multi_buffers': [True, True], 2026-02-21T09:03:53.3456672Z 'range_num_stages': [1, 2], 2026-02-21T09:03:53.3456921Z 'range_unroll_factors': [0, 1], 2026-02-21T09:03:53.3457142Z 'range_warp_specializes': [True, None]} 2026-02-21T09:03:53.3457814Z [52s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:03:54.4345269Z [53s] Generation 1 starting: 79 neighbors, 5 active search path(s) 2026-02-21T09:04:08.8620286Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 3.8 configs/s 2026-02-21T09:04:13.8129963Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.1 configs/s 2026-02-21T09:04:15.0791702Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 787.0 2026-02-21T09:04:15.0795156Z configs/s 2026-02-21T09:04:15.1600818Z [74s] Generation 1 complete: 2026-02-21T09:04:15.1605218Z ok=85 2026-02-21T09:04:15.1609104Z min=0.0471 2026-02-21T09:04:15.1613607Z mid=0.0820 2026-02-21T09:04:15.1615092Z max=0.4300 2026-02-21T09:04:15.1615349Z best={'block_sizes': [1, 16384], 2026-02-21T09:04:15.1615669Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:04:15.1615987Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:04:15.1616245Z 'num_sm_multiplier': 8, 2026-02-21T09:04:15.1616489Z 'num_stages': 3, 2026-02-21T09:04:15.1616675Z 'num_warps': 1, 2026-02-21T09:04:15.1616901Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:04:15.1617167Z 'range_flattens': [False, None], 2026-02-21T09:04:15.1617763Z 'range_multi_buffers': [True, True], 2026-02-21T09:04:15.1618004Z 'range_num_stages': [1, 2], 2026-02-21T09:04:15.1618247Z 'range_unroll_factors': [1, 1], 2026-02-21T09:04:15.1618471Z 'range_warp_specializes': [True, None]} 2026-02-21T09:04:15.1618869Z [74s] Fitting surrogate: 185 points, 185 targets 2026-02-21T09:04:16.0374417Z [75s] Generation 2 starting: 62 neighbors, 5 active search path(s) 2026-02-21T09:04:52.0737589Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 0.4 configs/s 2026-02-21T09:04:55.8862805Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 17.0 configs/s 2026-02-21T09:04:58.5915293Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 371.9 2026-02-21T09:04:58.5918935Z configs/s 2026-02-21T09:04:58.7452218Z [118s] Generation 2 complete: 2026-02-21T09:04:58.7454093Z error=1 2026-02-21T09:04:58.7454298Z ok=66 2026-02-21T09:04:58.7454908Z min=0.0491 2026-02-21T09:04:58.7455096Z mid=0.0757 2026-02-21T09:04:58.7455291Z max=2.0634 2026-02-21T09:04:58.7455567Z best={'block_sizes': [1, 16384], 2026-02-21T09:04:58.7459831Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:04:58.7464142Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:04:58.7466259Z 'num_sm_multiplier': 8, 2026-02-21T09:04:58.7466517Z 'num_stages': 3, 2026-02-21T09:04:58.7466739Z 'num_warps': 1, 2026-02-21T09:04:58.7466931Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:04:58.7467209Z 'range_flattens': [False, None], 2026-02-21T09:04:58.7467440Z 'range_multi_buffers': [True, True], 2026-02-21T09:04:58.7467697Z 'range_num_stages': [1, 2], 2026-02-21T09:04:58.7467909Z 'range_unroll_factors': [1, 1], 2026-02-21T09:04:58.7468168Z 'range_warp_specializes': [True, None]} 2026-02-21T09:04:58.7468458Z [118s] Fitting surrogate: 252 points, 252 targets 2026-02-21T09:04:59.6050070Z [118s] Generation 3 starting: 62 neighbors, 5 active search path(s) 2026-02-21T09:05:11.4449171Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 11.6 configs/s 2026-02-21T09:05:15.2157340Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 17.7 configs/s 2026-02-21T09:05:16.5088310Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 772.9 2026-02-21T09:05:16.5088707Z configs/s 2026-02-21T09:05:16.5950706Z [135s] Generation 3 complete: 2026-02-21T09:05:16.5953977Z error=2 2026-02-21T09:05:16.5958372Z ok=65 2026-02-21T09:05:16.5962250Z min=0.0472 2026-02-21T09:05:16.5966789Z mid=0.0799 2026-02-21T09:05:16.5971219Z max=0.2785 2026-02-21T09:05:16.5976292Z best={'block_sizes': [1, 16384], 2026-02-21T09:05:16.5980956Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:05:16.5981950Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:05:16.5982246Z 'num_sm_multiplier': 8, 2026-02-21T09:05:16.5982485Z 'num_stages': 3, 2026-02-21T09:05:16.5982710Z 'num_warps': 1, 2026-02-21T09:05:16.5982940Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:05:16.5983174Z 'range_flattens': [False, None], 2026-02-21T09:05:16.5983435Z 'range_multi_buffers': [True, True], 2026-02-21T09:05:16.5983657Z 'range_num_stages': [1, 2], 2026-02-21T09:05:16.5983891Z 'range_unroll_factors': [1, 1], 2026-02-21T09:05:16.5984107Z 'range_warp_specializes': [True, None]} 2026-02-21T09:05:16.5984384Z [135s] Fitting surrogate: 319 points, 319 targets 2026-02-21T09:05:17.1823918Z [136s] Generation 4 starting: 38 neighbors, 3 active search path(s) 2026-02-21T09:05:24.1080261Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 7.7 configs/s 2026-02-21T09:05:26.5232952Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 17.3 configs/s 2026-02-21T09:05:26.5240822Z [145s] Generation 4 complete: 2026-02-21T09:05:26.5245515Z ok=42 2026-02-21T09:05:26.5250178Z min=0.0472 2026-02-21T09:05:26.5251858Z mid=0.0841 2026-02-21T09:05:26.5252489Z max=0.4608 2026-02-21T09:05:26.5252722Z best={'block_sizes': [1, 16384], 2026-02-21T09:05:26.5253014Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:05:26.5253347Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:05:26.5253610Z 'num_sm_multiplier': 8, 2026-02-21T09:05:26.5253870Z 'num_stages': 3, 2026-02-21T09:05:26.5254081Z 'num_warps': 1, 2026-02-21T09:05:26.5254347Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:05:26.5254595Z 'range_flattens': [False, None], 2026-02-21T09:05:26.5254861Z 'range_multi_buffers': [True, True], 2026-02-21T09:05:26.5255125Z 'range_num_stages': [1, 2], 2026-02-21T09:05:26.5255348Z 'range_unroll_factors': [1, 1], 2026-02-21T09:05:26.5255609Z 'range_warp_specializes': [True, None]} 2026-02-21T09:05:26.5256788Z [145s] Fitting surrogate: 361 points, 361 targets 2026-02-21T09:05:27.1254513Z [146s] Generation 5 starting: 38 neighbors, 3 active search path(s) 2026-02-21T09:05:38.6421256Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 2.2 configs/s 2026-02-21T09:05:41.0785436Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 17.1 configs/s 2026-02-21T09:05:41.6746949Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1636.1 2026-02-21T09:05:41.6751245Z configs/s 2026-02-21T09:05:41.7259838Z [161s] Generation 5 complete: 2026-02-21T09:05:41.7264271Z ok=42 2026-02-21T09:05:41.7269187Z min=0.0471 2026-02-21T09:05:41.7273622Z mid=0.0820 2026-02-21T09:05:41.7282014Z max=0.4690 2026-02-21T09:05:41.7282190Z best={'block_sizes': [1, 16384], 2026-02-21T09:05:41.7282440Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:05:41.7282684Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:05:41.7282892Z 'num_sm_multiplier': 8, 2026-02-21T09:05:41.7283064Z 'num_stages': 3, 2026-02-21T09:05:41.7283207Z 'num_warps': 1, 2026-02-21T09:05:41.7283402Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:05:41.7283614Z 'range_flattens': [False, None], 2026-02-21T09:05:41.7283807Z 'range_multi_buffers': [True, True], 2026-02-21T09:05:41.7283989Z 'range_num_stages': [1, 2], 2026-02-21T09:05:41.7284163Z 'range_unroll_factors': [1, 1], 2026-02-21T09:05:41.7284344Z 'range_warp_specializes': [True, None]} 2026-02-21T09:05:41.7284559Z [161s] Fitting surrogate: 403 points, 403 targets 2026-02-21T09:05:42.2910183Z [161s] Generation 6 starting: 35 neighbors, 3 active search path(s) 2026-02-21T09:05:49.5501646Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 3.3 configs/s 2026-02-21T09:05:51.7390117Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 17.3 configs/s 2026-02-21T09:05:53.9808965Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 740.1 2026-02-21T09:05:53.9812681Z configs/s 2026-02-21T09:05:54.0722286Z [173s] Generation 6 complete: 2026-02-21T09:05:54.0724226Z ok=39 2026-02-21T09:05:54.0724741Z min=0.0471 2026-02-21T09:05:54.0724880Z mid=0.0779 2026-02-21T09:05:54.0725002Z max=0.1885 2026-02-21T09:05:54.0725148Z best={'block_sizes': [1, 16384], 2026-02-21T09:05:54.0725389Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:05:54.0725630Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:05:54.0725832Z 'num_sm_multiplier': 8, 2026-02-21T09:05:54.0725991Z 'num_stages': 3, 2026-02-21T09:05:54.0726140Z 'num_warps': 1, 2026-02-21T09:05:54.0726297Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:05:54.0726502Z 'range_flattens': [False, None], 2026-02-21T09:05:54.0726685Z 'range_multi_buffers': [True, True], 2026-02-21T09:05:54.0726881Z 'range_num_stages': [1, 2], 2026-02-21T09:05:54.0727045Z 'range_unroll_factors': [1, 1], 2026-02-21T09:05:54.0727234Z 'range_warp_specializes': [True, None]} 2026-02-21T09:05:54.0740286Z [173s] Fitting surrogate: 442 points, 442 targets 2026-02-21T09:05:54.5853106Z [173s] Generation 7 starting: 31 neighbors, 3 active search path(s) 2026-02-21T09:06:03.7412341Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 1.8 configs/s 2026-02-21T09:06:05.7522185Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 17.3 configs/s 2026-02-21T09:06:06.6093739Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1154.8 2026-02-21T09:06:06.6097807Z configs/s 2026-02-21T09:06:06.6747921Z [186s] Generation 7 complete: 2026-02-21T09:06:06.6749967Z ok=35 2026-02-21T09:06:06.6750170Z min=0.0471 2026-02-21T09:06:06.6755765Z mid=0.0799 2026-02-21T09:06:06.6760416Z max=0.8529 2026-02-21T09:06:06.6762537Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:06.6762831Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:06.6763091Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:06.6763307Z 'num_sm_multiplier': 8, 2026-02-21T09:06:06.6763515Z 'num_stages': 3, 2026-02-21T09:06:06.6763694Z 'num_warps': 1, 2026-02-21T09:06:06.6763864Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:06.6764062Z 'range_flattens': [False, None], 2026-02-21T09:06:06.6764253Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:06.6764438Z 'range_num_stages': [1, 2], 2026-02-21T09:06:06.6764619Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:06.6764803Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:06.6765031Z [186s] Fitting surrogate: 477 points, 477 targets 2026-02-21T09:06:07.0738038Z [186s] Generation 8 starting: 24 neighbors, 2 active search path(s) 2026-02-21T09:06:12.2128665Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 7.7 configs/s 2026-02-21T09:06:13.7539117Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.4 configs/s 2026-02-21T09:06:13.9319086Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5051.4 2026-02-21T09:06:13.9322785Z configs/s 2026-02-21T09:06:13.9646547Z [193s] Generation 8 complete: 2026-02-21T09:06:13.9651006Z ok=27 2026-02-21T09:06:13.9652492Z min=0.0471 2026-02-21T09:06:13.9652662Z mid=0.0830 2026-02-21T09:06:13.9652792Z max=0.4035 2026-02-21T09:06:13.9652951Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:13.9653190Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:13.9653445Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:13.9653642Z 'num_sm_multiplier': 8, 2026-02-21T09:06:13.9653812Z 'num_stages': 3, 2026-02-21T09:06:13.9653952Z 'num_warps': 1, 2026-02-21T09:06:13.9654118Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:13.9654310Z 'range_flattens': [False, None], 2026-02-21T09:06:13.9654500Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:13.9654695Z 'range_num_stages': [1, 2], 2026-02-21T09:06:13.9654861Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:13.9655049Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:13.9664189Z [193s] Fitting surrogate: 504 points, 504 targets 2026-02-21T09:06:14.4964959Z [193s] Generation 9 starting: 24 neighbors, 2 active search path(s) 2026-02-21T09:06:19.5566492Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 4.7 configs/s 2026-02-21T09:06:21.0958690Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.4 configs/s 2026-02-21T09:06:21.4956635Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2389.9 2026-02-21T09:06:21.4960313Z configs/s 2026-02-21T09:06:21.5402135Z [200s] Generation 9 complete: 2026-02-21T09:06:21.5406564Z ok=27 2026-02-21T09:06:21.5410421Z min=0.0472 2026-02-21T09:06:21.5412575Z mid=0.0819 2026-02-21T09:06:21.5412794Z max=0.2621 2026-02-21T09:06:21.5417511Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:21.5421445Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:21.5426389Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:21.5430340Z 'num_sm_multiplier': 8, 2026-02-21T09:06:21.5434796Z 'num_stages': 3, 2026-02-21T09:06:21.5436154Z 'num_warps': 1, 2026-02-21T09:06:21.5436379Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:21.5436607Z 'range_flattens': [False, None], 2026-02-21T09:06:21.5436798Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:21.5437001Z 'range_num_stages': [1, 2], 2026-02-21T09:06:21.5437182Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:21.5437379Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:21.5437682Z [200s] Fitting surrogate: 531 points, 531 targets 2026-02-21T09:06:22.0761158Z [201s] Generation 10 starting: 25 neighbors, 2 active search path(s) 2026-02-21T09:06:29.0514388Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 2.6 configs/s 2026-02-21T09:06:30.6542605Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 27/27 17.3 configs/s 2026-02-21T09:06:30.9432560Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3247.1 2026-02-21T09:06:30.9434384Z configs/s 2026-02-21T09:06:30.9819669Z [210s] Generation 10 complete: 2026-02-21T09:06:30.9825322Z ok=28 2026-02-21T09:06:30.9829345Z min=0.0471 2026-02-21T09:06:30.9830788Z mid=0.0820 2026-02-21T09:06:30.9830954Z max=0.2642 2026-02-21T09:06:30.9831117Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:30.9831366Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:30.9831686Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:30.9831897Z 'num_sm_multiplier': 8, 2026-02-21T09:06:30.9832062Z 'num_stages': 3, 2026-02-21T09:06:30.9832218Z 'num_warps': 1, 2026-02-21T09:06:30.9832379Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:30.9832587Z 'range_flattens': [False, None], 2026-02-21T09:06:30.9832775Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:30.9832972Z 'range_num_stages': [1, 2], 2026-02-21T09:06:30.9833165Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:30.9833367Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:30.9840514Z [210s] Fitting surrogate: 559 points, 559 targets 2026-02-21T09:06:31.3663079Z [210s] Generation 11 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:06:33.9350515Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.1 configs/s 2026-02-21T09:06:34.6429847Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s 2026-02-21T09:06:34.9283487Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3273.4 2026-02-21T09:06:34.9288256Z configs/s 2026-02-21T09:06:34.9664767Z [214s] Generation 11 complete: 2026-02-21T09:06:34.9669077Z ok=13 2026-02-21T09:06:34.9673414Z min=0.0472 2026-02-21T09:06:34.9674752Z mid=0.0841 2026-02-21T09:06:34.9674919Z max=0.1188 2026-02-21T09:06:34.9675062Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:34.9675336Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:34.9675956Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:34.9676157Z 'num_sm_multiplier': 8, 2026-02-21T09:06:34.9676321Z 'num_stages': 3, 2026-02-21T09:06:34.9676461Z 'num_warps': 1, 2026-02-21T09:06:34.9676622Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:34.9676814Z 'range_flattens': [False, None], 2026-02-21T09:06:34.9676999Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:34.9677179Z 'range_num_stages': [1, 2], 2026-02-21T09:06:34.9677350Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:34.9677527Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:34.9684190Z [214s] Fitting surrogate: 572 points, 572 targets 2026-02-21T09:06:35.3578101Z [214s] Generation 12 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:06:37.7486899Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 12.1 configs/s 2026-02-21T09:06:38.3553069Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.9 configs/s 2026-02-21T09:06:39.4166286Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 931.0 2026-02-21T09:06:39.4170437Z configs/s 2026-02-21T09:06:39.4938802Z [218s] Generation 12 complete: 2026-02-21T09:06:39.4939171Z ok=12 2026-02-21T09:06:39.4939344Z min=0.0492 2026-02-21T09:06:39.4939521Z mid=0.0656 2026-02-21T09:06:39.4939661Z max=0.0819 2026-02-21T09:06:39.4939802Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:39.4940025Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:39.4940261Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:39.4940462Z 'num_sm_multiplier': 8, 2026-02-21T09:06:39.4940618Z 'num_stages': 3, 2026-02-21T09:06:39.4940762Z 'num_warps': 1, 2026-02-21T09:06:39.4940916Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:39.4941112Z 'range_flattens': [False, None], 2026-02-21T09:06:39.4941321Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:39.4941523Z 'range_num_stages': [1, 2], 2026-02-21T09:06:39.4941770Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:39.4956416Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:39.4956676Z [218s] Fitting surrogate: 584 points, 584 targets 2026-02-21T09:06:39.8923283Z [219s] Generation 13 starting: 8 neighbors, 1 active search path(s) 2026-02-21T09:06:42.4528762Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 5.5 configs/s 2026-02-21T09:06:42.9275971Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 18.8 configs/s 2026-02-21T09:06:43.3177506Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2425.0 2026-02-21T09:06:43.3179226Z configs/s 2026-02-21T09:06:43.3610161Z [222s] Generation 13 complete: 2026-02-21T09:06:43.3614531Z ok=10 2026-02-21T09:06:43.3618926Z min=0.0492 2026-02-21T09:06:43.3620438Z mid=0.0789 2026-02-21T09:06:43.3620667Z max=0.1638 2026-02-21T09:06:43.3626442Z best={'block_sizes': [1, 16384], 2026-02-21T09:06:43.3628554Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:06:43.3628829Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:06:43.3629024Z 'num_sm_multiplier': 8, 2026-02-21T09:06:43.3629193Z 'num_stages': 3, 2026-02-21T09:06:43.3629332Z 'num_warps': 1, 2026-02-21T09:06:43.3629496Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:06:43.3629690Z 'range_flattens': [False, None], 2026-02-21T09:06:43.3629876Z 'range_multi_buffers': [True, True], 2026-02-21T09:06:43.3630056Z 'range_num_stages': [1, 2], 2026-02-21T09:06:43.3630229Z 'range_unroll_factors': [1, 1], 2026-02-21T09:06:43.3630403Z 'range_warp_specializes': [True, None]} 2026-02-21T09:06:43.3630699Z [222s] Fitting surrogate: 594 points, 594 targets 2026-02-21T09:06:43.6448541Z [222s] Autotuning complete in 223.0s after searching 575 configs. 2026-02-21T09:06:43.6448887Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:06:43.6450255Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[1, 1], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:06:43.6451152Z 2026-02-21T09:06:43.6451398Z [222s] Code of selected kernel: /tmp/torchinductor_root/sa/csaksjz5whuhhqvf6nkuvevnby6jtj627r5wbemjjclexvgnaj33.py 2026-02-21T09:06:44.5165236Z WARNING:tritonbench.utils.triton_op:Completed input ID 97: 2026-02-21T09:06:44.5165603Z (M, N) 2026-02-21T09:06:44.5165782Z ------------- 2026-02-21T09:06:44.5165943Z (4096, 12672) 2026-02-21T09:06:44.5166032Z 2026-02-21T09:06:44.5166392Z 100%|██████████| 20/20 [57:42<00:00, 197.61s/it] 2026-02-21T09:06:44.5166654Z 100%|██████████| 20/20 [57:42<00:00, 173.10s/it] 2026-02-21T09:06:44.5205140Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp7ibdmmxl.csv 2026-02-21T09:06:46.3381407Z (M, N) triton_softmax-speedup triton_softmax-accuracy torch_compile_softmax-speedup torch_compile_softmax-accuracy helion_softmax_tritonbench-speedup helion_softmax_tritonbench-accuracy 2026-02-21T09:06:46.3383249Z ------------- ------------------------ ------------------------- ------------------------------- -------------------------------- ------------------------------------ ------------------------------------- 2026-02-21T09:06:46.3383923Z (4096, 256) 0.927277 1 1.50605 1 1.6272 1 2026-02-21T09:06:46.3384479Z (4096, 896) 1.85317 1 1.43865 1 2.22536 1 2026-02-21T09:06:46.3384978Z (4096, 1536) 3.53206 1 2.22187 1 4.50606 1 2026-02-21T09:06:46.3385446Z (4096, 2176) 2.35532 1 1.5938 1 4.63032 1 2026-02-21T09:06:46.3385922Z (4096, 2816) 2.41506 1 1.64293 1 4.06486 1 2026-02-21T09:06:46.3386393Z (4096, 3584) 2.67651 1 1.54743 1 3.49984 1 2026-02-21T09:06:46.3386861Z (4096, 4224) 3.73652 1 1.97417 1 5.01296 1 2026-02-21T09:06:46.3387347Z (4096, 4864) 3.79746 1 1.86131 1 4.97742 1 2026-02-21T09:06:46.3387810Z (4096, 5504) 4.08316 1 1.88111 1 4.88052 1 2026-02-21T09:06:46.3388257Z (4096, 6144) 4.13037 1 2.09229 1 4.47552 1 2026-02-21T09:06:46.3388725Z (4096, 6784) 4.28609 1 1.67387 1 4.58957 1 2026-02-21T09:06:46.3389559Z (4096, 7424) 4.91419 1 1.86457 1 4.83533 1 2026-02-21T09:06:46.3390034Z (4096, 8064) 4.83901 1 1.78632 1 4.77175 1 2026-02-21T09:06:46.3390510Z (4096, 8704) 2.68658 1 1.91263 1 4.13071 1 2026-02-21T09:06:46.3391136Z (4096, 9344) 1.74183 1 0.986338 1 2.44567 1 2026-02-21T09:06:46.3391675Z (4096, 10112) 1.74482 1 0.949176 1 2.40527 1 2026-02-21T09:06:46.3392150Z (4096, 10752) 1.73486 1 1.06122 1 2.25434 1 2026-02-21T09:06:46.3392626Z (4096, 11392) 1.74539 1 0.864912 1 2.25341 1 2026-02-21T09:06:46.3393099Z (4096, 12032) 1.7451 1 0.843066 1 2.20091 1 2026-02-21T09:06:46.3393573Z (4096, 12672) 1.7628 1 0.834627 1 2.05979 1 2026-02-21T09:06:46.3394052Z average 2.83538 1 1.52682 1 3.59234 1 2026-02-21T09:08:54.4878461Z Using num_inputs=20 for softmax 2026-02-21T09:08:54.4998627Z Running softmax benchmark with Helion implementation... 2026-02-21T09:08:54.5000316Z 2026-02-21T09:08:54.7235003Z Equally-spaced-k mode: Selected 20 equally spaced inputs (total available: 98) 2026-02-21T09:08:54.7236984Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 5, 10, 15, 20, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 77, 82, 87, 92, 97] 2026-02-21T09:08:54.7243691Z 2026-02-21T09:08:54.7251174Z 0%| | 0/20 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:09:25.5307674Z %c32_i32 = arith.constant 32 : i32 2026-02-21T09:09:25.5307875Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:09:25.5308070Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:09:25.5308265Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T09:09:25.5308857Z %cst = arith.constant dense<0.000000e+00> : tensor<128xf32> 2026-02-21T09:09:25.5309176Z %cst_0 = arith.constant dense<0xFF800000> : tensor<128xf32> 2026-02-21T09:09:25.5309414Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:09:25.5309612Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:09:25.5309803Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:09:25.5310000Z %c256_i64 = arith.constant 256 : i64 2026-02-21T09:09:25.5310188Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:09:25.5310509Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : , > 2026-02-21T09:09:25.5310981Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : , > 2026-02-21T09:09:25.5311321Z %2 = tt.get_program_id x : i32 2026-02-21T09:09:25.5311720Z scf.for %arg2 = %2 to %c32_i32 step %c9472_i32 : i32 { 2026-02-21T09:09:25.5311971Z %3 = arith.muli %arg2, %c128_i32 : i32 2026-02-21T09:09:25.5312172Z %c240_i32 = arith.constant 240 : i32 2026-02-21T09:09:25.5312376Z %c48_i32 = arith.constant 48 : i32 2026-02-21T09:09:25.5312749Z %4:2 = scf.for %arg3 = %c0_i32 to %c240_i32 step %c48_i32 iter_args(%arg4 = %cst_0, %arg5 = %cst) -> (tensor<128xf32>, tensor<128xf32>) : i32 { 2026-02-21T09:09:25.5313225Z %33 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5313555Z %34 = arith.extf %33 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5313835Z %35 = "tt.reduce"(%34) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5314035Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5314220Z %91 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:09:25.5314417Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5314611Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5314836Z %36 = arith.truncf %35 : tensor<128xf32> to tensor<128xf16> 2026-02-21T09:09:25.5315093Z %37 = arith.extf %36 : tensor<128xf16> to tensor<128xf32> 2026-02-21T09:09:25.5315486Z %38 = arith.cmpf ogt, %arg4, %37 : tensor<128xf32> 2026-02-21T09:09:25.5315722Z %39 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32> 2026-02-21T09:09:25.5315940Z %40 = arith.ori %38, %39 : tensor<128xi1> 2026-02-21T09:09:25.5316182Z %41 = arith.select %40, %arg4, %37 : tensor<128xi1>, tensor<128xf32> 2026-02-21T09:09:25.5316436Z %42 = arith.subf %arg4, %41 : tensor<128xf32> 2026-02-21T09:09:25.5316804Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5317175Z %44 = arith.mulf %arg5, %43 : tensor<128xf32> 2026-02-21T09:09:25.5317437Z %45 = tt.expand_dims %41 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5317741Z %46 = tt.broadcast %45 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5317981Z %47 = arith.subf %34, %46 : tensor<128x16xf32> 2026-02-21T09:09:25.5318375Z %48 = tt.extern_elementwise %47 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5318744Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5318936Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5319125Z %91 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:09:25.5319315Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5319499Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5319705Z %50 = arith.addf %44, %49 : tensor<128xf32> 2026-02-21T09:09:25.5319896Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:09:25.5320087Z %51 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:09:25.5320273Z %52 = arith.addi %arg3, %51 : i32 2026-02-21T09:09:25.5320549Z %53 = tt.descriptor_load %0[%3, %52] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5320937Z %54 = arith.extf %53 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5321169Z %55 = "tt.reduce"(%54) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5321363Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5321577Z %91 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:09:25.5321773Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5321956Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5322185Z %56 = arith.truncf %55 : tensor<128xf32> to tensor<128xf16> 2026-02-21T09:09:25.5322428Z %57 = arith.extf %56 : tensor<128xf16> to tensor<128xf32> 2026-02-21T09:09:25.5322664Z %58 = arith.cmpf ogt, %41, %57 : tensor<128xf32> 2026-02-21T09:09:25.5322884Z %59 = arith.cmpf une, %41, %41 : tensor<128xf32> 2026-02-21T09:09:25.5323089Z %60 = arith.ori %58, %59 : tensor<128xi1> 2026-02-21T09:09:25.5323326Z %61 = arith.select %60, %41, %57 : tensor<128xi1>, tensor<128xf32> 2026-02-21T09:09:25.5323560Z %62 = arith.subf %41, %61 : tensor<128xf32> 2026-02-21T09:09:25.5323920Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5324281Z %64 = arith.mulf %50, %63 : tensor<128xf32> 2026-02-21T09:09:25.5324529Z %65 = tt.expand_dims %61 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5324830Z %66 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5325068Z %67 = arith.subf %54, %66 : tensor<128x16xf32> 2026-02-21T09:09:25.5325429Z %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5325781Z %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5325973Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5326162Z %91 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:09:25.5326419Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5326616Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5326817Z %70 = arith.addf %64, %69 : tensor<128xf32> 2026-02-21T09:09:25.5327020Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:09:25.5327210Z %71 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T09:09:25.5327409Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:09:25.5327687Z %73 = tt.descriptor_load %0[%3, %72] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5328009Z %74 = arith.extf %73 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5328247Z %75 = "tt.reduce"(%74) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5328438Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5328633Z %91 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:09:25.5328827Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5329026Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5329254Z %76 = arith.truncf %75 : tensor<128xf32> to tensor<128xf16> 2026-02-21T09:09:25.5329515Z %77 = arith.extf %76 : tensor<128xf16> to tensor<128xf32> 2026-02-21T09:09:25.5329758Z %78 = arith.cmpf ogt, %61, %77 : tensor<128xf32> 2026-02-21T09:09:25.5329980Z %79 = arith.cmpf une, %61, %61 : tensor<128xf32> 2026-02-21T09:09:25.5330193Z %80 = arith.ori %78, %79 : tensor<128xi1> 2026-02-21T09:09:25.5330425Z %81 = arith.select %80, %61, %77 : tensor<128xi1>, tensor<128xf32> 2026-02-21T09:09:25.5330668Z %82 = arith.subf %61, %81 : tensor<128xf32> 2026-02-21T09:09:25.5331019Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5331388Z %84 = arith.mulf %70, %83 : tensor<128xf32> 2026-02-21T09:09:25.5331729Z %85 = tt.expand_dims %81 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5332022Z %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5332262Z %87 = arith.subf %74, %86 : tensor<128x16xf32> 2026-02-21T09:09:25.5332615Z %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5332980Z %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5333174Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:09:25.5333351Z %91 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:09:25.5333542Z tt.reduce.return %91 : f32 2026-02-21T09:09:25.5333725Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5333922Z %90 = arith.addf %84, %89 : tensor<128xf32> 2026-02-21T09:09:25.5334157Z scf.yield %81, %90 : tensor<128xf32>, tensor<128xf32> 2026-02-21T09:09:25.5334381Z } {tt.num_stages = 1 : i32} 2026-02-21T09:09:25.5334675Z %5 = tt.descriptor_load %0[%3, %c240_i32] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5335018Z %6 = arith.extf %5 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5335259Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5335451Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:09:25.5335643Z %33 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:09:25.5335834Z tt.reduce.return %33 : f32 2026-02-21T09:09:25.5336031Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5336256Z %8 = arith.truncf %7 : tensor<128xf32> to tensor<128xf16> 2026-02-21T09:09:25.5336509Z %9 = arith.extf %8 : tensor<128xf16> to tensor<128xf32> 2026-02-21T09:09:25.5336745Z %10 = arith.cmpf ogt, %4#0, %9 : tensor<128xf32> 2026-02-21T09:09:25.5336968Z %11 = arith.cmpf une, %4#0, %4#0 : tensor<128xf32> 2026-02-21T09:09:25.5337187Z %12 = arith.ori %10, %11 : tensor<128xi1> 2026-02-21T09:09:25.5337423Z %13 = arith.select %12, %4#0, %9 : tensor<128xi1>, tensor<128xf32> 2026-02-21T09:09:25.5337792Z %14 = arith.subf %4#0, %13 : tensor<128xf32> 2026-02-21T09:09:25.5338158Z %15 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5338531Z %16 = arith.mulf %4#1, %15 : tensor<128xf32> 2026-02-21T09:09:25.5338798Z %17 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5339101Z %18 = tt.broadcast %17 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5339349Z %19 = arith.subf %6, %18 : tensor<128x16xf32> 2026-02-21T09:09:25.5339725Z %20 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5340108Z %21 = "tt.reduce"(%20) <{axis = 1 : i32}> ({ 2026-02-21T09:09:25.5340312Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:09:25.5340498Z %33 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:09:25.5340702Z tt.reduce.return %33 : f32 2026-02-21T09:09:25.5340889Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T09:09:25.5341094Z %22 = arith.addf %16, %21 : tensor<128xf32> 2026-02-21T09:09:25.5341292Z %c240_i32_1 = arith.constant 240 : i32 2026-02-21T09:09:25.5341495Z %c48_i32_2 = arith.constant 48 : i32 2026-02-21T09:09:25.5341771Z scf.for %arg3 = %c0_i32 to %c240_i32_1 step %c48_i32_2 : i32 { 2026-02-21T09:09:25.5342104Z %33 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5342473Z %34 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5342781Z %35 = arith.extf %33 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5343059Z %36 = tt.broadcast %34 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5343361Z %37 = arith.subf %35, %36 : tensor<128x16xf32> 2026-02-21T09:09:25.5343733Z %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5344151Z %39 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5344442Z %40 = tt.broadcast %39 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5344691Z %41 = arith.divf %38, %40 : tensor<128x16xf32> 2026-02-21T09:09:25.5344927Z %42 = arith.truncf %41 : tensor<128x16xf32> to tensor<128x16xf16> 2026-02-21T09:09:25.5345250Z tt.descriptor_store %1[%3, %arg3], %42 : !tt.tensordesc>, tensor<128x16xf16> 2026-02-21T09:09:25.5345560Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:09:25.5345746Z %43 = arith.muli %c16_i32, %c1_i32 : i32 2026-02-21T09:09:25.5345939Z %44 = arith.addi %arg3, %43 : i32 2026-02-21T09:09:25.5346207Z %45 = tt.descriptor_load %0[%3, %44] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5346547Z %46 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5346833Z %47 = arith.extf %45 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5347086Z %48 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5347324Z %49 = arith.subf %47, %48 : tensor<128x16xf32> 2026-02-21T09:09:25.5347684Z %50 = tt.extern_elementwise %49 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5348101Z %51 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5348395Z %52 = tt.broadcast %51 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5348629Z %53 = arith.divf %50, %52 : tensor<128x16xf32> 2026-02-21T09:09:25.5348866Z %54 = arith.truncf %53 : tensor<128x16xf32> to tensor<128x16xf16> 2026-02-21T09:09:25.5349234Z tt.descriptor_store %1[%3, %44], %54 : !tt.tensordesc>, tensor<128x16xf16> 2026-02-21T09:09:25.5349521Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:09:25.5349706Z %55 = arith.muli %c16_i32, %c2_i32 : i32 2026-02-21T09:09:25.5349899Z %56 = arith.addi %arg3, %55 : i32 2026-02-21T09:09:25.5350164Z %57 = tt.descriptor_load %0[%3, %56] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5350492Z %58 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5350778Z %59 = arith.extf %57 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5351030Z %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5351262Z %61 = arith.subf %59, %60 : tensor<128x16xf32> 2026-02-21T09:09:25.5351674Z %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5352091Z %63 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5352377Z %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5352603Z %65 = arith.divf %62, %64 : tensor<128x16xf32> 2026-02-21T09:09:25.5352838Z %66 = arith.truncf %65 : tensor<128x16xf32> to tensor<128x16xf16> 2026-02-21T09:09:25.5353145Z tt.descriptor_store %1[%3, %56], %66 : !tt.tensordesc>, tensor<128x16xf16> 2026-02-21T09:09:25.5353426Z } {tt.num_stages = 1 : i32} 2026-02-21T09:09:25.5353707Z %23 = tt.descriptor_load %0[%3, %c240_i32_1] : !tt.tensordesc> -> tensor<128x16xf16> 2026-02-21T09:09:25.5354057Z %24 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5354402Z %25 = arith.extf %23 : tensor<128x16xf16> to tensor<128x16xf32> 2026-02-21T09:09:25.5354664Z %26 = tt.broadcast %24 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5354903Z %27 = arith.subf %25, %26 : tensor<128x16xf32> 2026-02-21T09:09:25.5355261Z %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T09:09:25.5355672Z %29 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32> 2026-02-21T09:09:25.5355963Z %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x16xf32> 2026-02-21T09:09:25.5356189Z %31 = arith.divf %28, %30 : tensor<128x16xf32> 2026-02-21T09:09:25.5356430Z %32 = arith.truncf %31 : tensor<128x16xf32> to tensor<128x16xf16> 2026-02-21T09:09:25.5356755Z tt.descriptor_store %1[%3, %c240_i32_1], %32 : !tt.tensordesc>, tensor<128x16xf16> 2026-02-21T09:09:25.5357154Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T09:09:25.5357447Z tt.return 2026-02-21T09:09:25.5357576Z } 2026-02-21T09:09:25.5357703Z } 2026-02-21T09:09:25.5357770Z 2026-02-21T09:09:25.5357820Z {-# 2026-02-21T09:09:25.5357951Z external_resources: { 2026-02-21T09:09:25.5358108Z mlir_reproducer: { 2026-02-21T09:09:25.5362519Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:09:25.5366969Z disable_threading: false, 2026-02-21T09:09:25.5367134Z verify_each: true 2026-02-21T09:09:25.5367289Z } 2026-02-21T09:09:25.5367412Z } 2026-02-21T09:09:25.5367522Z #-} 2026-02-21T09:09:25.5367945Z /tmp/torchinductor_root/gh/cghsz2gbngaiott6x5qitkdnthg7svazopk76hcnafw7y6j7mcjy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:09:25.5369145Z /tmp/torchinductor_root/gh/cghsz2gbngaiott6x5qitkdnthg7svazopk76hcnafw7y6j7mcjy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:09:25.5370125Z [25s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:09:25.5371311Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:09:25.5372389Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:09:25.5372643Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:09:29.6949885Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.5 configs/s 2026-02-21T09:09:29.6960638Z [29s] Adaptive compile timeout: 30s (90% percentile=1.9s, bounds=[30.0s, 60s]) 2026-02-21T09:09:30.2080681Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1966.4 configs/s 2026-02-21T09:09:30.2622884Z [30s] Initial random population of 100, 5 starting points: 2026-02-21T09:09:30.2626578Z error=6 2026-02-21T09:09:30.2631720Z ok=94 2026-02-21T09:09:30.2634781Z min=0.0082 2026-02-21T09:09:30.2639265Z mid=0.0410 2026-02-21T09:09:30.2643050Z max=5.7293 2026-02-21T09:09:30.2644694Z best={'block_sizes': [4, 64], 2026-02-21T09:09:30.2644948Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:09:30.2645172Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:09:30.2645369Z 'num_stages': 6, 2026-02-21T09:09:30.2645510Z 'num_warps': 2, 2026-02-21T09:09:30.2645663Z 'pid_type': 'flat', 2026-02-21T09:09:30.2645823Z 'range_flattens': [None, True], 2026-02-21T09:09:30.2646007Z 'range_multi_buffers': [None, False], 2026-02-21T09:09:30.2646199Z 'range_num_stages': [0, 0], 2026-02-21T09:09:30.2646362Z 'range_unroll_factors': [0, 0], 2026-02-21T09:09:30.2646549Z 'range_warp_specializes': [None, True]} 2026-02-21T09:09:30.2646876Z [30s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:09:31.4714609Z [31s] Generation 1 starting: 88 neighbors, 5 active search path(s) 2026-02-21T09:09:34.5019606Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 41.8 configs/s 2026-02-21T09:09:40.3688521Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 15.8 configs/s 2026-02-21T09:09:43.9366883Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 300.3 2026-02-21T09:09:43.9368326Z configs/s 2026-02-21T09:09:44.2836482Z [44s] Generation 1 complete: 2026-02-21T09:09:44.2838116Z ok=94 2026-02-21T09:09:44.2838293Z min=0.0081 2026-02-21T09:09:44.2838421Z mid=0.0083 2026-02-21T09:09:44.2838550Z max=0.1045 2026-02-21T09:09:44.2838689Z best={'block_sizes': [4, 256], 2026-02-21T09:09:44.2838906Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:09:44.2839122Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:09:44.2839315Z 'num_stages': 6, 2026-02-21T09:09:44.2839453Z 'num_warps': 1, 2026-02-21T09:09:44.2839598Z 'pid_type': 'flat', 2026-02-21T09:09:44.2840091Z 'range_flattens': [None, True], 2026-02-21T09:09:44.2840315Z 'range_multi_buffers': [None, None], 2026-02-21T09:09:44.2840509Z 'range_num_stages': [0, 0], 2026-02-21T09:09:44.2840678Z 'range_unroll_factors': [0, 0], 2026-02-21T09:09:44.2840871Z 'range_warp_specializes': [None, True]} 2026-02-21T09:09:44.2851197Z [44s] Fitting surrogate: 194 points, 194 targets 2026-02-21T09:09:45.3382249Z [45s] Generation 2 starting: 73 neighbors, 5 active search path(s) 2026-02-21T09:09:48.2370869Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 14.6 configs/s 2026-02-21T09:09:52.8535083Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.2 configs/s 2026-02-21T09:09:56.1650093Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 328.1 2026-02-21T09:09:56.1654124Z configs/s 2026-02-21T09:09:56.4695112Z [56s] Generation 2 complete: 2026-02-21T09:09:56.4698188Z ok=78 2026-02-21T09:09:56.4702114Z min=0.0062 2026-02-21T09:09:56.4705959Z mid=0.0082 2026-02-21T09:09:56.4707588Z max=0.0348 2026-02-21T09:09:56.4707805Z best={'block_sizes': [4, 256], 2026-02-21T09:09:56.4713615Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:09:56.4718012Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:09:56.4722367Z 'num_stages': 3, 2026-02-21T09:09:56.4725817Z 'num_warps': 1, 2026-02-21T09:09:56.4729244Z 'pid_type': 'flat', 2026-02-21T09:09:56.4733037Z 'range_flattens': [None, True], 2026-02-21T09:09:56.4737952Z 'range_multi_buffers': [None, True], 2026-02-21T09:09:56.4742352Z 'range_num_stages': [0, 0], 2026-02-21T09:09:56.4743660Z 'range_unroll_factors': [0, 0], 2026-02-21T09:09:56.4743910Z 'range_warp_specializes': [None, True]} 2026-02-21T09:09:56.4744199Z [56s] Fitting surrogate: 272 points, 272 targets 2026-02-21T09:09:57.4476336Z [57s] Generation 3 starting: 63 neighbors, 5 active search path(s) 2026-02-21T09:09:59.6350663Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 109.8 configs/s 2026-02-21T09:10:03.7739959Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.1 configs/s 2026-02-21T09:10:07.0550346Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 336.6 2026-02-21T09:10:07.0552015Z configs/s 2026-02-21T09:10:07.3819968Z [67s] Generation 3 complete: 2026-02-21T09:10:07.3822051Z ok=68 2026-02-21T09:10:07.3829241Z min=0.0062 2026-02-21T09:10:07.3835086Z mid=0.0082 2026-02-21T09:10:07.3840383Z max=0.0164 2026-02-21T09:10:07.3844720Z best={'block_sizes': [4, 256], 2026-02-21T09:10:07.3849501Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:10:07.3849860Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:07.3850090Z 'num_stages': 4, 2026-02-21T09:10:07.3855152Z 'num_warps': 1, 2026-02-21T09:10:07.3857356Z 'pid_type': 'flat', 2026-02-21T09:10:07.3857569Z 'range_flattens': [None, True], 2026-02-21T09:10:07.3857819Z 'range_multi_buffers': [None, True], 2026-02-21T09:10:07.3858490Z 'range_num_stages': [0, 4], 2026-02-21T09:10:07.3858678Z 'range_unroll_factors': [0, 0], 2026-02-21T09:10:07.3858878Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:07.3859192Z [67s] Fitting surrogate: 340 points, 340 targets 2026-02-21T09:10:08.1969515Z [68s] Generation 4 starting: 58 neighbors, 5 active search path(s) 2026-02-21T09:10:10.2752319Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 85.8 configs/s 2026-02-21T09:10:14.0334632Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 16.1 configs/s 2026-02-21T09:10:17.3706909Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.9 2026-02-21T09:10:17.3708083Z configs/s 2026-02-21T09:10:17.6706522Z [77s] Generation 4 complete: 2026-02-21T09:10:17.6710856Z ok=63 2026-02-21T09:10:17.6715350Z min=0.0062 2026-02-21T09:10:17.6719760Z mid=0.0081 2026-02-21T09:10:17.6721884Z max=0.0102 2026-02-21T09:10:17.6722178Z best={'block_sizes': [4, 256], 2026-02-21T09:10:17.6726384Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:10:17.6727722Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:17.6727960Z 'num_stages': 4, 2026-02-21T09:10:17.6728103Z 'num_warps': 1, 2026-02-21T09:10:17.6728250Z 'pid_type': 'flat', 2026-02-21T09:10:17.6728414Z 'range_flattens': [None, True], 2026-02-21T09:10:17.6728592Z 'range_multi_buffers': [None, True], 2026-02-21T09:10:17.6728782Z 'range_num_stages': [0, 4], 2026-02-21T09:10:17.6728944Z 'range_unroll_factors': [0, 0], 2026-02-21T09:10:17.6729126Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:17.6729415Z [77s] Fitting surrogate: 403 points, 403 targets 2026-02-21T09:10:18.5847506Z [78s] Generation 5 starting: 62 neighbors, 5 active search path(s) 2026-02-21T09:10:20.9560974Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 83.7 configs/s 2026-02-21T09:10:24.9232996Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 16.3 configs/s 2026-02-21T09:10:28.3169303Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 318.2 2026-02-21T09:10:28.3169751Z configs/s 2026-02-21T09:10:28.6658002Z [88s] Generation 5 complete: 2026-02-21T09:10:28.6662023Z ok=67 2026-02-21T09:10:28.6663163Z min=0.0062 2026-02-21T09:10:28.6663332Z mid=0.0062 2026-02-21T09:10:28.6663464Z max=0.0163 2026-02-21T09:10:28.6663603Z best={'block_sizes': [4, 256], 2026-02-21T09:10:28.6663820Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:10:28.6664047Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:28.6664242Z 'num_stages': 7, 2026-02-21T09:10:28.6664380Z 'num_warps': 1, 2026-02-21T09:10:28.6664523Z 'pid_type': 'flat', 2026-02-21T09:10:28.6664680Z 'range_flattens': [None, True], 2026-02-21T09:10:28.6664863Z 'range_multi_buffers': [None, None], 2026-02-21T09:10:28.6665106Z 'range_num_stages': [0, 0], 2026-02-21T09:10:28.6665280Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:28.6665467Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:28.6674168Z [88s] Fitting surrogate: 470 points, 470 targets 2026-02-21T09:10:29.5749900Z [89s] Generation 6 starting: 61 neighbors, 5 active search path(s) 2026-02-21T09:10:31.6266966Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 42.5 configs/s 2026-02-21T09:10:35.4802861Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.3 configs/s 2026-02-21T09:10:38.8673392Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 317.9 2026-02-21T09:10:38.8677274Z configs/s 2026-02-21T09:10:39.1798041Z [99s] Generation 6 complete: 2026-02-21T09:10:39.1802303Z ok=66 2026-02-21T09:10:39.1806758Z min=0.0062 2026-02-21T09:10:39.1812133Z mid=0.0080 2026-02-21T09:10:39.1816468Z max=0.0143 2026-02-21T09:10:39.1818191Z best={'block_sizes': [4, 256], 2026-02-21T09:10:39.1818878Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:10:39.1819126Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:39.1819325Z 'num_stages': 7, 2026-02-21T09:10:39.1819466Z 'num_warps': 1, 2026-02-21T09:10:39.1819614Z 'pid_type': 'flat', 2026-02-21T09:10:39.1819774Z 'range_flattens': [None, True], 2026-02-21T09:10:39.1819962Z 'range_multi_buffers': [None, None], 2026-02-21T09:10:39.1820150Z 'range_num_stages': [0, 0], 2026-02-21T09:10:39.1820322Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:39.1820503Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:39.1820773Z [99s] Fitting surrogate: 536 points, 536 targets 2026-02-21T09:10:40.0533261Z [100s] Generation 7 starting: 53 neighbors, 4 active search path(s) 2026-02-21T09:10:41.8706036Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 86.1 configs/s 2026-02-21T09:10:45.3338030Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.4 configs/s 2026-02-21T09:10:48.2256521Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 378.1 2026-02-21T09:10:48.2257227Z configs/s 2026-02-21T09:10:48.5049297Z [108s] Generation 7 complete: 2026-02-21T09:10:48.5050908Z ok=58 2026-02-21T09:10:48.5051088Z min=0.0062 2026-02-21T09:10:48.5051227Z mid=0.0062 2026-02-21T09:10:48.5051363Z max=0.0143 2026-02-21T09:10:48.5051508Z best={'block_sizes': [4, 256], 2026-02-21T09:10:48.5052083Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:10:48.5052378Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:48.5052581Z 'num_stages': 7, 2026-02-21T09:10:48.5052738Z 'num_warps': 2, 2026-02-21T09:10:48.5052886Z 'pid_type': 'flat', 2026-02-21T09:10:48.5053058Z 'range_flattens': [None, True], 2026-02-21T09:10:48.5053244Z 'range_multi_buffers': [None, None], 2026-02-21T09:10:48.5053510Z 'range_num_stages': [0, 0], 2026-02-21T09:10:48.5053695Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:48.5053885Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:48.5065616Z [108s] Fitting surrogate: 594 points, 594 targets 2026-02-21T09:10:48.9429267Z [109s] Generation 8 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:10:49.4447848Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 34.7 configs/s 2026-02-21T09:10:50.1819474Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.3 configs/s 2026-02-21T09:10:50.8089587Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1614.6 2026-02-21T09:10:50.8096578Z configs/s 2026-02-21T09:10:50.8769151Z [110s] Generation 8 complete: 2026-02-21T09:10:50.8773591Z ok=13 2026-02-21T09:10:50.8777989Z min=0.0062 2026-02-21T09:10:50.8780104Z mid=0.0062 2026-02-21T09:10:50.8780317Z max=0.0081 2026-02-21T09:10:50.8785717Z best={'block_sizes': [4, 256], 2026-02-21T09:10:50.8789052Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:10:50.8794014Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:50.8798281Z 'num_stages': 7, 2026-02-21T09:10:50.8799576Z 'num_warps': 2, 2026-02-21T09:10:50.8799761Z 'pid_type': 'flat', 2026-02-21T09:10:50.8799940Z 'range_flattens': [None, True], 2026-02-21T09:10:50.8800121Z 'range_multi_buffers': [None, None], 2026-02-21T09:10:50.8800313Z 'range_num_stages': [0, 0], 2026-02-21T09:10:50.8800486Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:50.8800662Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:50.8800959Z [111s] Fitting surrogate: 607 points, 607 targets 2026-02-21T09:10:51.2915657Z [111s] Generation 9 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:10:51.8188003Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 36.5 configs/s 2026-02-21T09:10:52.4968052Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.4 configs/s 2026-02-21T09:10:53.0794189Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1735.4 2026-02-21T09:10:53.0798392Z configs/s 2026-02-21T09:10:53.1456019Z [113s] Generation 9 complete: 2026-02-21T09:10:53.1459889Z ok=13 2026-02-21T09:10:53.1464778Z min=0.0061 2026-02-21T09:10:53.1466572Z mid=0.0062 2026-02-21T09:10:53.1466742Z max=0.0123 2026-02-21T09:10:53.1466887Z best={'block_sizes': [4, 256], 2026-02-21T09:10:53.1467148Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:10:53.1467412Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:10:53.1467608Z 'num_stages': 7, 2026-02-21T09:10:53.1467762Z 'num_warps': 4, 2026-02-21T09:10:53.1467906Z 'pid_type': 'flat', 2026-02-21T09:10:53.1468079Z 'range_flattens': [None, True], 2026-02-21T09:10:53.1468261Z 'range_multi_buffers': [None, None], 2026-02-21T09:10:53.1468451Z 'range_num_stages': [0, 0], 2026-02-21T09:10:53.1468649Z 'range_unroll_factors': [0, 1], 2026-02-21T09:10:53.1468833Z 'range_warp_specializes': [None, True]} 2026-02-21T09:10:53.1473024Z [113s] Fitting surrogate: 620 points, 620 targets 2026-02-21T09:10:53.4231841Z [113s] Autotuning complete in 113.5s after searching 597 configs. 2026-02-21T09:10:53.4232273Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:10:53.4233290Z @helion.kernel(config=helion.Config(block_sizes=[4, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:10:53.4234158Z 2026-02-21T09:10:53.4234418Z [113s] Code of selected kernel: /tmp/torchinductor_root/fk/cfksxei47ym7ts5y5qdyesq4ngw3aqn56gffr2islwommctdkqzr.py 2026-02-21T09:10:53.4440495Z from __future__ import annotations 2026-02-21T09:10:53.4440720Z 2026-02-21T09:10:53.4445608Z import torch 2026-02-21T09:10:53.4450729Z import triton 2026-02-21T09:10:53.4455836Z import triton.language as tl 2026-02-21T09:10:53.4460401Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:10:53.4462040Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:10:53.4462418Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:10:53.4466954Z 2026-02-21T09:10:53.4470448Z _BLOCK_SIZE_0 = tl.constexpr(4) 2026-02-21T09:10:53.4475210Z _BLOCK_SIZE_1 = tl.constexpr(256) 2026-02-21T09:10:53.4479668Z 2026-02-21T09:10:53.4482514Z @triton.jit 2026-02-21T09:10:53.4484921Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:10:53.4485279Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:10:53.4489296Z pid_0 = tl.program_id(0) 2026-02-21T09:10:53.4493868Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T09:10:53.4495437Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T09:10:53.4496116Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:10:53.4496414Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:10:53.4496687Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:10:53.4496944Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:10:53.4497199Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:10:53.4497476Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:10:53.4497726Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:10:53.4497961Z # src[softmax.py:82-89]: ... 2026-02-21T09:10:53.4498269Z for offset_2 in tl.range(0, 256, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, flatten=True): 2026-02-21T09:10:53.4498741Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:10:53.4498988Z mi_copy = mi 2026-02-21T09:10:53.4499131Z di_copy = di 2026-02-21T09:10:53.4499282Z mi_copy_0 = mi_copy 2026-02-21T09:10:53.4499433Z di_copy_0 = di_copy 2026-02-21T09:10:53.4499620Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:10:53.4499953Z values = tl.load(x + (indices_0[:, None] * 256 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T09:10:53.4500313Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:10:53.4500568Z local_amax = tl.cast(tl.max(values, 1), tl.float16) 2026-02-21T09:10:53.4500819Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:10:53.4501058Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:10:53.4501266Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:10:53.4501528Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:10:53.4501845Z v_2 = mi_copy_0 - v_1 2026-02-21T09:10:53.4502038Z v_3 = libdevice.exp(v_2) 2026-02-21T09:10:53.4502219Z v_4 = di_copy_0 * v_3 2026-02-21T09:10:53.4502411Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:10:53.4502628Z subscript = v_1[:, None] 2026-02-21T09:10:53.4502808Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:10:53.4503005Z v_6 = v_5 - subscript 2026-02-21T09:10:53.4503220Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:10:53.4503488Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:10:53.4503699Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:10:53.4503889Z v_7 = libdevice.exp(v_6) 2026-02-21T09:10:53.4504077Z sum_1 = tl.cast(tl.sum(v_7, 1), tl.float32) 2026-02-21T09:10:53.4504268Z di = v_4 + sum_1 2026-02-21T09:10:53.4504452Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:10:53.4504622Z mi = v_1 2026-02-21T09:10:53.4504831Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:10:53.4505098Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:10:53.4505390Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:10:53.4505794Z for offset_2 in tl.range(0, 256, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, flatten=True): 2026-02-21T09:10:53.4506152Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:10:53.4506383Z mi_copy_1 = mi 2026-02-21T09:10:53.4506527Z di_copy_1 = di 2026-02-21T09:10:53.4506681Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:10:53.4506843Z di_copy_1_0 = di_copy_1 2026-02-21T09:10:53.4507030Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:10:53.4507363Z values_1 = tl.load(x + (indices_0[:, None] * 256 + indices_2[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T09:10:53.4507851Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:10:53.4508134Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:10:53.4508322Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:10:53.4508508Z v_10 = v_9 - subscript_1 2026-02-21T09:10:53.4508676Z v_11 = libdevice.exp(v_10) 2026-02-21T09:10:53.4508857Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:10:53.4509040Z v_12 = v_11 / subscript_2 2026-02-21T09:10:53.4509210Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:10:53.4509459Z tl.store(out + (indices_0[:, None] * 256 + indices_2[None, :] * 1), v_13, None) 2026-02-21T09:10:53.4509653Z 2026-02-21T09:10:53.4509780Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:10:53.4510017Z """ 2026-02-21T09:10:53.4510216Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:10:53.4510594Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:10:53.4510818Z Args: 2026-02-21T09:10:53.4510975Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:10:53.4511173Z Returns: 2026-02-21T09:10:53.4511348Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:10:53.4511620Z """ 2026-02-21T09:10:53.4511757Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:10:53.4511939Z m, n = x.size() 2026-02-21T09:10:53.4512102Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:10:53.4512310Z out = torch.empty_like(x) 2026-02-21T09:10:53.4512545Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:10:53.4512778Z _BLOCK_SIZE_0 = 4 2026-02-21T09:10:53.4512998Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:10:53.4513313Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:10:53.4513635Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:10:53.4513873Z # src[softmax.py:79-92]: ... 2026-02-21T09:10:53.4514191Z _launcher(_helion_softmax_two_pass, (triton.cdiv(4096, _BLOCK_SIZE_0),), x, out, num_warps=4, num_stages=7) 2026-02-21T09:10:53.4514524Z # src[softmax.py:93]: return out 2026-02-21T09:10:53.4514688Z return out 2026-02-21T09:10:54.0853893Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T09:10:54.0857722Z (M, N) 2026-02-21T09:10:54.0862995Z ----------- 2026-02-21T09:10:54.0867384Z (4096, 256) 2026-02-21T09:10:54.0867544Z 2026-02-21T09:10:54.0872734Z 5%|▌ | 1/20 [01:59<37:47, 119.36s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T09:10:54.0876178Z (M, N) 2026-02-21T09:10:54.0876429Z ----------- 2026-02-21T09:10:54.0876598Z (4096, 896) 2026-02-21T09:10:54.0876883Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:10:55.7142604Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:10:57.0473996Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax 2026-02-21T09:10:58.0773184Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:10:58.0777576Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:10:58.0779169Z 'dtype': 'torch.float16', 2026-02-21T09:10:58.0779458Z 'shape': (4096, 896), 2026-02-21T09:10:58.0785225Z 'stride': (896, 1)},), 2026-02-21T09:10:58.0787416Z 'kwargs': {}} 2026-02-21T09:10:58.0787819Z INFO:tritonbench.utils.triton_op:Took 1.85ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:10:58.2521767Z [0s] Autotune random seed: 2138408546 2026-02-21T09:10:58.2773509Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:11:30.4749298Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]) 2026-02-21T09:11:30.4767479Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s 2026-02-21T09:11:34.1974490Z module { 2026-02-21T09:11:34.1976880Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:11:34.1977446Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:11:34.1977684Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:11:34.1977912Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:11:34.1978161Z %cst = arith.constant dense<896> : tensor<64x1xi32> 2026-02-21T09:11:34.1978856Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<64xf32> 2026-02-21T09:11:34.1979194Z %cst_1 = arith.constant dense<0xFF800000> : tensor<64xf32> 2026-02-21T09:11:34.1979461Z %c64_i32 = arith.constant 64 : i32 2026-02-21T09:11:34.1979687Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:11:34.1979901Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:11:34.1980099Z %c896_i64 = arith.constant 896 : i64 2026-02-21T09:11:34.1980290Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:11:34.1980633Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T09:11:34.1981100Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T09:11:34.1981455Z %2 = tt.get_program_id x : i32 2026-02-21T09:11:34.1981748Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T09:11:34.1981959Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T09:11:34.1982201Z scf.for %arg2 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T09:11:34.1982445Z %5 = arith.muli %arg2, %c64_i32 : i32 2026-02-21T09:11:34.1982719Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T09:11:34.1983012Z %7 = tt.splat %5 : i32 -> tensor<64xi32> 2026-02-21T09:11:34.1983245Z %8 = arith.addi %7, %6 : tensor<64xi32> 2026-02-21T09:11:34.1983463Z %c768_i32 = arith.constant 768 : i32 2026-02-21T09:11:34.1983685Z %c384_i32 = arith.constant 384 : i32 2026-02-21T09:11:34.1984109Z %9:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c384_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<64xf32>, tensor<64xf32>) : i32 { 2026-02-21T09:11:34.1984586Z %49 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:34.1984911Z %50 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:11:34.1985152Z %51 = arith.addi %50, %49 : tensor<128xi32> 2026-02-21T09:11:34.1985455Z %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:11:34.1985819Z %53 = arith.muli %52, %cst : tensor<64x1xi32> 2026-02-21T09:11:34.1986119Z %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:34.1986456Z %55 = tt.broadcast %53 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.1986745Z %56 = tt.broadcast %54 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.1986996Z %57 = arith.addi %55, %56 : tensor<64x128xi32> 2026-02-21T09:11:34.1987258Z %58 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.1987570Z %59 = tt.addptr %58, %57 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:11:34.1987846Z %60 = tt.load %59 : tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.1988102Z %61 = arith.extf %60 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.1988351Z %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.1988716Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.1988922Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:34.1989141Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.1989353Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.1989594Z %63 = arith.truncf %62 : tensor<64xf32> to tensor<64xf16> 2026-02-21T09:11:34.1989864Z %64 = arith.extf %63 : tensor<64xf16> to tensor<64xf32> 2026-02-21T09:11:34.1990113Z %65 = arith.cmpf ogt, %arg4, %64 : tensor<64xf32> 2026-02-21T09:11:34.1990370Z %66 = arith.cmpf une, %arg4, %arg4 : tensor<64xf32> 2026-02-21T09:11:34.1990607Z %67 = arith.ori %65, %66 : tensor<64xi1> 2026-02-21T09:11:34.1990875Z %68 = arith.select %67, %arg4, %64 : tensor<64xi1>, tensor<64xf32> 2026-02-21T09:11:34.1991145Z %69 = arith.subf %arg4, %68 : tensor<64xf32> 2026-02-21T09:11:34.1991626Z %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.1992030Z %71 = arith.mulf %arg5, %70 : tensor<64xf32> 2026-02-21T09:11:34.1992304Z %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.1992629Z %73 = tt.broadcast %72 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.1992887Z %74 = arith.subf %61, %73 : tensor<64x128xf32> 2026-02-21T09:11:34.1993283Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.1993702Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.1993923Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.1994144Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:34.1994363Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.1994586Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.1994817Z %77 = arith.addf %71, %76 : tensor<64xf32> 2026-02-21T09:11:34.1995052Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:11:34.1995278Z %78 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:11:34.1995495Z %79 = arith.addi %arg3, %78 : i32 2026-02-21T09:11:34.1995769Z %80 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:34.1996055Z %81 = tt.splat %79 : i32 -> tensor<128xi32> 2026-02-21T09:11:34.1996289Z %82 = arith.addi %81, %80 : tensor<128xi32> 2026-02-21T09:11:34.1996569Z %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:11:34.1996874Z %84 = arith.muli %83, %cst : tensor<64x1xi32> 2026-02-21T09:11:34.1997175Z %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:34.1997485Z %86 = tt.broadcast %84 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.1997769Z %87 = tt.broadcast %85 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.1998023Z %88 = arith.addi %86, %87 : tensor<64x128xi32> 2026-02-21T09:11:34.1998281Z %89 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.1998578Z %90 = tt.addptr %89, %88 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:11:34.1998858Z %91 = tt.load %90 : tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.1999113Z %92 = arith.extf %91 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.1999357Z %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.1999579Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.1999789Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:34.2000015Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.2000225Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2000488Z %94 = arith.truncf %93 : tensor<64xf32> to tensor<64xf16> 2026-02-21T09:11:34.2000774Z %95 = arith.extf %94 : tensor<64xf16> to tensor<64xf32> 2026-02-21T09:11:34.2001100Z %96 = arith.cmpf ogt, %68, %95 : tensor<64xf32> 2026-02-21T09:11:34.2001348Z %97 = arith.cmpf une, %68, %68 : tensor<64xf32> 2026-02-21T09:11:34.2001608Z %98 = arith.ori %96, %97 : tensor<64xi1> 2026-02-21T09:11:34.2001874Z %99 = arith.select %98, %68, %95 : tensor<64xi1>, tensor<64xf32> 2026-02-21T09:11:34.2002122Z %100 = arith.subf %68, %99 : tensor<64xf32> 2026-02-21T09:11:34.2002514Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2002913Z %102 = arith.mulf %77, %101 : tensor<64xf32> 2026-02-21T09:11:34.2003192Z %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2003507Z %104 = tt.broadcast %103 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2003824Z %105 = arith.subf %92, %104 : tensor<64x128xf32> 2026-02-21T09:11:34.2004245Z %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2004704Z %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.2004946Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.2005163Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:34.2005382Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.2005602Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2005846Z %108 = arith.addf %102, %107 : tensor<64xf32> 2026-02-21T09:11:34.2006079Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:11:34.2006302Z %109 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:11:34.2006544Z %110 = arith.addi %arg3, %109 : i32 2026-02-21T09:11:34.2006824Z %111 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:34.2007142Z %112 = tt.splat %110 : i32 -> tensor<128xi32> 2026-02-21T09:11:34.2007389Z %113 = arith.addi %112, %111 : tensor<128xi32> 2026-02-21T09:11:34.2007717Z %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:11:34.2008045Z %115 = arith.muli %114, %cst : tensor<64x1xi32> 2026-02-21T09:11:34.2008372Z %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:34.2008737Z %117 = tt.broadcast %115 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.2009055Z %118 = tt.broadcast %116 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.2009348Z %119 = arith.addi %117, %118 : tensor<64x128xi32> 2026-02-21T09:11:34.2009635Z %120 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.2009973Z %121 = tt.addptr %120, %119 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:11:34.2010251Z %122 = tt.load %121 : tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.2010503Z %123 = arith.extf %122 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2010747Z %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.2010948Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.2011134Z %140 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:34.2011335Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.2011520Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2011789Z %125 = arith.truncf %124 : tensor<64xf32> to tensor<64xf16> 2026-02-21T09:11:34.2012035Z %126 = arith.extf %125 : tensor<64xf16> to tensor<64xf32> 2026-02-21T09:11:34.2012283Z %127 = arith.cmpf ogt, %99, %126 : tensor<64xf32> 2026-02-21T09:11:34.2012518Z %128 = arith.cmpf une, %99, %99 : tensor<64xf32> 2026-02-21T09:11:34.2012733Z %129 = arith.ori %127, %128 : tensor<64xi1> 2026-02-21T09:11:34.2012992Z %130 = arith.select %129, %99, %126 : tensor<64xi1>, tensor<64xf32> 2026-02-21T09:11:34.2013287Z %131 = arith.subf %99, %130 : tensor<64xf32> 2026-02-21T09:11:34.2013654Z %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2014022Z %133 = arith.mulf %108, %132 : tensor<64xf32> 2026-02-21T09:11:34.2014279Z %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2014580Z %135 = tt.broadcast %134 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2014822Z %136 = arith.subf %123, %135 : tensor<64x128xf32> 2026-02-21T09:11:34.2015199Z %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2015569Z %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.2015845Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:34.2016036Z %140 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:34.2016220Z tt.reduce.return %140 : f32 2026-02-21T09:11:34.2016409Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2016609Z %139 = arith.addf %133, %138 : tensor<64xf32> 2026-02-21T09:11:34.2016835Z scf.yield %130, %139 : tensor<64xf32>, tensor<64xf32> 2026-02-21T09:11:34.2017054Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:11:34.2017309Z %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:34.2017577Z %11 = tt.splat %c768_i32 : i32 -> tensor<128xi32> 2026-02-21T09:11:34.2017786Z %12 = arith.addi %11, %10 : tensor<128xi32> 2026-02-21T09:11:34.2018049Z %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> 2026-02-21T09:11:34.2018309Z %14 = arith.muli %13, %cst : tensor<64x1xi32> 2026-02-21T09:11:34.2018580Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:34.2018873Z %16 = tt.broadcast %14 : tensor<64x1xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.2019147Z %17 = tt.broadcast %15 : tensor<1x128xi32> -> tensor<64x128xi32> 2026-02-21T09:11:34.2019385Z %18 = arith.addi %16, %17 : tensor<64x128xi32> 2026-02-21T09:11:34.2019616Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.2019897Z %20 = tt.addptr %19, %18 : tensor<64x128x!tt.ptr>, tensor<64x128xi32> 2026-02-21T09:11:34.2020141Z %21 = tt.load %20 : tensor<64x128x!tt.ptr> 2026-02-21T09:11:34.2020375Z %22 = arith.extf %21 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2020601Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.2020799Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:34.2020988Z %49 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:11:34.2021175Z tt.reduce.return %49 : f32 2026-02-21T09:11:34.2021365Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2021631Z %24 = arith.truncf %23 : tensor<64xf32> to tensor<64xf16> 2026-02-21T09:11:34.2021893Z %25 = arith.extf %24 : tensor<64xf16> to tensor<64xf32> 2026-02-21T09:11:34.2022130Z %26 = arith.cmpf ogt, %9#0, %25 : tensor<64xf32> 2026-02-21T09:11:34.2022364Z %27 = arith.cmpf une, %9#0, %9#0 : tensor<64xf32> 2026-02-21T09:11:34.2022592Z %28 = arith.ori %26, %27 : tensor<64xi1> 2026-02-21T09:11:34.2022815Z %29 = arith.select %28, %9#0, %25 : tensor<64xi1>, tensor<64xf32> 2026-02-21T09:11:34.2023049Z %30 = arith.subf %9#0, %29 : tensor<64xf32> 2026-02-21T09:11:34.2023391Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2023740Z %32 = arith.mulf %9#1, %31 : tensor<64xf32> 2026-02-21T09:11:34.2023984Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2024327Z %34 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2024564Z %35 = arith.subf %22, %34 : tensor<64x128xf32> 2026-02-21T09:11:34.2024914Z %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2025271Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T09:11:34.2025456Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:34.2025638Z %49 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:11:34.2025818Z tt.reduce.return %49 : f32 2026-02-21T09:11:34.2026002Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T09:11:34.2026195Z %38 = arith.addf %32, %37 : tensor<64xf32> 2026-02-21T09:11:34.2026383Z %c768_i32_2 = arith.constant 768 : i32 2026-02-21T09:11:34.2026575Z %c384_i32_3 = arith.constant 384 : i32 2026-02-21T09:11:34.2026853Z scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c384_i32_3 : i32 { 2026-02-21T09:11:34.2027193Z %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc> -> tensor<64x128xf16> 2026-02-21T09:11:34.2027540Z %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2027839Z %51 = arith.extf %49 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2028107Z %52 = tt.broadcast %50 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2028343Z %53 = arith.subf %51, %52 : tensor<64x128xf32> 2026-02-21T09:11:34.2028716Z %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2029127Z %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2029421Z %56 = tt.broadcast %55 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2029666Z %57 = arith.divf %54, %56 : tensor<64x128xf32> 2026-02-21T09:11:34.2029903Z %58 = arith.truncf %57 : tensor<64x128xf32> to tensor<64x128xf16> 2026-02-21T09:11:34.2030234Z tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc>, tensor<64x128xf16> 2026-02-21T09:11:34.2030535Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:11:34.2030736Z %59 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:11:34.2030927Z %60 = arith.addi %arg3, %59 : i32 2026-02-21T09:11:34.2031202Z %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc> -> tensor<64x128xf16> 2026-02-21T09:11:34.2031583Z %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2031884Z %63 = arith.extf %61 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2032161Z %64 = tt.broadcast %62 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2032411Z %65 = arith.subf %63, %64 : tensor<64x128xf32> 2026-02-21T09:11:34.2032821Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2033233Z %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2033518Z %68 = tt.broadcast %67 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2033771Z %69 = arith.divf %66, %68 : tensor<64x128xf32> 2026-02-21T09:11:34.2034018Z %70 = arith.truncf %69 : tensor<64x128xf32> to tensor<64x128xf16> 2026-02-21T09:11:34.2034358Z tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc>, tensor<64x128xf16> 2026-02-21T09:11:34.2034658Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:11:34.2034866Z %71 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:11:34.2035073Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:11:34.2035357Z %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc> -> tensor<64x128xf16> 2026-02-21T09:11:34.2035775Z %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2036073Z %75 = arith.extf %73 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2036355Z %76 = tt.broadcast %74 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2036600Z %77 = arith.subf %75, %76 : tensor<64x128xf32> 2026-02-21T09:11:34.2037015Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2037458Z %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2037755Z %80 = tt.broadcast %79 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2038011Z %81 = arith.divf %78, %80 : tensor<64x128xf32> 2026-02-21T09:11:34.2038264Z %82 = arith.truncf %81 : tensor<64x128xf32> to tensor<64x128xf16> 2026-02-21T09:11:34.2038666Z tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc>, tensor<64x128xf16> 2026-02-21T09:11:34.2038983Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:11:34.2039307Z %39 = tt.descriptor_load %0[%5, %c768_i32_2] : !tt.tensordesc> -> tensor<64x128xf16> 2026-02-21T09:11:34.2039700Z %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2040023Z %41 = arith.extf %39 : tensor<64x128xf16> to tensor<64x128xf32> 2026-02-21T09:11:34.2040305Z %42 = tt.broadcast %40 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2040566Z %43 = arith.subf %41, %42 : tensor<64x128xf32> 2026-02-21T09:11:34.2040962Z %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T09:11:34.2041414Z %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32> 2026-02-21T09:11:34.2041742Z %46 = tt.broadcast %45 : tensor<64x1xf32> -> tensor<64x128xf32> 2026-02-21T09:11:34.2041982Z %47 = arith.divf %44, %46 : tensor<64x128xf32> 2026-02-21T09:11:34.2042222Z %48 = arith.truncf %47 : tensor<64x128xf32> to tensor<64x128xf16> 2026-02-21T09:11:34.2042574Z tt.descriptor_store %1[%5, %c768_i32_2], %48 : !tt.tensordesc>, tensor<64x128xf16> 2026-02-21T09:11:34.2042931Z } {tt.loop_unroll_factor = 1 : i32, tt.warp_specialize} 2026-02-21T09:11:34.2043152Z tt.return 2026-02-21T09:11:34.2043296Z } 2026-02-21T09:11:34.2043425Z } 2026-02-21T09:11:34.2043507Z 2026-02-21T09:11:34.2043561Z {-# 2026-02-21T09:11:34.2043705Z external_resources: { 2026-02-21T09:11:34.2043874Z mlir_reproducer: { 2026-02-21T09:11:34.2048752Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:11:34.2053497Z disable_threading: false, 2026-02-21T09:11:34.2053672Z verify_each: true 2026-02-21T09:11:34.2053813Z } 2026-02-21T09:11:34.2053943Z } 2026-02-21T09:11:34.2054063Z #-} 2026-02-21T09:11:34.2054501Z /tmp/torchinductor_root/yd/cyd3dfdak4qyi5lrmdhe7jxqrsagzqh3pbfiqueeizcj7exphius.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:11:34.2055748Z /tmp/torchinductor_root/yd/cyd3dfdak4qyi5lrmdhe7jxqrsagzqh3pbfiqueeizcj7exphius.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:11:34.2056724Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:11:34.2057782Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_sm_multiplier=8, num_stages=6, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:11:34.2063035Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:11:34.2063349Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:11:36.3871183Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.0 configs/s 2026-02-21T09:11:36.3881791Z [38s] Adaptive compile timeout: 30s (90% percentile=1.3s, bounds=[30.0s, 30s]) 2026-02-21T09:11:36.5495516Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 5895.2 configs/s 2026-02-21T09:11:36.5758818Z [38s] Initial random population of 100, 5 starting points: 2026-02-21T09:11:36.5762532Z error=5 2026-02-21T09:11:36.5767013Z timeout=1 2026-02-21T09:11:36.5768619Z ok=94 2026-02-21T09:11:36.5768836Z min=0.0123 2026-02-21T09:11:36.5774916Z mid=0.0921 2026-02-21T09:11:36.5779318Z max=20.7043 2026-02-21T09:11:36.5780623Z best={'block_sizes': [4, 128], 2026-02-21T09:11:36.5780891Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:11:36.5781125Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:11:36.5781315Z 'maxnreg': 128, 2026-02-21T09:11:36.5781467Z 'num_sm_multiplier': 8, 2026-02-21T09:11:36.5781748Z 'num_stages': 3, 2026-02-21T09:11:36.5781901Z 'num_warps': 1, 2026-02-21T09:11:36.5782062Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:11:36.5782252Z 'range_flattens': [True, True], 2026-02-21T09:11:36.5782432Z 'range_multi_buffers': [True, True], 2026-02-21T09:11:36.5782616Z 'range_num_stages': [3, 2], 2026-02-21T09:11:36.5782777Z 'range_unroll_factors': [1, 2], 2026-02-21T09:11:36.5782958Z 'range_warp_specializes': [True, None]} 2026-02-21T09:11:36.5783169Z [38s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:11:37.9837829Z [39s] Generation 1 starting: 101 neighbors, 5 active search path(s) 2026-02-21T09:11:42.4781195Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 106/106 16.5 configs/s 2026-02-21T09:11:42.8082187Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T09:11:42.8085857Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:11:42.8086734Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:11:42.8087245Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:11:42.8087451Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:11:42.8087636Z %c1184_i32 = arith.constant 1184 : i32 2026-02-21T09:11:42.8087856Z %cst = arith.constant dense<896> : tensor<16x1xi32> 2026-02-21T09:11:42.8088102Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T09:11:42.8088364Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T09:11:42.8088576Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:11:42.8088763Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:11:42.8095061Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:11:42.8099859Z %c896_i64 = arith.constant 896 : i64 2026-02-21T09:11:42.8103607Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:11:42.8107133Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T09:11:42.8110415Z %1 = tt.get_program_id x : i32 2026-02-21T09:11:42.8114308Z scf.for %arg2 = %1 to %c256_i32 step %c1184_i32 : i32 { 2026-02-21T09:11:42.8118351Z %2 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T09:11:42.8121323Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:11:42.8121863Z %4 = tt.splat %2 : i32 -> tensor<16xi32> 2026-02-21T09:11:42.8122081Z %5 = arith.addi %4, %3 : tensor<16xi32> 2026-02-21T09:11:42.8122278Z %c768_i32 = arith.constant 768 : i32 2026-02-21T09:11:42.8122483Z %c256_i32_2 = arith.constant 256 : i32 2026-02-21T09:11:42.8122866Z %6:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c256_i32_2 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T09:11:42.8123336Z %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:11:42.8123687Z %49 = arith.extf %48 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8123931Z %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8124138Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:42.8124339Z %86 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:42.8124538Z tt.reduce.return %86 : f32 2026-02-21T09:11:42.8124742Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8124968Z %51 = arith.truncf %50 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:11:42.8125213Z %52 = arith.extf %51 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:11:42.8125443Z %53 = arith.cmpf ogt, %arg4, %52 : tensor<16xf32> 2026-02-21T09:11:42.8125677Z %54 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T09:11:42.8125890Z %55 = arith.ori %53, %54 : tensor<16xi1> 2026-02-21T09:11:42.8126126Z %56 = arith.select %55, %arg4, %52 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:11:42.8126377Z %57 = arith.subf %arg4, %56 : tensor<16xf32> 2026-02-21T09:11:42.8126746Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8127111Z %59 = arith.mulf %arg5, %58 : tensor<16xf32> 2026-02-21T09:11:42.8127363Z %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8127658Z %61 = tt.broadcast %60 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8127900Z %62 = arith.subf %49, %61 : tensor<16x128xf32> 2026-02-21T09:11:42.8128265Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8128630Z %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8128821Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:42.8129011Z %86 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:42.8129198Z tt.reduce.return %86 : f32 2026-02-21T09:11:42.8129610Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8129820Z %65 = arith.addf %59, %64 : tensor<16xf32> 2026-02-21T09:11:42.8130013Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:11:42.8130207Z %66 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T09:11:42.8130395Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T09:11:42.8130678Z %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:11:42.8130989Z %69 = arith.extf %68 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8131222Z %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8131415Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:42.8131634Z %86 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:42.8131828Z tt.reduce.return %86 : f32 2026-02-21T09:11:42.8132009Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8132302Z %71 = arith.truncf %70 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:11:42.8132544Z %72 = arith.extf %71 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:11:42.8132777Z %73 = arith.cmpf ogt, %56, %72 : tensor<16xf32> 2026-02-21T09:11:42.8132990Z %74 = arith.cmpf une, %56, %56 : tensor<16xf32> 2026-02-21T09:11:42.8133196Z %75 = arith.ori %73, %74 : tensor<16xi1> 2026-02-21T09:11:42.8133429Z %76 = arith.select %75, %56, %72 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:11:42.8133657Z %77 = arith.subf %56, %76 : tensor<16xf32> 2026-02-21T09:11:42.8134009Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8134359Z %79 = arith.mulf %65, %78 : tensor<16xf32> 2026-02-21T09:11:42.8134613Z %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8134910Z %81 = tt.broadcast %80 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8135144Z %82 = arith.subf %69, %81 : tensor<16x128xf32> 2026-02-21T09:11:42.8135509Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8135868Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8136063Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:42.8136241Z %86 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:42.8136430Z tt.reduce.return %86 : f32 2026-02-21T09:11:42.8136622Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8136816Z %85 = arith.addf %79, %84 : tensor<16xf32> 2026-02-21T09:11:42.8137039Z scf.yield %76, %85 : tensor<16xf32>, tensor<16xf32> 2026-02-21T09:11:42.8137248Z } {tt.num_stages = 1 : i32} 2026-02-21T09:11:42.8137537Z %7 = tt.descriptor_load %0[%2, %c768_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:11:42.8137863Z %8 = arith.extf %7 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8138094Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8138285Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:42.8138465Z %48 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:11:42.8138672Z tt.reduce.return %48 : f32 2026-02-21T09:11:42.8138857Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8139089Z %10 = arith.truncf %9 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:11:42.8139331Z %11 = arith.extf %10 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:11:42.8139565Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<16xf32> 2026-02-21T09:11:42.8139783Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<16xf32> 2026-02-21T09:11:42.8139995Z %14 = arith.ori %12, %13 : tensor<16xi1> 2026-02-21T09:11:42.8140229Z %15 = arith.select %14, %6#0, %11 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:11:42.8140467Z %16 = arith.subf %6#0, %15 : tensor<16xf32> 2026-02-21T09:11:42.8140895Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8141286Z %18 = arith.mulf %6#1, %17 : tensor<16xf32> 2026-02-21T09:11:42.8141576Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8141882Z %20 = tt.broadcast %19 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8142115Z %21 = arith.subf %8, %20 : tensor<16x128xf32> 2026-02-21T09:11:42.8142486Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8142851Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T09:11:42.8143061Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:42.8143251Z %48 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:11:42.8143492Z tt.reduce.return %48 : f32 2026-02-21T09:11:42.8143698Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:11:42.8143889Z %24 = arith.addf %18, %23 : tensor<16xf32> 2026-02-21T09:11:42.8144086Z %c768_i32_3 = arith.constant 768 : i32 2026-02-21T09:11:42.8144270Z %c256_i32_4 = arith.constant 256 : i32 2026-02-21T09:11:42.8144500Z scf.for %arg3 = %c0_i32 to %c768_i32_3 step %c256_i32_4 : i32 { 2026-02-21T09:11:42.8144780Z %48 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:42.8145033Z %49 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:11:42.8145242Z %50 = arith.addi %49, %48 : tensor<128xi32> 2026-02-21T09:11:42.8145486Z %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:11:42.8145750Z %52 = arith.muli %51, %cst : tensor<16x1xi32> 2026-02-21T09:11:42.8145998Z %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:42.8146298Z %54 = tt.broadcast %52 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8146567Z %55 = tt.broadcast %53 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8146801Z %56 = arith.addi %54, %55 : tensor<16x128xi32> 2026-02-21T09:11:42.8147046Z %57 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8147325Z %58 = tt.addptr %57, %56 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8147582Z %59 = tt.load %58 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8147829Z %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8148114Z %61 = arith.extf %59 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8148373Z %62 = tt.broadcast %60 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8148603Z %63 = arith.subf %61, %62 : tensor<16x128xf32> 2026-02-21T09:11:42.8148974Z %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8149382Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8149665Z %66 = tt.broadcast %65 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8149899Z %67 = arith.divf %64, %66 : tensor<16x128xf32> 2026-02-21T09:11:42.8150129Z %68 = arith.truncf %67 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:11:42.8150399Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8150668Z %70 = tt.addptr %69, %56 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8150923Z tt.store %70, %68 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8151125Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:11:42.8151321Z %71 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T09:11:42.8151518Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:11:42.8151834Z %73 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:42.8152089Z %74 = tt.splat %72 : i32 -> tensor<128xi32> 2026-02-21T09:11:42.8152289Z %75 = arith.addi %74, %73 : tensor<128xi32> 2026-02-21T09:11:42.8152547Z %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:11:42.8152807Z %77 = arith.muli %76, %cst : tensor<16x1xi32> 2026-02-21T09:11:42.8153070Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:42.8153369Z %79 = tt.broadcast %77 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8153629Z %80 = tt.broadcast %78 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8153883Z %81 = arith.addi %79, %80 : tensor<16x128xi32> 2026-02-21T09:11:42.8154127Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8154504Z %83 = tt.addptr %82, %81 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8154759Z %84 = tt.load %83 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8155014Z %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8155307Z %86 = arith.extf %84 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8155566Z %87 = tt.broadcast %85 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8155803Z %88 = arith.subf %86, %87 : tensor<16x128xf32> 2026-02-21T09:11:42.8156173Z %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8156596Z %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8156885Z %91 = tt.broadcast %90 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8157117Z %92 = arith.divf %89, %91 : tensor<16x128xf32> 2026-02-21T09:11:42.8157360Z %93 = arith.truncf %92 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:11:42.8157629Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8157909Z %95 = tt.addptr %94, %81 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8158164Z tt.store %95, %93 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8158372Z } {tt.num_stages = 1 : i32} 2026-02-21T09:11:42.8158600Z %25 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:42.8158856Z %26 = tt.splat %c768_i32_3 : i32 -> tensor<128xi32> 2026-02-21T09:11:42.8159075Z %27 = arith.addi %26, %25 : tensor<128xi32> 2026-02-21T09:11:42.8159321Z %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:11:42.8159582Z %29 = arith.muli %28, %cst : tensor<16x1xi32> 2026-02-21T09:11:42.8159835Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:42.8160128Z %31 = tt.broadcast %29 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8160395Z %32 = tt.broadcast %30 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:11:42.8160623Z %33 = arith.addi %31, %32 : tensor<16x128xi32> 2026-02-21T09:11:42.8160862Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8161136Z %35 = tt.addptr %34, %33 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8161395Z %36 = tt.load %35 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8161712Z %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8162003Z %38 = arith.extf %36 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:11:42.8162263Z %39 = tt.broadcast %37 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8162504Z %40 = arith.subf %38, %39 : tensor<16x128xf32> 2026-02-21T09:11:42.8162942Z %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:11:42.8163370Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:11:42.8163671Z %43 = tt.broadcast %42 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:11:42.8163919Z %44 = arith.divf %41, %43 : tensor<16x128xf32> 2026-02-21T09:11:42.8164156Z %45 = arith.truncf %44 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:11:42.8164436Z %46 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8164718Z %47 = tt.addptr %46, %33 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:11:42.8164986Z tt.store %47, %45 : tensor<16x128x!tt.ptr> 2026-02-21T09:11:42.8165264Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:11:42.8165584Z tt.return 2026-02-21T09:11:42.8165726Z } 2026-02-21T09:11:42.8165849Z } 2026-02-21T09:11:42.8165923Z 2026-02-21T09:11:42.8165983Z {-# 2026-02-21T09:11:42.8166115Z external_resources: { 2026-02-21T09:11:42.8166286Z mlir_reproducer: { 2026-02-21T09:11:42.8170834Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:11:42.8175300Z disable_threading: false, 2026-02-21T09:11:42.8175468Z verify_each: true 2026-02-21T09:11:42.8175618Z } 2026-02-21T09:11:42.8175734Z } 2026-02-21T09:11:42.8175853Z #-} 2026-02-21T09:11:42.8176284Z /tmp/torchinductor_root/rc/crcdf4rqqxqatr7f3tzmzu3irll77hqde3z57327bjf64sdz6cal.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:11:42.8177464Z /tmp/torchinductor_root/rc/crcdf4rqqxqatr7f3tzmzu3irll77hqde3z57327bjf64sdz6cal.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:11:42.8178435Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:11:42.8179507Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[1, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:11:42.8180512Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:11:42.8180772Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:11:43.6424763Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T09:11:43.6426719Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:11:43.6427199Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:11:43.6427406Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:11:43.6427595Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:11:43.6428091Z %c1184_i32 = arith.constant 1184 : i32 2026-02-21T09:11:43.6428340Z %cst = arith.constant dense<896> : tensor<8x1xi32> 2026-02-21T09:11:43.6428595Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:11:43.6428847Z %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:11:43.6429069Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:11:43.6429254Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:11:43.6429434Z %c896_i32 = arith.constant 896 : i32 2026-02-21T09:11:43.6429622Z %c896_i64 = arith.constant 896 : i64 2026-02-21T09:11:43.6429796Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:11:43.6430108Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : , > 2026-02-21T09:11:43.6430421Z %1 = tt.get_program_id x : i32 2026-02-21T09:11:43.6430638Z scf.for %arg2 = %1 to %c512_i32 step %c1184_i32 : i32 { 2026-02-21T09:11:43.6430870Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:11:43.6431105Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:11:43.6431365Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:11:43.6431635Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:11:43.6431834Z %c768_i32 = arith.constant 768 : i32 2026-02-21T09:11:43.6432016Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:11:43.6432383Z %6:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c256_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:11:43.6432850Z %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T09:11:43.6433172Z %49 = arith.extf %48 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6433412Z %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6433608Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:43.6433809Z %86 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:43.6434008Z tt.reduce.return %86 : f32 2026-02-21T09:11:43.6434201Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6434462Z %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:11:43.6434706Z %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:11:43.6434933Z %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32> 2026-02-21T09:11:43.6435162Z %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:11:43.6435374Z %55 = arith.ori %53, %54 : tensor<8xi1> 2026-02-21T09:11:43.6435612Z %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:11:43.6435857Z %57 = arith.subf %arg4, %56 : tensor<8xf32> 2026-02-21T09:11:43.6436214Z %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6436578Z %59 = arith.mulf %arg5, %58 : tensor<8xf32> 2026-02-21T09:11:43.6436975Z %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6437268Z %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6437499Z %62 = arith.subf %49, %61 : tensor<8x128xf32> 2026-02-21T09:11:43.6437862Z %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6438234Z %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6438426Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:43.6438620Z %86 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:43.6438804Z tt.reduce.return %86 : f32 2026-02-21T09:11:43.6438995Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6439194Z %65 = arith.addf %59, %64 : tensor<8xf32> 2026-02-21T09:11:43.6439396Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:11:43.6439665Z %66 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T09:11:43.6439855Z %67 = arith.addi %arg3, %66 : i32 2026-02-21T09:11:43.6440131Z %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T09:11:43.6440438Z %69 = arith.extf %68 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6440672Z %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6440859Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:43.6441045Z %86 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:11:43.6441242Z tt.reduce.return %86 : f32 2026-02-21T09:11:43.6441424Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6441696Z %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:11:43.6441936Z %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:11:43.6442163Z %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32> 2026-02-21T09:11:43.6442376Z %74 = arith.cmpf une, %56, %56 : tensor<8xf32> 2026-02-21T09:11:43.6442585Z %75 = arith.ori %73, %74 : tensor<8xi1> 2026-02-21T09:11:43.6442807Z %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:11:43.6443039Z %77 = arith.subf %56, %76 : tensor<8xf32> 2026-02-21T09:11:43.6443392Z %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6443743Z %79 = arith.mulf %65, %78 : tensor<8xf32> 2026-02-21T09:11:43.6443995Z %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6444277Z %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6444516Z %82 = arith.subf %69, %81 : tensor<8x128xf32> 2026-02-21T09:11:43.6444880Z %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6445242Z %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6445440Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:11:43.6445619Z %86 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:11:43.6445806Z tt.reduce.return %86 : f32 2026-02-21T09:11:43.6445984Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6446209Z %85 = arith.addf %79, %84 : tensor<8xf32> 2026-02-21T09:11:43.6446418Z scf.yield %76, %85 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:11:43.6446636Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:11:43.6446928Z %7 = tt.descriptor_load %0[%2, %c768_i32] : !tt.tensordesc> -> tensor<8x128xf16> 2026-02-21T09:11:43.6447244Z %8 = arith.extf %7 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6447476Z %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6447663Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:43.6447853Z %48 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:11:43.6448117Z tt.reduce.return %48 : f32 2026-02-21T09:11:43.6448305Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6448518Z %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:11:43.6448758Z %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:11:43.6448989Z %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32> 2026-02-21T09:11:43.6449195Z %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T09:11:43.6449397Z %14 = arith.ori %12, %13 : tensor<8xi1> 2026-02-21T09:11:43.6449613Z %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:11:43.6449844Z %16 = arith.subf %6#0, %15 : tensor<8xf32> 2026-02-21T09:11:43.6450184Z %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6450599Z %18 = arith.mulf %6#1, %17 : tensor<8xf32> 2026-02-21T09:11:43.6450853Z %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6451132Z %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6451366Z %21 = arith.subf %8, %20 : tensor<8x128xf32> 2026-02-21T09:11:43.6451747Z %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6452109Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T09:11:43.6452303Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:11:43.6452475Z %48 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:11:43.6452661Z tt.reduce.return %48 : f32 2026-02-21T09:11:43.6452839Z }) : (tensor<8x128xf32>) -> tensor<8xf32> 2026-02-21T09:11:43.6453034Z %24 = arith.addf %18, %23 : tensor<8xf32> 2026-02-21T09:11:43.6453220Z %c768_i32_2 = arith.constant 768 : i32 2026-02-21T09:11:43.6453413Z %c256_i32_3 = arith.constant 256 : i32 2026-02-21T09:11:43.6453638Z scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c256_i32_3 : i32 { 2026-02-21T09:11:43.6453926Z %48 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:43.6454188Z %49 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:11:43.6454393Z %50 = arith.addi %49, %48 : tensor<128xi32> 2026-02-21T09:11:43.6454650Z %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:11:43.6454906Z %52 = arith.muli %51, %cst : tensor<8x1xi32> 2026-02-21T09:11:43.6455167Z %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:43.6455457Z %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6455724Z %55 = tt.broadcast %53 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6455965Z %56 = arith.addi %54, %55 : tensor<8x128xi32> 2026-02-21T09:11:43.6456203Z %57 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6456487Z %58 = tt.addptr %57, %56 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6456735Z %59 = tt.load %58 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6456995Z %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6457286Z %61 = arith.extf %59 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6457536Z %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6457772Z %63 = arith.subf %61, %62 : tensor<8x128xf32> 2026-02-21T09:11:43.6458130Z %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6458553Z %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6458841Z %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6459129Z %67 = arith.divf %64, %66 : tensor<8x128xf32> 2026-02-21T09:11:43.6459361Z %68 = arith.truncf %67 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T09:11:43.6459624Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6459897Z %70 = tt.addptr %69, %56 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6460143Z tt.store %70, %68 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6460347Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:11:43.6460533Z %71 = arith.muli %c128_i32, %c1_i32 : i32 2026-02-21T09:11:43.6460732Z %72 = arith.addi %arg3, %71 : i32 2026-02-21T09:11:43.6460965Z %73 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:43.6461204Z %74 = tt.splat %72 : i32 -> tensor<128xi32> 2026-02-21T09:11:43.6461406Z %75 = arith.addi %74, %73 : tensor<128xi32> 2026-02-21T09:11:43.6461756Z %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:11:43.6462024Z %77 = arith.muli %76, %cst : tensor<8x1xi32> 2026-02-21T09:11:43.6462275Z %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:43.6462568Z %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6462831Z %80 = tt.broadcast %78 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6463061Z %81 = arith.addi %79, %80 : tensor<8x128xi32> 2026-02-21T09:11:43.6463295Z %82 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6463567Z %83 = tt.addptr %82, %81 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6463819Z %84 = tt.load %83 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6464073Z %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6464348Z %86 = arith.extf %84 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6464609Z %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6464835Z %88 = arith.subf %86, %87 : tensor<8x128xf32> 2026-02-21T09:11:43.6465200Z %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6465598Z %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6465878Z %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6466107Z %92 = arith.divf %89, %91 : tensor<8x128xf32> 2026-02-21T09:11:43.6466333Z %93 = arith.truncf %92 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T09:11:43.6466601Z %94 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6466868Z %95 = tt.addptr %94, %81 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6467124Z tt.store %95, %93 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6467335Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:11:43.6467587Z %25 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:11:43.6467862Z %26 = tt.splat %c768_i32_2 : i32 -> tensor<128xi32> 2026-02-21T09:11:43.6468078Z %27 = arith.addi %26, %25 : tensor<128xi32> 2026-02-21T09:11:43.6468335Z %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:11:43.6468596Z %29 = arith.muli %28, %cst : tensor<8x1xi32> 2026-02-21T09:11:43.6468862Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:11:43.6469158Z %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6469434Z %32 = tt.broadcast %30 : tensor<1x128xi32> -> tensor<8x128xi32> 2026-02-21T09:11:43.6469677Z %33 = arith.addi %31, %32 : tensor<8x128xi32> 2026-02-21T09:11:43.6470009Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6470317Z %35 = tt.addptr %34, %33 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6470587Z %36 = tt.load %35 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6470845Z %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6471143Z %38 = arith.extf %36 : tensor<8x128xf16> to tensor<8x128xf32> 2026-02-21T09:11:43.6471405Z %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6471689Z %40 = arith.subf %38, %39 : tensor<8x128xf32> 2026-02-21T09:11:43.6472070Z %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32> 2026-02-21T09:11:43.6472509Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:11:43.6472859Z %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x128xf32> 2026-02-21T09:11:43.6473096Z %44 = arith.divf %41, %43 : tensor<8x128xf32> 2026-02-21T09:11:43.6473335Z %45 = arith.truncf %44 : tensor<8x128xf32> to tensor<8x128xf16> 2026-02-21T09:11:43.6473605Z %46 = tt.splat %arg1 : !tt.ptr -> tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6473894Z %47 = tt.addptr %46, %33 : tensor<8x128x!tt.ptr>, tensor<8x128xi32> 2026-02-21T09:11:43.6474166Z tt.store %47, %45 : tensor<8x128x!tt.ptr> 2026-02-21T09:11:43.6474393Z } {tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T09:11:43.6474601Z tt.return 2026-02-21T09:11:43.6474731Z } 2026-02-21T09:11:43.6474865Z } 2026-02-21T09:11:43.6474937Z 2026-02-21T09:11:43.6474988Z {-# 2026-02-21T09:11:43.6475129Z external_resources: { 2026-02-21T09:11:43.6475291Z mlir_reproducer: { 2026-02-21T09:11:43.6479649Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:11:43.6484129Z disable_threading: false, 2026-02-21T09:11:43.6484297Z verify_each: true 2026-02-21T09:11:43.6484447Z } 2026-02-21T09:11:43.6484571Z } 2026-02-21T09:11:43.6484684Z #-} 2026-02-21T09:11:43.6485113Z /tmp/torchinductor_root/sn/csnj4e6rue3cqjy3o56b4y3yibltaqdjhwbfymxbfefijdc47pey.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:11:43.6486341Z /tmp/torchinductor_root/sn/csnj4e6rue3cqjy3o56b4y3yibltaqdjhwbfymxbfefijdc47pey.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:11:43.6487329Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:11:43.6488391Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:11:43.6489379Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:11:43.6489644Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:11:48.9920136Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 106/106 16.4 configs/s 2026-02-21T09:11:52.6177046Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 297.0 2026-02-21T09:11:52.6180659Z configs/s 2026-02-21T09:11:52.9190566Z [54s] Generation 1 complete: 2026-02-21T09:11:52.9194894Z error=2 2026-02-21T09:11:52.9195106Z ok=105 2026-02-21T09:11:52.9199249Z min=0.0104 2026-02-21T09:11:52.9204521Z mid=0.0184 2026-02-21T09:11:52.9209065Z max=0.0799 2026-02-21T09:11:52.9210488Z best={'block_sizes': [4, 512], 2026-02-21T09:11:52.9210759Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:11:52.9211019Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:11:52.9211214Z 'maxnreg': 128, 2026-02-21T09:11:52.9211380Z 'num_sm_multiplier': 8, 2026-02-21T09:11:52.9211791Z 'num_stages': 3, 2026-02-21T09:11:52.9211967Z 'num_warps': 1, 2026-02-21T09:11:52.9212131Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:11:52.9212342Z 'range_flattens': [True, True], 2026-02-21T09:11:52.9212534Z 'range_multi_buffers': [True, True], 2026-02-21T09:11:52.9212723Z 'range_num_stages': [4, 2], 2026-02-21T09:11:52.9212902Z 'range_unroll_factors': [1, 2], 2026-02-21T09:11:52.9213089Z 'range_warp_specializes': [True, None]} 2026-02-21T09:11:52.9213320Z [54s] Fitting surrogate: 207 points, 207 targets 2026-02-21T09:11:54.0393306Z [55s] Generation 2 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:11:58.5828449Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 11.2 configs/s 2026-02-21T09:12:04.1852872Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 16.4 configs/s 2026-02-21T09:12:08.5305197Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 247.5 2026-02-21T09:12:08.5309228Z configs/s 2026-02-21T09:12:08.9029463Z [70s] Generation 2 complete: 2026-02-21T09:12:08.9033722Z ok=93 2026-02-21T09:12:08.9037605Z min=0.0102 2026-02-21T09:12:08.9042052Z mid=0.0143 2026-02-21T09:12:08.9046600Z max=0.0409 2026-02-21T09:12:08.9051758Z best={'block_sizes': [4, 1024], 2026-02-21T09:12:08.9056038Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:12:08.9057555Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:12:08.9057789Z 'maxnreg': 128, 2026-02-21T09:12:08.9058030Z 'num_sm_multiplier': 16, 2026-02-21T09:12:08.9058202Z 'num_stages': 3, 2026-02-21T09:12:08.9058342Z 'num_warps': 4, 2026-02-21T09:12:08.9058508Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:12:08.9058733Z 'range_flattens': [True, True], 2026-02-21T09:12:08.9062809Z 'range_multi_buffers': [True, True], 2026-02-21T09:12:08.9064215Z 'range_num_stages': [4, 2], 2026-02-21T09:12:08.9064441Z 'range_unroll_factors': [1, 2], 2026-02-21T09:12:08.9064654Z 'range_warp_specializes': [True, None]} 2026-02-21T09:12:08.9065305Z [70s] Fitting surrogate: 300 points, 300 targets 2026-02-21T09:12:10.0837659Z [71s] Generation 3 starting: 89 neighbors, 5 active search path(s) 2026-02-21T09:12:13.9430178Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 30.9 configs/s 2026-02-21T09:12:19.7871984Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 16.0 configs/s 2026-02-21T09:12:23.8160127Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 253.9 2026-02-21T09:12:23.8160665Z configs/s 2026-02-21T09:12:24.2108224Z [85s] Generation 3 complete: 2026-02-21T09:12:24.2109894Z error=2 2026-02-21T09:12:24.2110054Z ok=93 2026-02-21T09:12:24.2110177Z min=0.0102 2026-02-21T09:12:24.2110311Z mid=0.0123 2026-02-21T09:12:24.2110431Z max=0.0389 2026-02-21T09:12:24.2110576Z best={'block_sizes': [1, 1024], 2026-02-21T09:12:24.2111242Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:12:24.2111833Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:12:24.2112040Z 'num_stages': 2, 2026-02-21T09:12:24.2112180Z 'num_warps': 1, 2026-02-21T09:12:24.2112331Z 'pid_type': 'flat', 2026-02-21T09:12:24.2112486Z 'range_flattens': [None, True], 2026-02-21T09:12:24.2112671Z 'range_multi_buffers': [None, False], 2026-02-21T09:12:24.2112853Z 'range_num_stages': [0, 2], 2026-02-21T09:12:24.2113027Z 'range_unroll_factors': [0, 4], 2026-02-21T09:12:24.2113211Z 'range_warp_specializes': [None, None]} 2026-02-21T09:12:24.2124843Z [85s] Fitting surrogate: 395 points, 395 targets 2026-02-21T09:12:25.3177933Z [87s] Generation 4 starting: 76 neighbors, 5 active search path(s) 2026-02-21T09:12:28.6727758Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 57.7 configs/s 2026-02-21T09:12:33.5182732Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.2 configs/s 2026-02-21T09:12:37.9187306Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 243.9 2026-02-21T09:12:37.9188794Z configs/s 2026-02-21T09:12:38.3142675Z [100s] Generation 4 complete: 2026-02-21T09:12:38.3147044Z ok=81 2026-02-21T09:12:38.3150237Z min=0.0102 2026-02-21T09:12:38.3154231Z mid=0.0112 2026-02-21T09:12:38.3156228Z max=0.0307 2026-02-21T09:12:38.3156403Z best={'block_sizes': [1, 1024], 2026-02-21T09:12:38.3156689Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:12:38.3156977Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:12:38.3157186Z 'num_stages': 2, 2026-02-21T09:12:38.3157347Z 'num_warps': 1, 2026-02-21T09:12:38.3157489Z 'pid_type': 'flat', 2026-02-21T09:12:38.3157654Z 'range_flattens': [None, True], 2026-02-21T09:12:38.3157834Z 'range_multi_buffers': [None, False], 2026-02-21T09:12:38.3158026Z 'range_num_stages': [0, 2], 2026-02-21T09:12:38.3158215Z 'range_unroll_factors': [0, 4], 2026-02-21T09:12:38.3158414Z 'range_warp_specializes': [None, None]} 2026-02-21T09:12:38.3162547Z [100s] Fitting surrogate: 476 points, 476 targets 2026-02-21T09:12:39.1232639Z [100s] Generation 5 starting: 52 neighbors, 4 active search path(s) 2026-02-21T09:12:41.7201291Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 19.7 configs/s 2026-02-21T09:12:45.0027032Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.4 configs/s 2026-02-21T09:12:47.7658801Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.2 2026-02-21T09:12:47.7662848Z configs/s 2026-02-21T09:12:48.0314445Z [109s] Generation 5 complete: 2026-02-21T09:12:48.0318758Z ok=56 2026-02-21T09:12:48.0322196Z min=0.0083 2026-02-21T09:12:48.0326615Z mid=0.0102 2026-02-21T09:12:48.0331816Z max=0.0512 2026-02-21T09:12:48.0336221Z best={'block_sizes': [1, 1024], 2026-02-21T09:12:48.0338240Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:12:48.0338920Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:12:48.0339111Z 'num_stages': 2, 2026-02-21T09:12:48.0339261Z 'num_warps': 1, 2026-02-21T09:12:48.0339399Z 'pid_type': 'flat', 2026-02-21T09:12:48.0339561Z 'range_flattens': [None, True], 2026-02-21T09:12:48.0339746Z 'range_multi_buffers': [None, False], 2026-02-21T09:12:48.0339930Z 'range_num_stages': [0, 2], 2026-02-21T09:12:48.0340101Z 'range_unroll_factors': [0, 4], 2026-02-21T09:12:48.0340275Z 'range_warp_specializes': [None, None]} 2026-02-21T09:12:48.0340499Z [109s] Fitting surrogate: 532 points, 532 targets 2026-02-21T09:12:48.7523820Z [110s] Generation 6 starting: 41 neighbors, 3 active search path(s) 2026-02-21T09:12:50.8437112Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 32.2 configs/s 2026-02-21T09:12:53.3897048Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.4 configs/s 2026-02-21T09:12:55.4399811Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 498.2 2026-02-21T09:12:55.4403878Z configs/s 2026-02-21T09:12:55.6362953Z [117s] Generation 6 complete: 2026-02-21T09:12:55.6364350Z ok=44 2026-02-21T09:12:55.6364523Z min=0.0102 2026-02-21T09:12:55.6364671Z mid=0.0102 2026-02-21T09:12:55.6364803Z max=0.0164 2026-02-21T09:12:55.6364963Z best={'block_sizes': [1, 1024], 2026-02-21T09:12:55.6365226Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:12:55.6365508Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:12:55.6365704Z 'num_stages': 2, 2026-02-21T09:12:55.6365856Z 'num_warps': 1, 2026-02-21T09:12:55.6366008Z 'pid_type': 'flat', 2026-02-21T09:12:55.6366170Z 'range_flattens': [None, True], 2026-02-21T09:12:55.6366359Z 'range_multi_buffers': [None, False], 2026-02-21T09:12:55.6366549Z 'range_num_stages': [0, 2], 2026-02-21T09:12:55.6366727Z 'range_unroll_factors': [0, 4], 2026-02-21T09:12:55.6366941Z 'range_warp_specializes': [None, None]} 2026-02-21T09:12:55.6382142Z [117s] Fitting surrogate: 576 points, 576 targets 2026-02-21T09:12:56.2326676Z [117s] Generation 7 starting: 27 neighbors, 2 active search path(s) 2026-02-21T09:12:57.4996463Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 42.1 configs/s 2026-02-21T09:12:59.1798977Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 16.5 configs/s 2026-02-21T09:13:00.6538329Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 691.2 2026-02-21T09:13:00.6540560Z configs/s 2026-02-21T09:13:00.7970951Z [122s] Generation 7 complete: 2026-02-21T09:13:00.7974125Z ok=30 2026-02-21T09:13:00.7978487Z min=0.0092 2026-02-21T09:13:00.7982111Z mid=0.0102 2026-02-21T09:13:00.7986984Z max=0.0164 2026-02-21T09:13:00.7989121Z best={'block_sizes': [1, 1024], 2026-02-21T09:13:00.7989433Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:13:00.7990178Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:13:00.7990389Z 'num_stages': 2, 2026-02-21T09:13:00.7991502Z 'num_warps': 1, 2026-02-21T09:13:00.7991725Z 'pid_type': 'flat', 2026-02-21T09:13:00.7991927Z 'range_flattens': [None, True], 2026-02-21T09:13:00.7992149Z 'range_multi_buffers': [None, False], 2026-02-21T09:13:00.7992345Z 'range_num_stages': [0, 2], 2026-02-21T09:13:00.7992519Z 'range_unroll_factors': [0, 4], 2026-02-21T09:13:00.7992723Z 'range_warp_specializes': [None, None]} 2026-02-21T09:13:00.7994837Z [122s] Fitting surrogate: 606 points, 606 targets 2026-02-21T09:13:01.3496875Z [123s] Generation 8 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:13:02.6586338Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 46.4 configs/s 2026-02-21T09:13:04.0751126Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.7 configs/s 2026-02-21T09:13:05.3883864Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 775.5 2026-02-21T09:13:05.3885234Z configs/s 2026-02-21T09:13:05.5173879Z [127s] Generation 8 complete: 2026-02-21T09:13:05.5178276Z ok=26 2026-02-21T09:13:05.5179641Z min=0.0102 2026-02-21T09:13:05.5179814Z mid=0.0102 2026-02-21T09:13:05.5179954Z max=0.0143 2026-02-21T09:13:05.5180096Z best={'block_sizes': [1, 1024], 2026-02-21T09:13:05.5180356Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:13:05.5180620Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:13:05.5180819Z 'num_stages': 2, 2026-02-21T09:13:05.5180957Z 'num_warps': 1, 2026-02-21T09:13:05.5181102Z 'pid_type': 'flat', 2026-02-21T09:13:05.5181254Z 'range_flattens': [None, True], 2026-02-21T09:13:05.5181437Z 'range_multi_buffers': [None, False], 2026-02-21T09:13:05.5181861Z 'range_num_stages': [0, 2], 2026-02-21T09:13:05.5182035Z 'range_unroll_factors': [0, 4], 2026-02-21T09:13:05.5182238Z 'range_warp_specializes': [None, None]} 2026-02-21T09:13:05.5194030Z [127s] Fitting surrogate: 632 points, 632 targets 2026-02-21T09:13:05.8070723Z [127s] Autotuning complete in 127.5s after searching 611 configs. 2026-02-21T09:13:05.8071036Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:13:05.8072173Z @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:13:05.8072997Z 2026-02-21T09:13:05.8073249Z [127s] Code of selected kernel: /tmp/torchinductor_root/sr/csrpjvbpym57p7lvzdhfstpheoqr7wm4hgy2bqbykun7hp3dddxd.py 2026-02-21T09:13:05.8298339Z from __future__ import annotations 2026-02-21T09:13:05.8301279Z 2026-02-21T09:13:05.8305591Z import torch 2026-02-21T09:13:05.8310749Z import triton 2026-02-21T09:13:05.8312433Z import triton.language as tl 2026-02-21T09:13:05.8312692Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:13:05.8312961Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:13:05.8313254Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:13:05.8313429Z 2026-02-21T09:13:05.8313508Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:13:05.8313687Z _BLOCK_SIZE_1 = tl.constexpr(1024) 2026-02-21T09:13:05.8313800Z 2026-02-21T09:13:05.8313865Z @triton.jit 2026-02-21T09:13:05.8314008Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:13:05.8314264Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:13:05.8314510Z pid_0 = tl.program_id(0) 2026-02-21T09:13:05.8314678Z offset_0 = pid_0 2026-02-21T09:13:05.8314849Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:13:05.8315143Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:13:05.8315778Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:13:05.8316044Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:13:05.8316307Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:13:05.8316559Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:13:05.8316835Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:13:05.8317096Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:13:05.8317323Z # src[softmax.py:82-89]: ... 2026-02-21T09:13:05.8317689Z for offset_2 in tl.range(0, 896, _BLOCK_SIZE_1, loop_unroll_factor=4, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:13:05.8318097Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:13:05.8318338Z mask_1 = indices_2 < 896 2026-02-21T09:13:05.8318612Z mi_copy = mi 2026-02-21T09:13:05.8318776Z di_copy = di 2026-02-21T09:13:05.8318924Z mi_copy_0 = mi_copy 2026-02-21T09:13:05.8319101Z di_copy_0 = di_copy 2026-02-21T09:13:05.8319297Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:13:05.8319667Z values = tl.load(x + (indices_0[:, None] * 896 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:13:05.8320064Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:13:05.8320466Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:13:05.8320859Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:13:05.8321119Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:13:05.8321353Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:13:05.8321663Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:13:05.8321924Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:13:05.8322168Z v_2 = mi_copy_0 - v_1 2026-02-21T09:13:05.8322336Z v_3 = libdevice.exp(v_2) 2026-02-21T09:13:05.8322507Z v_4 = di_copy_0 * v_3 2026-02-21T09:13:05.8322689Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:13:05.8322915Z subscript = v_1[:, None] 2026-02-21T09:13:05.8323093Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:13:05.8323270Z v_6 = v_5 - subscript 2026-02-21T09:13:05.8323485Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:13:05.8323742Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:13:05.8323961Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:13:05.8324150Z v_7 = libdevice.exp(v_6) 2026-02-21T09:13:05.8324467Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:13:05.8324827Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:13:05.8325019Z di = v_4 + sum_1 2026-02-21T09:13:05.8325180Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:13:05.8325351Z mi = v_1 2026-02-21T09:13:05.8325555Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:13:05.8325823Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:13:05.8326109Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:13:05.8326555Z for offset_2 in tl.range(0, 896, _BLOCK_SIZE_1, loop_unroll_factor=4, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:13:05.8326958Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:13:05.8327191Z mask_2 = indices_2 < 896 2026-02-21T09:13:05.8327352Z mi_copy_1 = mi 2026-02-21T09:13:05.8327584Z di_copy_1 = di 2026-02-21T09:13:05.8327737Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:13:05.8327899Z di_copy_1_0 = di_copy_1 2026-02-21T09:13:05.8328088Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:13:05.8328445Z values_1 = tl.load(x + (indices_0[:, None] * 896 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:13:05.8328877Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:13:05.8329149Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:13:05.8329345Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:13:05.8329531Z v_10 = v_9 - subscript_1 2026-02-21T09:13:05.8329700Z v_11 = libdevice.exp(v_10) 2026-02-21T09:13:05.8329883Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:13:05.8330062Z v_12 = v_11 / subscript_2 2026-02-21T09:13:05.8330307Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:13:05.8330579Z tl.store(out + (indices_0[:, None] * 896 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:13:05.8330801Z 2026-02-21T09:13:05.8330928Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:13:05.8331165Z """ 2026-02-21T09:13:05.8331365Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:13:05.8331715Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:13:05.8331931Z Args: 2026-02-21T09:13:05.8332096Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:13:05.8332289Z Returns: 2026-02-21T09:13:05.8332476Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:13:05.8332689Z """ 2026-02-21T09:13:05.8332835Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:13:05.8333019Z m, n = x.size() 2026-02-21T09:13:05.8333187Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:13:05.8333417Z out = torch.empty_like(x) 2026-02-21T09:13:05.8333655Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:13:05.8333990Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:13:05.8334316Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:13:05.8334567Z # src[softmax.py:79-92]: ... 2026-02-21T09:13:05.8334828Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=2) 2026-02-21T09:13:05.8335119Z # src[softmax.py:93]: return out 2026-02-21T09:13:05.8335302Z return out 2026-02-21T09:13:06.5882130Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T09:13:06.5883875Z (M, N) 2026-02-21T09:13:06.5884048Z ----------- 2026-02-21T09:13:06.5884181Z (4096, 896) 2026-02-21T09:13:06.5884256Z 2026-02-21T09:13:06.5889110Z 10%|█ | 2/20 [04:11<38:07, 127.09s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10: 2026-02-21T09:13:06.5893302Z (M, N) 2026-02-21T09:13:06.5898325Z ------------ 2026-02-21T09:13:06.5902707Z (4096, 1536) 2026-02-21T09:13:06.5907280Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:13:08.0589372Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:13:09.5184751Z INFO:tritonbench.utils.triton_op:Took 2.16ms to get benchmark function for torch_compile_softmax 2026-02-21T09:13:10.6219436Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:13:10.6221430Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:13:10.6221849Z 'dtype': 'torch.float16', 2026-02-21T09:13:10.6222037Z 'shape': (4096, 1536), 2026-02-21T09:13:10.6222216Z 'stride': (1536, 1)},), 2026-02-21T09:13:10.6222389Z 'kwargs': {}} 2026-02-21T09:13:10.6236824Z INFO:tritonbench.utils.triton_op:Took 1.92ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:13:10.7955389Z [0s] Autotune random seed: 2138408546 2026-02-21T09:13:10.8200863Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:13:43.0809084Z [32s] Timeout after 30s compiling Config(block_sizes=[512, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:13:43.5386981Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:13:43.5396967Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s 2026-02-21T09:13:49.4175661Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.1 configs/s 2026-02-21T09:13:49.4186336Z [38s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s]) 2026-02-21T09:13:50.1077990Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1448.2 configs/s 2026-02-21T09:13:50.1688244Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T09:13:50.1692580Z error=7 2026-02-21T09:13:50.1693855Z timeout=2 2026-02-21T09:13:50.1694017Z ok=91 2026-02-21T09:13:50.1694141Z min=0.0225 2026-02-21T09:13:50.1694275Z mid=0.1536 2026-02-21T09:13:50.1694397Z max=35.8574 2026-02-21T09:13:50.1694547Z best={'block_sizes': [32, 512], 2026-02-21T09:13:50.1694798Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:13:50.1695058Z 'load_eviction_policies': ['', 'first'], 2026-02-21T09:13:50.1695276Z 'num_stages': 3, 2026-02-21T09:13:50.1695432Z 'num_warps': 32, 2026-02-21T09:13:50.1695577Z 'pid_type': 'flat', 2026-02-21T09:13:50.1695736Z 'range_flattens': [None, False], 2026-02-21T09:13:50.1695919Z 'range_multi_buffers': [None, None], 2026-02-21T09:13:50.1696099Z 'range_num_stages': [0, 1], 2026-02-21T09:13:50.1696270Z 'range_unroll_factors': [0, 1], 2026-02-21T09:13:50.1696445Z 'range_warp_specializes': [None, False]} 2026-02-21T09:13:50.1710710Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:13:51.4984015Z [40s] Generation 1 starting: 98 neighbors, 5 active search path(s) 2026-02-21T09:14:24.1054854Z [73s] Timeout after 30s compiling Config(block_sizes=[16, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[4, 3], range_warp_specializes=[None, False]) 2026-02-21T09:14:24.1072366Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 0.6 configs/s 2026-02-21T09:14:30.0251486Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 17.4 configs/s 2026-02-21T09:14:31.7495919Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 588.8 2026-02-21T09:14:31.7499596Z configs/s 2026-02-21T09:14:31.8962250Z [81s] Generation 1 complete: 2026-02-21T09:14:31.8963960Z error=4 2026-02-21T09:14:31.8964130Z timeout=1 2026-02-21T09:14:31.8964261Z ok=99 2026-02-21T09:14:31.8964401Z min=0.0143 2026-02-21T09:14:31.8964531Z mid=0.0267 2026-02-21T09:14:31.8964667Z max=0.1085 2026-02-21T09:14:31.8964808Z best={'block_sizes': [16, 256], 2026-02-21T09:14:31.8965024Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:14:31.8965236Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:14:31.8965423Z 'num_stages': 2, 2026-02-21T09:14:31.8965592Z 'num_warps': 16, 2026-02-21T09:14:31.8966065Z 'pid_type': 'flat', 2026-02-21T09:14:31.8966227Z 'range_flattens': [None, True], 2026-02-21T09:14:31.8966400Z 'range_multi_buffers': [None, True], 2026-02-21T09:14:31.8966588Z 'range_num_stages': [0, 1], 2026-02-21T09:14:31.8966753Z 'range_unroll_factors': [0, 3], 2026-02-21T09:14:31.8966939Z 'range_warp_specializes': [None, False]} 2026-02-21T09:14:31.8977718Z [81s] Fitting surrogate: 204 points, 204 targets 2026-02-21T09:14:33.0148034Z [82s] Generation 2 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:14:38.6470853Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 6.9 configs/s 2026-02-21T09:14:43.9114906Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 17.0 configs/s 2026-02-21T09:14:47.5768347Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 278.0 2026-02-21T09:14:47.5769859Z configs/s 2026-02-21T09:14:47.8686548Z [97s] Generation 2 complete: 2026-02-21T09:14:47.8690643Z error=3 2026-02-21T09:14:47.8695023Z ok=90 2026-02-21T09:14:47.8696488Z min=0.0143 2026-02-21T09:14:47.8696658Z mid=0.0205 2026-02-21T09:14:47.8696836Z max=0.0942 2026-02-21T09:14:47.8696988Z best={'block_sizes': [16, 256], 2026-02-21T09:14:47.8697229Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:14:47.8697452Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:14:47.8697646Z 'num_stages': 2, 2026-02-21T09:14:47.8697792Z 'num_warps': 16, 2026-02-21T09:14:47.8697942Z 'pid_type': 'flat', 2026-02-21T09:14:47.8698114Z 'range_flattens': [None, True], 2026-02-21T09:14:47.8698297Z 'range_multi_buffers': [None, True], 2026-02-21T09:14:47.8698493Z 'range_num_stages': [0, 1], 2026-02-21T09:14:47.8698662Z 'range_unroll_factors': [0, 3], 2026-02-21T09:14:47.8698855Z 'range_warp_specializes': [None, False]} 2026-02-21T09:14:47.8701333Z [97s] Fitting surrogate: 297 points, 297 targets 2026-02-21T09:14:49.1202006Z [98s] Generation 3 starting: 88 neighbors, 5 active search path(s) 2026-02-21T09:14:53.7005438Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 14.7 configs/s 2026-02-21T09:14:58.8533157Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 17.2 configs/s 2026-02-21T09:15:03.1168076Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 239.9 2026-02-21T09:15:03.1171288Z configs/s 2026-02-21T09:15:03.4606159Z [112s] Generation 3 complete: 2026-02-21T09:15:03.4607903Z error=4 2026-02-21T09:15:03.4608059Z ok=90 2026-02-21T09:15:03.4608185Z min=0.0143 2026-02-21T09:15:03.4608320Z mid=0.0184 2026-02-21T09:15:03.4608439Z max=0.0716 2026-02-21T09:15:03.4608586Z best={'block_sizes': [16, 256], 2026-02-21T09:15:03.4608793Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:15:03.4609009Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:15:03.4609184Z 'num_stages': 1, 2026-02-21T09:15:03.4609365Z 'num_warps': 16, 2026-02-21T09:15:03.4609515Z 'pid_type': 'flat', 2026-02-21T09:15:03.4609677Z 'range_flattens': [None, True], 2026-02-21T09:15:03.4609860Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:03.4610040Z 'range_num_stages': [0, 1], 2026-02-21T09:15:03.4610207Z 'range_unroll_factors': [0, 3], 2026-02-21T09:15:03.4610383Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:03.4624735Z [112s] Fitting surrogate: 391 points, 391 targets 2026-02-21T09:15:04.4576758Z [113s] Generation 4 starting: 71 neighbors, 5 active search path(s) 2026-02-21T09:15:08.4866140Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 12.6 configs/s 2026-02-21T09:15:13.0659980Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s 2026-02-21T09:15:16.5543043Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 292.5 2026-02-21T09:15:16.5543438Z configs/s 2026-02-21T09:15:16.8499882Z [126s] Generation 4 complete: 2026-02-21T09:15:16.8504019Z ok=77 2026-02-21T09:15:16.8505900Z min=0.0143 2026-02-21T09:15:16.8506103Z mid=0.0164 2026-02-21T09:15:16.8506243Z max=0.0389 2026-02-21T09:15:16.8506388Z best={'block_sizes': [32, 256], 2026-02-21T09:15:16.8506619Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:15:16.8506840Z 'load_eviction_policies': ['', 'first'], 2026-02-21T09:15:16.8507039Z 'num_stages': 4, 2026-02-21T09:15:16.8507185Z 'num_warps': 32, 2026-02-21T09:15:16.8507335Z 'pid_type': 'flat', 2026-02-21T09:15:16.8507501Z 'range_flattens': [None, False], 2026-02-21T09:15:16.8507698Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:16.8507887Z 'range_num_stages': [0, 1], 2026-02-21T09:15:16.8508069Z 'range_unroll_factors': [0, 1], 2026-02-21T09:15:16.8508260Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:16.8514475Z [126s] Fitting surrogate: 468 points, 468 targets 2026-02-21T09:15:17.9311302Z [127s] Generation 5 starting: 69 neighbors, 5 active search path(s) 2026-02-21T09:15:22.1706902Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 11.5 configs/s 2026-02-21T09:15:26.4579173Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.5 configs/s 2026-02-21T09:15:28.4272699Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 517.6 2026-02-21T09:15:28.4276027Z configs/s 2026-02-21T09:15:28.6088532Z [137s] Generation 5 complete: 2026-02-21T09:15:28.6092801Z ok=74 2026-02-21T09:15:28.6096837Z min=0.0102 2026-02-21T09:15:28.6097078Z mid=0.0164 2026-02-21T09:15:28.6097232Z max=0.0471 2026-02-21T09:15:28.6097373Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:28.6097630Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:28.6097901Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:28.6103361Z 'num_stages': 4, 2026-02-21T09:15:28.6105530Z 'num_warps': 2, 2026-02-21T09:15:28.6105797Z 'pid_type': 'flat', 2026-02-21T09:15:28.6109686Z 'range_flattens': [None, True], 2026-02-21T09:15:28.6112869Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:28.6116127Z 'range_num_stages': [0, 3], 2026-02-21T09:15:28.6121062Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:28.6125370Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:28.6129483Z [137s] Fitting surrogate: 542 points, 542 targets 2026-02-21T09:15:29.3717402Z [138s] Generation 6 starting: 39 neighbors, 3 active search path(s) 2026-02-21T09:15:31.5818662Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 20.4 configs/s 2026-02-21T09:15:33.9703356Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.6 configs/s 2026-02-21T09:15:35.7459081Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 574.5 2026-02-21T09:15:35.7463129Z configs/s 2026-02-21T09:15:35.9118752Z [145s] Generation 6 complete: 2026-02-21T09:15:35.9123035Z ok=42 2026-02-21T09:15:35.9124806Z min=0.0102 2026-02-21T09:15:35.9124974Z mid=0.0143 2026-02-21T09:15:35.9125097Z max=0.0205 2026-02-21T09:15:35.9125303Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:35.9125555Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:35.9129012Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:35.9132258Z 'num_stages': 4, 2026-02-21T09:15:35.9136243Z 'num_warps': 2, 2026-02-21T09:15:35.9140195Z 'pid_type': 'flat', 2026-02-21T09:15:35.9140465Z 'range_flattens': [None, True], 2026-02-21T09:15:35.9140683Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:35.9144675Z 'range_num_stages': [0, 3], 2026-02-21T09:15:35.9149062Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:35.9153396Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:35.9157794Z [145s] Fitting surrogate: 584 points, 584 targets 2026-02-21T09:15:36.5860340Z [145s] Generation 7 starting: 32 neighbors, 3 active search path(s) 2026-02-21T09:15:38.4718160Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 35.4 configs/s 2026-02-21T09:15:40.4575886Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.5 configs/s 2026-02-21T09:15:41.8450439Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 733.2 2026-02-21T09:15:41.8453841Z configs/s 2026-02-21T09:15:41.9755496Z [151s] Generation 7 complete: 2026-02-21T09:15:41.9759756Z ok=35 2026-02-21T09:15:41.9764149Z min=0.0102 2026-02-21T09:15:41.9768476Z mid=0.0143 2026-02-21T09:15:41.9770468Z max=0.0245 2026-02-21T09:15:41.9770652Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:41.9770906Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:41.9771181Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:41.9771383Z 'num_stages': 4, 2026-02-21T09:15:41.9771524Z 'num_warps': 2, 2026-02-21T09:15:41.9772148Z 'pid_type': 'flat', 2026-02-21T09:15:41.9772343Z 'range_flattens': [None, True], 2026-02-21T09:15:41.9772540Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:41.9772727Z 'range_num_stages': [0, 3], 2026-02-21T09:15:41.9772905Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:41.9773087Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:41.9774448Z [151s] Fitting surrogate: 619 points, 619 targets 2026-02-21T09:15:42.5528535Z [151s] Generation 8 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:15:44.2934811Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 10.9 configs/s 2026-02-21T09:15:45.5174285Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.0 configs/s 2026-02-21T09:15:46.5777914Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 957.3 2026-02-21T09:15:46.5778261Z configs/s 2026-02-21T09:15:46.6769588Z [155s] Generation 8 complete: 2026-02-21T09:15:46.6771350Z ok=23 2026-02-21T09:15:46.6771765Z min=0.0102 2026-02-21T09:15:46.6771937Z mid=0.0143 2026-02-21T09:15:46.6772103Z max=0.0246 2026-02-21T09:15:46.6772273Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:46.6772548Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:46.6772864Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:46.6773082Z 'num_stages': 4, 2026-02-21T09:15:46.6773247Z 'num_warps': 2, 2026-02-21T09:15:46.6773415Z 'pid_type': 'flat', 2026-02-21T09:15:46.6773603Z 'range_flattens': [None, True], 2026-02-21T09:15:46.6773814Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:46.6774027Z 'range_num_stages': [0, 3], 2026-02-21T09:15:46.6774203Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:46.6774413Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:46.6787887Z [155s] Fitting surrogate: 642 points, 642 targets 2026-02-21T09:15:47.2108383Z [156s] Generation 9 starting: 22 neighbors, 2 active search path(s) 2026-02-21T09:15:48.4005950Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 40.1 configs/s 2026-02-21T09:15:49.7509363Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.8 configs/s 2026-02-21T09:15:50.9633808Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 837.8 2026-02-21T09:15:50.9639128Z configs/s 2026-02-21T09:15:51.0846715Z [160s] Generation 9 complete: 2026-02-21T09:15:51.0852308Z ok=25 2026-02-21T09:15:51.0857265Z min=0.0102 2026-02-21T09:15:51.0859660Z mid=0.0143 2026-02-21T09:15:51.0859872Z max=0.0184 2026-02-21T09:15:51.0860033Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:51.0860325Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:51.0860610Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:51.0860824Z 'num_stages': 4, 2026-02-21T09:15:51.0860974Z 'num_warps': 2, 2026-02-21T09:15:51.0861133Z 'pid_type': 'flat', 2026-02-21T09:15:51.0861814Z 'range_flattens': [None, True], 2026-02-21T09:15:51.0862019Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:51.0862228Z 'range_num_stages': [0, 3], 2026-02-21T09:15:51.0862404Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:51.0862606Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:51.0866885Z [160s] Fitting surrogate: 667 points, 667 targets 2026-02-21T09:15:51.5686130Z [160s] Generation 10 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:15:52.6069573Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 23.5 configs/s 2026-02-21T09:15:53.2620033Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.1 configs/s 2026-02-21T09:15:53.4724157Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4625.7 2026-02-21T09:15:53.4726114Z configs/s 2026-02-21T09:15:53.5022695Z [162s] Generation 10 complete: 2026-02-21T09:15:53.5024263Z ok=13 2026-02-21T09:15:53.5024534Z min=0.0102 2026-02-21T09:15:53.5029596Z mid=0.0184 2026-02-21T09:15:53.5031277Z max=0.0348 2026-02-21T09:15:53.5031507Z best={'block_sizes': [1, 2048], 2026-02-21T09:15:53.5032009Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:15:53.5032310Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:15:53.5032527Z 'num_stages': 4, 2026-02-21T09:15:53.5032663Z 'num_warps': 2, 2026-02-21T09:15:53.5032809Z 'pid_type': 'flat', 2026-02-21T09:15:53.5032962Z 'range_flattens': [None, True], 2026-02-21T09:15:53.5033144Z 'range_multi_buffers': [None, None], 2026-02-21T09:15:53.5033324Z 'range_num_stages': [0, 3], 2026-02-21T09:15:53.5033491Z 'range_unroll_factors': [0, 0], 2026-02-21T09:15:53.5033668Z 'range_warp_specializes': [None, False]} 2026-02-21T09:15:53.5043064Z [162s] Fitting surrogate: 680 points, 680 targets 2026-02-21T09:15:53.7777446Z [162s] Autotuning complete in 163.0s after searching 645 configs. 2026-02-21T09:15:53.7777786Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:15:53.7778771Z @helion.kernel(config=helion.Config(block_sizes=[1, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:15:53.7779603Z 2026-02-21T09:15:53.7779858Z [162s] Code of selected kernel: /tmp/torchinductor_root/rq/crqwxm54fi4dqrcw7a7smoqs52tonjm5ukqqar7yfqzsg6sdozrg.py 2026-02-21T09:15:53.7998319Z from __future__ import annotations 2026-02-21T09:15:53.8000219Z 2026-02-21T09:15:53.8000387Z import torch 2026-02-21T09:15:53.8000567Z import triton 2026-02-21T09:15:53.8000722Z import triton.language as tl 2026-02-21T09:15:53.8000942Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:15:53.8001223Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:15:53.8002014Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:15:53.8002193Z 2026-02-21T09:15:53.8002266Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:15:53.8002460Z _BLOCK_SIZE_1 = tl.constexpr(2048) 2026-02-21T09:15:53.8002579Z 2026-02-21T09:15:53.8002646Z @triton.jit 2026-02-21T09:15:53.8002790Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:15:53.8003043Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:15:53.8003286Z pid_0 = tl.program_id(0) 2026-02-21T09:15:53.8003454Z offset_0 = pid_0 2026-02-21T09:15:53.8003621Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:15:53.8003905Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:15:53.8004197Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:15:53.8004457Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:15:53.8004834Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:15:53.8005087Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:15:53.8005363Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:15:53.8005608Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:15:53.8005854Z # src[softmax.py:82-89]: ... 2026-02-21T09:15:53.8006150Z for offset_2 in tl.range(0, 1536, _BLOCK_SIZE_1, warp_specialize=False, num_stages=3, flatten=True): 2026-02-21T09:15:53.8006505Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:15:53.8006769Z mask_1 = indices_2 < 1536 2026-02-21T09:15:53.8006939Z mi_copy = mi 2026-02-21T09:15:53.8007081Z di_copy = di 2026-02-21T09:15:53.8007230Z mi_copy_0 = mi_copy 2026-02-21T09:15:53.8007382Z di_copy_0 = di_copy 2026-02-21T09:15:53.8007573Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:15:53.8007943Z values = tl.load(x + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:15:53.8008332Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:15:53.8008743Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:15:53.8009131Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:15:53.8009399Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:15:53.8009631Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:15:53.8009838Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:15:53.8010088Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:15:53.8010326Z v_2 = mi_copy_0 - v_1 2026-02-21T09:15:53.8010498Z v_3 = libdevice.exp(v_2) 2026-02-21T09:15:53.8010667Z v_4 = di_copy_0 * v_3 2026-02-21T09:15:53.8010858Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:15:53.8011054Z subscript = v_1[:, None] 2026-02-21T09:15:53.8011228Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:15:53.8011402Z v_6 = v_5 - subscript 2026-02-21T09:15:53.8011694Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:15:53.8011956Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:15:53.8012178Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:15:53.8012370Z v_7 = libdevice.exp(v_6) 2026-02-21T09:15:53.8012686Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:15:53.8013050Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:15:53.8013248Z di = v_4 + sum_1 2026-02-21T09:15:53.8013423Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:15:53.8013691Z mi = v_1 2026-02-21T09:15:53.8013899Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:15:53.8014181Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:15:53.8014479Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:15:53.8014882Z for offset_2 in tl.range(0, 1536, _BLOCK_SIZE_1, warp_specialize=False, num_stages=3, flatten=True): 2026-02-21T09:15:53.8015227Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:15:53.8015475Z mask_2 = indices_2 < 1536 2026-02-21T09:15:53.8015644Z mi_copy_1 = mi 2026-02-21T09:15:53.8015805Z di_copy_1 = di 2026-02-21T09:15:53.8015963Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:15:53.8016124Z di_copy_1_0 = di_copy_1 2026-02-21T09:15:53.8016315Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:15:53.8016740Z values_1 = tl.load(x + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:15:53.8017184Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:15:53.8017460Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:15:53.8017654Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:15:53.8017841Z v_10 = v_9 - subscript_1 2026-02-21T09:15:53.8018009Z v_11 = libdevice.exp(v_10) 2026-02-21T09:15:53.8018191Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:15:53.8018366Z v_12 = v_11 / subscript_2 2026-02-21T09:15:53.8018542Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:15:53.8018810Z tl.store(out + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:15:53.8019030Z 2026-02-21T09:15:53.8019159Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:15:53.8019402Z """ 2026-02-21T09:15:53.8019607Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:15:53.8019915Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:15:53.8020129Z Args: 2026-02-21T09:15:53.8020296Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:15:53.8020485Z Returns: 2026-02-21T09:15:53.8020666Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:15:53.8020870Z """ 2026-02-21T09:15:53.8021013Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:15:53.8021186Z m, n = x.size() 2026-02-21T09:15:53.8021354Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:15:53.8021603Z out = torch.empty_like(x) 2026-02-21T09:15:53.8021824Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:15:53.8022138Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:15:53.8022446Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:15:53.8022686Z # src[softmax.py:79-92]: ... 2026-02-21T09:15:53.8022935Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=4) 2026-02-21T09:15:53.8023208Z # src[softmax.py:93]: return out 2026-02-21T09:15:53.8023378Z return out 2026-02-21T09:15:54.8772251Z WARNING:tritonbench.utils.triton_op:Completed input ID 10: 2026-02-21T09:15:54.8776584Z (M, N) 2026-02-21T09:15:54.8781212Z ------------ 2026-02-21T09:15:54.8782600Z (4096, 1536) 2026-02-21T09:15:54.8782721Z 2026-02-21T09:15:54.8783245Z 15%|█▌ | 3/20 [07:00<41:20, 145.90s/it]WARNING:tritonbench.utils.triton_op:Running input ID 15: 2026-02-21T09:15:54.8787751Z (M, N) 2026-02-21T09:15:54.8792145Z ------------ 2026-02-21T09:15:54.8794025Z (4096, 2176) 2026-02-21T09:15:54.8794335Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:15:56.2953052Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:15:57.6638814Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax 2026-02-21T09:16:02.4500955Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:16:02.4502712Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:16:02.4502934Z 'dtype': 'torch.float16', 2026-02-21T09:16:02.4503135Z 'shape': (4096, 2176), 2026-02-21T09:16:02.4503316Z 'stride': (2176, 1)},), 2026-02-21T09:16:02.4503496Z 'kwargs': {}} 2026-02-21T09:16:02.4522953Z INFO:tritonbench.utils.triton_op:Took 2.43ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:16:02.8604451Z [0s] Autotune random seed: 2138408546 2026-02-21T09:16:02.8868412Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:16:35.3679518Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:16:35.8899428Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T09:16:37.8999170Z module { 2026-02-21T09:16:37.9003205Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:16:37.9004496Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:16:37.9004734Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:16:37.9004919Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:16:37.9005105Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:16:37.9005336Z %cst = arith.constant dense<2176> : tensor<16x1xi32> 2026-02-21T09:16:37.9005607Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T09:16:37.9005860Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T09:16:37.9006080Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:16:37.9006263Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:16:37.9006450Z %c2176_i32 = arith.constant 2176 : i32 2026-02-21T09:16:37.9006634Z %c2176_i64 = arith.constant 2176 : i64 2026-02-21T09:16:37.9006806Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:16:37.9007125Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2176_i32], [%c2176_i64, %c1_i64] : , > 2026-02-21T09:16:37.9007486Z %1 = tt.get_program_id x : i32 2026-02-21T09:16:37.9007661Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:16:37.9007842Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T09:16:37.9008034Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:16:37.9008246Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T09:16:37.9008483Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:16:37.9008734Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T09:16:37.9008932Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T09:16:37.9009118Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:16:37.9009307Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:16:37.9009677Z %8:2 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T09:16:37.9010157Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:16:37.9010488Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9010749Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9010950Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9011147Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:16:37.9011863Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9012062Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9012288Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:16:37.9012538Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:16:37.9012797Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T09:16:37.9013032Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T09:16:37.9013260Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T09:16:37.9013506Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:16:37.9013761Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T09:16:37.9014135Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9014651Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T09:16:37.9014928Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9015243Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9015498Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T09:16:37.9015882Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9016271Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9016467Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9016665Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:16:37.9016859Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9017060Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9017277Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T09:16:37.9017481Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:16:37.9017690Z %68 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:16:37.9017890Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T09:16:37.9018185Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:16:37.9018516Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9018760Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9018961Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9019159Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:16:37.9019368Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9019565Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9019801Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:16:37.9020054Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:16:37.9020302Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T09:16:37.9020531Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T09:16:37.9020738Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T09:16:37.9020982Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:16:37.9021227Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T09:16:37.9021618Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9021967Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T09:16:37.9022219Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9022517Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9022751Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T09:16:37.9023117Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9023563Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9023759Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9023938Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:16:37.9024136Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9024328Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9024529Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T09:16:37.9024727Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:16:37.9024914Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:16:37.9025112Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:16:37.9025382Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:16:37.9025704Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9026006Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9026197Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9026384Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:16:37.9026569Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9026756Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9026972Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:16:37.9027214Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:16:37.9027441Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T09:16:37.9027645Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T09:16:37.9027851Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T09:16:37.9028075Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:16:37.9028309Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T09:16:37.9028665Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9029032Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T09:16:37.9029289Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9029583Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9029837Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T09:16:37.9030207Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9030584Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9030772Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9030958Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:16:37.9031151Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9031337Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9031590Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T09:16:37.9031793Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:16:37.9031989Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:16:37.9032176Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:16:37.9032464Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:16:37.9032795Z %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9033028Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9033223Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9033403Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:16:37.9033599Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9033780Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9034019Z %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:16:37.9034343Z %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:16:37.9034572Z %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32> 2026-02-21T09:16:37.9034797Z %116 = arith.cmpf une, %98, %98 : tensor<16xf32> 2026-02-21T09:16:37.9034998Z %117 = arith.ori %115, %116 : tensor<16xi1> 2026-02-21T09:16:37.9035237Z %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:16:37.9035472Z %119 = arith.subf %98, %118 : tensor<16xf32> 2026-02-21T09:16:37.9035836Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9036205Z %121 = arith.mulf %107, %120 : tensor<16xf32> 2026-02-21T09:16:37.9036461Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9036817Z %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9037070Z %124 = arith.subf %111, %123 : tensor<16x128xf32> 2026-02-21T09:16:37.9037482Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9037853Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9038048Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:16:37.9038225Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:16:37.9038416Z tt.reduce.return %128 : f32 2026-02-21T09:16:37.9038596Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9038797Z %127 = arith.addf %121, %126 : tensor<16xf32> 2026-02-21T09:16:37.9039085Z scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32> 2026-02-21T09:16:37.9039309Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:16:37.9039617Z %9 = tt.descriptor_load %0[%4, %c2048_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:16:37.9039944Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9040179Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9040368Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:16:37.9040554Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:16:37.9040739Z tt.reduce.return %50 : f32 2026-02-21T09:16:37.9040931Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9041164Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:16:37.9041399Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:16:37.9041673Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T09:16:37.9041885Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T09:16:37.9042091Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T09:16:37.9042319Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:16:37.9042558Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T09:16:37.9042910Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9043257Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T09:16:37.9043511Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9043795Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9044039Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T09:16:37.9044400Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9044775Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T09:16:37.9044975Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:16:37.9045149Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:16:37.9045446Z tt.reduce.return %50 : f32 2026-02-21T09:16:37.9045627Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:16:37.9045836Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T09:16:37.9046029Z %c2048_i32_2 = arith.constant 2048 : i32 2026-02-21T09:16:37.9046227Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T09:16:37.9046463Z scf.for %arg3 = %c0_i32 to %c2048_i32_2 step %c512_i32_3 : i32 { 2026-02-21T09:16:37.9046746Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:16:37.9047012Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:16:37.9047215Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T09:16:37.9047468Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:16:37.9047733Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T09:16:37.9048049Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:16:37.9048349Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9048609Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9048848Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T09:16:37.9049082Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9049367Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9049666Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9049978Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9050268Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9050523Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9050763Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T09:16:37.9051129Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9051574Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9051858Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9052084Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T09:16:37.9052320Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:16:37.9052589Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9052870Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9053125Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9053338Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:16:37.9053538Z %73 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:16:37.9053729Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T09:16:37.9053965Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:16:37.9054208Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:16:37.9054412Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T09:16:37.9054654Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:16:37.9054921Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T09:16:37.9055179Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:16:37.9055465Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9055728Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9055960Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T09:16:37.9056205Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9056551Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9056863Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9057206Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9057500Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9057776Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9058020Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T09:16:37.9058421Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9058869Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9059228Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9059481Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T09:16:37.9059723Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:16:37.9060008Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9060299Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9060566Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9060786Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:16:37.9060983Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:16:37.9061189Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T09:16:37.9061431Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:16:37.9061729Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T09:16:37.9061944Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T09:16:37.9062216Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:16:37.9062498Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T09:16:37.9062773Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:16:37.9063091Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9063371Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9063637Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T09:16:37.9063896Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9064192Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9064524Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9064854Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9065172Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9065498Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9065750Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T09:16:37.9066128Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9066540Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9066830Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9067068Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T09:16:37.9067313Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:16:37.9067596Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9067933Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9068203Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9068407Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:16:37.9068602Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:16:37.9068791Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T09:16:37.9069031Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:16:37.9069286Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T09:16:37.9069490Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T09:16:37.9069748Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:16:37.9070012Z %129 = arith.muli %128, %cst : tensor<16x1xi32> 2026-02-21T09:16:37.9070338Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:16:37.9070638Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9070916Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9071162Z %133 = arith.addi %131, %132 : tensor<16x128xi32> 2026-02-21T09:16:37.9071396Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9071719Z %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9072026Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9072345Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9072642Z %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9072908Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9073157Z %140 = arith.subf %138, %139 : tensor<16x128xf32> 2026-02-21T09:16:37.9073535Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9073963Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9074259Z %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9074500Z %144 = arith.divf %141, %143 : tensor<16x128xf32> 2026-02-21T09:16:37.9074748Z %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:16:37.9075019Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9075308Z %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9075565Z tt.store %147, %145 : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9075784Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:16:37.9076032Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:16:37.9076289Z %28 = tt.splat %c2048_i32_2 : i32 -> tensor<128xi32> 2026-02-21T09:16:37.9076506Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:16:37.9076751Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:16:37.9077008Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T09:16:37.9077258Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:16:37.9077549Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9077815Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:16:37.9078049Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T09:16:37.9078285Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9078555Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9078911Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9079208Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9079493Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:16:37.9079756Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9079982Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T09:16:37.9080347Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:16:37.9080748Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:16:37.9081030Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:16:37.9081337Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T09:16:37.9081602Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:16:37.9081872Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9082139Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:16:37.9082396Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T09:16:37.9082689Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T09:16:37.9082972Z tt.return 2026-02-21T09:16:37.9083108Z } 2026-02-21T09:16:37.9083230Z } 2026-02-21T09:16:37.9083300Z 2026-02-21T09:16:37.9083362Z {-# 2026-02-21T09:16:37.9083492Z external_resources: { 2026-02-21T09:16:37.9083659Z mlir_reproducer: { 2026-02-21T09:16:37.9087953Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:16:37.9092358Z disable_threading: false, 2026-02-21T09:16:37.9092531Z verify_each: true 2026-02-21T09:16:37.9092671Z } 2026-02-21T09:16:37.9092793Z } 2026-02-21T09:16:37.9092904Z #-} 2026-02-21T09:16:37.9093334Z /tmp/torchinductor_root/jy/cjyrwyokcoajngeke66pu5qatfgz2ltezf3ru4xoqva3brfvaa4b.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:16:37.9094532Z /tmp/torchinductor_root/jy/cjyrwyokcoajngeke66pu5qatfgz2ltezf3ru4xoqva3brfvaa4b.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:16:37.9095553Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:16:37.9096587Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:16:37.9097524Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:16:37.9097823Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:16:42.0624354Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.1 configs/s 2026-02-21T09:16:42.0635299Z [39s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s]) 2026-02-21T09:16:42.0637356Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T09:16:42.0637592Z error=11 2026-02-21T09:16:42.0637726Z timeout=1 2026-02-21T09:16:42.0637861Z ok=88 2026-02-21T09:16:42.0638000Z min=0.0123 2026-02-21T09:16:42.0638146Z mid=0.1992 2026-02-21T09:16:42.0638271Z max=50.0265 2026-02-21T09:16:42.0638424Z best={'block_sizes': [1, 4096], 2026-02-21T09:16:42.0638665Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:16:42.0638933Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:16:42.0639132Z 'num_stages': 5, 2026-02-21T09:16:42.0639278Z 'num_warps': 1, 2026-02-21T09:16:42.0639432Z 'pid_type': 'flat', 2026-02-21T09:16:42.0639596Z 'range_flattens': [None, False], 2026-02-21T09:16:42.0639843Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:42.0640033Z 'range_num_stages': [0, 1], 2026-02-21T09:16:42.0640214Z 'range_unroll_factors': [0, 0], 2026-02-21T09:16:42.0640399Z 'range_warp_specializes': [None, False]} 2026-02-21T09:16:42.0651055Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:16:43.3329623Z [40s] Generation 1 starting: 89 neighbors, 5 active search path(s) 2026-02-21T09:16:51.4028980Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 7.6 configs/s 2026-02-21T09:16:57.0053154Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s 2026-02-21T09:16:57.5299309Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1909.3 2026-02-21T09:16:57.5303250Z configs/s 2026-02-21T09:16:57.5860903Z [54s] Generation 1 complete: 2026-02-21T09:16:57.5865213Z ok=94 2026-02-21T09:16:57.5869213Z min=0.0123 2026-02-21T09:16:57.5873748Z mid=0.0266 2026-02-21T09:16:57.5878090Z max=0.1680 2026-02-21T09:16:57.5882143Z best={'block_sizes': [1, 4096], 2026-02-21T09:16:57.5885853Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:16:57.5889618Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:16:57.5892754Z 'num_stages': 5, 2026-02-21T09:16:57.5894875Z 'num_warps': 1, 2026-02-21T09:16:57.5895061Z 'pid_type': 'flat', 2026-02-21T09:16:57.5895237Z 'range_flattens': [None, False], 2026-02-21T09:16:57.5895438Z 'range_multi_buffers': [None, False], 2026-02-21T09:16:57.5895627Z 'range_num_stages': [0, 1], 2026-02-21T09:16:57.5895809Z 'range_unroll_factors': [0, 0], 2026-02-21T09:16:57.5895990Z 'range_warp_specializes': [None, False]} 2026-02-21T09:16:57.5896217Z [54s] Fitting surrogate: 194 points, 194 targets 2026-02-21T09:16:58.9084740Z [56s] Generation 2 starting: 79 neighbors, 5 active search path(s) 2026-02-21T09:17:05.1562143Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 19.7 configs/s 2026-02-21T09:17:10.0788037Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 16.4 configs/s 2026-02-21T09:17:11.2564869Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 862.1 2026-02-21T09:17:11.2566425Z configs/s 2026-02-21T09:17:11.3626388Z [68s] Generation 2 complete: 2026-02-21T09:17:11.3632045Z ok=84 2026-02-21T09:17:11.3638122Z min=0.0123 2026-02-21T09:17:11.3638330Z mid=0.0205 2026-02-21T09:17:11.3642816Z max=0.0430 2026-02-21T09:17:11.3647707Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:11.3652041Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:17:11.3656781Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:17:11.3657057Z 'num_stages': 5, 2026-02-21T09:17:11.3662384Z 'num_warps': 1, 2026-02-21T09:17:11.3667400Z 'pid_type': 'flat', 2026-02-21T09:17:11.3667692Z 'range_flattens': [None, False], 2026-02-21T09:17:11.3668301Z 'range_multi_buffers': [None, False], 2026-02-21T09:17:11.3672924Z 'range_num_stages': [0, 1], 2026-02-21T09:17:11.3674555Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:11.3674793Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:11.3675110Z [68s] Fitting surrogate: 278 points, 278 targets 2026-02-21T09:17:12.3566378Z [69s] Generation 3 starting: 64 neighbors, 5 active search path(s) 2026-02-21T09:17:21.1596600Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 4.9 configs/s 2026-02-21T09:17:25.2235623Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.4 configs/s 2026-02-21T09:17:26.5949342Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 891.0 2026-02-21T09:17:26.5950619Z configs/s 2026-02-21T09:17:26.6975640Z [83s] Generation 3 complete: 2026-02-21T09:17:26.6979272Z ok=69 2026-02-21T09:17:26.6983561Z min=0.0123 2026-02-21T09:17:26.6988402Z mid=0.0186 2026-02-21T09:17:26.6988666Z max=0.1475 2026-02-21T09:17:26.6988835Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:26.6989074Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:17:26.6989307Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:17:26.6989500Z 'num_stages': 5, 2026-02-21T09:17:26.6989642Z 'num_warps': 1, 2026-02-21T09:17:26.6989789Z 'pid_type': 'flat', 2026-02-21T09:17:26.6989947Z 'range_flattens': [None, False], 2026-02-21T09:17:26.6990134Z 'range_multi_buffers': [None, None], 2026-02-21T09:17:26.6990318Z 'range_num_stages': [0, 1], 2026-02-21T09:17:26.6990494Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:26.6990683Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:26.6994255Z [83s] Fitting surrogate: 347 points, 347 targets 2026-02-21T09:17:27.4893058Z [84s] Generation 4 starting: 50 neighbors, 4 active search path(s) 2026-02-21T09:17:30.9225909Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 19.8 configs/s 2026-02-21T09:17:34.1079660Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.5 configs/s 2026-02-21T09:17:35.7545721Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 618.2 2026-02-21T09:17:35.7546653Z configs/s 2026-02-21T09:17:35.9068028Z [93s] Generation 4 complete: 2026-02-21T09:17:35.9072337Z ok=55 2026-02-21T09:17:35.9075622Z min=0.0123 2026-02-21T09:17:35.9080014Z mid=0.0164 2026-02-21T09:17:35.9084325Z max=0.0470 2026-02-21T09:17:35.9088222Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:35.9092607Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:17:35.9096583Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:35.9096876Z 'num_stages': 1, 2026-02-21T09:17:35.9097056Z 'num_warps': 1, 2026-02-21T09:17:35.9097236Z 'pid_type': 'flat', 2026-02-21T09:17:35.9097439Z 'range_flattens': [None, False], 2026-02-21T09:17:35.9097977Z 'range_multi_buffers': [None, None], 2026-02-21T09:17:35.9098264Z 'range_num_stages': [0, 0], 2026-02-21T09:17:35.9098468Z 'range_unroll_factors': [0, 1], 2026-02-21T09:17:35.9098661Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:35.9103053Z [93s] Fitting surrogate: 402 points, 402 targets 2026-02-21T09:17:36.5034524Z [93s] Generation 5 starting: 37 neighbors, 4 active search path(s) 2026-02-21T09:17:41.7366734Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 4.3 configs/s 2026-02-21T09:17:44.1450313Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.5 configs/s 2026-02-21T09:17:45.2680083Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 906.5 2026-02-21T09:17:45.2682009Z configs/s 2026-02-21T09:17:45.3832153Z [102s] Generation 5 complete: 2026-02-21T09:17:45.3835163Z ok=41 2026-02-21T09:17:45.3835368Z min=0.0123 2026-02-21T09:17:45.3835582Z mid=0.0185 2026-02-21T09:17:45.3835714Z max=0.0471 2026-02-21T09:17:45.3836247Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:45.3841023Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:17:45.3841376Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:17:45.3841678Z 'num_stages': 6, 2026-02-21T09:17:45.3847724Z 'num_warps': 1, 2026-02-21T09:17:45.3849996Z 'pid_type': 'flat', 2026-02-21T09:17:45.3850203Z 'range_flattens': [None, False], 2026-02-21T09:17:45.3850410Z 'range_multi_buffers': [None, None], 2026-02-21T09:17:45.3854230Z 'range_num_stages': [0, 1], 2026-02-21T09:17:45.3856583Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:45.3860128Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:45.3860363Z [102s] Fitting surrogate: 443 points, 443 targets 2026-02-21T09:17:46.1060320Z [103s] Generation 6 starting: 31 neighbors, 3 active search path(s) 2026-02-21T09:17:49.4091514Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 5.0 configs/s 2026-02-21T09:17:51.6310487Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 14.6 configs/s 2026-02-21T09:17:52.7368933Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 917.5 2026-02-21T09:17:52.7369622Z configs/s 2026-02-21T09:17:52.8451468Z [109s] Generation 6 complete: 2026-02-21T09:17:52.8455591Z ok=34 2026-02-21T09:17:52.8459983Z min=0.0123 2026-02-21T09:17:52.8464415Z mid=0.0143 2026-02-21T09:17:52.8465999Z max=0.1495 2026-02-21T09:17:52.8466259Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:52.8466509Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:17:52.8466779Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:17:52.8467006Z 'num_stages': 6, 2026-02-21T09:17:52.8467170Z 'num_warps': 1, 2026-02-21T09:17:52.8467320Z 'pid_type': 'flat', 2026-02-21T09:17:52.8467484Z 'range_flattens': [None, False], 2026-02-21T09:17:52.8467735Z 'range_multi_buffers': [None, None], 2026-02-21T09:17:52.8467959Z 'range_num_stages': [0, 1], 2026-02-21T09:17:52.8468161Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:52.8468772Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:52.8468986Z [109s] Fitting surrogate: 477 points, 477 targets 2026-02-21T09:17:53.2341396Z [110s] Generation 7 starting: 21 neighbors, 2 active search path(s) 2026-02-21T09:17:55.1299271Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 18.0 configs/s 2026-02-21T09:17:56.5005386Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.6 configs/s 2026-02-21T09:17:57.5744413Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 945.8 2026-02-21T09:17:57.5748534Z configs/s 2026-02-21T09:17:57.6747493Z [114s] Generation 7 complete: 2026-02-21T09:17:57.6752224Z ok=24 2026-02-21T09:17:57.6756149Z min=0.0123 2026-02-21T09:17:57.6758028Z mid=0.0164 2026-02-21T09:17:57.6758197Z max=0.0327 2026-02-21T09:17:57.6758343Z best={'block_sizes': [1, 4096], 2026-02-21T09:17:57.6758964Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:17:57.6759254Z 'load_eviction_policies': ['', ''], 2026-02-21T09:17:57.6759443Z 'num_stages': 2, 2026-02-21T09:17:57.6759588Z 'num_warps': 1, 2026-02-21T09:17:57.6759742Z 'pid_type': 'flat', 2026-02-21T09:17:57.6759902Z 'range_flattens': [None, False], 2026-02-21T09:17:57.6760089Z 'range_multi_buffers': [None, None], 2026-02-21T09:17:57.6760272Z 'range_num_stages': [0, 0], 2026-02-21T09:17:57.6760447Z 'range_unroll_factors': [0, 0], 2026-02-21T09:17:57.6760634Z 'range_warp_specializes': [None, False]} 2026-02-21T09:17:57.6764782Z [114s] Fitting surrogate: 501 points, 501 targets 2026-02-21T09:17:58.1760293Z [115s] Generation 8 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:18:07.9923220Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.0 configs/s 2026-02-21T09:18:09.2824760Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.8 configs/s 2026-02-21T09:18:10.1123021Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1219.6 2026-02-21T09:18:10.1127164Z configs/s 2026-02-21T09:18:10.1969051Z [127s] Generation 8 complete: 2026-02-21T09:18:10.1973987Z ok=22 2026-02-21T09:18:10.1976193Z min=0.0123 2026-02-21T09:18:10.1976352Z mid=0.0143 2026-02-21T09:18:10.1976482Z max=0.0676 2026-02-21T09:18:10.1976616Z best={'block_sizes': [1, 4096], 2026-02-21T09:18:10.1976871Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:18:10.1977122Z 'load_eviction_policies': ['', ''], 2026-02-21T09:18:10.1977307Z 'num_stages': 2, 2026-02-21T09:18:10.1977445Z 'num_warps': 1, 2026-02-21T09:18:10.1977593Z 'pid_type': 'flat', 2026-02-21T09:18:10.1977754Z 'range_flattens': [None, False], 2026-02-21T09:18:10.1977932Z 'range_multi_buffers': [None, None], 2026-02-21T09:18:10.1978117Z 'range_num_stages': [0, 0], 2026-02-21T09:18:10.1978281Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:10.1978499Z 'range_warp_specializes': [None, False]} 2026-02-21T09:18:10.1986633Z [127s] Fitting surrogate: 523 points, 523 targets 2026-02-21T09:18:10.6819013Z [127s] Generation 9 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:18:13.2981085Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 4.4 configs/s 2026-02-21T09:18:13.9799505Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.3 configs/s 2026-02-21T09:18:14.5741151Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1696.3 2026-02-21T09:18:14.5742742Z configs/s 2026-02-21T09:18:14.6370461Z [131s] Generation 9 complete: 2026-02-21T09:18:14.6374801Z ok=13 2026-02-21T09:18:14.6376221Z min=0.0123 2026-02-21T09:18:14.6376422Z mid=0.0123 2026-02-21T09:18:14.6379289Z max=0.0204 2026-02-21T09:18:14.6379475Z best={'block_sizes': [1, 4096], 2026-02-21T09:18:14.6379737Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:18:14.6380353Z 'load_eviction_policies': ['', ''], 2026-02-21T09:18:14.6380534Z 'num_stages': 8, 2026-02-21T09:18:14.6380685Z 'num_warps': 1, 2026-02-21T09:18:14.6380828Z 'pid_type': 'flat', 2026-02-21T09:18:14.6380989Z 'range_flattens': [None, True], 2026-02-21T09:18:14.6381171Z 'range_multi_buffers': [None, False], 2026-02-21T09:18:14.6381355Z 'range_num_stages': [0, 3], 2026-02-21T09:18:14.6381532Z 'range_unroll_factors': [0, 2], 2026-02-21T09:18:14.6381788Z 'range_warp_specializes': [None, None]} 2026-02-21T09:18:14.6386504Z [131s] Fitting surrogate: 536 points, 536 targets 2026-02-21T09:18:14.9126231Z [132s] Autotuning complete in 132.0s after searching 514 configs. 2026-02-21T09:18:14.9130502Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:18:14.9135901Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T09:18:14.9136809Z 2026-02-21T09:18:14.9137138Z [132s] Code of selected kernel: /tmp/torchinductor_root/mo/cmoutw7hy3xdywgwxgkivyc2c2otyaqfmanyliv46l4mv3edno3n.py 2026-02-21T09:18:14.9350481Z from __future__ import annotations 2026-02-21T09:18:14.9355418Z 2026-02-21T09:18:14.9356963Z import torch 2026-02-21T09:18:14.9357141Z import triton 2026-02-21T09:18:14.9357311Z import triton.language as tl 2026-02-21T09:18:14.9357528Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:18:14.9357810Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:18:14.9358113Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:18:14.9358302Z 2026-02-21T09:18:14.9358375Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:18:14.9358567Z _BLOCK_SIZE_1 = tl.constexpr(4096) 2026-02-21T09:18:14.9358711Z 2026-02-21T09:18:14.9358770Z @triton.jit 2026-02-21T09:18:14.9358930Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:18:14.9359194Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:18:14.9359461Z pid_0 = tl.program_id(0) 2026-02-21T09:18:14.9359635Z offset_0 = pid_0 2026-02-21T09:18:14.9359812Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:18:14.9360107Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:18:14.9360410Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:18:14.9360693Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:18:14.9360947Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:18:14.9361215Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:18:14.9361498Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:18:14.9361967Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:18:14.9362220Z # src[softmax.py:82-89]: ... 2026-02-21T09:18:14.9362593Z for offset_2 in tl.range(0, 2176, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:18:14.9363036Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:18:14.9363277Z mask_1 = indices_2 < 2176 2026-02-21T09:18:14.9363453Z mi_copy = mi 2026-02-21T09:18:14.9363606Z di_copy = di 2026-02-21T09:18:14.9363752Z mi_copy_0 = mi_copy 2026-02-21T09:18:14.9363922Z di_copy_0 = di_copy 2026-02-21T09:18:14.9364109Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:18:14.9364439Z values = tl.load(x + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), mask_1[None, :], other=0) 2026-02-21T09:18:14.9364785Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:18:14.9365197Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:18:14.9365848Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:18:14.9366105Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:18:14.9366348Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:18:14.9366557Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:18:14.9366824Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:18:14.9367059Z v_2 = mi_copy_0 - v_1 2026-02-21T09:18:14.9367234Z v_3 = libdevice.exp(v_2) 2026-02-21T09:18:14.9367413Z v_4 = di_copy_0 * v_3 2026-02-21T09:18:14.9367603Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:18:14.9367819Z subscript = v_1[:, None] 2026-02-21T09:18:14.9367990Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:18:14.9368257Z v_6 = v_5 - subscript 2026-02-21T09:18:14.9368476Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:18:14.9368752Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:18:14.9368966Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:18:14.9369163Z v_7 = libdevice.exp(v_6) 2026-02-21T09:18:14.9369492Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:18:14.9369847Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:18:14.9370052Z di = v_4 + sum_1 2026-02-21T09:18:14.9370210Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:18:14.9370387Z mi = v_1 2026-02-21T09:18:14.9370584Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:18:14.9370852Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:18:14.9371151Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:18:14.9371644Z for offset_2 in tl.range(0, 2176, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:18:14.9372054Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:18:14.9372282Z mask_2 = indices_2 < 2176 2026-02-21T09:18:14.9372453Z mi_copy_1 = mi 2026-02-21T09:18:14.9372599Z di_copy_1 = di 2026-02-21T09:18:14.9372774Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:18:14.9372943Z di_copy_1_0 = di_copy_1 2026-02-21T09:18:14.9373124Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:18:14.9373438Z values_1 = tl.load(x + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), mask_2[None, :], other=0) 2026-02-21T09:18:14.9373812Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:18:14.9374095Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:18:14.9374290Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:18:14.9374467Z v_10 = v_9 - subscript_1 2026-02-21T09:18:14.9374640Z v_11 = libdevice.exp(v_10) 2026-02-21T09:18:14.9374815Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:18:14.9375000Z v_12 = v_11 / subscript_2 2026-02-21T09:18:14.9375167Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:18:14.9375437Z tl.store(out + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:18:14.9375648Z 2026-02-21T09:18:14.9375778Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:18:14.9376005Z """ 2026-02-21T09:18:14.9376210Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:18:14.9376511Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:18:14.9376731Z Args: 2026-02-21T09:18:14.9376892Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:18:14.9377158Z Returns: 2026-02-21T09:18:14.9377332Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:18:14.9377546Z """ 2026-02-21T09:18:14.9377687Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:18:14.9377861Z m, n = x.size() 2026-02-21T09:18:14.9378032Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:18:14.9378229Z out = torch.empty_like(x) 2026-02-21T09:18:14.9378459Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:18:14.9378768Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:18:14.9379088Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:18:14.9379331Z # src[softmax.py:79-92]: ... 2026-02-21T09:18:14.9379582Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=8) 2026-02-21T09:18:14.9379913Z # src[softmax.py:93]: return out 2026-02-21T09:18:14.9380086Z return out 2026-02-21T09:18:15.6558877Z WARNING:tritonbench.utils.triton_op:Completed input ID 15: 2026-02-21T09:18:15.6563304Z (M, N) 2026-02-21T09:18:15.6564814Z ------------ 2026-02-21T09:18:15.6564996Z (4096, 2176) 2026-02-21T09:18:15.6565077Z 2026-02-21T09:18:15.6565522Z 20%|██ | 4/20 [09:20<38:22, 143.88s/it]WARNING:tritonbench.utils.triton_op:Running input ID 20: 2026-02-21T09:18:15.6569238Z (M, N) 2026-02-21T09:18:15.6571342Z ------------ 2026-02-21T09:18:15.6571525Z (4096, 2816) 2026-02-21T09:18:15.6571937Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:18:17.0443344Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:18:18.5828636Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for torch_compile_softmax 2026-02-21T09:18:19.8082121Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:18:19.8083826Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:18:19.8084105Z 'dtype': 'torch.float16', 2026-02-21T09:18:19.8084330Z 'shape': (4096, 2816), 2026-02-21T09:18:19.8084537Z 'stride': (2816, 1)},), 2026-02-21T09:18:19.8084725Z 'kwargs': {}} 2026-02-21T09:18:19.8100853Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:18:19.9838357Z [0s] Autotune random seed: 2138408546 2026-02-21T09:18:20.0086892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:18:52.7239067Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:18:53.4617318Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T09:18:59.8265820Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.7 configs/s 2026-02-21T09:18:59.8268974Z [39s] Adaptive compile timeout: 30s (90% percentile=3.5s, bounds=[30.0s, 30s]) 2026-02-21T09:18:59.8280431Z [39s] Initial random population of 100, 5 starting points: 2026-02-21T09:18:59.8280808Z error=10 2026-02-21T09:18:59.8280960Z timeout=1 2026-02-21T09:18:59.8281137Z ok=89 2026-02-21T09:18:59.8281293Z min=0.0163 2026-02-21T09:18:59.8281458Z mid=0.2415 2026-02-21T09:18:59.8281710Z max=65.2831 2026-02-21T09:18:59.8281907Z best={'block_sizes': [1, 4096], 2026-02-21T09:18:59.8282164Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:18:59.8282431Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:18:59.8282640Z 'num_stages': 5, 2026-02-21T09:18:59.8282797Z 'num_warps': 1, 2026-02-21T09:18:59.8283368Z 'pid_type': 'flat', 2026-02-21T09:18:59.8283543Z 'range_flattens': [None, False], 2026-02-21T09:18:59.8283756Z 'range_multi_buffers': [None, False], 2026-02-21T09:18:59.8283950Z 'range_num_stages': [0, 1], 2026-02-21T09:18:59.8284118Z 'range_unroll_factors': [0, 0], 2026-02-21T09:18:59.8284300Z 'range_warp_specializes': [None, False]} 2026-02-21T09:18:59.8298996Z [39s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:19:01.2690632Z [41s] Generation 1 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:19:22.1731120Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 1.3 configs/s 2026-02-21T09:19:27.5274011Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 16.6 configs/s 2026-02-21T09:19:28.0594849Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1885.3 2026-02-21T09:19:28.0599162Z configs/s 2026-02-21T09:19:28.1159155Z [68s] Generation 1 complete: 2026-02-21T09:19:28.1163192Z ok=92 2026-02-21T09:19:28.1168213Z min=0.0143 2026-02-21T09:19:28.1172532Z mid=0.0288 2026-02-21T09:19:28.1173826Z max=0.3052 2026-02-21T09:19:28.1174008Z best={'block_sizes': [1, 4096], 2026-02-21T09:19:28.1174256Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:28.1174520Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:19:28.1174706Z 'num_stages': 4, 2026-02-21T09:19:28.1174850Z 'num_warps': 4, 2026-02-21T09:19:28.1174986Z 'pid_type': 'flat', 2026-02-21T09:19:28.1175145Z 'range_flattens': [None, False], 2026-02-21T09:19:28.1175322Z 'range_multi_buffers': [None, False], 2026-02-21T09:19:28.1175544Z 'range_num_stages': [0, 1], 2026-02-21T09:19:28.1175721Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:28.1175900Z 'range_warp_specializes': [None, False]} 2026-02-21T09:19:28.1179726Z [68s] Fitting surrogate: 192 points, 192 targets 2026-02-21T09:19:29.3190374Z [69s] Generation 2 starting: 87 neighbors, 5 active search path(s) 2026-02-21T09:19:37.4572746Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 10.3 configs/s 2026-02-21T09:19:42.8625889Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.6 configs/s 2026-02-21T09:19:44.0826998Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 830.5 2026-02-21T09:19:44.0828723Z configs/s 2026-02-21T09:19:44.1887936Z [84s] Generation 2 complete: 2026-02-21T09:19:44.1892326Z error=2 2026-02-21T09:19:44.1894014Z ok=91 2026-02-21T09:19:44.1894223Z min=0.0143 2026-02-21T09:19:44.1894363Z mid=0.0246 2026-02-21T09:19:44.1898138Z max=0.0859 2026-02-21T09:19:44.1902103Z best={'block_sizes': [1, 4096], 2026-02-21T09:19:44.1903650Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:44.1903942Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:19:44.1904130Z 'num_stages': 4, 2026-02-21T09:19:44.1904318Z 'num_warps': 2, 2026-02-21T09:19:44.1904475Z 'pid_type': 'flat', 2026-02-21T09:19:44.1904636Z 'range_flattens': [None, True], 2026-02-21T09:19:44.1904814Z 'range_multi_buffers': [None, False], 2026-02-21T09:19:44.1905005Z 'range_num_stages': [0, 1], 2026-02-21T09:19:44.1905166Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:44.1905347Z 'range_warp_specializes': [None, False]} 2026-02-21T09:19:44.1915074Z [84s] Fitting surrogate: 285 points, 285 targets 2026-02-21T09:19:44.9785122Z [84s] Generation 3 starting: 62 neighbors, 5 active search path(s) 2026-02-21T09:19:53.0929590Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 4.6 configs/s 2026-02-21T09:19:56.6508747Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 18.2 configs/s 2026-02-21T09:19:58.6649493Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 504.4 2026-02-21T09:19:58.6653174Z configs/s 2026-02-21T09:19:58.8374473Z [98s] Generation 3 complete: 2026-02-21T09:19:58.8375870Z error=6 2026-02-21T09:19:58.8378653Z ok=61 2026-02-21T09:19:58.8378828Z min=0.0143 2026-02-21T09:19:58.8379022Z mid=0.0206 2026-02-21T09:19:58.8379164Z max=0.2008 2026-02-21T09:19:58.8379307Z best={'block_sizes': [1, 4096], 2026-02-21T09:19:58.8384034Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:19:58.8388633Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:19:58.8392067Z 'num_stages': 4, 2026-02-21T09:19:58.8396298Z 'num_warps': 2, 2026-02-21T09:19:58.8397733Z 'pid_type': 'flat', 2026-02-21T09:19:58.8397997Z 'range_flattens': [None, True], 2026-02-21T09:19:58.8402838Z 'range_multi_buffers': [None, False], 2026-02-21T09:19:58.8404710Z 'range_num_stages': [0, 1], 2026-02-21T09:19:58.8404956Z 'range_unroll_factors': [0, 0], 2026-02-21T09:19:58.8405155Z 'range_warp_specializes': [None, False]} 2026-02-21T09:19:58.8405459Z [98s] Fitting surrogate: 352 points, 352 targets 2026-02-21T09:19:59.5362688Z [99s] Generation 4 starting: 44 neighbors, 4 active search path(s) 2026-02-21T09:20:02.5737972Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 19.1 configs/s 2026-02-21T09:20:05.4237408Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.7 configs/s 2026-02-21T09:20:07.8596595Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 418.5 2026-02-21T09:20:07.8599283Z configs/s 2026-02-21T09:20:08.0720253Z [108s] Generation 4 complete: 2026-02-21T09:20:08.0722234Z ok=49 2026-02-21T09:20:08.0722408Z min=0.0143 2026-02-21T09:20:08.0722538Z mid=0.0184 2026-02-21T09:20:08.0722665Z max=0.0246 2026-02-21T09:20:08.0722801Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:08.0723058Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:20:08.0723323Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:08.0723507Z 'num_stages': 4, 2026-02-21T09:20:08.0723679Z 'num_warps': 2, 2026-02-21T09:20:08.0723838Z 'pid_type': 'flat', 2026-02-21T09:20:08.0724001Z 'range_flattens': [None, True], 2026-02-21T09:20:08.0724175Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:08.0724364Z 'range_num_stages': [0, 1], 2026-02-21T09:20:08.0724524Z 'range_unroll_factors': [0, 0], 2026-02-21T09:20:08.0724711Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:08.0735415Z [108s] Fitting surrogate: 401 points, 401 targets 2026-02-21T09:20:08.6899161Z [108s] Generation 5 starting: 35 neighbors, 3 active search path(s) 2026-02-21T09:20:11.7622981Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 9.0 configs/s 2026-02-21T09:20:14.0299028Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 16.6 configs/s 2026-02-21T09:20:16.0217475Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 510.9 2026-02-21T09:20:16.0222780Z configs/s 2026-02-21T09:20:16.1911917Z [116s] Generation 5 complete: 2026-02-21T09:20:16.1915968Z ok=39 2026-02-21T09:20:16.1919806Z min=0.0143 2026-02-21T09:20:16.1923768Z mid=0.0164 2026-02-21T09:20:16.1928206Z max=0.0287 2026-02-21T09:20:16.1929690Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:16.1929958Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:20:16.1930199Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:16.1930391Z 'num_stages': 8, 2026-02-21T09:20:16.1930532Z 'num_warps': 2, 2026-02-21T09:20:16.1930680Z 'pid_type': 'flat', 2026-02-21T09:20:16.1930834Z 'range_flattens': [None, False], 2026-02-21T09:20:16.1931020Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:16.1931201Z 'range_num_stages': [0, 3], 2026-02-21T09:20:16.1931374Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:16.1931641Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:16.1931876Z [116s] Fitting surrogate: 440 points, 440 targets 2026-02-21T09:20:16.6053039Z [116s] Generation 6 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:20:47.0660654Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.3 configs/s 2026-02-21T09:20:48.3686608Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.7 configs/s 2026-02-21T09:20:49.4054545Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1356.1 2026-02-21T09:20:49.4054954Z configs/s 2026-02-21T09:20:49.4792893Z [149s] Generation 6 complete: 2026-02-21T09:20:49.4797875Z ok=22 2026-02-21T09:20:49.4802364Z min=0.0143 2026-02-21T09:20:49.4803815Z mid=0.0184 2026-02-21T09:20:49.4804009Z max=0.1003 2026-02-21T09:20:49.4804189Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:49.4804448Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:20:49.4804694Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:49.4804894Z 'num_stages': 8, 2026-02-21T09:20:49.4805039Z 'num_warps': 2, 2026-02-21T09:20:49.4805240Z 'pid_type': 'flat', 2026-02-21T09:20:49.4805402Z 'range_flattens': [None, False], 2026-02-21T09:20:49.4805583Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:49.4805772Z 'range_num_stages': [0, 3], 2026-02-21T09:20:49.4805938Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:49.4806121Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:49.4806339Z [149s] Fitting surrogate: 462 points, 462 targets 2026-02-21T09:20:49.7533094Z [149s] Generation 7 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:20:51.6623482Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.0 configs/s 2026-02-21T09:20:52.3409488Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.4 configs/s 2026-02-21T09:20:53.0279242Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1464.7 2026-02-21T09:20:53.0287515Z configs/s 2026-02-21T09:20:53.0942922Z [153s] Generation 7 complete: 2026-02-21T09:20:53.0947852Z ok=13 2026-02-21T09:20:53.0952768Z min=0.0143 2026-02-21T09:20:53.0954390Z mid=0.0162 2026-02-21T09:20:53.0954603Z max=0.0225 2026-02-21T09:20:53.0959859Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:53.0964338Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:20:53.0968680Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:53.0973584Z 'num_stages': 8, 2026-02-21T09:20:53.0977858Z 'num_warps': 2, 2026-02-21T09:20:53.0981789Z 'pid_type': 'flat', 2026-02-21T09:20:53.0982069Z 'range_flattens': [None, False], 2026-02-21T09:20:53.0982305Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:53.0986314Z 'range_num_stages': [0, 3], 2026-02-21T09:20:53.0990069Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:53.0995002Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:53.0999312Z [153s] Fitting surrogate: 475 points, 475 targets 2026-02-21T09:20:53.3937427Z [153s] Generation 8 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:20:54.7390409Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.2 configs/s 2026-02-21T09:20:55.4741238Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.4 configs/s 2026-02-21T09:20:56.2692095Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1268.8 2026-02-21T09:20:56.2692599Z configs/s 2026-02-21T09:20:56.3449272Z [156s] Generation 8 complete: 2026-02-21T09:20:56.3450977Z ok=14 2026-02-21T09:20:56.3451148Z min=0.0143 2026-02-21T09:20:56.3451279Z mid=0.0143 2026-02-21T09:20:56.3451411Z max=0.0205 2026-02-21T09:20:56.3451611Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:56.3451864Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:20:56.3452149Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:56.3452347Z 'num_stages': 8, 2026-02-21T09:20:56.3452513Z 'num_warps': 2, 2026-02-21T09:20:56.3452987Z 'pid_type': 'flat', 2026-02-21T09:20:56.3453190Z 'range_flattens': [None, False], 2026-02-21T09:20:56.3453378Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:56.3453577Z 'range_num_stages': [0, 3], 2026-02-21T09:20:56.3453745Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:56.3453933Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:56.3465500Z [156s] Fitting surrogate: 489 points, 489 targets 2026-02-21T09:20:56.6299205Z [156s] Generation 9 starting: 7 neighbors, 1 active search path(s) 2026-02-21T09:20:57.1363307Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 38.6 configs/s 2026-02-21T09:20:57.5642006Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 18.4 configs/s 2026-02-21T09:20:58.0703318Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1976.0 2026-02-21T09:20:58.0703742Z configs/s 2026-02-21T09:20:58.1216982Z [158s] Generation 9 complete: 2026-02-21T09:20:58.1220303Z ok=9 2026-02-21T09:20:58.1223137Z min=0.0143 2026-02-21T09:20:58.1227430Z mid=0.0143 2026-02-21T09:20:58.1230900Z max=0.0184 2026-02-21T09:20:58.1233400Z best={'block_sizes': [1, 4096], 2026-02-21T09:20:58.1238641Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:20:58.1240013Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:20:58.1240230Z 'num_stages': 8, 2026-02-21T09:20:58.1240378Z 'num_warps': 2, 2026-02-21T09:20:58.1240519Z 'pid_type': 'flat', 2026-02-21T09:20:58.1240684Z 'range_flattens': [None, False], 2026-02-21T09:20:58.1240862Z 'range_multi_buffers': [None, False], 2026-02-21T09:20:58.1241050Z 'range_num_stages': [0, 3], 2026-02-21T09:20:58.1241215Z 'range_unroll_factors': [0, 1], 2026-02-21T09:20:58.1241391Z 'range_warp_specializes': [None, False]} 2026-02-21T09:20:58.1241673Z [158s] Fitting surrogate: 498 points, 498 targets 2026-02-21T09:20:58.3076784Z [158s] Autotuning complete in 158.3s after searching 475 configs. 2026-02-21T09:20:58.3077169Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:20:58.3078086Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:20:58.3078891Z 2026-02-21T09:20:58.3079141Z [158s] Code of selected kernel: /tmp/torchinductor_root/nn/cnn3xdqwqruzw2ye3vyccovtcwlzrcws36zg4c6mxkzxps7xq3ie.py 2026-02-21T09:20:58.3307506Z from __future__ import annotations 2026-02-21T09:20:58.3307731Z 2026-02-21T09:20:58.3312597Z import torch 2026-02-21T09:20:58.3314673Z import triton 2026-02-21T09:20:58.3314901Z import triton.language as tl 2026-02-21T09:20:58.3318988Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:20:58.3322988Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:20:58.3324503Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:20:58.3324712Z 2026-02-21T09:20:58.3324788Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:20:58.3324982Z _BLOCK_SIZE_1 = tl.constexpr(4096) 2026-02-21T09:20:58.3325098Z 2026-02-21T09:20:58.3325169Z @triton.jit 2026-02-21T09:20:58.3325324Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:20:58.3325585Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:20:58.3325833Z pid_0 = tl.program_id(0) 2026-02-21T09:20:58.3326002Z offset_0 = pid_0 2026-02-21T09:20:58.3326171Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:20:58.3326454Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:20:58.3326752Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:20:58.3327016Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:20:58.3327564Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:20:58.3327838Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:20:58.3328114Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:20:58.3328358Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:20:58.3328592Z # src[softmax.py:82-89]: ... 2026-02-21T09:20:58.3329000Z for offset_2 in tl.range(0, 2816, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T09:20:58.3329454Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:20:58.3329690Z mask_1 = indices_2 < 2816 2026-02-21T09:20:58.3329852Z mi_copy = mi 2026-02-21T09:20:58.3329996Z di_copy = di 2026-02-21T09:20:58.3330136Z mi_copy_0 = mi_copy 2026-02-21T09:20:58.3330293Z di_copy_0 = di_copy 2026-02-21T09:20:58.3330474Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:20:58.3330843Z values = tl.load(x + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:20:58.3331229Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:20:58.3331861Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:20:58.3332265Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:20:58.3332519Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:20:58.3332757Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:20:58.3332968Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:20:58.3333222Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:20:58.3333463Z v_2 = mi_copy_0 - v_1 2026-02-21T09:20:58.3333632Z v_3 = libdevice.exp(v_2) 2026-02-21T09:20:58.3333804Z v_4 = di_copy_0 * v_3 2026-02-21T09:20:58.3333984Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:20:58.3334187Z subscript = v_1[:, None] 2026-02-21T09:20:58.3334364Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:20:58.3334538Z v_6 = v_5 - subscript 2026-02-21T09:20:58.3334749Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:20:58.3335006Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:20:58.3335224Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:20:58.3335404Z v_7 = libdevice.exp(v_6) 2026-02-21T09:20:58.3335727Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:20:58.3336076Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:20:58.3336277Z di = v_4 + sum_1 2026-02-21T09:20:58.3336444Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:20:58.3336705Z mi = v_1 2026-02-21T09:20:58.3336916Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:20:58.3337186Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:20:58.3337527Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:20:58.3338045Z for offset_2 in tl.range(0, 2816, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T09:20:58.3338509Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:20:58.3338747Z mask_2 = indices_2 < 2816 2026-02-21T09:20:58.3338909Z mi_copy_1 = mi 2026-02-21T09:20:58.3339059Z di_copy_1 = di 2026-02-21T09:20:58.3339207Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:20:58.3339375Z di_copy_1_0 = di_copy_1 2026-02-21T09:20:58.3339634Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:20:58.3339995Z values_1 = tl.load(x + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:20:58.3340421Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:20:58.3340698Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:20:58.3340895Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:20:58.3341075Z v_10 = v_9 - subscript_1 2026-02-21T09:20:58.3341250Z v_11 = libdevice.exp(v_10) 2026-02-21T09:20:58.3341433Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:20:58.3341649Z v_12 = v_11 / subscript_2 2026-02-21T09:20:58.3341826Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:20:58.3342089Z tl.store(out + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:20:58.3342303Z 2026-02-21T09:20:58.3342433Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:20:58.3342662Z """ 2026-02-21T09:20:58.3342869Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:20:58.3343177Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:20:58.3343391Z Args: 2026-02-21T09:20:58.3343555Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:20:58.3343748Z Returns: 2026-02-21T09:20:58.3343932Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:20:58.3344135Z """ 2026-02-21T09:20:58.3344277Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:20:58.3344450Z m, n = x.size() 2026-02-21T09:20:58.3344620Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:20:58.3344820Z out = torch.empty_like(x) 2026-02-21T09:20:58.3345034Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:20:58.3345351Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:20:58.3345655Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:20:58.3345888Z # src[softmax.py:79-92]: ... 2026-02-21T09:20:58.3346132Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=8) 2026-02-21T09:20:58.3346401Z # src[softmax.py:93]: return out 2026-02-21T09:20:58.3346570Z return out 2026-02-21T09:20:59.1741499Z WARNING:tritonbench.utils.triton_op:Completed input ID 20: 2026-02-21T09:20:59.1745798Z (M, N) 2026-02-21T09:20:59.1747239Z ------------ 2026-02-21T09:20:59.1747412Z (4096, 2816) 2026-02-21T09:20:59.1747500Z 2026-02-21T09:20:59.1753773Z 25%|██▌ | 5/20 [12:04<37:44, 150.96s/it]WARNING:tritonbench.utils.triton_op:Running input ID 26: 2026-02-21T09:20:59.1755204Z (M, N) 2026-02-21T09:20:59.1755367Z ------------ 2026-02-21T09:20:59.1755516Z (4096, 3584) 2026-02-21T09:20:59.1755876Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:21:00.5442325Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:21:02.1035269Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for torch_compile_softmax 2026-02-21T09:21:03.3692562Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:21:03.3696105Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:21:03.3700547Z 'dtype': 'torch.float16', 2026-02-21T09:21:03.3702248Z 'shape': (4096, 3584), 2026-02-21T09:21:03.3702517Z 'stride': (3584, 1)},), 2026-02-21T09:21:03.3706658Z 'kwargs': {}} 2026-02-21T09:21:03.3713961Z INFO:tritonbench.utils.triton_op:Took 2.28ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:21:03.5449274Z [0s] Autotune random seed: 2138408546 2026-02-21T09:21:03.5697677Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:21:36.0581636Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:21:36.8746842Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T09:21:43.5669403Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.9 configs/s 2026-02-21T09:21:43.5690033Z [39s] Adaptive compile timeout: 30s (90% percentile=4.1s, bounds=[30.0s, 30s]) 2026-02-21T09:21:43.6875233Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 7881.2 configs/s 2026-02-21T09:21:43.7119083Z [40s] Initial random population of 100, 5 starting points: 2026-02-21T09:21:43.7120952Z error=10 2026-02-21T09:21:43.7121177Z timeout=1 2026-02-21T09:21:43.7121368Z ok=89 2026-02-21T09:21:43.7121719Z min=0.0205 2026-02-21T09:21:43.7121895Z mid=0.3227 2026-02-21T09:21:43.7122072Z max=82.3716 2026-02-21T09:21:43.7122264Z best={'block_sizes': [1, 4096], 2026-02-21T09:21:43.7122575Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:21:43.7122894Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:21:43.7123105Z 'num_stages': 5, 2026-02-21T09:21:43.7123258Z 'num_warps': 1, 2026-02-21T09:21:43.7123417Z 'pid_type': 'flat', 2026-02-21T09:21:43.7123584Z 'range_flattens': [None, False], 2026-02-21T09:21:43.7123818Z 'range_multi_buffers': [None, False], 2026-02-21T09:21:43.7124020Z 'range_num_stages': [0, 1], 2026-02-21T09:21:43.7124209Z 'range_unroll_factors': [0, 0], 2026-02-21T09:21:43.7124404Z 'range_warp_specializes': [None, False]} 2026-02-21T09:21:43.7128833Z [40s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:21:45.2526608Z [41s] Generation 1 starting: 95 neighbors, 5 active search path(s) 2026-02-21T09:22:16.2279006Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.9 configs/s 2026-02-21T09:22:22.2749839Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.5 configs/s 2026-02-21T09:22:23.7496404Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 688.5 2026-02-21T09:22:23.7500580Z configs/s 2026-02-21T09:22:23.8641865Z [80s] Generation 1 complete: 2026-02-21T09:22:23.8644359Z ok=101 2026-02-21T09:22:23.8649485Z min=0.0184 2026-02-21T09:22:23.8649748Z mid=0.0328 2026-02-21T09:22:23.8649922Z max=0.4095 2026-02-21T09:22:23.8650102Z best={'block_sizes': [1, 4096], 2026-02-21T09:22:23.8650382Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:22:23.8650684Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:22:23.8655597Z 'num_stages': 5, 2026-02-21T09:22:23.8657465Z 'num_warps': 2, 2026-02-21T09:22:23.8657699Z 'pid_type': 'flat', 2026-02-21T09:22:23.8658284Z 'range_flattens': [None, None], 2026-02-21T09:22:23.8658472Z 'range_multi_buffers': [None, False], 2026-02-21T09:22:23.8658671Z 'range_num_stages': [0, 1], 2026-02-21T09:22:23.8658837Z 'range_unroll_factors': [0, 1], 2026-02-21T09:22:23.8659042Z 'range_warp_specializes': [None, False]} 2026-02-21T09:22:23.8671256Z [80s] Fitting surrogate: 201 points, 201 targets 2026-02-21T09:22:24.9737212Z [81s] Generation 2 starting: 83 neighbors, 5 active search path(s) 2026-02-21T09:22:33.9571279Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 6.0 configs/s 2026-02-21T09:22:39.1416547Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.5 configs/s 2026-02-21T09:22:43.2170540Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 269.5 2026-02-21T09:22:43.2174515Z configs/s 2026-02-21T09:22:43.4844282Z [99s] Generation 2 complete: 2026-02-21T09:22:43.4848995Z ok=88 2026-02-21T09:22:43.4850204Z min=0.0184 2026-02-21T09:22:43.4850411Z mid=0.0284 2026-02-21T09:22:43.4850544Z max=0.1289 2026-02-21T09:22:43.4854444Z best={'block_sizes': [1, 4096], 2026-02-21T09:22:43.4858513Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:22:43.4860036Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:22:43.4860301Z 'num_stages': 5, 2026-02-21T09:22:43.4862589Z 'num_warps': 4, 2026-02-21T09:22:43.4862783Z 'pid_type': 'flat', 2026-02-21T09:22:43.4862961Z 'range_flattens': [None, None], 2026-02-21T09:22:43.4863151Z 'range_multi_buffers': [None, None], 2026-02-21T09:22:43.4863349Z 'range_num_stages': [0, 1], 2026-02-21T09:22:43.4863519Z 'range_unroll_factors': [0, 0], 2026-02-21T09:22:43.4863711Z 'range_warp_specializes': [None, False]} 2026-02-21T09:22:43.4864010Z [99s] Fitting surrogate: 289 points, 289 targets 2026-02-21T09:22:44.5739688Z [101s] Generation 3 starting: 80 neighbors, 5 active search path(s) 2026-02-21T09:22:53.1328670Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 5.4 configs/s 2026-02-21T09:22:58.2464200Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 16.6 configs/s 2026-02-21T09:23:02.0314753Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.8 2026-02-21T09:23:02.0316095Z configs/s 2026-02-21T09:23:02.3292290Z [118s] Generation 3 complete: 2026-02-21T09:23:02.3298803Z ok=86 2026-02-21T09:23:02.3302264Z min=0.0184 2026-02-21T09:23:02.3306071Z mid=0.0247 2026-02-21T09:23:02.3309934Z max=0.0984 2026-02-21T09:23:02.3314570Z best={'block_sizes': [1, 4096], 2026-02-21T09:23:02.3318232Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:23:02.3319447Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:23:02.3319646Z 'num_stages': 5, 2026-02-21T09:23:02.3319813Z 'num_warps': 4, 2026-02-21T09:23:02.3325378Z 'pid_type': 'flat', 2026-02-21T09:23:02.3329221Z 'range_flattens': [None, None], 2026-02-21T09:23:02.3330782Z 'range_multi_buffers': [None, None], 2026-02-21T09:23:02.3331069Z 'range_num_stages': [0, 1], 2026-02-21T09:23:02.3335960Z 'range_unroll_factors': [0, 0], 2026-02-21T09:23:02.3340229Z 'range_warp_specializes': [None, False]} 2026-02-21T09:23:02.3341961Z [118s] Fitting surrogate: 375 points, 375 targets 2026-02-21T09:23:03.2583784Z [119s] Generation 4 starting: 64 neighbors, 5 active search path(s) 2026-02-21T09:23:08.5678956Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 5.2 configs/s 2026-02-21T09:23:12.8793365Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 15.2 configs/s 2026-02-21T09:23:16.9722642Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 249.0 2026-02-21T09:23:16.9724145Z configs/s 2026-02-21T09:23:17.3022154Z [133s] Generation 4 complete: 2026-02-21T09:23:17.3026290Z ok=70 2026-02-21T09:23:17.3029716Z min=0.0184 2026-02-21T09:23:17.3031069Z mid=0.0225 2026-02-21T09:23:17.3031226Z max=0.0572 2026-02-21T09:23:17.3031374Z best={'block_sizes': [1, 4096], 2026-02-21T09:23:17.3031810Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:23:17.3032083Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:23:17.3032266Z 'num_stages': 5, 2026-02-21T09:23:17.3032412Z 'num_warps': 4, 2026-02-21T09:23:17.3032550Z 'pid_type': 'flat', 2026-02-21T09:23:17.3032715Z 'range_flattens': [None, False], 2026-02-21T09:23:17.3032895Z 'range_multi_buffers': [None, True], 2026-02-21T09:23:17.3033084Z 'range_num_stages': [0, 1], 2026-02-21T09:23:17.3033253Z 'range_unroll_factors': [0, 0], 2026-02-21T09:23:17.3033431Z 'range_warp_specializes': [None, False]} 2026-02-21T09:23:17.3040285Z [133s] Fitting surrogate: 445 points, 445 targets 2026-02-21T09:23:18.1269144Z [134s] Generation 5 starting: 54 neighbors, 5 active search path(s) 2026-02-21T09:23:22.1827512Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 8.4 configs/s 2026-02-21T09:23:25.5892685Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.6 configs/s 2026-02-21T09:23:29.1866501Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 307.9 2026-02-21T09:23:29.1867323Z configs/s 2026-02-21T09:23:29.4498439Z [145s] Generation 5 complete: 2026-02-21T09:23:29.4503327Z ok=59 2026-02-21T09:23:29.4507675Z min=0.0184 2026-02-21T09:23:29.4512115Z mid=0.0184 2026-02-21T09:23:29.4516498Z max=0.0573 2026-02-21T09:23:29.4521453Z best={'block_sizes': [1, 4096], 2026-02-21T09:23:29.4525786Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:23:29.4526981Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:29.4527196Z 'maxnreg': 64, 2026-02-21T09:23:29.4527358Z 'num_sm_multiplier': 16, 2026-02-21T09:23:29.4527520Z 'num_stages': 3, 2026-02-21T09:23:29.4527697Z 'num_warps': 1, 2026-02-21T09:23:29.4527875Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:23:29.4528076Z 'range_flattens': [None, None], 2026-02-21T09:23:29.4528251Z 'range_multi_buffers': [True, None], 2026-02-21T09:23:29.4528442Z 'range_num_stages': [4, 0], 2026-02-21T09:23:29.4528604Z 'range_unroll_factors': [0, 1], 2026-02-21T09:23:29.4528784Z 'range_warp_specializes': [True, None]} 2026-02-21T09:23:29.4529001Z [145s] Fitting surrogate: 504 points, 504 targets 2026-02-21T09:23:30.3020418Z [146s] Generation 6 starting: 53 neighbors, 4 active search path(s) 2026-02-21T09:23:49.1764097Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 0.9 configs/s 2026-02-21T09:23:52.5177138Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 17.0 configs/s 2026-02-21T09:23:55.0053712Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 409.4 2026-02-21T09:23:55.0057241Z configs/s 2026-02-21T09:23:55.2058578Z [171s] Generation 6 complete: 2026-02-21T09:23:55.2062242Z ok=57 2026-02-21T09:23:55.2066538Z min=0.0184 2026-02-21T09:23:55.2067956Z mid=0.0185 2026-02-21T09:23:55.2068130Z max=0.5161 2026-02-21T09:23:55.2068275Z best={'block_sizes': [1, 4096], 2026-02-21T09:23:55.2068537Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:23:55.2068789Z 'load_eviction_policies': ['', ''], 2026-02-21T09:23:55.2068979Z 'maxnreg': 64, 2026-02-21T09:23:55.2069139Z 'num_sm_multiplier': 16, 2026-02-21T09:23:55.2069302Z 'num_stages': 3, 2026-02-21T09:23:55.2069448Z 'num_warps': 1, 2026-02-21T09:23:55.2069607Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:23:55.2069806Z 'range_flattens': [None, None], 2026-02-21T09:23:55.2069986Z 'range_multi_buffers': [True, None], 2026-02-21T09:23:55.2070179Z 'range_num_stages': [4, 0], 2026-02-21T09:23:55.2070349Z 'range_unroll_factors': [0, 1], 2026-02-21T09:23:55.2077641Z 'range_warp_specializes': [True, None]} 2026-02-21T09:23:55.2080795Z [171s] Fitting surrogate: 561 points, 561 targets 2026-02-21T09:23:55.9880090Z [172s] Generation 7 starting: 41 neighbors, 3 active search path(s) 2026-02-21T09:23:58.9578985Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 19.5 configs/s 2026-02-21T09:24:01.4216505Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s 2026-02-21T09:24:04.1204317Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 427.6 2026-02-21T09:24:04.1208128Z configs/s 2026-02-21T09:24:04.3258719Z [180s] Generation 7 complete: 2026-02-21T09:24:04.3263044Z ok=45 2026-02-21T09:24:04.3264589Z min=0.0165 2026-02-21T09:24:04.3264795Z mid=0.0184 2026-02-21T09:24:04.3269684Z max=0.0880 2026-02-21T09:24:04.3273509Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:04.3278223Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:24:04.3281868Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:04.3285353Z 'num_stages': 8, 2026-02-21T09:24:04.3289115Z 'num_warps': 2, 2026-02-21T09:24:04.3290478Z 'pid_type': 'flat', 2026-02-21T09:24:04.3290711Z 'range_flattens': [None, None], 2026-02-21T09:24:04.3290905Z 'range_multi_buffers': [None, None], 2026-02-21T09:24:04.3291105Z 'range_num_stages': [0, 2], 2026-02-21T09:24:04.3291277Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:04.3291471Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:04.3291837Z [180s] Fitting surrogate: 606 points, 606 targets 2026-02-21T09:24:04.8592792Z [181s] Generation 8 starting: 19 neighbors, 2 active search path(s) 2026-02-21T09:24:06.4489716Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 22.4 configs/s 2026-02-21T09:24:07.6207930Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 16.8 configs/s 2026-02-21T09:24:08.8975294Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 792.9 2026-02-21T09:24:08.8978384Z configs/s 2026-02-21T09:24:09.0096936Z [185s] Generation 8 complete: 2026-02-21T09:24:09.0101160Z ok=21 2026-02-21T09:24:09.0102640Z min=0.0164 2026-02-21T09:24:09.0102825Z mid=0.0184 2026-02-21T09:24:09.0102952Z max=0.0245 2026-02-21T09:24:09.0103112Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:09.0103340Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:24:09.0103562Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:09.0103758Z 'num_stages': 8, 2026-02-21T09:24:09.0103901Z 'num_warps': 2, 2026-02-21T09:24:09.0104052Z 'pid_type': 'flat', 2026-02-21T09:24:09.0104210Z 'range_flattens': [None, None], 2026-02-21T09:24:09.0104397Z 'range_multi_buffers': [None, None], 2026-02-21T09:24:09.0104581Z 'range_num_stages': [0, 2], 2026-02-21T09:24:09.0104755Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:09.0104936Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:09.0117382Z [185s] Fitting surrogate: 627 points, 627 targets 2026-02-21T09:24:09.5656944Z [185s] Generation 9 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:24:11.0508916Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 43.4 configs/s 2026-02-21T09:24:12.4511401Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.9 configs/s 2026-02-21T09:24:13.9012572Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 699.1 2026-02-21T09:24:13.9015958Z configs/s 2026-02-21T09:24:14.0193088Z [190s] Generation 9 complete: 2026-02-21T09:24:14.0197388Z ok=25 2026-02-21T09:24:14.0201976Z min=0.0164 2026-02-21T09:24:14.0205716Z mid=0.0184 2026-02-21T09:24:14.0207623Z max=0.0307 2026-02-21T09:24:14.0207806Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:14.0208038Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:24:14.0208276Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:14.0208847Z 'num_stages': 8, 2026-02-21T09:24:14.0209044Z 'num_warps': 2, 2026-02-21T09:24:14.0209191Z 'pid_type': 'flat', 2026-02-21T09:24:14.0209362Z 'range_flattens': [None, None], 2026-02-21T09:24:14.0209546Z 'range_multi_buffers': [None, None], 2026-02-21T09:24:14.0209740Z 'range_num_stages': [0, 1], 2026-02-21T09:24:14.0209905Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:14.0210096Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:14.0215371Z [190s] Fitting surrogate: 652 points, 652 targets 2026-02-21T09:24:14.5417730Z [190s] Generation 10 starting: 22 neighbors, 2 active search path(s) 2026-02-21T09:24:16.2879475Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 14.4 configs/s 2026-02-21T09:24:17.6200976Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.1 configs/s 2026-02-21T09:24:19.0120032Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 728.7 2026-02-21T09:24:19.0124246Z configs/s 2026-02-21T09:24:19.1327326Z [195s] Generation 10 complete: 2026-02-21T09:24:19.1329105Z ok=24 2026-02-21T09:24:19.1329332Z min=0.0174 2026-02-21T09:24:19.1329515Z mid=0.0184 2026-02-21T09:24:19.1329678Z max=0.0266 2026-02-21T09:24:19.1329864Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:19.1330099Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:24:19.1330367Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:19.1330589Z 'num_stages': 8, 2026-02-21T09:24:19.1330746Z 'num_warps': 2, 2026-02-21T09:24:19.1330926Z 'pid_type': 'flat', 2026-02-21T09:24:19.1331105Z 'range_flattens': [None, None], 2026-02-21T09:24:19.1331334Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:19.1331810Z 'range_num_stages': [0, 1], 2026-02-21T09:24:19.1332022Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:19.1332250Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:19.1346160Z [195s] Fitting surrogate: 676 points, 676 targets 2026-02-21T09:24:19.6746480Z [196s] Generation 11 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:24:21.1345705Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 15.0 configs/s 2026-02-21T09:24:22.3406089Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 17.2 configs/s 2026-02-21T09:24:23.6700644Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 762.5 2026-02-21T09:24:23.6704447Z configs/s 2026-02-21T09:24:23.7848210Z [200s] Generation 11 complete: 2026-02-21T09:24:23.7849423Z ok=22 2026-02-21T09:24:23.7849610Z min=0.0164 2026-02-21T09:24:23.7849748Z mid=0.0184 2026-02-21T09:24:23.7849883Z max=0.0205 2026-02-21T09:24:23.7850029Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:23.7850277Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:24:23.7850530Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:23.7850745Z 'num_stages': 8, 2026-02-21T09:24:23.7851361Z 'num_warps': 2, 2026-02-21T09:24:23.7851635Z 'pid_type': 'flat', 2026-02-21T09:24:23.7851815Z 'range_flattens': [None, None], 2026-02-21T09:24:23.7852000Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:23.7852204Z 'range_num_stages': [0, 1], 2026-02-21T09:24:23.7852375Z 'range_unroll_factors': [0, 0], 2026-02-21T09:24:23.7852582Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:23.7869831Z [200s] Fitting surrogate: 698 points, 698 targets 2026-02-21T09:24:24.1868283Z [200s] Generation 12 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:24:25.0279482Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 22.6 configs/s 2026-02-21T09:24:25.6961916Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.7 configs/s 2026-02-21T09:24:26.7628909Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1406.2 2026-02-21T09:24:26.7629248Z configs/s 2026-02-21T09:24:26.8301992Z [203s] Generation 12 complete: 2026-02-21T09:24:26.8306206Z ok=12 2026-02-21T09:24:26.8307687Z min=0.0164 2026-02-21T09:24:26.8307845Z mid=0.0184 2026-02-21T09:24:26.8307976Z max=0.0184 2026-02-21T09:24:26.8308113Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:26.8308349Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:24:26.8308589Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:26.8308785Z 'num_stages': 8, 2026-02-21T09:24:26.8308924Z 'num_warps': 2, 2026-02-21T09:24:26.8309072Z 'pid_type': 'flat', 2026-02-21T09:24:26.8309234Z 'range_flattens': [None, None], 2026-02-21T09:24:26.8309408Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:26.8309595Z 'range_num_stages': [0, 2], 2026-02-21T09:24:26.8309757Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:26.8309940Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:26.8318085Z [203s] Fitting surrogate: 710 points, 710 targets 2026-02-21T09:24:27.2515877Z [203s] Generation 13 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:24:28.3629019Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 22.7 configs/s 2026-02-21T09:24:29.1580155Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 17.3 configs/s 2026-02-21T09:24:29.9922629Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1208.2 2026-02-21T09:24:29.9926730Z configs/s 2026-02-21T09:24:30.0704400Z [206s] Generation 13 complete: 2026-02-21T09:24:30.0704672Z ok=14 2026-02-21T09:24:30.0708931Z min=0.0164 2026-02-21T09:24:30.0712346Z mid=0.0184 2026-02-21T09:24:30.0714864Z max=0.0184 2026-02-21T09:24:30.0718110Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:30.0722204Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:24:30.0726534Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:30.0730812Z 'num_stages': 8, 2026-02-21T09:24:30.0732219Z 'num_warps': 2, 2026-02-21T09:24:30.0732785Z 'pid_type': 'flat', 2026-02-21T09:24:30.0732965Z 'range_flattens': [None, None], 2026-02-21T09:24:30.0733155Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:30.0733339Z 'range_num_stages': [0, 2], 2026-02-21T09:24:30.0733508Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:30.0733684Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:30.0738645Z [206s] Fitting surrogate: 724 points, 724 targets 2026-02-21T09:24:30.4767123Z [206s] Generation 14 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:24:31.5668939Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 13.8 configs/s 2026-02-21T09:24:32.2395388Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.5 configs/s 2026-02-21T09:24:32.8329652Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1687.5 2026-02-21T09:24:32.8333984Z configs/s 2026-02-21T09:24:32.8909979Z [209s] Generation 14 complete: 2026-02-21T09:24:32.8914055Z ok=12 2026-02-21T09:24:32.8918488Z min=0.0165 2026-02-21T09:24:32.8919862Z mid=0.0184 2026-02-21T09:24:32.8920026Z max=0.0267 2026-02-21T09:24:32.8920173Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:32.8920432Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:24:32.8920693Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:32.8920903Z 'num_stages': 8, 2026-02-21T09:24:32.8921042Z 'num_warps': 2, 2026-02-21T09:24:32.8921193Z 'pid_type': 'flat', 2026-02-21T09:24:32.8921351Z 'range_flattens': [None, None], 2026-02-21T09:24:32.8921912Z 'range_multi_buffers': [None, False], 2026-02-21T09:24:32.8922117Z 'range_num_stages': [0, 1], 2026-02-21T09:24:32.8922285Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:32.8922491Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:32.8937682Z [209s] Fitting surrogate: 736 points, 736 targets 2026-02-21T09:24:33.2827662Z [209s] Generation 15 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:24:34.3542774Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 13.0 configs/s 2026-02-21T09:24:34.9638281Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.7 configs/s 2026-02-21T09:24:35.5554785Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1692.3 2026-02-21T09:24:35.5556000Z configs/s 2026-02-21T09:24:35.6153467Z [212s] Generation 15 complete: 2026-02-21T09:24:35.6157134Z ok=11 2026-02-21T09:24:35.6161803Z min=0.0164 2026-02-21T09:24:35.6165977Z mid=0.0183 2026-02-21T09:24:35.6170406Z max=0.0266 2026-02-21T09:24:35.6173655Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:35.6175920Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:24:35.6176209Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:35.6176459Z 'num_stages': 8, 2026-02-21T09:24:35.6180241Z 'num_warps': 2, 2026-02-21T09:24:35.6184675Z 'pid_type': 'flat', 2026-02-21T09:24:35.6184921Z 'range_flattens': [None, None], 2026-02-21T09:24:35.6189062Z 'range_multi_buffers': [None, True], 2026-02-21T09:24:35.6192191Z 'range_num_stages': [0, 1], 2026-02-21T09:24:35.6196171Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:35.6196446Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:35.6200520Z [212s] Fitting surrogate: 747 points, 747 targets 2026-02-21T09:24:36.0031265Z [212s] Generation 16 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:24:37.0281189Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 15.5 configs/s 2026-02-21T09:24:37.6399834Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.7 configs/s 2026-02-21T09:24:38.2419996Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1664.6 2026-02-21T09:24:38.2424566Z configs/s 2026-02-21T09:24:38.3025150Z [214s] Generation 16 complete: 2026-02-21T09:24:38.3029335Z ok=11 2026-02-21T09:24:38.3033702Z min=0.0183 2026-02-21T09:24:38.3038058Z mid=0.0184 2026-02-21T09:24:38.3041165Z max=0.0246 2026-02-21T09:24:38.3044434Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:38.3048397Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:24:38.3048755Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:38.3048975Z 'num_stages': 8, 2026-02-21T09:24:38.3053311Z 'num_warps': 2, 2026-02-21T09:24:38.3057763Z 'pid_type': 'flat', 2026-02-21T09:24:38.3062035Z 'range_flattens': [None, None], 2026-02-21T09:24:38.3066899Z 'range_multi_buffers': [None, True], 2026-02-21T09:24:38.3071833Z 'range_num_stages': [0, 1], 2026-02-21T09:24:38.3076104Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:38.3080475Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:38.3083634Z [214s] Fitting surrogate: 758 points, 758 targets 2026-02-21T09:24:38.6770946Z [215s] Generation 17 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:24:39.7776942Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 12.6 configs/s 2026-02-21T09:24:40.3849674Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.8 configs/s 2026-02-21T09:24:40.9894978Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1657.1 2026-02-21T09:24:40.9898570Z configs/s 2026-02-21T09:24:41.0480347Z [217s] Generation 17 complete: 2026-02-21T09:24:41.0483586Z ok=11 2026-02-21T09:24:41.0488597Z min=0.0184 2026-02-21T09:24:41.0492913Z mid=0.0184 2026-02-21T09:24:41.0494352Z max=0.0266 2026-02-21T09:24:41.0494544Z best={'block_sizes': [1, 4096], 2026-02-21T09:24:41.0494786Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:24:41.0495049Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:24:41.0495256Z 'num_stages': 8, 2026-02-21T09:24:41.0495402Z 'num_warps': 2, 2026-02-21T09:24:41.0495585Z 'pid_type': 'flat', 2026-02-21T09:24:41.0495763Z 'range_flattens': [None, None], 2026-02-21T09:24:41.0495961Z 'range_multi_buffers': [None, True], 2026-02-21T09:24:41.0496143Z 'range_num_stages': [0, 1], 2026-02-21T09:24:41.0496316Z 'range_unroll_factors': [0, 1], 2026-02-21T09:24:41.0496495Z 'range_warp_specializes': [None, False]} 2026-02-21T09:24:41.0509251Z [217s] Fitting surrogate: 769 points, 769 targets 2026-02-21T09:24:41.3276946Z [217s] Autotuning complete in 217.8s after searching 734 configs. 2026-02-21T09:24:41.3281362Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:24:41.3283454Z @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:24:41.3284358Z 2026-02-21T09:24:41.3284646Z [217s] Code of selected kernel: /tmp/torchinductor_root/ww/cww7kwfj4efxgrw7h2zuolovdqiaferwiztrkeu6jzgpxtlfzzv4.py 2026-02-21T09:24:41.3505982Z from __future__ import annotations 2026-02-21T09:24:41.3509776Z 2026-02-21T09:24:41.3511819Z import torch 2026-02-21T09:24:41.3511997Z import triton 2026-02-21T09:24:41.3512157Z import triton.language as tl 2026-02-21T09:24:41.3512363Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:24:41.3512632Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:24:41.3512917Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:24:41.3513101Z 2026-02-21T09:24:41.3513171Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:24:41.3513348Z _BLOCK_SIZE_1 = tl.constexpr(4096) 2026-02-21T09:24:41.3513471Z 2026-02-21T09:24:41.3513526Z @triton.jit 2026-02-21T09:24:41.3513677Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:24:41.3513940Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:24:41.3514511Z pid_0 = tl.program_id(0) 2026-02-21T09:24:41.3514682Z offset_0 = pid_0 2026-02-21T09:24:41.3518425Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:24:41.3518788Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:24:41.3522899Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:24:41.3527724Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:24:41.3529473Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:24:41.3529824Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:24:41.3534778Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:24:41.3536883Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:24:41.3537203Z # src[softmax.py:82-89]: ... 2026-02-21T09:24:41.3542117Z for offset_2 in tl.range(0, 3584, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=False): 2026-02-21T09:24:41.3546263Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:24:41.3546605Z mask_1 = indices_2 < 3584 2026-02-21T09:24:41.3546814Z mi_copy = mi 2026-02-21T09:24:41.3551267Z di_copy = di 2026-02-21T09:24:41.3555414Z mi_copy_0 = mi_copy 2026-02-21T09:24:41.3560069Z di_copy_0 = di_copy 2026-02-21T09:24:41.3565485Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:24:41.3567134Z values = tl.load(x + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:24:41.3567587Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:24:41.3568028Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:24:41.3568458Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:24:41.3568741Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:24:41.3568996Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:24:41.3569217Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:24:41.3569491Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:24:41.3569729Z v_2 = mi_copy_0 - v_1 2026-02-21T09:24:41.3569904Z v_3 = libdevice.exp(v_2) 2026-02-21T09:24:41.3570071Z v_4 = di_copy_0 * v_3 2026-02-21T09:24:41.3570267Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:24:41.3570475Z subscript = v_1[:, None] 2026-02-21T09:24:41.3570649Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:24:41.3570835Z v_6 = v_5 - subscript 2026-02-21T09:24:41.3571045Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:24:41.3571318Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:24:41.3571611Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:24:41.3571813Z v_7 = libdevice.exp(v_6) 2026-02-21T09:24:41.3572148Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:24:41.3572505Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:24:41.3572714Z di = v_4 + sum_1 2026-02-21T09:24:41.3572876Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:24:41.3573057Z mi = v_1 2026-02-21T09:24:41.3573253Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:24:41.3573534Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:24:41.3573827Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:24:41.3574313Z for offset_2 in tl.range(0, 3584, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=False): 2026-02-21T09:24:41.3574977Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:24:41.3575212Z mask_2 = indices_2 < 3584 2026-02-21T09:24:41.3575387Z mi_copy_1 = mi 2026-02-21T09:24:41.3575532Z di_copy_1 = di 2026-02-21T09:24:41.3575686Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:24:41.3575850Z di_copy_1_0 = di_copy_1 2026-02-21T09:24:41.3576040Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:24:41.3576412Z values_1 = tl.load(x + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:24:41.3576864Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:24:41.3577146Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:24:41.3577328Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:24:41.3577587Z v_10 = v_9 - subscript_1 2026-02-21T09:24:41.3577769Z v_11 = libdevice.exp(v_10) 2026-02-21T09:24:41.3577945Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:24:41.3578129Z v_12 = v_11 / subscript_2 2026-02-21T09:24:41.3578297Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:24:41.3578568Z tl.store(out + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:24:41.3578783Z 2026-02-21T09:24:41.3578910Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:24:41.3579146Z """ 2026-02-21T09:24:41.3579358Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:24:41.3579660Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:24:41.3579891Z Args: 2026-02-21T09:24:41.3580049Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:24:41.3580247Z Returns: 2026-02-21T09:24:41.3580423Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:24:41.3580635Z """ 2026-02-21T09:24:41.3580769Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:24:41.3580948Z m, n = x.size() 2026-02-21T09:24:41.3581118Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:24:41.3581311Z out = torch.empty_like(x) 2026-02-21T09:24:41.3581577Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:24:41.3581885Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:24:41.3582199Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:24:41.3582434Z # src[softmax.py:79-92]: ... 2026-02-21T09:24:41.3582714Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=8) 2026-02-21T09:24:41.3582998Z # src[softmax.py:93]: return out 2026-02-21T09:24:41.3583173Z return out 2026-02-21T09:24:42.1946110Z WARNING:tritonbench.utils.triton_op:Completed input ID 26: 2026-02-21T09:24:42.1949816Z (M, N) 2026-02-21T09:24:42.1954887Z ------------ 2026-02-21T09:24:42.1956366Z (4096, 3584) 2026-02-21T09:24:42.1956501Z 2026-02-21T09:24:42.1957002Z 30%|███ | 6/20 [15:47<40:56, 175.46s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31: 2026-02-21T09:24:42.1961970Z (M, N) 2026-02-21T09:24:42.1966309Z ------------ 2026-02-21T09:24:42.1967685Z (4096, 4224) 2026-02-21T09:24:42.1968002Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:24:43.4733746Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:24:44.8452609Z INFO:tritonbench.utils.triton_op:Took 2.38ms to get benchmark function for torch_compile_softmax 2026-02-21T09:24:46.1441507Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:24:46.1443790Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:24:46.1444014Z 'dtype': 'torch.float16', 2026-02-21T09:24:46.1444249Z 'shape': (4096, 4224), 2026-02-21T09:24:46.1444777Z 'stride': (4224, 1)},), 2026-02-21T09:24:46.1444986Z 'kwargs': {}} 2026-02-21T09:24:46.1479603Z INFO:tritonbench.utils.triton_op:Took 3.87ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:24:46.3211275Z [0s] Autotune random seed: 2138408546 2026-02-21T09:24:46.3456935Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:25:20.4595738Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:25:20.4609701Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T09:25:22.4264197Z module { 2026-02-21T09:25:22.4270836Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:25:22.4271689Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:25:22.4271922Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:25:22.4272133Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:25:22.4272359Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:25:22.4272579Z %cst = arith.constant dense<4224> : tensor<16x1xi32> 2026-02-21T09:25:22.4272852Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T09:25:22.4273123Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T09:25:22.4273790Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:25:22.4273983Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:25:22.4274185Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T09:25:22.4274445Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T09:25:22.4276420Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:25:22.4276849Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T09:25:22.4280476Z %1 = tt.get_program_id x : i32 2026-02-21T09:25:22.4280776Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:25:22.4281007Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T09:25:22.4281269Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:25:22.4284028Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T09:25:22.4284324Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:25:22.4284630Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T09:25:22.4284880Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T09:25:22.4287585Z %c4096_i32_2 = arith.constant 4096 : i32 2026-02-21T09:25:22.4287821Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:25:22.4288274Z %8:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_2 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T09:25:22.4288881Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:25:22.4289305Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4289609Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4289862Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4290099Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:25:22.4290327Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4290541Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4290808Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:25:22.4291095Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:25:22.4291364Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T09:25:22.4292001Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T09:25:22.4292268Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T09:25:22.4292547Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:25:22.4292825Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T09:25:22.4293255Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4293687Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T09:25:22.4293981Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4294328Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4294608Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T09:25:22.4295122Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4295557Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4295779Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4296001Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:25:22.4296223Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4296449Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4296680Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T09:25:22.4296915Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T09:25:22.4297143Z %68 = arith.muli %c128_i32, %c1_i32_5 : i32 2026-02-21T09:25:22.4297365Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T09:25:22.4297694Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:25:22.4298064Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4298336Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4298563Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4298775Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:25:22.4298995Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4299197Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4299453Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:25:22.4299720Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:25:22.4299983Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T09:25:22.4300216Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T09:25:22.4300448Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T09:25:22.4300707Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:25:22.4300979Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T09:25:22.4301396Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4301854Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T09:25:22.4302144Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4302476Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4302753Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T09:25:22.4303169Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4303549Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4303753Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4303943Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:25:22.4304148Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4304344Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4304636Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T09:25:22.4304847Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:25:22.4305048Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:25:22.4305256Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:25:22.4305547Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:25:22.4305893Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4306137Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4306347Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4306549Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:25:22.4306753Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4306955Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4307189Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:25:22.4307509Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:25:22.4307753Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T09:25:22.4307980Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T09:25:22.4308190Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T09:25:22.4308436Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:25:22.4308690Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T09:25:22.4309066Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4309463Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T09:25:22.4309754Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4310112Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4310405Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T09:25:22.4310808Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4311214Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4311417Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4311652Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:25:22.4311852Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4312059Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4312283Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T09:25:22.4312493Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:25:22.4312702Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:25:22.4312904Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:25:22.4313218Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:25:22.4313570Z %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4313835Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4314045Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4314246Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:25:22.4314468Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4314667Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4314936Z %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:25:22.4315224Z %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:25:22.4315503Z %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32> 2026-02-21T09:25:22.4315757Z %116 = arith.cmpf une, %98, %98 : tensor<16xf32> 2026-02-21T09:25:22.4316000Z %117 = arith.ori %115, %116 : tensor<16xi1> 2026-02-21T09:25:22.4316285Z %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:25:22.4316595Z %119 = arith.subf %98, %118 : tensor<16xf32> 2026-02-21T09:25:22.4316981Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4317368Z %121 = arith.mulf %107, %120 : tensor<16xf32> 2026-02-21T09:25:22.4317652Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4318004Z %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4318282Z %124 = arith.subf %111, %123 : tensor<16x128xf32> 2026-02-21T09:25:22.4318705Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4319101Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4319330Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.4319592Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:25:22.4319822Z tt.reduce.return %128 : f32 2026-02-21T09:25:22.4320039Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4320270Z %127 = arith.addf %121, %126 : tensor<16xf32> 2026-02-21T09:25:22.4320528Z scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32> 2026-02-21T09:25:22.4320778Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:25:22.4321134Z %9 = tt.descriptor_load %0[%4, %c4096_i32_2] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:25:22.4321516Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4321839Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4322064Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.4322271Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:25:22.4322496Z tt.reduce.return %50 : f32 2026-02-21T09:25:22.4322708Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4322975Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:25:22.4323275Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:25:22.4323550Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T09:25:22.4323792Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T09:25:22.4324029Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T09:25:22.4324293Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:25:22.4324557Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T09:25:22.4324969Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4325377Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T09:25:22.4325668Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4326002Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4326278Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T09:25:22.4326699Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4327108Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.4327332Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.4327531Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:25:22.4327745Z tt.reduce.return %50 : f32 2026-02-21T09:25:22.4327948Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:25:22.4328178Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T09:25:22.4328401Z %c4096_i32_3 = arith.constant 4096 : i32 2026-02-21T09:25:22.4328620Z %c512_i32_4 = arith.constant 512 : i32 2026-02-21T09:25:22.4328886Z scf.for %arg3 = %c0_i32 to %c4096_i32_3 step %c512_i32_4 : i32 { 2026-02-21T09:25:22.4329212Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:25:22.4329645Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:25:22.4329882Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T09:25:22.4330176Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:25:22.4330482Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T09:25:22.4330765Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:25:22.4331084Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4331365Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4331676Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T09:25:22.4331930Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4332299Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4332632Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4332962Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4333269Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4333541Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4333792Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T09:25:22.4334190Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4334628Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4334932Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4335182Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T09:25:22.4335443Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:25:22.4335728Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4336030Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4336310Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4336531Z %c1_i32_5 = arith.constant 1 : i32 2026-02-21T09:25:22.4336739Z %73 = arith.muli %c128_i32, %c1_i32_5 : i32 2026-02-21T09:25:22.4336943Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T09:25:22.4337194Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:25:22.4337458Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:25:22.4337672Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T09:25:22.4337941Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:25:22.4338214Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T09:25:22.4338490Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:25:22.4338796Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4339076Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4339325Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T09:25:22.4339568Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4339866Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4340181Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4340515Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4340816Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4341154Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4341407Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T09:25:22.4341860Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4342308Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4342606Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4342857Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T09:25:22.4343108Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:25:22.4343390Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4348491Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4348821Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4349052Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:25:22.4349255Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:25:22.4349468Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T09:25:22.4351352Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:25:22.4351680Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T09:25:22.4351909Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T09:25:22.4352179Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:25:22.4352492Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T09:25:22.4354419Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:25:22.4354771Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4355090Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4355374Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T09:25:22.4355659Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4355984Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4356358Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4356723Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4357068Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4357383Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4357662Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T09:25:22.4358102Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4358586Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4358928Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4359210Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T09:25:22.4359486Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:25:22.4359815Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4360140Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4360448Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4360682Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:25:22.4360906Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:25:22.4361132Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T09:25:22.4361402Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:25:22.4361803Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T09:25:22.4362037Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T09:25:22.4362312Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:25:22.4362597Z %129 = arith.muli %128, %cst : tensor<16x1xi32> 2026-02-21T09:25:22.4362881Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:25:22.4363208Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4363496Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4363766Z %133 = arith.addi %131, %132 : tensor<16x128xi32> 2026-02-21T09:25:22.4364024Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4364396Z %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4364734Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4365067Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4365382Z %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4365661Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4365928Z %140 = arith.subf %138, %139 : tensor<16x128xf32> 2026-02-21T09:25:22.4366326Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4366775Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4367085Z %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4367344Z %144 = arith.divf %141, %143 : tensor<16x128xf32> 2026-02-21T09:25:22.4367606Z %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:25:22.4367896Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4368205Z %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4368509Z tt.store %147, %145 : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4368732Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:25:22.4368993Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:25:22.4369267Z %28 = tt.splat %c4096_i32_3 : i32 -> tensor<128xi32> 2026-02-21T09:25:22.4369496Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:25:22.4369764Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:25:22.4370037Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T09:25:22.4370321Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:25:22.4370626Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4370913Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:25:22.4371165Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T09:25:22.4371422Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4371764Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4372079Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4372413Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4372708Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:25:22.4372987Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4373303Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T09:25:22.4373691Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:25:22.4374134Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:25:22.4374429Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:25:22.4374699Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T09:25:22.4374962Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:25:22.4375272Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4375594Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:25:22.4375883Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T09:25:22.4376309Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T09:25:22.4376636Z tt.return 2026-02-21T09:25:22.4376791Z } 2026-02-21T09:25:22.4376929Z } 2026-02-21T09:25:22.4377014Z 2026-02-21T09:25:22.4377071Z {-# 2026-02-21T09:25:22.4377226Z external_resources: { 2026-02-21T09:25:22.4377408Z mlir_reproducer: { 2026-02-21T09:25:22.4382545Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:25:22.4388079Z disable_threading: false, 2026-02-21T09:25:22.4388271Z verify_each: true 2026-02-21T09:25:22.4388425Z } 2026-02-21T09:25:22.4388558Z } 2026-02-21T09:25:22.4388678Z #-} 2026-02-21T09:25:22.4389186Z /tmp/torchinductor_root/w2/cw2csammbsly2xzje4frurab3f4fx7byjvsr2fjttwpoqvu6choy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:25:22.4390568Z /tmp/torchinductor_root/w2/cw2csammbsly2xzje4frurab3f4fx7byjvsr2fjttwpoqvu6choy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:25:22.4391717Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:25:22.4392900Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:25:22.4394016Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:25:22.4394291Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:25:22.8462606Z module { 2026-02-21T09:25:22.8465085Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:25:22.8465571Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:25:22.8465880Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T09:25:22.8470860Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:25:22.8477400Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:25:22.8479522Z %c592_i32 = arith.constant 592 : i32 2026-02-21T09:25:22.8479864Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T09:25:22.8485449Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T09:25:22.8487758Z %cst_2 = arith.constant dense<4224> : tensor<8x1xi32> 2026-02-21T09:25:22.8488114Z %cst_3 = arith.constant dense<4224> : tensor<1024xi32> 2026-02-21T09:25:22.8488394Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:25:22.8494806Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:25:22.8496996Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:25:22.8497285Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:25:22.8502943Z %c4224_i32 = arith.constant 4224 : i32 2026-02-21T09:25:22.8505086Z %c4224_i64 = arith.constant 4224 : i64 2026-02-21T09:25:22.8505379Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:25:22.8510861Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : , > 2026-02-21T09:25:22.8512574Z %1 = tt.get_program_id x : i32 2026-02-21T09:25:22.8512832Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T09:25:22.8513078Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:25:22.8513313Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:25:22.8513569Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:25:22.8513768Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:25:22.8513954Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T09:25:22.8514147Z %c3072_i32_6 = arith.constant 3072 : i32 2026-02-21T09:25:22.8514381Z %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8514659Z %7 = tt.splat %c0_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8514885Z %8 = arith.addi %7, %6 : tensor<1024xi32> 2026-02-21T09:25:22.8515101Z %9 = arith.cmpi slt, %8, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8515371Z %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8515630Z %11 = arith.muli %10, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8515893Z %12 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8516179Z %13 = tt.broadcast %11 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8516445Z %14 = tt.broadcast %12 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8516674Z %15 = arith.addi %13, %14 : tensor<8x1024xi32> 2026-02-21T09:25:22.8516919Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8517194Z %17 = tt.addptr %16, %15 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8517508Z %18 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8518075Z %19 = tt.broadcast %18 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8518327Z %20 = tt.load %17, %19, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8518597Z %21 = arith.select %19, %20, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:25:22.8518933Z %22 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8519160Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8519359Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8519547Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:25:22.8519746Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8519928Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8520153Z %24 = arith.truncf %23 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:25:22.8520386Z %25 = arith.extf %24 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:25:22.8520693Z %26 = arith.cmpf ogt, %cst_5, %25 : tensor<8xf32> 2026-02-21T09:25:22.8520928Z %27 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32> 2026-02-21T09:25:22.8521137Z %28 = arith.ori %26, %27 : tensor<8xi1> 2026-02-21T09:25:22.8521372Z %29 = arith.select %28, %cst_5, %25 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:25:22.8521668Z %30 = arith.subf %cst_5, %29 : tensor<8xf32> 2026-02-21T09:25:22.8522031Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8522385Z %32 = arith.mulf %cst_4, %31 : tensor<8xf32> 2026-02-21T09:25:22.8522644Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8522936Z %34 = arith.extf %20 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8523196Z %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8523447Z %36 = arith.subf %34, %35 : tensor<8x1024xf32> 2026-02-21T09:25:22.8523815Z %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8524239Z %38 = arith.select %19, %37, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:25:22.8524503Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8524697Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8524893Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:25:22.8525083Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8525272Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8525464Z %40 = arith.addf %32, %39 : tensor<8xf32> 2026-02-21T09:25:22.8525660Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:25:22.8525843Z %41 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:25:22.8526036Z %42 = arith.addi %c0_i32, %41 : i32 2026-02-21T09:25:22.8526276Z %43 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8526524Z %44 = tt.splat %42 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8526730Z %45 = arith.addi %44, %43 : tensor<1024xi32> 2026-02-21T09:25:22.8526939Z %46 = arith.cmpi slt, %45, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8527199Z %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8527450Z %48 = arith.muli %47, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8527717Z %49 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8528015Z %50 = tt.broadcast %48 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8528274Z %51 = tt.broadcast %49 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8528512Z %52 = arith.addi %50, %51 : tensor<8x1024xi32> 2026-02-21T09:25:22.8528747Z %53 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8529026Z %54 = tt.addptr %53, %52 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8529387Z %55 = tt.expand_dims %46 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8529670Z %56 = tt.broadcast %55 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8529922Z %57 = tt.load %54, %56, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8530180Z %58 = arith.select %56, %57, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:25:22.8530455Z %59 = arith.extf %58 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8530678Z %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8530871Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8531059Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:25:22.8531248Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8531437Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8531744Z %61 = arith.truncf %60 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:25:22.8531987Z %62 = arith.extf %61 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:25:22.8532200Z %63 = arith.cmpf ogt, %29, %62 : tensor<8xf32> 2026-02-21T09:25:22.8532408Z %64 = arith.cmpf une, %29, %29 : tensor<8xf32> 2026-02-21T09:25:22.8532602Z %65 = arith.ori %63, %64 : tensor<8xi1> 2026-02-21T09:25:22.8532827Z %66 = arith.select %65, %29, %62 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:25:22.8533059Z %67 = arith.subf %29, %66 : tensor<8xf32> 2026-02-21T09:25:22.8534781Z %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8535140Z %69 = arith.mulf %40, %68 : tensor<8xf32> 2026-02-21T09:25:22.8535391Z %70 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8535689Z %71 = arith.extf %57 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8535963Z %72 = tt.broadcast %70 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8536199Z %73 = arith.subf %71, %72 : tensor<8x1024xf32> 2026-02-21T09:25:22.8536581Z %74 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8537005Z %75 = arith.select %56, %74, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:25:22.8537272Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8537467Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8537656Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:25:22.8537854Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8538042Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8538990Z %77 = arith.addf %69, %76 : tensor<8xf32> 2026-02-21T09:25:22.8539186Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:25:22.8539381Z %78 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:25:22.8539577Z %79 = arith.addi %c0_i32, %78 : i32 2026-02-21T09:25:22.8539829Z %80 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8540088Z %81 = tt.splat %79 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8540297Z %82 = arith.addi %81, %80 : tensor<1024xi32> 2026-02-21T09:25:22.8540524Z %83 = arith.cmpi slt, %82, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8540786Z %84 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8541053Z %85 = arith.muli %84, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8541315Z %86 = tt.expand_dims %82 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8541662Z %87 = tt.broadcast %85 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8544853Z %88 = tt.broadcast %86 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8545079Z %89 = arith.addi %87, %88 : tensor<8x1024xi32> 2026-02-21T09:25:22.8545317Z %90 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8545649Z %91 = tt.addptr %90, %89 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8548289Z %92 = tt.expand_dims %83 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8548570Z %93 = tt.broadcast %92 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8548822Z %94 = tt.load %91, %93, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8549085Z %95 = arith.select %93, %94, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:25:22.8549353Z %96 = arith.extf %95 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8549584Z %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8549770Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8549957Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:25:22.8550141Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8550403Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8550628Z %98 = arith.truncf %97 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:25:22.8550855Z %99 = arith.extf %98 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:25:22.8551079Z %100 = arith.cmpf ogt, %66, %99 : tensor<8xf32> 2026-02-21T09:25:22.8551287Z %101 = arith.cmpf une, %66, %66 : tensor<8xf32> 2026-02-21T09:25:22.8551495Z %102 = arith.ori %100, %101 : tensor<8xi1> 2026-02-21T09:25:22.8551750Z %103 = arith.select %102, %66, %99 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:25:22.8551995Z %104 = arith.subf %66, %103 : tensor<8xf32> 2026-02-21T09:25:22.8552364Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8552719Z %106 = arith.mulf %77, %105 : tensor<8xf32> 2026-02-21T09:25:22.8552976Z %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8553273Z %108 = arith.extf %94 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8553545Z %109 = tt.broadcast %107 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8553790Z %110 = arith.subf %108, %109 : tensor<8x1024xf32> 2026-02-21T09:25:22.8554170Z %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8554591Z %112 = arith.select %93, %111, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:25:22.8554847Z %113 = "tt.reduce"(%112) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8555045Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:25:22.8555224Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:25:22.8555415Z tt.reduce.return %192 : f32 2026-02-21T09:25:22.8555595Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8555801Z %114 = arith.addf %106, %113 : tensor<8xf32> 2026-02-21T09:25:22.8556185Z %115:2 = scf.for %arg3 = %c3072_i32 to %c4224_i32 step %c1024_i32 iter_args(%arg4 = %103, %arg5 = %114) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:25:22.8556604Z %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8556874Z %193 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8557086Z %194 = arith.addi %193, %192 : tensor<1024xi32> 2026-02-21T09:25:22.8557319Z %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8557588Z %196 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8557864Z %197 = arith.muli %196, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8558138Z %198 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8558439Z %199 = tt.broadcast %197 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8558721Z %200 = tt.broadcast %198 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8559020Z %201 = arith.addi %199, %200 : tensor<8x1024xi32> 2026-02-21T09:25:22.8559270Z %202 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8559560Z %203 = tt.addptr %202, %201 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8559872Z %204 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8560177Z %205 = tt.broadcast %204 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8560431Z %206 = tt.load %203, %205, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8560703Z %207 = arith.select %205, %206, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:25:22.8561009Z %208 = arith.extf %207 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8561273Z %209 = "tt.reduce"(%208) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8561585Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.8561782Z %227 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:25:22.8561990Z tt.reduce.return %227 : f32 2026-02-21T09:25:22.8562180Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8562433Z %210 = arith.truncf %209 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:25:22.8562699Z %211 = arith.extf %210 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:25:22.8562963Z %212 = arith.cmpf ogt, %arg4, %211 : tensor<8xf32> 2026-02-21T09:25:22.8563220Z %213 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:25:22.8563453Z %214 = arith.ori %212, %213 : tensor<8xi1> 2026-02-21T09:25:22.8563719Z %215 = arith.select %214, %arg4, %211 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:25:22.8563985Z %216 = arith.subf %arg4, %215 : tensor<8xf32> 2026-02-21T09:25:22.8564387Z %217 = tt.extern_elementwise %216 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8564780Z %218 = arith.mulf %arg5, %217 : tensor<8xf32> 2026-02-21T09:25:22.8565064Z %219 = tt.expand_dims %215 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8565386Z %220 = arith.extf %206 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8565674Z %221 = tt.broadcast %219 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8565944Z %222 = arith.subf %220, %221 : tensor<8x1024xf32> 2026-02-21T09:25:22.8566344Z %223 = tt.extern_elementwise %222 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8566807Z %224 = arith.select %205, %223, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:25:22.8567093Z %225 = "tt.reduce"(%224) <{axis = 1 : i32}> ({ 2026-02-21T09:25:22.8567302Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:25:22.8567514Z %227 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:25:22.8567722Z tt.reduce.return %227 : f32 2026-02-21T09:25:22.8567933Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:25:22.8568150Z %226 = arith.addf %218, %225 : tensor<8xf32> 2026-02-21T09:25:22.8568394Z scf.yield %215, %226 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:25:22.8568636Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:25:22.8568852Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T09:25:22.8569079Z %c3072_i32_8 = arith.constant 3072 : i32 2026-02-21T09:25:22.8569342Z %116 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8569634Z %117 = tt.splat %c0_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8569859Z %118 = arith.addi %117, %116 : tensor<1024xi32> 2026-02-21T09:25:22.8570104Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8570454Z %120 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:25:22.8570874Z %121 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8571196Z %122 = arith.extf %120 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8571480Z %123 = tt.broadcast %121 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8571802Z %124 = arith.subf %122, %123 : tensor<8x1024xf32> 2026-02-21T09:25:22.8572202Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8572652Z %126 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8572961Z %127 = tt.broadcast %126 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8573213Z %128 = arith.divf %125, %127 : tensor<8x1024xf32> 2026-02-21T09:25:22.8573518Z %129 = arith.truncf %128 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:25:22.8573826Z %130 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8574117Z %131 = arith.muli %130, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8574382Z %132 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8574672Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8574950Z %134 = tt.broadcast %132 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8575206Z %135 = arith.addi %133, %134 : tensor<8x1024xi32> 2026-02-21T09:25:22.8575467Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8575776Z %137 = tt.addptr %136, %135 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8576111Z %138 = tt.expand_dims %119 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8576445Z %139 = tt.broadcast %138 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8576719Z tt.store %137, %129, %139 : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8576958Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:25:22.8577163Z %140 = arith.muli %c1024_i32, %c1_i32_9 : i32 2026-02-21T09:25:22.8577360Z %141 = arith.addi %c0_i32, %140 : i32 2026-02-21T09:25:22.8577595Z %142 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8577855Z %143 = tt.splat %141 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8578067Z %144 = arith.addi %143, %142 : tensor<1024xi32> 2026-02-21T09:25:22.8578284Z %145 = arith.cmpi slt, %144, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8578594Z %146 = tt.descriptor_load %0[%2, %141] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:25:22.8578938Z %147 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8579236Z %148 = arith.extf %146 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8579502Z %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8579745Z %150 = arith.subf %148, %149 : tensor<8x1024xf32> 2026-02-21T09:25:22.8580127Z %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8580551Z %152 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8580846Z %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8581079Z %154 = arith.divf %151, %153 : tensor<8x1024xf32> 2026-02-21T09:25:22.8581324Z %155 = arith.truncf %154 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:25:22.8581648Z %156 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8581902Z %157 = arith.muli %156, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8582174Z %158 = tt.expand_dims %144 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8582510Z %159 = tt.broadcast %157 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8582777Z %160 = tt.broadcast %158 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8583015Z %161 = arith.addi %159, %160 : tensor<8x1024xi32> 2026-02-21T09:25:22.8583258Z %162 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8583547Z %163 = tt.addptr %162, %161 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8583848Z %164 = tt.expand_dims %145 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8584144Z %165 = tt.broadcast %164 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8584391Z tt.store %163, %155, %165 : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8584612Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:25:22.8584856Z %166 = arith.muli %c1024_i32, %c2_i32_10 : i32 2026-02-21T09:25:22.8585051Z %167 = arith.addi %c0_i32, %166 : i32 2026-02-21T09:25:22.8585291Z %168 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8585541Z %169 = tt.splat %167 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8585754Z %170 = arith.addi %169, %168 : tensor<1024xi32> 2026-02-21T09:25:22.8585971Z %171 = arith.cmpi slt, %170, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8586278Z %172 = tt.descriptor_load %0[%2, %167] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:25:22.8586630Z %173 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8586920Z %174 = arith.extf %172 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8587189Z %175 = tt.broadcast %173 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8587421Z %176 = arith.subf %174, %175 : tensor<8x1024xf32> 2026-02-21T09:25:22.8587804Z %177 = tt.extern_elementwise %176 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8588226Z %178 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8588518Z %179 = tt.broadcast %178 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8588759Z %180 = arith.divf %177, %179 : tensor<8x1024xf32> 2026-02-21T09:25:22.8588997Z %181 = arith.truncf %180 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:25:22.8589286Z %182 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8589537Z %183 = arith.muli %182, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8589802Z %184 = tt.expand_dims %170 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8590096Z %185 = tt.broadcast %183 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8590358Z %186 = tt.broadcast %184 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8590608Z %187 = arith.addi %185, %186 : tensor<8x1024xi32> 2026-02-21T09:25:22.8590845Z %188 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8591133Z %189 = tt.addptr %188, %187 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8591449Z %190 = tt.expand_dims %171 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8591797Z %191 = tt.broadcast %190 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8592063Z tt.store %189, %181, %191 : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8592328Z scf.for %arg3 = %c3072_i32_7 to %c4224_i32 step %c1024_i32 : i32 { 2026-02-21T09:25:22.8592629Z %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:25:22.8592900Z %193 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:25:22.8593129Z %194 = arith.addi %193, %192 : tensor<1024xi32> 2026-02-21T09:25:22.8593406Z %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32> 2026-02-21T09:25:22.8593735Z %196 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:25:22.8594106Z %197 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8594409Z %198 = arith.extf %196 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:25:22.8594690Z %199 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8594938Z %200 = arith.subf %198, %199 : tensor<8x1024xf32> 2026-02-21T09:25:22.8595333Z %201 = tt.extern_elementwise %200 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:25:22.8595779Z %202 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:25:22.8596146Z %203 = tt.broadcast %202 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:25:22.8596409Z %204 = arith.divf %201, %203 : tensor<8x1024xf32> 2026-02-21T09:25:22.8596661Z %205 = arith.truncf %204 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:25:22.8596972Z %206 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:25:22.8597251Z %207 = arith.muli %206, %cst_2 : tensor<8x1xi32> 2026-02-21T09:25:22.8597533Z %208 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:25:22.8597851Z %209 = tt.broadcast %207 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8598134Z %210 = tt.broadcast %208 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:25:22.8598406Z %211 = arith.addi %209, %210 : tensor<8x1024xi32> 2026-02-21T09:25:22.8598664Z %212 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8598984Z %213 = tt.addptr %212, %211 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:25:22.8599321Z %214 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:25:22.8599628Z %215 = tt.broadcast %214 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:25:22.8599903Z tt.store %213, %205, %215 : tensor<8x1024x!tt.ptr> 2026-02-21T09:25:22.8600139Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:25:22.8600531Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:25:22.8600880Z tt.return 2026-02-21T09:25:22.8601017Z } 2026-02-21T09:25:22.8601147Z } 2026-02-21T09:25:22.8601218Z 2026-02-21T09:25:22.8601270Z {-# 2026-02-21T09:25:22.8601408Z external_resources: { 2026-02-21T09:25:22.8601616Z mlir_reproducer: { 2026-02-21T09:25:22.8605991Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:25:22.8610512Z disable_threading: false, 2026-02-21T09:25:22.8610683Z verify_each: true 2026-02-21T09:25:22.8610837Z } 2026-02-21T09:25:22.8610957Z } 2026-02-21T09:25:22.8611080Z #-} 2026-02-21T09:25:22.8611603Z /tmp/torchinductor_root/cu/ccudrra547xcwkchef5tpeaz2v4byqm5ndbgc52wz75djktnqouq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:25:22.8612834Z /tmp/torchinductor_root/cu/ccudrra547xcwkchef5tpeaz2v4byqm5ndbgc52wz75djktnqouq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:25:22.8613821Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:25:22.8614904Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:25:22.8615858Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:25:22.8616120Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:25:27.2348468Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.7 configs/s 2026-02-21T09:25:27.2358336Z [40s] Adaptive compile timeout: 30s (90% percentile=5.5s, bounds=[30.0s, 30s]) 2026-02-21T09:25:28.5194394Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 779.9 configs/s 2026-02-21T09:25:28.6053511Z [42s] Initial random population of 100, 5 starting points: 2026-02-21T09:25:28.6054850Z error=14 2026-02-21T09:25:28.6055014Z timeout=1 2026-02-21T09:25:28.6055140Z ok=85 2026-02-21T09:25:28.6055274Z min=0.0398 2026-02-21T09:25:28.6055400Z mid=0.3544 2026-02-21T09:25:28.6055539Z max=98.1514 2026-02-21T09:25:28.6055685Z best={'block_sizes': [4, 512], 2026-02-21T09:25:28.6055917Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:25:28.6056159Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:25:28.6056376Z 'maxnreg': 64, 2026-02-21T09:25:28.6056834Z 'num_sm_multiplier': 8, 2026-02-21T09:25:28.6057010Z 'num_stages': 8, 2026-02-21T09:25:28.6057160Z 'num_warps': 8, 2026-02-21T09:25:28.6057312Z 'pid_type': 'persistent_blocked', 2026-02-21T09:25:28.6057502Z 'range_flattens': [False, None], 2026-02-21T09:25:28.6057680Z 'range_multi_buffers': [False, None], 2026-02-21T09:25:28.6057865Z 'range_num_stages': [4, 3], 2026-02-21T09:25:28.6058027Z 'range_unroll_factors': [2, 1], 2026-02-21T09:25:28.6058211Z 'range_warp_specializes': [False, True]} 2026-02-21T09:25:28.6068681Z [42s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:25:30.2294779Z [43s] Generation 1 starting: 92 neighbors, 5 active search path(s) 2026-02-21T09:25:41.0489524Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 3.9 configs/s 2026-02-21T09:25:46.7327387Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 16.8 configs/s 2026-02-21T09:25:49.7685374Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 333.6 2026-02-21T09:25:49.7685949Z configs/s 2026-02-21T09:25:49.9702082Z [63s] Generation 1 complete: 2026-02-21T09:25:49.9706072Z ok=98 2026-02-21T09:25:49.9710432Z min=0.0307 2026-02-21T09:25:49.9714972Z mid=0.0460 2026-02-21T09:25:49.9716484Z max=0.2191 2026-02-21T09:25:49.9716661Z best={'block_sizes': [2, 8192], 2026-02-21T09:25:49.9716920Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:25:49.9717180Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:25:49.9717380Z 'num_stages': 1, 2026-02-21T09:25:49.9717521Z 'num_warps': 1, 2026-02-21T09:25:49.9717668Z 'pid_type': 'flat', 2026-02-21T09:25:49.9717825Z 'range_flattens': [None, True], 2026-02-21T09:25:49.9718013Z 'range_multi_buffers': [None, True], 2026-02-21T09:25:49.9718193Z 'range_num_stages': [0, 4], 2026-02-21T09:25:49.9718365Z 'range_unroll_factors': [0, 0], 2026-02-21T09:25:49.9718572Z 'range_warp_specializes': [None, True]} 2026-02-21T09:25:49.9718870Z [63s] Fitting surrogate: 198 points, 198 targets 2026-02-21T09:25:51.1321352Z [64s] Generation 2 starting: 82 neighbors, 5 active search path(s) 2026-02-21T09:25:59.7321130Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 12.5 configs/s 2026-02-21T09:26:05.3403065Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 15.4 configs/s 2026-02-21T09:26:06.7677683Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 706.8 2026-02-21T09:26:06.7678027Z configs/s 2026-02-21T09:26:06.8755644Z [80s] Generation 2 complete: 2026-02-21T09:26:06.8760494Z ok=88 2026-02-21T09:26:06.8765039Z min=0.0205 2026-02-21T09:26:06.8766522Z mid=0.0348 2026-02-21T09:26:06.8766691Z max=0.1106 2026-02-21T09:26:06.8766834Z best={'block_sizes': [1, 8192], 2026-02-21T09:26:06.8767109Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:26:06.8767440Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:26:06.8767652Z 'num_stages': 7, 2026-02-21T09:26:06.8767801Z 'num_warps': 4, 2026-02-21T09:26:06.8767940Z 'pid_type': 'flat', 2026-02-21T09:26:06.8768099Z 'range_flattens': [None, True], 2026-02-21T09:26:06.8768271Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:06.8768456Z 'range_num_stages': [0, 3], 2026-02-21T09:26:06.8768616Z 'range_unroll_factors': [0, 3], 2026-02-21T09:26:06.8768797Z 'range_warp_specializes': [None, False]} 2026-02-21T09:26:06.8769014Z [80s] Fitting surrogate: 286 points, 286 targets 2026-02-21T09:26:07.8681831Z [81s] Generation 3 starting: 67 neighbors, 5 active search path(s) 2026-02-21T09:26:22.6211420Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 1.6 configs/s 2026-02-21T09:26:26.9243014Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.7 configs/s 2026-02-21T09:26:28.9599201Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 497.8 2026-02-21T09:26:28.9600388Z configs/s 2026-02-21T09:26:29.1198266Z [102s] Generation 3 complete: 2026-02-21T09:26:29.1199635Z ok=72 2026-02-21T09:26:29.1199805Z min=0.0204 2026-02-21T09:26:29.1199952Z mid=0.0327 2026-02-21T09:26:29.1200090Z max=0.2712 2026-02-21T09:26:29.1200239Z best={'block_sizes': [1, 8192], 2026-02-21T09:26:29.1200529Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:26:29.1200828Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:26:29.1201034Z 'num_stages': 7, 2026-02-21T09:26:29.1201179Z 'num_warps': 4, 2026-02-21T09:26:29.1201331Z 'pid_type': 'flat', 2026-02-21T09:26:29.1201492Z 'range_flattens': [None, True], 2026-02-21T09:26:29.1201913Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:29.1202119Z 'range_num_stages': [0, 3], 2026-02-21T09:26:29.1202292Z 'range_unroll_factors': [0, 3], 2026-02-21T09:26:29.1202828Z 'range_warp_specializes': [None, False]} 2026-02-21T09:26:29.1240876Z [102s] Fitting surrogate: 358 points, 358 targets 2026-02-21T09:26:30.0554517Z [103s] Generation 4 starting: 56 neighbors, 5 active search path(s) 2026-02-21T09:26:39.0748264Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 1.7 configs/s 2026-02-21T09:26:42.5765503Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 16.8 configs/s 2026-02-21T09:26:44.0580239Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 684.0 2026-02-21T09:26:44.0580577Z configs/s 2026-02-21T09:26:44.1728270Z [117s] Generation 4 complete: 2026-02-21T09:26:44.1730233Z ok=61 2026-02-21T09:26:44.1730403Z min=0.0204 2026-02-21T09:26:44.1730534Z mid=0.0328 2026-02-21T09:26:44.1730663Z max=0.1576 2026-02-21T09:26:44.1730809Z best={'block_sizes': [1, 8192], 2026-02-21T09:26:44.1731071Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:26:44.1731396Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:26:44.1731675Z 'num_stages': 7, 2026-02-21T09:26:44.1731823Z 'num_warps': 4, 2026-02-21T09:26:44.1731967Z 'pid_type': 'flat', 2026-02-21T09:26:44.1732133Z 'range_flattens': [None, True], 2026-02-21T09:26:44.1732309Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:44.1732497Z 'range_num_stages': [0, 3], 2026-02-21T09:26:44.1732669Z 'range_unroll_factors': [0, 3], 2026-02-21T09:26:44.1732850Z 'range_warp_specializes': [None, False]} 2026-02-21T09:26:44.1749664Z [117s] Fitting surrogate: 419 points, 419 targets 2026-02-21T09:26:44.9382571Z [118s] Generation 5 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:26:49.7827219Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 3.0 configs/s 2026-02-21T09:26:50.9355287Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.1 configs/s 2026-02-21T09:26:50.9359812Z [124s] Generation 5 complete: 2026-02-21T09:26:50.9361911Z ok=20 2026-02-21T09:26:50.9362149Z min=0.0204 2026-02-21T09:26:50.9364344Z mid=0.0409 2026-02-21T09:26:50.9369498Z max=0.0552 2026-02-21T09:26:50.9371395Z best={'block_sizes': [1, 8192], 2026-02-21T09:26:50.9371788Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:26:50.9372096Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:26:50.9372309Z 'num_stages': 7, 2026-02-21T09:26:50.9372467Z 'num_warps': 4, 2026-02-21T09:26:50.9372626Z 'pid_type': 'flat', 2026-02-21T09:26:50.9372793Z 'range_flattens': [None, True], 2026-02-21T09:26:50.9372987Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:50.9373186Z 'range_num_stages': [0, 3], 2026-02-21T09:26:50.9373358Z 'range_unroll_factors': [0, 3], 2026-02-21T09:26:50.9373551Z 'range_warp_specializes': [None, False]} 2026-02-21T09:26:50.9373778Z [124s] Fitting surrogate: 439 points, 439 targets 2026-02-21T09:26:51.3155925Z [124s] Generation 6 starting: 19 neighbors, 1 active search path(s) 2026-02-21T09:26:55.0395782Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.7 configs/s 2026-02-21T09:26:56.2345304Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.4 configs/s 2026-02-21T09:26:56.2351819Z [129s] Generation 6 complete: 2026-02-21T09:26:56.2354886Z ok=21 2026-02-21T09:26:56.2355137Z min=0.0204 2026-02-21T09:26:56.2355281Z mid=0.0451 2026-02-21T09:26:56.2355432Z max=0.1044 2026-02-21T09:26:56.2359508Z best={'block_sizes': [1, 8192], 2026-02-21T09:26:56.2364892Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:26:56.2366276Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:26:56.2366501Z 'num_stages': 7, 2026-02-21T09:26:56.2366655Z 'num_warps': 4, 2026-02-21T09:26:56.2366799Z 'pid_type': 'flat', 2026-02-21T09:26:56.2366964Z 'range_flattens': [None, True], 2026-02-21T09:26:56.2367148Z 'range_multi_buffers': [None, None], 2026-02-21T09:26:56.2367343Z 'range_num_stages': [0, 3], 2026-02-21T09:26:56.2367894Z 'range_unroll_factors': [0, 3], 2026-02-21T09:26:56.2368110Z 'range_warp_specializes': [None, False]} 2026-02-21T09:26:56.2368341Z [129s] Fitting surrogate: 460 points, 460 targets 2026-02-21T09:26:56.4002873Z [130s] Autotuning complete in 130.1s after searching 448 configs. 2026-02-21T09:26:56.4005212Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:26:56.4006188Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:26:56.4007025Z 2026-02-21T09:26:56.4007275Z [130s] Code of selected kernel: /tmp/torchinductor_root/bl/cbllivasagu25kv3xae4hgfckwjsmqt3hem5fxlwoeb2l3uruoyh.py 2026-02-21T09:26:56.4262493Z from __future__ import annotations 2026-02-21T09:26:56.4264269Z 2026-02-21T09:26:56.4264432Z import torch 2026-02-21T09:26:56.4264597Z import triton 2026-02-21T09:26:56.4264760Z import triton.language as tl 2026-02-21T09:26:56.4264971Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:26:56.4265245Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:26:56.4265546Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:26:56.4265722Z 2026-02-21T09:26:56.4265793Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:26:56.4265977Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:26:56.4266092Z 2026-02-21T09:26:56.4266148Z @triton.jit 2026-02-21T09:26:56.4266296Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:26:56.4266543Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:26:56.4266796Z pid_0 = tl.program_id(0) 2026-02-21T09:26:56.4266958Z offset_0 = pid_0 2026-02-21T09:26:56.4267129Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:26:56.4267427Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:26:56.4267715Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:26:56.4267983Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:26:56.4268229Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:26:56.4268484Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:26:56.4268751Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:26:56.4269005Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:26:56.4269237Z # src[softmax.py:82-89]: ... 2026-02-21T09:26:56.4269595Z for offset_2 in tl.range(0, 4224, _BLOCK_SIZE_1, loop_unroll_factor=3, warp_specialize=False, num_stages=1, flatten=True): 2026-02-21T09:26:56.4269993Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:26:56.4270239Z mask_1 = indices_2 < 4224 2026-02-21T09:26:56.4270745Z mi_copy = mi 2026-02-21T09:26:56.4270894Z di_copy = di 2026-02-21T09:26:56.4271037Z mi_copy_0 = mi_copy 2026-02-21T09:26:56.4271197Z di_copy_0 = di_copy 2026-02-21T09:26:56.4271378Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:26:56.4272018Z values = tl.load(x + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:26:56.4272418Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:26:56.4272817Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:26:56.4273213Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:26:56.4273470Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:26:56.4273825Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:26:56.4274048Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:26:56.4274308Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:26:56.4274550Z v_2 = mi_copy_0 - v_1 2026-02-21T09:26:56.4274719Z v_3 = libdevice.exp(v_2) 2026-02-21T09:26:56.4274891Z v_4 = di_copy_0 * v_3 2026-02-21T09:26:56.4275077Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:26:56.4275283Z subscript = v_1[:, None] 2026-02-21T09:26:56.4275453Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:26:56.4275636Z v_6 = v_5 - subscript 2026-02-21T09:26:56.4275849Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:26:56.4276106Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:26:56.4276325Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:26:56.4276507Z v_7 = libdevice.exp(v_6) 2026-02-21T09:26:56.4276832Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:26:56.4277183Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:26:56.4277384Z di = v_4 + sum_1 2026-02-21T09:26:56.4277554Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:26:56.4277727Z mi = v_1 2026-02-21T09:26:56.4277959Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:26:56.4278223Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:26:56.4278518Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:26:56.4278942Z for offset_2 in tl.range(0, 4224, _BLOCK_SIZE_1, loop_unroll_factor=3, warp_specialize=False, num_stages=1, flatten=True): 2026-02-21T09:26:56.4279337Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:26:56.4279572Z mask_2 = indices_2 < 4224 2026-02-21T09:26:56.4279740Z mi_copy_1 = mi 2026-02-21T09:26:56.4279896Z di_copy_1 = di 2026-02-21T09:26:56.4280041Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:26:56.4280209Z di_copy_1_0 = di_copy_1 2026-02-21T09:26:56.4280389Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:26:56.4280758Z values_1 = tl.load(x + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:26:56.4281187Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:26:56.4281455Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:26:56.4281726Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:26:56.4281906Z v_10 = v_9 - subscript_1 2026-02-21T09:26:56.4282083Z v_11 = libdevice.exp(v_10) 2026-02-21T09:26:56.4282259Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:26:56.4282445Z v_12 = v_11 / subscript_2 2026-02-21T09:26:56.4282616Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:26:56.4282974Z tl.store(out + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:26:56.4283182Z 2026-02-21T09:26:56.4283315Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:26:56.4283541Z """ 2026-02-21T09:26:56.4283746Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:26:56.4284047Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:26:56.4284268Z Args: 2026-02-21T09:26:56.4284427Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:26:56.4284625Z Returns: 2026-02-21T09:26:56.4284807Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:26:56.4285010Z """ 2026-02-21T09:26:56.4285154Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:26:56.4285327Z m, n = x.size() 2026-02-21T09:26:56.4285560Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:26:56.4285769Z out = torch.empty_like(x) 2026-02-21T09:26:56.4286001Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:26:56.4286315Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:26:56.4286628Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:26:56.4286873Z # src[softmax.py:79-92]: ... 2026-02-21T09:26:56.4287124Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=7) 2026-02-21T09:26:56.4287402Z # src[softmax.py:93]: return out 2026-02-21T09:26:56.4287573Z return out 2026-02-21T09:26:56.9256576Z WARNING:tritonbench.utils.triton_op:Completed input ID 31: 2026-02-21T09:26:56.9258544Z (M, N) 2026-02-21T09:26:56.9258707Z ------------ 2026-02-21T09:26:56.9258859Z (4096, 4224) 2026-02-21T09:26:56.9265458Z 2026-02-21T09:26:56.9266103Z 35%|███▌ | 7/20 [18:02<35:07, 162.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 36: 2026-02-21T09:26:56.9267822Z (M, N) 2026-02-21T09:26:56.9268028Z ------------ 2026-02-21T09:26:56.9273964Z (4096, 4864) 2026-02-21T09:26:56.9278636Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:26:58.1964418Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:26:59.6430358Z INFO:tritonbench.utils.triton_op:Took 2.48ms to get benchmark function for torch_compile_softmax 2026-02-21T09:27:00.9660899Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:27:00.9664519Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:27:00.9668836Z 'dtype': 'torch.float16', 2026-02-21T09:27:00.9673269Z 'shape': (4096, 4864), 2026-02-21T09:27:00.9677110Z 'stride': (4864, 1)},), 2026-02-21T09:27:00.9681158Z 'kwargs': {}} 2026-02-21T09:27:00.9681721Z INFO:tritonbench.utils.triton_op:Took 2.32ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:27:01.1463148Z [0s] Autotune random seed: 2138408546 2026-02-21T09:27:01.1721833Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:27:35.8884921Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:27:35.8902999Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:27:38.2047113Z module { 2026-02-21T09:27:38.2051130Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:27:38.2051924Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:27:38.2057238Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T09:27:38.2061128Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:27:38.2065003Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:27:38.2067008Z %c592_i32 = arith.constant 592 : i32 2026-02-21T09:27:38.2067262Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T09:27:38.2067541Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T09:27:38.2067797Z %cst_2 = arith.constant dense<4864> : tensor<8x1xi32> 2026-02-21T09:27:38.2068027Z %cst_3 = arith.constant dense<4864> : tensor<1024xi32> 2026-02-21T09:27:38.2068276Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:27:38.2068528Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:27:38.2068745Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:27:38.2068934Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:27:38.2069403Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T09:27:38.2069673Z %c4864_i64 = arith.constant 4864 : i64 2026-02-21T09:27:38.2069852Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:27:38.2070175Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : , > 2026-02-21T09:27:38.2070509Z %1 = tt.get_program_id x : i32 2026-02-21T09:27:38.2070721Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T09:27:38.2070948Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:27:38.2071173Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:27:38.2071427Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:27:38.2071687Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:27:38.2071879Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T09:27:38.2072079Z %c3072_i32_6 = arith.constant 3072 : i32 2026-02-21T09:27:38.2072324Z %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2072599Z %7 = tt.splat %c0_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2072807Z %8 = arith.addi %7, %6 : tensor<1024xi32> 2026-02-21T09:27:38.2073025Z %9 = arith.cmpi slt, %8, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2073282Z %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2073547Z %11 = arith.muli %10, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2073809Z %12 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2074095Z %13 = tt.broadcast %11 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2074368Z %14 = tt.broadcast %12 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2074605Z %15 = arith.addi %13, %14 : tensor<8x1024xi32> 2026-02-21T09:27:38.2074851Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2075127Z %17 = tt.addptr %16, %15 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2075425Z %18 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2075717Z %19 = tt.broadcast %18 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2075964Z %20 = tt.load %17, %19, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2076227Z %21 = arith.select %19, %20, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:27:38.2076499Z %22 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2076731Z %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2076920Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2077110Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:27:38.2077304Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2077485Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2077706Z %24 = arith.truncf %23 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:27:38.2078097Z %25 = arith.extf %24 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:27:38.2078326Z %26 = arith.cmpf ogt, %cst_5, %25 : tensor<8xf32> 2026-02-21T09:27:38.2078547Z %27 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32> 2026-02-21T09:27:38.2078762Z %28 = arith.ori %26, %27 : tensor<8xi1> 2026-02-21T09:27:38.2078990Z %29 = arith.select %28, %cst_5, %25 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:27:38.2079222Z %30 = arith.subf %cst_5, %29 : tensor<8xf32> 2026-02-21T09:27:38.2079586Z %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2079936Z %32 = arith.mulf %cst_4, %31 : tensor<8xf32> 2026-02-21T09:27:38.2080185Z %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2080463Z %34 = arith.extf %20 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2080784Z %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2081026Z %36 = arith.subf %34, %35 : tensor<8x1024xf32> 2026-02-21T09:27:38.2081389Z %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2081840Z %38 = arith.select %19, %37, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:27:38.2082079Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2082274Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2082459Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:27:38.2082644Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2082836Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2083034Z %40 = arith.addf %32, %39 : tensor<8xf32> 2026-02-21T09:27:38.2083234Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:27:38.2083424Z %41 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:27:38.2083629Z %42 = arith.addi %c0_i32, %41 : i32 2026-02-21T09:27:38.2083861Z %43 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2084121Z %44 = tt.splat %42 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2084323Z %45 = arith.addi %44, %43 : tensor<1024xi32> 2026-02-21T09:27:38.2084531Z %46 = arith.cmpi slt, %45, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2084790Z %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2085042Z %48 = arith.muli %47, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2085303Z %49 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2085590Z %50 = tt.broadcast %48 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2085858Z %51 = tt.broadcast %49 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2086098Z %52 = arith.addi %50, %51 : tensor<8x1024xi32> 2026-02-21T09:27:38.2086336Z %53 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2086615Z %54 = tt.addptr %53, %52 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2086907Z %55 = tt.expand_dims %46 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2087198Z %56 = tt.broadcast %55 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2087452Z %57 = tt.load %54, %56, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2087707Z %58 = arith.select %56, %57, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:27:38.2088012Z %59 = arith.extf %58 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2088235Z %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2088428Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2088607Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:27:38.2088802Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2089045Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2089270Z %61 = arith.truncf %60 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:27:38.2089508Z %62 = arith.extf %61 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:27:38.2089724Z %63 = arith.cmpf ogt, %29, %62 : tensor<8xf32> 2026-02-21T09:27:38.2089932Z %64 = arith.cmpf une, %29, %29 : tensor<8xf32> 2026-02-21T09:27:38.2090127Z %65 = arith.ori %63, %64 : tensor<8xi1> 2026-02-21T09:27:38.2090345Z %66 = arith.select %65, %29, %62 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:27:38.2090568Z %67 = arith.subf %29, %66 : tensor<8xf32> 2026-02-21T09:27:38.2090919Z %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2091277Z %69 = arith.mulf %40, %68 : tensor<8xf32> 2026-02-21T09:27:38.2091514Z %70 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2091895Z %71 = arith.extf %57 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2092151Z %72 = tt.broadcast %70 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2092383Z %73 = arith.subf %71, %72 : tensor<8x1024xf32> 2026-02-21T09:27:38.2092751Z %74 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2093186Z %75 = arith.select %56, %74, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:27:38.2093435Z %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2093628Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2093803Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:27:38.2093998Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2094182Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2094385Z %77 = arith.addf %69, %76 : tensor<8xf32> 2026-02-21T09:27:38.2094576Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:27:38.2094766Z %78 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:27:38.2094960Z %79 = arith.addi %c0_i32, %78 : i32 2026-02-21T09:27:38.2095188Z %80 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2095440Z %81 = tt.splat %79 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2095635Z %82 = arith.addi %81, %80 : tensor<1024xi32> 2026-02-21T09:27:38.2095850Z %83 = arith.cmpi slt, %82, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2096103Z %84 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2096363Z %85 = arith.muli %84, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2096621Z %86 = tt.expand_dims %82 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2096909Z %87 = tt.broadcast %85 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2097182Z %88 = tt.broadcast %86 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2097426Z %89 = arith.addi %87, %88 : tensor<8x1024xi32> 2026-02-21T09:27:38.2097675Z %90 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2097960Z %91 = tt.addptr %90, %89 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2098274Z %92 = tt.expand_dims %83 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2098575Z %93 = tt.broadcast %92 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2098832Z %94 = tt.load %91, %93, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2099107Z %95 = arith.select %93, %94, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:27:38.2099385Z %96 = arith.extf %95 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2099623Z %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2099816Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2100008Z %192 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:27:38.2100260Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2100445Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2100677Z %98 = arith.truncf %97 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:27:38.2100919Z %99 = arith.extf %98 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:27:38.2101155Z %100 = arith.cmpf ogt, %66, %99 : tensor<8xf32> 2026-02-21T09:27:38.2101375Z %101 = arith.cmpf une, %66, %66 : tensor<8xf32> 2026-02-21T09:27:38.2101662Z %102 = arith.ori %100, %101 : tensor<8xi1> 2026-02-21T09:27:38.2101910Z %103 = arith.select %102, %66, %99 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:27:38.2102158Z %104 = arith.subf %66, %103 : tensor<8xf32> 2026-02-21T09:27:38.2102544Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2102984Z %106 = arith.mulf %77, %105 : tensor<8xf32> 2026-02-21T09:27:38.2103255Z %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2103557Z %108 = arith.extf %94 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2103840Z %109 = tt.broadcast %107 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2104101Z %110 = arith.subf %108, %109 : tensor<8x1024xf32> 2026-02-21T09:27:38.2104486Z %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2104928Z %112 = arith.select %93, %111, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:27:38.2105221Z %113 = "tt.reduce"(%112) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2105420Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:27:38.2105609Z %192 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:27:38.2105793Z tt.reduce.return %192 : f32 2026-02-21T09:27:38.2105980Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2106174Z %114 = arith.addf %106, %113 : tensor<8xf32> 2026-02-21T09:27:38.2106549Z %115:2 = scf.for %arg3 = %c3072_i32 to %c4864_i32 step %c1024_i32 iter_args(%arg4 = %103, %arg5 = %114) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:27:38.2106960Z %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2107227Z %193 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2107442Z %194 = arith.addi %193, %192 : tensor<1024xi32> 2026-02-21T09:27:38.2107659Z %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2107931Z %196 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2108194Z %197 = arith.muli %196, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2108462Z %198 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2108759Z %199 = tt.broadcast %197 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2109037Z %200 = tt.broadcast %198 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2109293Z %201 = arith.addi %199, %200 : tensor<8x1024xi32> 2026-02-21T09:27:38.2109532Z %202 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2109822Z %203 = tt.addptr %202, %201 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2110128Z %204 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2110433Z %205 = tt.broadcast %204 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2110701Z %206 = tt.load %203, %205, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2110973Z %207 = arith.select %205, %206, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:27:38.2111266Z %208 = arith.extf %207 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2111621Z %209 = "tt.reduce"(%208) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2111826Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:27:38.2112013Z %227 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:27:38.2112216Z tt.reduce.return %227 : f32 2026-02-21T09:27:38.2112412Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2112634Z %210 = arith.truncf %209 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:27:38.2112885Z %211 = arith.extf %210 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:27:38.2113118Z %212 = arith.cmpf ogt, %arg4, %211 : tensor<8xf32> 2026-02-21T09:27:38.2113348Z %213 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:27:38.2113556Z %214 = arith.ori %212, %213 : tensor<8xi1> 2026-02-21T09:27:38.2113795Z %215 = arith.select %214, %arg4, %211 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:27:38.2114037Z %216 = arith.subf %arg4, %215 : tensor<8xf32> 2026-02-21T09:27:38.2114463Z %217 = tt.extern_elementwise %216 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2114831Z %218 = arith.mulf %arg5, %217 : tensor<8xf32> 2026-02-21T09:27:38.2115080Z %219 = tt.expand_dims %215 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2115371Z %220 = arith.extf %206 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2115635Z %221 = tt.broadcast %219 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2115883Z %222 = arith.subf %220, %221 : tensor<8x1024xf32> 2026-02-21T09:27:38.2116263Z %223 = tt.extern_elementwise %222 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2116686Z %224 = arith.select %205, %223, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:27:38.2116956Z %225 = "tt.reduce"(%224) <{axis = 1 : i32}> ({ 2026-02-21T09:27:38.2117148Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:27:38.2117336Z %227 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:27:38.2117523Z tt.reduce.return %227 : f32 2026-02-21T09:27:38.2117713Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:27:38.2117923Z %226 = arith.addf %218, %225 : tensor<8xf32> 2026-02-21T09:27:38.2118138Z scf.yield %215, %226 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:27:38.2118360Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:27:38.2118556Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T09:27:38.2118753Z %c3072_i32_8 = arith.constant 3072 : i32 2026-02-21T09:27:38.2118989Z %116 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2119260Z %117 = tt.splat %c0_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2119475Z %118 = arith.addi %117, %116 : tensor<1024xi32> 2026-02-21T09:27:38.2119696Z %119 = arith.cmpi slt, %118, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2120017Z %120 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:27:38.2120363Z %121 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2120658Z %122 = arith.extf %120 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2120919Z %123 = tt.broadcast %121 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2121165Z %124 = arith.subf %122, %123 : tensor<8x1024xf32> 2026-02-21T09:27:38.2121576Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2121990Z %126 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2122285Z %127 = tt.broadcast %126 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2122523Z %128 = arith.divf %125, %127 : tensor<8x1024xf32> 2026-02-21T09:27:38.2122816Z %129 = arith.truncf %128 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:27:38.2123106Z %130 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2123362Z %131 = arith.muli %130, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2123629Z %132 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2123922Z %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2124193Z %134 = tt.broadcast %132 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2124430Z %135 = arith.addi %133, %134 : tensor<8x1024xi32> 2026-02-21T09:27:38.2124671Z %136 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2124962Z %137 = tt.addptr %136, %135 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2125310Z %138 = tt.expand_dims %119 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2125611Z %139 = tt.broadcast %138 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2125862Z tt.store %137, %129, %139 : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2126082Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:27:38.2126278Z %140 = arith.muli %c1024_i32, %c1_i32_9 : i32 2026-02-21T09:27:38.2126467Z %141 = arith.addi %c0_i32, %140 : i32 2026-02-21T09:27:38.2126704Z %142 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2126956Z %143 = tt.splat %141 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2127164Z %144 = arith.addi %143, %142 : tensor<1024xi32> 2026-02-21T09:27:38.2127375Z %145 = arith.cmpi slt, %144, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2127685Z %146 = tt.descriptor_load %0[%2, %141] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:27:38.2128028Z %147 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2128313Z %148 = arith.extf %146 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2128574Z %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2128809Z %150 = arith.subf %148, %149 : tensor<8x1024xf32> 2026-02-21T09:27:38.2129185Z %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2129595Z %152 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2129891Z %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2130135Z %154 = arith.divf %151, %153 : tensor<8x1024xf32> 2026-02-21T09:27:38.2130367Z %155 = arith.truncf %154 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:27:38.2130652Z %156 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2130908Z %157 = arith.muli %156, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2131176Z %158 = tt.expand_dims %144 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2131473Z %159 = tt.broadcast %157 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2131768Z %160 = tt.broadcast %158 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2132017Z %161 = arith.addi %159, %160 : tensor<8x1024xi32> 2026-02-21T09:27:38.2132257Z %162 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2132547Z %163 = tt.addptr %162, %161 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2132852Z %164 = tt.expand_dims %145 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2133157Z %165 = tt.broadcast %164 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2133418Z tt.store %163, %155, %165 : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2133685Z %c2_i32_10 = arith.constant 2 : i32 2026-02-21T09:27:38.2133893Z %166 = arith.muli %c1024_i32, %c2_i32_10 : i32 2026-02-21T09:27:38.2134091Z %167 = arith.addi %c0_i32, %166 : i32 2026-02-21T09:27:38.2134329Z %168 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2134578Z %169 = tt.splat %167 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2134789Z %170 = arith.addi %169, %168 : tensor<1024xi32> 2026-02-21T09:27:38.2135013Z %171 = arith.cmpi slt, %170, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2135313Z %172 = tt.descriptor_load %0[%2, %167] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:27:38.2135663Z %173 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2135954Z %174 = arith.extf %172 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2136265Z %175 = tt.broadcast %173 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2136510Z %176 = arith.subf %174, %175 : tensor<8x1024xf32> 2026-02-21T09:27:38.2136892Z %177 = tt.extern_elementwise %176 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2137318Z %178 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2137608Z %179 = tt.broadcast %178 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2137855Z %180 = arith.divf %177, %179 : tensor<8x1024xf32> 2026-02-21T09:27:38.2138093Z %181 = arith.truncf %180 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:27:38.2138387Z %182 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2138653Z %183 = arith.muli %182, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2138915Z %184 = tt.expand_dims %170 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2139221Z %185 = tt.broadcast %183 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2139484Z %186 = tt.broadcast %184 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2139730Z %187 = arith.addi %185, %186 : tensor<8x1024xi32> 2026-02-21T09:27:38.2139967Z %188 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2140261Z %189 = tt.addptr %188, %187 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2140602Z %190 = tt.expand_dims %171 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2140893Z %191 = tt.broadcast %190 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2141159Z tt.store %189, %181, %191 : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2141422Z scf.for %arg3 = %c3072_i32_7 to %c4864_i32 step %c1024_i32 : i32 { 2026-02-21T09:27:38.2141757Z %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:27:38.2142043Z %193 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:27:38.2142263Z %194 = arith.addi %193, %192 : tensor<1024xi32> 2026-02-21T09:27:38.2142502Z %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32> 2026-02-21T09:27:38.2142829Z %196 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:27:38.2143207Z %197 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2143513Z %198 = arith.extf %196 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:27:38.2143797Z %199 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2144051Z %200 = arith.subf %198, %199 : tensor<8x1024xf32> 2026-02-21T09:27:38.2144448Z %201 = tt.extern_elementwise %200 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:27:38.2144899Z %202 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:27:38.2145251Z %203 = tt.broadcast %202 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:27:38.2145509Z %204 = arith.divf %201, %203 : tensor<8x1024xf32> 2026-02-21T09:27:38.2145775Z %205 = arith.truncf %204 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:27:38.2146075Z %206 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:27:38.2146367Z %207 = arith.muli %206, %cst_2 : tensor<8x1xi32> 2026-02-21T09:27:38.2146642Z %208 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:27:38.2146958Z %209 = tt.broadcast %207 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2147262Z %210 = tt.broadcast %208 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:27:38.2147516Z %211 = arith.addi %209, %210 : tensor<8x1024xi32> 2026-02-21T09:27:38.2147836Z %212 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2148142Z %213 = tt.addptr %212, %211 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:27:38.2148478Z %214 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:27:38.2148790Z %215 = tt.broadcast %214 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:27:38.2149069Z tt.store %213, %205, %215 : tensor<8x1024x!tt.ptr> 2026-02-21T09:27:38.2149308Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:27:38.2149679Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:27:38.2150046Z tt.return 2026-02-21T09:27:38.2150178Z } 2026-02-21T09:27:38.2150311Z } 2026-02-21T09:27:38.2150383Z 2026-02-21T09:27:38.2150435Z {-# 2026-02-21T09:27:38.2150587Z external_resources: { 2026-02-21T09:27:38.2150750Z mlir_reproducer: { 2026-02-21T09:27:38.2155136Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:27:38.2159540Z disable_threading: false, 2026-02-21T09:27:38.2159715Z verify_each: true 2026-02-21T09:27:38.2159857Z } 2026-02-21T09:27:38.2159982Z } 2026-02-21T09:27:38.2160094Z #-} 2026-02-21T09:27:38.2160521Z /tmp/torchinductor_root/2e/c2e47ra7xtiv7wxu5whsmkdqrvjbmzr2ti3rxjnzfqisyyhz3f4t.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:27:38.2161789Z /tmp/torchinductor_root/2e/c2e47ra7xtiv7wxu5whsmkdqrvjbmzr2ti3rxjnzfqisyyhz3f4t.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:27:38.2162763Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:27:38.2163857Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:27:38.2164816Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:27:38.2165067Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:27:42.7875901Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.4 configs/s 2026-02-21T09:27:42.7884702Z [41s] Adaptive compile timeout: 30s (90% percentile=6.3s, bounds=[30.0s, 30s]) 2026-02-21T09:27:43.9119335Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 888.1 configs/s 2026-02-21T09:27:43.9911156Z [42s] Initial random population of 100, 5 starting points: 2026-02-21T09:27:43.9914122Z error=13 2026-02-21T09:27:43.9918549Z timeout=1 2026-02-21T09:27:43.9922538Z ok=86 2026-02-21T09:27:43.9922732Z min=0.0411 2026-02-21T09:27:43.9922874Z mid=0.3808 2026-02-21T09:27:43.9922999Z max=112.9472 2026-02-21T09:27:43.9923159Z best={'block_sizes': [1, 8192], 2026-02-21T09:27:43.9923461Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:27:43.9923776Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:27:43.9923975Z 'num_sm_multiplier': 2, 2026-02-21T09:27:43.9924141Z 'num_stages': 7, 2026-02-21T09:27:43.9924278Z 'num_warps': 32, 2026-02-21T09:27:43.9924439Z 'pid_type': 'persistent_blocked', 2026-02-21T09:27:43.9924627Z 'range_flattens': [True, True], 2026-02-21T09:27:43.9924815Z 'range_multi_buffers': [False, None], 2026-02-21T09:27:43.9925006Z 'range_num_stages': [4, 3], 2026-02-21T09:27:43.9925172Z 'range_unroll_factors': [2, 3], 2026-02-21T09:27:43.9925361Z 'range_warp_specializes': [False, False]} 2026-02-21T09:27:43.9925573Z [42s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:27:45.1477926Z [43s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T09:28:17.7578148Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.8 configs/s 2026-02-21T09:28:23.0744331Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.9 configs/s 2026-02-21T09:28:23.5508856Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2077.3 2026-02-21T09:28:23.5513677Z configs/s 2026-02-21T09:28:23.5975377Z [82s] Generation 1 complete: 2026-02-21T09:28:23.5980294Z ok=91 2026-02-21T09:28:23.5984150Z min=0.0225 2026-02-21T09:28:23.5986044Z mid=0.0451 2026-02-21T09:28:23.5986214Z max=0.2663 2026-02-21T09:28:23.5986368Z best={'block_sizes': [1, 8192], 2026-02-21T09:28:23.5986617Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:28:23.5986888Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:28:23.5987085Z 'num_sm_multiplier': 64, 2026-02-21T09:28:23.5987251Z 'num_stages': 5, 2026-02-21T09:28:23.5987391Z 'num_warps': 1, 2026-02-21T09:28:23.5987550Z 'pid_type': 'persistent_interleaved', 2026-02-21T09:28:23.5987739Z 'range_flattens': [None, False], 2026-02-21T09:28:23.5987941Z 'range_multi_buffers': [True, False], 2026-02-21T09:28:23.5988460Z 'range_num_stages': [2, 1], 2026-02-21T09:28:23.5988639Z 'range_unroll_factors': [0, 0], 2026-02-21T09:28:23.5988821Z 'range_warp_specializes': [True, None]} 2026-02-21T09:28:23.5994735Z [82s] Fitting surrogate: 191 points, 191 targets 2026-02-21T09:28:24.5784674Z [83s] Generation 2 starting: 70 neighbors, 5 active search path(s) 2026-02-21T09:28:38.0716142Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 1.1 configs/s 2026-02-21T09:28:42.3582778Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 17.2 configs/s 2026-02-21T09:28:43.8572174Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 675.6 2026-02-21T09:28:43.8572723Z configs/s 2026-02-21T09:28:43.9758168Z [102s] Generation 2 complete: 2026-02-21T09:28:43.9759678Z error=2 2026-02-21T09:28:43.9759859Z ok=74 2026-02-21T09:28:43.9760021Z min=0.0204 2026-02-21T09:28:43.9760606Z mid=0.0369 2026-02-21T09:28:43.9760869Z max=0.4075 2026-02-21T09:28:43.9761092Z best={'block_sizes': [1, 8192], 2026-02-21T09:28:43.9761454Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:28:43.9762029Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:28:43.9762242Z 'num_stages': 7, 2026-02-21T09:28:43.9762413Z 'num_warps': 4, 2026-02-21T09:28:43.9762589Z 'pid_type': 'flat', 2026-02-21T09:28:43.9762778Z 'range_flattens': [None, True], 2026-02-21T09:28:43.9762979Z 'range_multi_buffers': [None, None], 2026-02-21T09:28:43.9763190Z 'range_num_stages': [0, 3], 2026-02-21T09:28:43.9763355Z 'range_unroll_factors': [0, 3], 2026-02-21T09:28:43.9763576Z 'range_warp_specializes': [None, None]} 2026-02-21T09:28:43.9785989Z [102s] Fitting surrogate: 267 points, 267 targets 2026-02-21T09:28:44.8307384Z [103s] Generation 3 starting: 65 neighbors, 5 active search path(s) 2026-02-21T09:28:51.0540009Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 16.1 configs/s 2026-02-21T09:28:55.2888934Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.7 configs/s 2026-02-21T09:28:57.1039746Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 559.5 2026-02-21T09:28:57.1043707Z configs/s 2026-02-21T09:28:57.2457578Z [116s] Generation 3 complete: 2026-02-21T09:28:57.2459482Z ok=71 2026-02-21T09:28:57.2459682Z min=0.0204 2026-02-21T09:28:57.2464433Z mid=0.0328 2026-02-21T09:28:57.2467567Z max=0.1720 2026-02-21T09:28:57.2469735Z best={'block_sizes': [1, 8192], 2026-02-21T09:28:57.2470041Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:28:57.2470326Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:28:57.2470526Z 'num_stages': 7, 2026-02-21T09:28:57.2470677Z 'num_warps': 4, 2026-02-21T09:28:57.2470818Z 'pid_type': 'flat', 2026-02-21T09:28:57.2470980Z 'range_flattens': [None, True], 2026-02-21T09:28:57.2471192Z 'range_multi_buffers': [None, None], 2026-02-21T09:28:57.2471874Z 'range_num_stages': [0, 3], 2026-02-21T09:28:57.2472040Z 'range_unroll_factors': [0, 3], 2026-02-21T09:28:57.2472224Z 'range_warp_specializes': [None, None]} 2026-02-21T09:28:57.2476123Z [116s] Fitting surrogate: 338 points, 338 targets 2026-02-21T09:28:58.1510043Z [116s] Generation 4 starting: 64 neighbors, 5 active search path(s) 2026-02-21T09:29:03.7491099Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 14.3 configs/s 2026-02-21T09:29:08.2039470Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 15.2 configs/s 2026-02-21T09:29:10.2029844Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 508.2 2026-02-21T09:29:10.2033808Z configs/s 2026-02-21T09:29:10.3697717Z [129s] Generation 4 complete: 2026-02-21T09:29:10.3701481Z ok=70 2026-02-21T09:29:10.3704677Z min=0.0204 2026-02-21T09:29:10.3709028Z mid=0.0328 2026-02-21T09:29:10.3709975Z max=0.0778 2026-02-21T09:29:10.3710155Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:10.3710430Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:29:10.3710723Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:10.3710911Z 'num_stages': 6, 2026-02-21T09:29:10.3711060Z 'num_warps': 4, 2026-02-21T09:29:10.3711200Z 'pid_type': 'flat', 2026-02-21T09:29:10.3711361Z 'range_flattens': [None, True], 2026-02-21T09:29:10.3711606Z 'range_multi_buffers': [None, False], 2026-02-21T09:29:10.3711803Z 'range_num_stages': [0, 3], 2026-02-21T09:29:10.3711969Z 'range_unroll_factors': [0, 3], 2026-02-21T09:29:10.3712153Z 'range_warp_specializes': [None, None]} 2026-02-21T09:29:10.3714931Z [129s] Fitting surrogate: 408 points, 408 targets 2026-02-21T09:29:11.0910811Z [129s] Generation 5 starting: 52 neighbors, 4 active search path(s) 2026-02-21T09:29:15.5499299Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 9.9 configs/s 2026-02-21T09:29:18.7875590Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.9 configs/s 2026-02-21T09:29:20.7837124Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 508.7 2026-02-21T09:29:20.7838801Z configs/s 2026-02-21T09:29:20.9543174Z [139s] Generation 5 complete: 2026-02-21T09:29:20.9544844Z ok=57 2026-02-21T09:29:20.9545060Z min=0.0204 2026-02-21T09:29:20.9545231Z mid=0.0287 2026-02-21T09:29:20.9545400Z max=0.0614 2026-02-21T09:29:20.9545603Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:20.9545854Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:29:20.9546139Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:20.9546356Z 'num_stages': 2, 2026-02-21T09:29:20.9546517Z 'num_warps': 2, 2026-02-21T09:29:20.9546681Z 'pid_type': 'flat', 2026-02-21T09:29:20.9546864Z 'range_flattens': [None, True], 2026-02-21T09:29:20.9547118Z 'range_multi_buffers': [None, None], 2026-02-21T09:29:20.9547363Z 'range_num_stages': [0, 4], 2026-02-21T09:29:20.9547593Z 'range_unroll_factors': [0, 0], 2026-02-21T09:29:20.9547809Z 'range_warp_specializes': [None, True]} 2026-02-21T09:29:20.9559649Z [139s] Fitting surrogate: 465 points, 465 targets 2026-02-21T09:29:21.6131181Z [140s] Generation 6 starting: 31 neighbors, 3 active search path(s) 2026-02-21T09:29:25.4767305Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 9.2 configs/s 2026-02-21T09:29:27.4108362Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.9 configs/s 2026-02-21T09:29:28.7608825Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 749.7 2026-02-21T09:29:28.7612950Z configs/s 2026-02-21T09:29:28.8713606Z [147s] Generation 6 complete: 2026-02-21T09:29:28.8717909Z ok=34 2026-02-21T09:29:28.8722294Z min=0.0204 2026-02-21T09:29:28.8723823Z mid=0.0226 2026-02-21T09:29:28.8724442Z max=0.0655 2026-02-21T09:29:28.8724597Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:28.8729424Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:29:28.8730849Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:28.8731086Z 'num_stages': 2, 2026-02-21T09:29:28.8731233Z 'num_warps': 2, 2026-02-21T09:29:28.8731384Z 'pid_type': 'flat', 2026-02-21T09:29:28.8731608Z 'range_flattens': [None, True], 2026-02-21T09:29:28.8731808Z 'range_multi_buffers': [None, False], 2026-02-21T09:29:28.8731994Z 'range_num_stages': [0, 4], 2026-02-21T09:29:28.8732178Z 'range_unroll_factors': [0, 0], 2026-02-21T09:29:28.8732373Z 'range_warp_specializes': [None, True]} 2026-02-21T09:29:28.8732593Z [147s] Fitting surrogate: 499 points, 499 targets 2026-02-21T09:29:29.4555779Z [148s] Generation 7 starting: 38 neighbors, 3 active search path(s) 2026-02-21T09:29:34.1447017Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 5.3 configs/s 2026-02-21T09:29:36.8658374Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 14.5 configs/s 2026-02-21T09:29:38.8028044Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 524.9 2026-02-21T09:29:38.8032227Z configs/s 2026-02-21T09:29:38.9656844Z [157s] Generation 7 complete: 2026-02-21T09:29:38.9661185Z ok=41 2026-02-21T09:29:38.9665587Z min=0.0204 2026-02-21T09:29:38.9670784Z mid=0.0225 2026-02-21T09:29:38.9672848Z max=0.0942 2026-02-21T09:29:38.9673037Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:38.9673273Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:29:38.9673528Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:38.9673737Z 'num_stages': 1, 2026-02-21T09:29:38.9673887Z 'num_warps': 4, 2026-02-21T09:29:38.9674031Z 'pid_type': 'flat', 2026-02-21T09:29:38.9674199Z 'range_flattens': [None, True], 2026-02-21T09:29:38.9674380Z 'range_multi_buffers': [None, True], 2026-02-21T09:29:38.9674614Z 'range_num_stages': [0, 4], 2026-02-21T09:29:38.9674785Z 'range_unroll_factors': [0, 0], 2026-02-21T09:29:38.9674966Z 'range_warp_specializes': [None, True]} 2026-02-21T09:29:38.9675188Z [157s] Fitting surrogate: 540 points, 540 targets 2026-02-21T09:29:39.7560307Z [158s] Generation 8 starting: 41 neighbors, 3 active search path(s) 2026-02-21T09:29:43.9290952Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 7.8 configs/s 2026-02-21T09:29:46.4726925Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.8 configs/s 2026-02-21T09:29:48.4426973Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 514.9 2026-02-21T09:29:48.4431006Z configs/s 2026-02-21T09:29:48.6045244Z [167s] Generation 8 complete: 2026-02-21T09:29:48.6046614Z ok=44 2026-02-21T09:29:48.6046811Z min=0.0204 2026-02-21T09:29:48.6046987Z mid=0.0267 2026-02-21T09:29:48.6047156Z max=0.0513 2026-02-21T09:29:48.6047386Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:48.6047631Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:29:48.6047892Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:48.6048104Z 'num_stages': 1, 2026-02-21T09:29:48.6048263Z 'num_warps': 4, 2026-02-21T09:29:48.6048423Z 'pid_type': 'flat', 2026-02-21T09:29:48.6048584Z 'range_flattens': [None, True], 2026-02-21T09:29:48.6048769Z 'range_multi_buffers': [None, True], 2026-02-21T09:29:48.6048953Z 'range_num_stages': [0, 4], 2026-02-21T09:29:48.6049126Z 'range_unroll_factors': [0, 0], 2026-02-21T09:29:48.6049325Z 'range_warp_specializes': [None, True]} 2026-02-21T09:29:48.6065202Z [167s] Fitting surrogate: 584 points, 584 targets 2026-02-21T09:29:49.1933260Z [168s] Generation 9 starting: 28 neighbors, 2 active search path(s) 2026-02-21T09:29:53.0090671Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 4.0 configs/s 2026-02-21T09:29:54.7862175Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 16.7 configs/s 2026-02-21T09:29:56.1434076Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 745.2 2026-02-21T09:29:56.1438604Z configs/s 2026-02-21T09:29:56.2552235Z [175s] Generation 9 complete: 2026-02-21T09:29:56.2557299Z ok=31 2026-02-21T09:29:56.2561991Z min=0.0204 2026-02-21T09:29:56.2564978Z mid=0.0225 2026-02-21T09:29:56.2567747Z max=0.4096 2026-02-21T09:29:56.2567951Z best={'block_sizes': [1, 8192], 2026-02-21T09:29:56.2568212Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:29:56.2568503Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:29:56.2568710Z 'num_stages': 8, 2026-02-21T09:29:56.2568861Z 'num_warps': 4, 2026-02-21T09:29:56.2569003Z 'pid_type': 'flat', 2026-02-21T09:29:56.2569167Z 'range_flattens': [None, False], 2026-02-21T09:29:56.2569349Z 'range_multi_buffers': [None, False], 2026-02-21T09:29:56.2569910Z 'range_num_stages': [0, 3], 2026-02-21T09:29:56.2570108Z 'range_unroll_factors': [0, 2], 2026-02-21T09:29:56.2570299Z 'range_warp_specializes': [None, False]} 2026-02-21T09:29:56.7195334Z [175s] Fitting surrogate: 615 points, 615 targets 2026-02-21T09:29:56.7195768Z [175s] Generation 10 starting: 17 neighbors, 1 active search path(s) 2026-02-21T09:30:07.5100810Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 1.0 configs/s 2026-02-21T09:30:08.6093471Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.0 configs/s 2026-02-21T09:30:09.2688571Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1514.1 2026-02-21T09:30:09.2692823Z configs/s 2026-02-21T09:30:09.3292017Z [188s] Generation 10 complete: 2026-02-21T09:30:09.3293791Z ok=19 2026-02-21T09:30:09.3294021Z min=0.0204 2026-02-21T09:30:09.3294192Z mid=0.0287 2026-02-21T09:30:09.3294362Z max=0.3605 2026-02-21T09:30:09.3294581Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:09.3294888Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:09.3295170Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:09.3295377Z 'num_stages': 8, 2026-02-21T09:30:09.3295529Z 'num_warps': 4, 2026-02-21T09:30:09.3295682Z 'pid_type': 'flat', 2026-02-21T09:30:09.3295850Z 'range_flattens': [None, False], 2026-02-21T09:30:09.3296057Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:09.3296267Z 'range_num_stages': [0, 3], 2026-02-21T09:30:09.3296461Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:09.3296674Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:09.3314328Z [188s] Fitting surrogate: 634 points, 634 targets 2026-02-21T09:30:09.7470791Z [188s] Generation 11 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:30:13.8836438Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.2 configs/s 2026-02-21T09:30:14.9144330Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s 2026-02-21T09:30:15.5684755Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1528.2 2026-02-21T09:30:15.5686064Z configs/s 2026-02-21T09:30:15.6286103Z [194s] Generation 11 complete: 2026-02-21T09:30:15.6290499Z ok=18 2026-02-21T09:30:15.6294376Z min=0.0205 2026-02-21T09:30:15.6296374Z mid=0.0287 2026-02-21T09:30:15.6296533Z max=0.3411 2026-02-21T09:30:15.6296684Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:15.6296936Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:15.6297211Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:15.6297407Z 'num_stages': 8, 2026-02-21T09:30:15.6297548Z 'num_warps': 4, 2026-02-21T09:30:15.6297693Z 'pid_type': 'flat', 2026-02-21T09:30:15.6297848Z 'range_flattens': [None, False], 2026-02-21T09:30:15.6298040Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:15.6298583Z 'range_num_stages': [0, 3], 2026-02-21T09:30:15.6298783Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:15.6298962Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:15.6305746Z [194s] Fitting surrogate: 652 points, 652 targets 2026-02-21T09:30:16.4919398Z [195s] Generation 12 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:30:30.1789096Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.6 configs/s 2026-02-21T09:30:31.2280154Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.9 configs/s 2026-02-21T09:30:31.8769466Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1542.3 2026-02-21T09:30:31.8770609Z configs/s 2026-02-21T09:30:31.9361528Z [210s] Generation 12 complete: 2026-02-21T09:30:31.9363156Z ok=18 2026-02-21T09:30:31.9363319Z min=0.0205 2026-02-21T09:30:31.9363459Z mid=0.0287 2026-02-21T09:30:31.9363587Z max=0.2090 2026-02-21T09:30:31.9363756Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:31.9364031Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:31.9364293Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:31.9364486Z 'num_stages': 8, 2026-02-21T09:30:31.9364623Z 'num_warps': 4, 2026-02-21T09:30:31.9364769Z 'pid_type': 'flat', 2026-02-21T09:30:31.9364923Z 'range_flattens': [None, False], 2026-02-21T09:30:31.9365111Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:31.9365291Z 'range_num_stages': [0, 3], 2026-02-21T09:30:31.9365465Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:31.9365647Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:31.9384438Z [210s] Fitting surrogate: 670 points, 670 targets 2026-02-21T09:30:32.4060312Z [211s] Generation 13 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:30:34.3361532Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.3 configs/s 2026-02-21T09:30:35.3737185Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.1 configs/s 2026-02-21T09:30:36.0927090Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1393.0 2026-02-21T09:30:36.0928291Z configs/s 2026-02-21T09:30:36.1606319Z [214s] Generation 13 complete: 2026-02-21T09:30:36.1606651Z ok=18 2026-02-21T09:30:36.1611327Z min=0.0204 2026-02-21T09:30:36.1614959Z mid=0.0267 2026-02-21T09:30:36.1618739Z max=0.1106 2026-02-21T09:30:36.1621791Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:36.1625693Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:36.1626883Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:36.1627083Z 'num_stages': 8, 2026-02-21T09:30:36.1627235Z 'num_warps': 4, 2026-02-21T09:30:36.1627376Z 'pid_type': 'flat', 2026-02-21T09:30:36.1627554Z 'range_flattens': [None, False], 2026-02-21T09:30:36.1627744Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:36.1627965Z 'range_num_stages': [0, 3], 2026-02-21T09:30:36.1628510Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:36.1628691Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:36.1635177Z [214s] Fitting surrogate: 688 points, 688 targets 2026-02-21T09:30:36.5941139Z [215s] Generation 14 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:30:42.0092898Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 1.4 configs/s 2026-02-21T09:30:43.0049163Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.8 configs/s 2026-02-21T09:30:43.5767548Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1745.5 2026-02-21T09:30:43.5771880Z configs/s 2026-02-21T09:30:43.6320103Z [222s] Generation 14 complete: 2026-02-21T09:30:43.6324460Z ok=17 2026-02-21T09:30:43.6325793Z min=0.0204 2026-02-21T09:30:43.6325955Z mid=0.0266 2026-02-21T09:30:43.6326093Z max=0.4096 2026-02-21T09:30:43.6326599Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:43.6326899Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:43.6327180Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:43.6327376Z 'num_stages': 8, 2026-02-21T09:30:43.6327525Z 'num_warps': 4, 2026-02-21T09:30:43.6327672Z 'pid_type': 'flat', 2026-02-21T09:30:43.6327840Z 'range_flattens': [None, False], 2026-02-21T09:30:43.6328023Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:43.6328216Z 'range_num_stages': [0, 3], 2026-02-21T09:30:43.6328386Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:43.6328576Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:43.6348709Z [222s] Fitting surrogate: 705 points, 705 targets 2026-02-21T09:30:44.0582026Z [222s] Generation 15 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:30:47.9787153Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 2.3 configs/s 2026-02-21T09:30:48.9068739Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.0 configs/s 2026-02-21T09:30:49.8141407Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1110.3 2026-02-21T09:30:49.8143059Z configs/s 2026-02-21T09:30:49.8902901Z [228s] Generation 15 complete: 2026-02-21T09:30:49.8907287Z ok=17 2026-02-21T09:30:49.8911135Z min=0.0204 2026-02-21T09:30:49.8914382Z mid=0.0225 2026-02-21T09:30:49.8918262Z max=0.3405 2026-02-21T09:30:49.8918517Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:49.8918808Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:49.8922749Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:49.8927712Z 'num_stages': 8, 2026-02-21T09:30:49.8929032Z 'num_warps': 4, 2026-02-21T09:30:49.8929224Z 'pid_type': 'flat', 2026-02-21T09:30:49.8929413Z 'range_flattens': [None, False], 2026-02-21T09:30:49.8929608Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:49.8929833Z 'range_num_stages': [0, 3], 2026-02-21T09:30:49.8930018Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:49.8930206Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:49.8934386Z [228s] Fitting surrogate: 722 points, 722 targets 2026-02-21T09:30:50.3299130Z [229s] Generation 16 starting: 14 neighbors, 1 active search path(s) 2026-02-21T09:30:53.9480674Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2.4 configs/s 2026-02-21T09:30:54.8069620Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.2 configs/s 2026-02-21T09:30:55.7027593Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1122.9 2026-02-21T09:30:55.7031296Z configs/s 2026-02-21T09:30:55.7875237Z [234s] Generation 16 complete: 2026-02-21T09:30:55.7877365Z ok=16 2026-02-21T09:30:55.7877563Z min=0.0205 2026-02-21T09:30:55.7877702Z mid=0.0225 2026-02-21T09:30:55.7877839Z max=0.0473 2026-02-21T09:30:55.7878013Z best={'block_sizes': [1, 8192], 2026-02-21T09:30:55.7878644Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:30:55.7878919Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:30:55.7879123Z 'num_stages': 8, 2026-02-21T09:30:55.7879268Z 'num_warps': 4, 2026-02-21T09:30:55.7879423Z 'pid_type': 'flat', 2026-02-21T09:30:55.7879585Z 'range_flattens': [None, False], 2026-02-21T09:30:55.7879777Z 'range_multi_buffers': [None, False], 2026-02-21T09:30:55.7879974Z 'range_num_stages': [0, 3], 2026-02-21T09:30:55.7880143Z 'range_unroll_factors': [0, 2], 2026-02-21T09:30:55.7880349Z 'range_warp_specializes': [None, False]} 2026-02-21T09:30:55.7907987Z [234s] Fitting surrogate: 738 points, 738 targets 2026-02-21T09:30:56.2800204Z [235s] Generation 17 starting: 18 neighbors, 1 active search path(s) 2026-02-21T09:30:58.2651841Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 12.4 configs/s 2026-02-21T09:30:59.4104577Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.3 configs/s 2026-02-21T09:31:00.1287469Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1394.8 2026-02-21T09:31:00.1291874Z configs/s 2026-02-21T09:31:00.1912557Z [239s] Generation 17 complete: 2026-02-21T09:31:00.1917447Z ok=20 2026-02-21T09:31:00.1918993Z min=0.0205 2026-02-21T09:31:00.1919197Z mid=0.0286 2026-02-21T09:31:00.1924006Z max=0.3415 2026-02-21T09:31:00.1925566Z best={'block_sizes': [1, 8192], 2026-02-21T09:31:00.1925904Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:31:00.1931116Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:31:00.1933240Z 'num_stages': 8, 2026-02-21T09:31:00.1936876Z 'num_warps': 4, 2026-02-21T09:31:00.1940262Z 'pid_type': 'flat', 2026-02-21T09:31:00.1944158Z 'range_flattens': [None, False], 2026-02-21T09:31:00.1945625Z 'range_multi_buffers': [None, False], 2026-02-21T09:31:00.1945874Z 'range_num_stages': [0, 3], 2026-02-21T09:31:00.1946085Z 'range_unroll_factors': [0, 2], 2026-02-21T09:31:00.1946280Z 'range_warp_specializes': [None, False]} 2026-02-21T09:31:00.1946597Z [239s] Fitting surrogate: 758 points, 758 targets 2026-02-21T09:31:00.6081030Z [239s] Generation 18 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:31:04.7795658Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 4.5 configs/s 2026-02-21T09:31:05.6973136Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.2 configs/s 2026-02-21T09:31:06.5244601Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1215.4 2026-02-21T09:31:06.5245011Z configs/s 2026-02-21T09:31:06.5989092Z [245s] Generation 18 complete: 2026-02-21T09:31:06.5993992Z ok=17 2026-02-21T09:31:06.5995419Z min=0.0204 2026-02-21T09:31:06.5995587Z mid=0.0225 2026-02-21T09:31:06.5995712Z max=0.0574 2026-02-21T09:31:06.5995864Z best={'block_sizes': [1, 8192], 2026-02-21T09:31:06.5996166Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:31:06.5996427Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:31:06.5996617Z 'num_stages': 8, 2026-02-21T09:31:06.5996754Z 'num_warps': 4, 2026-02-21T09:31:06.5996898Z 'pid_type': 'flat', 2026-02-21T09:31:06.5997053Z 'range_flattens': [None, False], 2026-02-21T09:31:06.5997235Z 'range_multi_buffers': [None, False], 2026-02-21T09:31:06.5997413Z 'range_num_stages': [0, 3], 2026-02-21T09:31:06.5997584Z 'range_unroll_factors': [0, 2], 2026-02-21T09:31:06.5997769Z 'range_warp_specializes': [None, False]} 2026-02-21T09:31:06.6019296Z [245s] Fitting surrogate: 775 points, 775 targets 2026-02-21T09:31:07.0283722Z [245s] Generation 19 starting: 15 neighbors, 1 active search path(s) 2026-02-21T09:31:08.7748318Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 11.3 configs/s 2026-02-21T09:31:09.6905488Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.2 configs/s 2026-02-21T09:31:11.0994483Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 992.4 2026-02-21T09:31:11.0997953Z configs/s 2026-02-21T09:31:11.1848185Z [250s] Generation 19 complete: 2026-02-21T09:31:11.1852761Z ok=17 2026-02-21T09:31:11.1854323Z min=0.0205 2026-02-21T09:31:11.1854525Z mid=0.0225 2026-02-21T09:31:11.1854651Z max=0.0327 2026-02-21T09:31:11.1854807Z best={'block_sizes': [1, 8192], 2026-02-21T09:31:11.1855056Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:31:11.1855326Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:31:11.1855520Z 'num_stages': 8, 2026-02-21T09:31:11.1855660Z 'num_warps': 4, 2026-02-21T09:31:11.1855806Z 'pid_type': 'flat', 2026-02-21T09:31:11.1855962Z 'range_flattens': [None, False], 2026-02-21T09:31:11.1856148Z 'range_multi_buffers': [None, False], 2026-02-21T09:31:11.1856327Z 'range_num_stages': [0, 3], 2026-02-21T09:31:11.1856875Z 'range_unroll_factors': [0, 2], 2026-02-21T09:31:11.1857090Z 'range_warp_specializes': [None, False]} 2026-02-21T09:31:11.1864372Z [250s] Fitting surrogate: 792 points, 792 targets 2026-02-21T09:31:11.5942326Z [250s] Generation 20 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:31:13.2869569Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 16.0 configs/s 2026-02-21T09:31:14.0287821Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.2 configs/s 2026-02-21T09:31:14.8678049Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1197.3 2026-02-21T09:31:14.8679857Z configs/s 2026-02-21T09:31:14.9413525Z [253s] Generation 20 complete: 2026-02-21T09:31:14.9415598Z ok=14 2026-02-21T09:31:14.9415763Z min=0.0205 2026-02-21T09:31:14.9415902Z mid=0.0225 2026-02-21T09:31:14.9416030Z max=0.0368 2026-02-21T09:31:14.9416165Z best={'block_sizes': [1, 8192], 2026-02-21T09:31:14.9416471Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:31:14.9416730Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:31:14.9416924Z 'num_stages': 8, 2026-02-21T09:31:14.9417064Z 'num_warps': 4, 2026-02-21T09:31:14.9417207Z 'pid_type': 'flat', 2026-02-21T09:31:14.9417361Z 'range_flattens': [None, False], 2026-02-21T09:31:14.9417548Z 'range_multi_buffers': [None, False], 2026-02-21T09:31:14.9417728Z 'range_num_stages': [0, 3], 2026-02-21T09:31:14.9417898Z 'range_unroll_factors': [0, 2], 2026-02-21T09:31:14.9418079Z 'range_warp_specializes': [None, False]} 2026-02-21T09:31:14.9443222Z [253s] Fitting surrogate: 806 points, 806 targets 2026-02-21T09:31:15.2299622Z [254s] Autotuning complete in 254.1s after searching 769 configs. 2026-02-21T09:31:15.2301887Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:31:15.2303020Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:31:15.2304235Z 2026-02-21T09:31:15.2304503Z [254s] Code of selected kernel: /tmp/torchinductor_root/dc/cdcv4axqksvqcx43ypyvb56ys6vejja4lzjbcj5l3og76zsgyolr.py 2026-02-21T09:31:15.2526901Z from __future__ import annotations 2026-02-21T09:31:15.2531328Z 2026-02-21T09:31:15.2535674Z import torch 2026-02-21T09:31:15.2537662Z import triton 2026-02-21T09:31:15.2537857Z import triton.language as tl 2026-02-21T09:31:15.2538091Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:31:15.2538360Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:31:15.2538665Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:31:15.2538841Z 2026-02-21T09:31:15.2539222Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:31:15.2539444Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:31:15.2539567Z 2026-02-21T09:31:15.2539642Z @triton.jit 2026-02-21T09:31:15.2539799Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:31:15.2540065Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:31:15.2540318Z pid_0 = tl.program_id(0) 2026-02-21T09:31:15.2540489Z offset_0 = pid_0 2026-02-21T09:31:15.2540664Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:31:15.2540956Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:31:15.2541246Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:31:15.2541523Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:31:15.2541830Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:31:15.2542079Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:31:15.2542363Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:31:15.2542608Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:31:15.2542839Z # src[softmax.py:82-89]: ... 2026-02-21T09:31:15.2543250Z for offset_2 in tl.range(0, 4864, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T09:31:15.2543726Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:31:15.2543969Z mask_1 = indices_2 < 4864 2026-02-21T09:31:15.2544131Z mi_copy = mi 2026-02-21T09:31:15.2544277Z di_copy = di 2026-02-21T09:31:15.2544417Z mi_copy_0 = mi_copy 2026-02-21T09:31:15.2544579Z di_copy_0 = di_copy 2026-02-21T09:31:15.2544757Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:31:15.2545133Z values = tl.load(x + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:31:15.2545523Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:31:15.2545921Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:31:15.2546313Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:31:15.2546565Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:31:15.2546806Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:31:15.2547020Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:31:15.2547276Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:31:15.2547540Z v_2 = mi_copy_0 - v_1 2026-02-21T09:31:15.2547708Z v_3 = libdevice.exp(v_2) 2026-02-21T09:31:15.2547880Z v_4 = di_copy_0 * v_3 2026-02-21T09:31:15.2548062Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:31:15.2548370Z subscript = v_1[:, None] 2026-02-21T09:31:15.2548549Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:31:15.2548728Z v_6 = v_5 - subscript 2026-02-21T09:31:15.2548946Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:31:15.2549204Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:31:15.2549421Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:31:15.2549602Z v_7 = libdevice.exp(v_6) 2026-02-21T09:31:15.2549928Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:31:15.2550298Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:31:15.2550500Z di = v_4 + sum_1 2026-02-21T09:31:15.2550676Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:31:15.2550852Z mi = v_1 2026-02-21T09:31:15.2551131Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:31:15.2551399Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:31:15.2551802Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:31:15.2552327Z for offset_2 in tl.range(0, 4864, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T09:31:15.2552789Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:31:15.2553034Z mask_2 = indices_2 < 4864 2026-02-21T09:31:15.2553198Z mi_copy_1 = mi 2026-02-21T09:31:15.2553350Z di_copy_1 = di 2026-02-21T09:31:15.2553497Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:31:15.2553668Z di_copy_1_0 = di_copy_1 2026-02-21T09:31:15.2553854Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:31:15.2554233Z values_1 = tl.load(x + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:31:15.2554671Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:31:15.2554947Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:31:15.2555142Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:31:15.2555322Z v_10 = v_9 - subscript_1 2026-02-21T09:31:15.2555499Z v_11 = libdevice.exp(v_10) 2026-02-21T09:31:15.2555672Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:31:15.2555857Z v_12 = v_11 / subscript_2 2026-02-21T09:31:15.2556034Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:31:15.2556297Z tl.store(out + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:31:15.2556515Z 2026-02-21T09:31:15.2556642Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:31:15.2556868Z """ 2026-02-21T09:31:15.2557077Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:31:15.2557378Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:31:15.2557597Z Args: 2026-02-21T09:31:15.2557758Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:31:15.2557947Z Returns: 2026-02-21T09:31:15.2558128Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:31:15.2558331Z """ 2026-02-21T09:31:15.2558470Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:31:15.2558640Z m, n = x.size() 2026-02-21T09:31:15.2558811Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:31:15.2559011Z out = torch.empty_like(x) 2026-02-21T09:31:15.2559240Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:31:15.2559554Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:31:15.2559856Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:31:15.2560096Z # src[softmax.py:79-92]: ... 2026-02-21T09:31:15.2560404Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=8) 2026-02-21T09:31:15.2560677Z # src[softmax.py:93]: return out 2026-02-21T09:31:15.2560842Z return out 2026-02-21T09:31:16.0218367Z WARNING:tritonbench.utils.triton_op:Completed input ID 36: 2026-02-21T09:31:16.0218691Z (M, N) 2026-02-21T09:31:16.0223034Z ------------ 2026-02-21T09:31:16.0231916Z (4096, 4864) 2026-02-21T09:31:16.0235714Z 2026-02-21T09:31:16.0237948Z 40%|████ | 8/20 [22:21<38:36, 193.01s/it]WARNING:tritonbench.utils.triton_op:Running input ID 41: 2026-02-21T09:31:16.0238310Z (M, N) 2026-02-21T09:31:16.0241905Z ------------ 2026-02-21T09:31:16.0246572Z (4096, 5504) 2026-02-21T09:31:16.0251330Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:31:17.2439114Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:31:18.7616000Z INFO:tritonbench.utils.triton_op:Took 2.31ms to get benchmark function for torch_compile_softmax 2026-02-21T09:31:20.0959611Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:31:20.0963494Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:31:20.0967900Z 'dtype': 'torch.float16', 2026-02-21T09:31:20.0971974Z 'shape': (4096, 5504), 2026-02-21T09:31:20.0975823Z 'stride': (5504, 1)},), 2026-02-21T09:31:20.0976121Z 'kwargs': {}} 2026-02-21T09:31:20.0986505Z INFO:tritonbench.utils.triton_op:Took 2.75ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:31:20.2701717Z [0s] Autotune random seed: 2138408546 2026-02-21T09:31:20.2947892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:31:55.3686024Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:31:55.4096601Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:32:02.6039770Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.8 configs/s 2026-02-21T09:32:02.6050619Z [42s] Adaptive compile timeout: 30s (90% percentile=7.2s, bounds=[30.0s, 30s]) 2026-02-21T09:32:03.4351127Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1192.4 configs/s 2026-02-21T09:32:03.4968845Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T09:32:03.4974024Z error=12 2026-02-21T09:32:03.4978448Z timeout=1 2026-02-21T09:32:03.4983201Z ok=87 2026-02-21T09:32:03.4987255Z min=0.0430 2026-02-21T09:32:03.4991950Z mid=0.4362 2026-02-21T09:32:03.4993986Z max=128.0870 2026-02-21T09:32:03.4994224Z best={'block_sizes': [1, 8192], 2026-02-21T09:32:03.4999738Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:32:03.5001936Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:32:03.5002172Z 'num_sm_multiplier': 2, 2026-02-21T09:32:03.5002341Z 'num_stages': 7, 2026-02-21T09:32:03.5002481Z 'num_warps': 32, 2026-02-21T09:32:03.5002652Z 'pid_type': 'persistent_blocked', 2026-02-21T09:32:03.5002841Z 'range_flattens': [True, True], 2026-02-21T09:32:03.5003029Z 'range_multi_buffers': [False, None], 2026-02-21T09:32:03.5003219Z 'range_num_stages': [4, 3], 2026-02-21T09:32:03.5003383Z 'range_unroll_factors': [2, 3], 2026-02-21T09:32:03.5003573Z 'range_warp_specializes': [False, False]} 2026-02-21T09:32:03.5003788Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:32:04.7300787Z [44s] Generation 1 starting: 92 neighbors, 5 active search path(s) 2026-02-21T09:32:42.9542244Z [82s] Timeout after 30s compiling Config(block_sizes=[2, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[False, None]) 2026-02-21T09:32:42.9555681Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 0.5 configs/s 2026-02-21T09:32:48.4066699Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 17.8 configs/s 2026-02-21T09:32:48.9079767Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1973.5 2026-02-21T09:32:48.9084151Z configs/s 2026-02-21T09:32:48.9561510Z [88s] Generation 1 complete: 2026-02-21T09:32:48.9566830Z error=4 2026-02-21T09:32:48.9568549Z timeout=1 2026-02-21T09:32:48.9568743Z ok=93 2026-02-21T09:32:48.9572784Z min=0.0246 2026-02-21T09:32:48.9577396Z mid=0.0512 2026-02-21T09:32:48.9579903Z max=0.3666 2026-02-21T09:32:48.9580122Z best={'block_sizes': [1, 8192], 2026-02-21T09:32:48.9580472Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:32:48.9580753Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:32:48.9580952Z 'num_stages': 7, 2026-02-21T09:32:48.9584469Z 'num_warps': 8, 2026-02-21T09:32:48.9585802Z 'pid_type': 'flat', 2026-02-21T09:32:48.9586071Z 'range_flattens': [None, True], 2026-02-21T09:32:48.9586275Z 'range_multi_buffers': [None, None], 2026-02-21T09:32:48.9590419Z 'range_num_stages': [0, 3], 2026-02-21T09:32:48.9594413Z 'range_unroll_factors': [0, 3], 2026-02-21T09:32:48.9598841Z 'range_warp_specializes': [None, False]} 2026-02-21T09:32:48.9602656Z [88s] Fitting surrogate: 198 points, 198 targets 2026-02-21T09:32:50.1290092Z [89s] Generation 2 starting: 78 neighbors, 5 active search path(s) 2026-02-21T09:33:18.3520177Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.9 configs/s 2026-02-21T09:33:23.7941436Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 16.4 configs/s 2026-02-21T09:33:25.5174905Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 587.2 2026-02-21T09:33:25.5176204Z configs/s 2026-02-21T09:33:25.6429952Z [125s] Generation 2 complete: 2026-02-21T09:33:25.6434919Z ok=84 2026-02-21T09:33:25.6439318Z min=0.0245 2026-02-21T09:33:25.6440763Z mid=0.0420 2026-02-21T09:33:25.6440917Z max=1.0669 2026-02-21T09:33:25.6441064Z best={'block_sizes': [1, 8192], 2026-02-21T09:33:25.6441309Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:33:25.6441649Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:33:25.6441847Z 'num_stages': 7, 2026-02-21T09:33:25.6442000Z 'num_warps': 4, 2026-02-21T09:33:25.6442150Z 'pid_type': 'flat', 2026-02-21T09:33:25.6442307Z 'range_flattens': [None, True], 2026-02-21T09:33:25.6442523Z 'range_multi_buffers': [None, None], 2026-02-21T09:33:25.6442722Z 'range_num_stages': [0, 3], 2026-02-21T09:33:25.6442892Z 'range_unroll_factors': [0, 3], 2026-02-21T09:33:25.6443070Z 'range_warp_specializes': [None, False]} 2026-02-21T09:33:25.6447857Z [125s] Fitting surrogate: 282 points, 282 targets 2026-02-21T09:33:26.5646264Z [126s] Generation 3 starting: 65 neighbors, 5 active search path(s) 2026-02-21T09:33:59.3393779Z [159s] Timeout after 30s compiling Config(block_sizes=[4, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=8, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[3, 4], range_unroll_factors=[3, 1], range_warp_specializes=[False, None]) 2026-02-21T09:33:59.3411282Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.4 configs/s 2026-02-21T09:34:03.4509986Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.0 configs/s 2026-02-21T09:34:05.5382834Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 485.7 2026-02-21T09:34:05.5384077Z configs/s 2026-02-21T09:34:05.7028111Z [165s] Generation 3 complete: 2026-02-21T09:34:05.7032540Z timeout=1 2026-02-21T09:34:05.7033952Z ok=69 2026-02-21T09:34:05.7034127Z min=0.0226 2026-02-21T09:34:05.7034255Z mid=0.0369 2026-02-21T09:34:05.7034387Z max=0.2785 2026-02-21T09:34:05.7034529Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:05.7034760Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:05.7035007Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:05.7035193Z 'num_stages': 1, 2026-02-21T09:34:05.7035341Z 'num_warps': 4, 2026-02-21T09:34:05.7035481Z 'pid_type': 'flat', 2026-02-21T09:34:05.7035643Z 'range_flattens': [None, True], 2026-02-21T09:34:05.7035816Z 'range_multi_buffers': [None, True], 2026-02-21T09:34:05.7035999Z 'range_num_stages': [0, 4], 2026-02-21T09:34:05.7036508Z 'range_unroll_factors': [0, 4], 2026-02-21T09:34:05.7036708Z 'range_warp_specializes': [None, None]} 2026-02-21T09:34:05.7044895Z [165s] Fitting surrogate: 352 points, 352 targets 2026-02-21T09:34:06.4904663Z [166s] Generation 4 starting: 50 neighbors, 4 active search path(s) 2026-02-21T09:34:13.7454526Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 3.7 configs/s 2026-02-21T09:34:16.8714810Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.9 configs/s 2026-02-21T09:34:18.9486500Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.6 2026-02-21T09:34:18.9491999Z configs/s 2026-02-21T09:34:19.1199298Z [178s] Generation 4 complete: 2026-02-21T09:34:19.1204541Z ok=54 2026-02-21T09:34:19.1206987Z min=0.0225 2026-02-21T09:34:19.1207206Z mid=0.0327 2026-02-21T09:34:19.1207347Z max=0.1516 2026-02-21T09:34:19.1207501Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:19.1207820Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:19.1213424Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:19.1215702Z 'num_stages': 1, 2026-02-21T09:34:19.1215910Z 'num_warps': 4, 2026-02-21T09:34:19.1216078Z 'pid_type': 'flat', 2026-02-21T09:34:19.1216272Z 'range_flattens': [None, True], 2026-02-21T09:34:19.1216472Z 'range_multi_buffers': [None, True], 2026-02-21T09:34:19.1216682Z 'range_num_stages': [0, 4], 2026-02-21T09:34:19.1216857Z 'range_unroll_factors': [0, 4], 2026-02-21T09:34:19.1217059Z 'range_warp_specializes': [None, None]} 2026-02-21T09:34:19.1217371Z [178s] Fitting surrogate: 406 points, 406 targets 2026-02-21T09:34:19.7963297Z [179s] Generation 5 starting: 22 neighbors, 3 active search path(s) 2026-02-21T09:34:23.8351099Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 6.0 configs/s 2026-02-21T09:34:25.2324425Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.0 configs/s 2026-02-21T09:34:26.6658443Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 704.8 2026-02-21T09:34:26.6659730Z configs/s 2026-02-21T09:34:26.7836372Z [186s] Generation 5 complete: 2026-02-21T09:34:26.7840267Z ok=25 2026-02-21T09:34:26.7845392Z min=0.0225 2026-02-21T09:34:26.7850619Z mid=0.0266 2026-02-21T09:34:26.7855112Z max=0.0451 2026-02-21T09:34:26.7858928Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:26.7860395Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:26.7860679Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:26.7860879Z 'num_stages': 6, 2026-02-21T09:34:26.7861035Z 'num_warps': 4, 2026-02-21T09:34:26.7861178Z 'pid_type': 'flat', 2026-02-21T09:34:26.7861349Z 'range_flattens': [None, True], 2026-02-21T09:34:26.7861531Z 'range_multi_buffers': [None, False], 2026-02-21T09:34:26.7861799Z 'range_num_stages': [0, 3], 2026-02-21T09:34:26.7861994Z 'range_unroll_factors': [0, 2], 2026-02-21T09:34:26.7862551Z 'range_warp_specializes': [None, False]} 2026-02-21T09:34:26.7862784Z [186s] Fitting surrogate: 431 points, 431 targets 2026-02-21T09:34:27.2443784Z [186s] Generation 6 starting: 26 neighbors, 2 active search path(s) 2026-02-21T09:34:29.7967700Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 27.3 configs/s 2026-02-21T09:34:31.3678451Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.0 configs/s 2026-02-21T09:34:33.1223756Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 577.0 2026-02-21T09:34:33.1224446Z configs/s 2026-02-21T09:34:33.2706903Z [192s] Generation 6 complete: 2026-02-21T09:34:33.2712715Z ok=28 2026-02-21T09:34:33.2713265Z min=0.0225 2026-02-21T09:34:33.2713456Z mid=0.0246 2026-02-21T09:34:33.2713611Z max=0.0511 2026-02-21T09:34:33.2713765Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:33.2714405Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:33.2714718Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:33.2714928Z 'num_stages': 6, 2026-02-21T09:34:33.2715091Z 'num_warps': 4, 2026-02-21T09:34:33.2715239Z 'pid_type': 'flat', 2026-02-21T09:34:33.2715411Z 'range_flattens': [None, True], 2026-02-21T09:34:33.2715602Z 'range_multi_buffers': [None, False], 2026-02-21T09:34:33.2715806Z 'range_num_stages': [0, 2], 2026-02-21T09:34:33.2715980Z 'range_unroll_factors': [0, 2], 2026-02-21T09:34:33.2716177Z 'range_warp_specializes': [None, False]} 2026-02-21T09:34:33.2723850Z [192s] Fitting surrogate: 459 points, 459 targets 2026-02-21T09:34:33.7238276Z [193s] Generation 7 starting: 21 neighbors, 2 active search path(s) 2026-02-21T09:34:35.9360708Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 12.4 configs/s 2026-02-21T09:34:37.1948743Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.3 configs/s 2026-02-21T09:34:38.7149118Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 665.7 2026-02-21T09:34:38.7153439Z configs/s 2026-02-21T09:34:38.8373681Z [198s] Generation 7 complete: 2026-02-21T09:34:38.8379081Z ok=23 2026-02-21T09:34:38.8386038Z min=0.0225 2026-02-21T09:34:38.8387518Z mid=0.0245 2026-02-21T09:34:38.8387676Z max=0.0326 2026-02-21T09:34:38.8387829Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:38.8388057Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:38.8388303Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:38.8388497Z 'num_stages': 3, 2026-02-21T09:34:38.8388637Z 'num_warps': 4, 2026-02-21T09:34:38.8388785Z 'pid_type': 'flat', 2026-02-21T09:34:38.8388940Z 'range_flattens': [None, True], 2026-02-21T09:34:38.8389124Z 'range_multi_buffers': [None, None], 2026-02-21T09:34:38.8389305Z 'range_num_stages': [0, 0], 2026-02-21T09:34:38.8389474Z 'range_unroll_factors': [0, 0], 2026-02-21T09:34:38.8389676Z 'range_warp_specializes': [None, True]} 2026-02-21T09:34:38.8390353Z [198s] Fitting surrogate: 482 points, 482 targets 2026-02-21T09:34:39.2881855Z [198s] Generation 8 starting: 18 neighbors, 2 active search path(s) 2026-02-21T09:34:41.3939149Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 10.5 configs/s 2026-02-21T09:34:42.4842337Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 17.2 configs/s 2026-02-21T09:34:43.7472351Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 799.1 2026-02-21T09:34:43.7476289Z configs/s 2026-02-21T09:34:43.8561805Z [203s] Generation 8 complete: 2026-02-21T09:34:43.8565055Z ok=20 2026-02-21T09:34:43.8568943Z min=0.0226 2026-02-21T09:34:43.8573035Z mid=0.0227 2026-02-21T09:34:43.8577747Z max=0.0389 2026-02-21T09:34:43.8579372Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:43.8580097Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:43.8580388Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:43.8585690Z 'num_stages': 3, 2026-02-21T09:34:43.8585939Z 'num_warps': 2, 2026-02-21T09:34:43.8586133Z 'pid_type': 'flat', 2026-02-21T09:34:43.8586361Z 'range_flattens': [None, True], 2026-02-21T09:34:43.8586582Z 'range_multi_buffers': [None, False], 2026-02-21T09:34:43.8586781Z 'range_num_stages': [0, 0], 2026-02-21T09:34:43.8586956Z 'range_unroll_factors': [0, 0], 2026-02-21T09:34:43.8587151Z 'range_warp_specializes': [None, True]} 2026-02-21T09:34:43.8587362Z [203s] Fitting surrogate: 502 points, 502 targets 2026-02-21T09:34:44.2294477Z [203s] Generation 9 starting: 9 neighbors, 1 active search path(s) 2026-02-21T09:34:45.4660912Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 15.3 configs/s 2026-02-21T09:34:46.0085997Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.2 configs/s 2026-02-21T09:34:46.6495086Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1560.2 2026-02-21T09:34:46.6497712Z configs/s 2026-02-21T09:34:46.7098499Z [206s] Generation 9 complete: 2026-02-21T09:34:46.7101877Z ok=10 2026-02-21T09:34:46.7106122Z min=0.0225 2026-02-21T09:34:46.7111021Z mid=0.0226 2026-02-21T09:34:46.7115428Z max=0.0226 2026-02-21T09:34:46.7118711Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:46.7123288Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:46.7126722Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:46.7126976Z 'num_stages': 3, 2026-02-21T09:34:46.7127130Z 'num_warps': 1, 2026-02-21T09:34:46.7127275Z 'pid_type': 'flat', 2026-02-21T09:34:46.7127440Z 'range_flattens': [None, True], 2026-02-21T09:34:46.7127618Z 'range_multi_buffers': [None, False], 2026-02-21T09:34:46.7127810Z 'range_num_stages': [0, 0], 2026-02-21T09:34:46.7127984Z 'range_unroll_factors': [0, 0], 2026-02-21T09:34:46.7128179Z 'range_warp_specializes': [None, True]} 2026-02-21T09:34:46.7128405Z [206s] Fitting surrogate: 512 points, 512 targets 2026-02-21T09:34:47.0588155Z [206s] Generation 10 starting: 5 neighbors, 1 active search path(s) 2026-02-21T09:34:47.9636499Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 11.0 configs/s 2026-02-21T09:34:48.3264448Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 6/6 19.2 configs/s 2026-02-21T09:34:48.7688407Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2239.6 2026-02-21T09:34:48.7689522Z configs/s 2026-02-21T09:34:48.8134253Z [208s] Generation 10 complete: 2026-02-21T09:34:48.8137340Z ok=7 2026-02-21T09:34:48.8141943Z min=0.0225 2026-02-21T09:34:48.8143940Z mid=0.0225 2026-02-21T09:34:48.8144112Z max=0.0225 2026-02-21T09:34:48.8144261Z best={'block_sizes': [1, 8192], 2026-02-21T09:34:48.8149103Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:34:48.8151216Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:34:48.8151915Z 'num_stages': 3, 2026-02-21T09:34:48.8156129Z 'num_warps': 1, 2026-02-21T09:34:48.8160072Z 'pid_type': 'flat', 2026-02-21T09:34:48.8160360Z 'range_flattens': [None, True], 2026-02-21T09:34:48.8160593Z 'range_multi_buffers': [None, False], 2026-02-21T09:34:48.8164639Z 'range_num_stages': [0, 0], 2026-02-21T09:34:48.8167911Z 'range_unroll_factors': [0, 0], 2026-02-21T09:34:48.8172226Z 'range_warp_specializes': [None, True]} 2026-02-21T09:34:48.8177147Z [208s] Fitting surrogate: 519 points, 519 targets 2026-02-21T09:34:49.0893106Z [208s] Autotuning complete in 208.8s after searching 498 configs. 2026-02-21T09:34:49.0893530Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:34:49.0894779Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:34:49.0895621Z 2026-02-21T09:34:49.0895886Z [208s] Code of selected kernel: /tmp/torchinductor_root/ul/culrzb4qt45ddt4lkeeka2wtyrtxpfvhgey7flljwev7nwmsfbdj.py 2026-02-21T09:34:49.1115204Z from __future__ import annotations 2026-02-21T09:34:49.1116850Z 2026-02-21T09:34:49.1117006Z import torch 2026-02-21T09:34:49.1117179Z import triton 2026-02-21T09:34:49.1117346Z import triton.language as tl 2026-02-21T09:34:49.1117560Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:34:49.1117844Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:34:49.1118141Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:34:49.1118337Z 2026-02-21T09:34:49.1118410Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:34:49.1118602Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:34:49.1118723Z 2026-02-21T09:34:49.1118802Z @triton.jit 2026-02-21T09:34:49.1118961Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:34:49.1119221Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:34:49.1119483Z pid_0 = tl.program_id(0) 2026-02-21T09:34:49.1119653Z offset_0 = pid_0 2026-02-21T09:34:49.1119839Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:34:49.1120133Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:34:49.1120435Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:34:49.1120719Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:34:49.1120979Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:34:49.1121255Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:34:49.1121631Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:34:49.1121921Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:34:49.1122183Z # src[softmax.py:82-89]: ... 2026-02-21T09:34:49.1122543Z for offset_2 in tl.range(0, 5504, _BLOCK_SIZE_1, warp_specialize=True, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:34:49.1122975Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:34:49.1123225Z mask_1 = indices_2 < 5504 2026-02-21T09:34:49.1123417Z mi_copy = mi 2026-02-21T09:34:49.1123567Z di_copy = di 2026-02-21T09:34:49.1123726Z mi_copy_0 = mi_copy 2026-02-21T09:34:49.1123892Z di_copy_0 = di_copy 2026-02-21T09:34:49.1124088Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:34:49.1124477Z values = tl.load(x + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:34:49.1124874Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:34:49.1125306Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:34:49.1125950Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:34:49.1126207Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:34:49.1126453Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:34:49.1126662Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:34:49.1126925Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:34:49.1127162Z v_2 = mi_copy_0 - v_1 2026-02-21T09:34:49.1127339Z v_3 = libdevice.exp(v_2) 2026-02-21T09:34:49.1127513Z v_4 = di_copy_0 * v_3 2026-02-21T09:34:49.1127698Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:34:49.1127902Z subscript = v_1[:, None] 2026-02-21T09:34:49.1128074Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:34:49.1128320Z v_6 = v_5 - subscript 2026-02-21T09:34:49.1128531Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:34:49.1128802Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:34:49.1129013Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:34:49.1129201Z v_7 = libdevice.exp(v_6) 2026-02-21T09:34:49.1129525Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:34:49.1129882Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:34:49.1130087Z di = v_4 + sum_1 2026-02-21T09:34:49.1130245Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:34:49.1130424Z mi = v_1 2026-02-21T09:34:49.1130621Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:34:49.1130896Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:34:49.1131195Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:34:49.1131661Z for offset_2 in tl.range(0, 5504, _BLOCK_SIZE_1, warp_specialize=True, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:34:49.1132054Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:34:49.1132285Z mask_2 = indices_2 < 5504 2026-02-21T09:34:49.1132455Z mi_copy_1 = mi 2026-02-21T09:34:49.1132600Z di_copy_1 = di 2026-02-21T09:34:49.1132750Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:34:49.1132922Z di_copy_1_0 = di_copy_1 2026-02-21T09:34:49.1133104Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:34:49.1133479Z values_1 = tl.load(x + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:34:49.1133907Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:34:49.1134197Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:34:49.1134390Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:34:49.1134608Z v_10 = v_9 - subscript_1 2026-02-21T09:34:49.1134784Z v_11 = libdevice.exp(v_10) 2026-02-21T09:34:49.1134957Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:34:49.1135143Z v_12 = v_11 / subscript_2 2026-02-21T09:34:49.1135311Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:34:49.1135582Z tl.store(out + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:34:49.1135794Z 2026-02-21T09:34:49.1135921Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:34:49.1136162Z """ 2026-02-21T09:34:49.1136370Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:34:49.1136666Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:34:49.1136887Z Args: 2026-02-21T09:34:49.1137047Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:34:49.1137374Z Returns: 2026-02-21T09:34:49.1137548Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:34:49.1137762Z """ 2026-02-21T09:34:49.1137898Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:34:49.1138077Z m, n = x.size() 2026-02-21T09:34:49.1138248Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:34:49.1138447Z out = torch.empty_like(x) 2026-02-21T09:34:49.1138678Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:34:49.1138990Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:34:49.1139303Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:34:49.1139536Z # src[softmax.py:79-92]: ... 2026-02-21T09:34:49.1139790Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=3) 2026-02-21T09:34:49.1140124Z # src[softmax.py:93]: return out 2026-02-21T09:34:49.1140295Z return out 2026-02-21T09:34:50.2567440Z WARNING:tritonbench.utils.triton_op:Completed input ID 41: 2026-02-21T09:34:50.2571211Z (M, N) 2026-02-21T09:34:50.2575118Z ------------ 2026-02-21T09:34:50.2577027Z (4096, 5504) 2026-02-21T09:34:50.2577162Z 2026-02-21T09:34:50.2577626Z 45%|████▌ | 9/20 [25:55<36:36, 199.65s/it]WARNING:tritonbench.utils.triton_op:Running input ID 46: 2026-02-21T09:34:50.2582171Z (M, N) 2026-02-21T09:34:50.2584097Z ------------ 2026-02-21T09:34:50.2584273Z (4096, 6144) 2026-02-21T09:34:50.2584606Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:34:51.4635234Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:34:52.9796018Z INFO:tritonbench.utils.triton_op:Took 2.16ms to get benchmark function for torch_compile_softmax 2026-02-21T09:34:54.2088347Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:34:54.2092638Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:34:54.2097017Z 'dtype': 'torch.float16', 2026-02-21T09:34:54.2101403Z 'shape': (4096, 6144), 2026-02-21T09:34:54.2105605Z 'stride': (6144, 1)},), 2026-02-21T09:34:54.2110698Z 'kwargs': {}} 2026-02-21T09:34:54.2112426Z INFO:tritonbench.utils.triton_op:Took 2.46ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:34:54.3839239Z [0s] Autotune random seed: 2138408546 2026-02-21T09:34:54.4087714Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:35:27.9158603Z [33s] Timeout after 30s compiling Config(block_sizes=[128, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]) 2026-02-21T09:35:30.3274906Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:35:32.4106365Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=6, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T09:35:32.4126949Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T09:35:39.7371496Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.6 configs/s 2026-02-21T09:35:39.7384161Z [45s] Adaptive compile timeout: 30s (90% percentile=7.9s, bounds=[30.0s, 30s]) 2026-02-21T09:35:40.4668087Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1355.7 configs/s 2026-02-21T09:35:40.5262829Z [46s] Initial random population of 100, 5 starting points: 2026-02-21T09:35:40.5267129Z error=10 2026-02-21T09:35:40.5268623Z timeout=3 2026-02-21T09:35:40.5268964Z ok=87 2026-02-21T09:35:40.5274437Z min=0.0430 2026-02-21T09:35:40.5276415Z mid=0.4813 2026-02-21T09:35:40.5276685Z max=141.7728 2026-02-21T09:35:40.5276915Z best={'block_sizes': [1, 8192], 2026-02-21T09:35:40.5277267Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:35:40.5277607Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:35:40.5277977Z 'num_sm_multiplier': 2, 2026-02-21T09:35:40.5282685Z 'num_stages': 7, 2026-02-21T09:35:40.5285097Z 'num_warps': 32, 2026-02-21T09:35:40.5290775Z 'pid_type': 'persistent_blocked', 2026-02-21T09:35:40.5294880Z 'range_flattens': [True, True], 2026-02-21T09:35:40.5299360Z 'range_multi_buffers': [False, None], 2026-02-21T09:35:40.5300834Z 'range_num_stages': [4, 3], 2026-02-21T09:35:40.5301179Z 'range_unroll_factors': [2, 3], 2026-02-21T09:35:40.5305606Z 'range_warp_specializes': [False, False]} 2026-02-21T09:35:40.5310062Z [46s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:35:41.6772300Z [47s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T09:35:52.1949809Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 3.4 configs/s 2026-02-21T09:35:57.5680201Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.9 configs/s 2026-02-21T09:36:01.9364355Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 231.7 2026-02-21T09:36:01.9368355Z configs/s 2026-02-21T09:36:02.2228226Z [67s] Generation 1 complete: 2026-02-21T09:36:02.2229611Z ok=91 2026-02-21T09:36:02.2229872Z min=0.0328 2026-02-21T09:36:02.2230074Z mid=0.0471 2026-02-21T09:36:02.2230235Z max=0.2561 2026-02-21T09:36:02.2230441Z best={'block_sizes': [1, 2048], 2026-02-21T09:36:02.2230732Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:36:02.2231061Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:36:02.2231284Z 'num_stages': 5, 2026-02-21T09:36:02.2231498Z 'num_warps': 2, 2026-02-21T09:36:02.2231944Z 'pid_type': 'flat', 2026-02-21T09:36:02.2232185Z 'range_flattens': [None, False], 2026-02-21T09:36:02.2232424Z 'range_multi_buffers': [None, False], 2026-02-21T09:36:02.2232705Z 'range_num_stages': [0, 1], 2026-02-21T09:36:02.2232958Z 'range_unroll_factors': [0, 0], 2026-02-21T09:36:02.2233177Z 'range_warp_specializes': [None, False]} 2026-02-21T09:36:02.2246190Z [67s] Fitting surrogate: 191 points, 191 targets 2026-02-21T09:36:03.4733803Z [69s] Generation 2 starting: 75 neighbors, 5 active search path(s) 2026-02-21T09:36:27.1045441Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.8 configs/s 2026-02-21T09:36:32.4405724Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.9 configs/s 2026-02-21T09:36:34.0789594Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 617.5 2026-02-21T09:36:34.0794217Z configs/s 2026-02-21T09:36:34.2110422Z [99s] Generation 2 complete: 2026-02-21T09:36:34.2114906Z error=2 2026-02-21T09:36:34.2116594Z ok=79 2026-02-21T09:36:34.2116854Z min=0.0246 2026-02-21T09:36:34.2117049Z mid=0.0389 2026-02-21T09:36:34.2117264Z max=0.3337 2026-02-21T09:36:34.2117491Z best={'block_sizes': [1, 8192], 2026-02-21T09:36:34.2117805Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:36:34.2118171Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:36:34.2118514Z 'num_stages': 1, 2026-02-21T09:36:34.2122420Z 'num_warps': 2, 2026-02-21T09:36:34.2124110Z 'pid_type': 'flat', 2026-02-21T09:36:34.2124852Z 'range_flattens': [None, True], 2026-02-21T09:36:34.2125149Z 'range_multi_buffers': [None, True], 2026-02-21T09:36:34.2125443Z 'range_num_stages': [0, 4], 2026-02-21T09:36:34.2125682Z 'range_unroll_factors': [0, 0], 2026-02-21T09:36:34.2125979Z 'range_warp_specializes': [None, True]} 2026-02-21T09:36:34.2126355Z [99s] Fitting surrogate: 272 points, 272 targets 2026-02-21T09:36:35.2439751Z [100s] Generation 3 starting: 68 neighbors, 5 active search path(s) 2026-02-21T09:36:49.5302953Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.5 configs/s 2026-02-21T09:36:53.7757689Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.7 configs/s 2026-02-21T09:36:57.0586826Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 309.6 2026-02-21T09:36:57.0592026Z configs/s 2026-02-21T09:36:57.3111252Z [122s] Generation 3 complete: 2026-02-21T09:36:57.3112386Z ok=74 2026-02-21T09:36:57.3116972Z min=0.0246 2026-02-21T09:36:57.3118563Z mid=0.0348 2026-02-21T09:36:57.3118793Z max=0.8202 2026-02-21T09:36:57.3119029Z best={'block_sizes': [1, 8192], 2026-02-21T09:36:57.3119345Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:36:57.3119705Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:36:57.3119957Z 'num_stages': 1, 2026-02-21T09:36:57.3120183Z 'num_warps': 2, 2026-02-21T09:36:57.3120378Z 'pid_type': 'flat', 2026-02-21T09:36:57.3120625Z 'range_flattens': [None, True], 2026-02-21T09:36:57.3120892Z 'range_multi_buffers': [None, True], 2026-02-21T09:36:57.3121139Z 'range_num_stages': [0, 4], 2026-02-21T09:36:57.3121395Z 'range_unroll_factors': [0, 0], 2026-02-21T09:36:57.3121726Z 'range_warp_specializes': [None, True]} 2026-02-21T09:36:57.3129663Z [122s] Fitting surrogate: 346 points, 346 targets 2026-02-21T09:36:58.2905991Z [123s] Generation 4 starting: 58 neighbors, 5 active search path(s) 2026-02-21T09:37:24.8774984Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.4 configs/s 2026-02-21T09:37:28.4225667Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.8 configs/s 2026-02-21T09:37:32.0190194Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 345.9 2026-02-21T09:37:32.0191373Z configs/s 2026-02-21T09:37:32.2404809Z [157s] Generation 4 complete: 2026-02-21T09:37:32.2409419Z error=1 2026-02-21T09:37:32.2413584Z ok=63 2026-02-21T09:37:32.2415218Z min=0.0246 2026-02-21T09:37:32.2415493Z mid=0.0328 2026-02-21T09:37:32.2415703Z max=0.3267 2026-02-21T09:37:32.2415935Z best={'block_sizes': [1, 8192], 2026-02-21T09:37:32.2416265Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:37:32.2416665Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:37:32.2416935Z 'num_stages': 7, 2026-02-21T09:37:32.2417226Z 'num_warps': 2, 2026-02-21T09:37:32.2417906Z 'pid_type': 'flat', 2026-02-21T09:37:32.2418170Z 'range_flattens': [None, None], 2026-02-21T09:37:32.2418457Z 'range_multi_buffers': [None, False], 2026-02-21T09:37:32.2418739Z 'range_num_stages': [0, 4], 2026-02-21T09:37:32.2419253Z 'range_unroll_factors': [0, 0], 2026-02-21T09:37:32.2419510Z 'range_warp_specializes': [None, True]} 2026-02-21T09:37:32.2419885Z [157s] Fitting surrogate: 410 points, 410 targets 2026-02-21T09:37:32.8735759Z [158s] Generation 5 starting: 34 neighbors, 3 active search path(s) 2026-02-21T09:37:38.4721882Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 5.7 configs/s 2026-02-21T09:37:40.6590302Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.8 configs/s 2026-02-21T09:37:42.3645156Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 593.8 2026-02-21T09:37:42.3646320Z configs/s 2026-02-21T09:37:42.4949329Z [168s] Generation 5 complete: 2026-02-21T09:37:42.4950345Z ok=38 2026-02-21T09:37:42.4950534Z min=0.0246 2026-02-21T09:37:42.4950753Z mid=0.0328 2026-02-21T09:37:42.4950929Z max=0.0799 2026-02-21T09:37:42.4951151Z best={'block_sizes': [1, 8192], 2026-02-21T09:37:42.4951472Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:37:42.4952128Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:37:42.4952388Z 'num_stages': 7, 2026-02-21T09:37:42.4952625Z 'num_warps': 2, 2026-02-21T09:37:42.4952822Z 'pid_type': 'flat', 2026-02-21T09:37:42.4953064Z 'range_flattens': [None, None], 2026-02-21T09:37:42.4953328Z 'range_multi_buffers': [None, False], 2026-02-21T09:37:42.4953574Z 'range_num_stages': [0, 4], 2026-02-21T09:37:42.4953819Z 'range_unroll_factors': [0, 0], 2026-02-21T09:37:42.4954052Z 'range_warp_specializes': [None, True]} 2026-02-21T09:37:42.4965425Z [168s] Fitting surrogate: 448 points, 448 targets 2026-02-21T09:37:43.0046393Z [168s] Generation 6 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:37:46.8439267Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 4.6 configs/s 2026-02-21T09:37:48.2293482Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.1 configs/s 2026-02-21T09:37:49.7497817Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 666.4 2026-02-21T09:37:49.7499385Z configs/s 2026-02-21T09:37:49.8803524Z [175s] Generation 6 complete: 2026-02-21T09:37:49.8807899Z ok=26 2026-02-21T09:37:49.8811358Z min=0.0246 2026-02-21T09:37:49.8813724Z mid=0.0266 2026-02-21T09:37:49.8813980Z max=0.0429 2026-02-21T09:37:49.8818704Z best={'block_sizes': [1, 8192], 2026-02-21T09:37:49.8820206Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:37:49.8820599Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:37:49.8820887Z 'num_stages': 7, 2026-02-21T09:37:49.8821132Z 'num_warps': 2, 2026-02-21T09:37:49.8821377Z 'pid_type': 'flat', 2026-02-21T09:37:49.8821666Z 'range_flattens': [None, None], 2026-02-21T09:37:49.8821938Z 'range_multi_buffers': [None, False], 2026-02-21T09:37:49.8822183Z 'range_num_stages': [0, 4], 2026-02-21T09:37:49.8822434Z 'range_unroll_factors': [0, 0], 2026-02-21T09:37:49.8822671Z 'range_warp_specializes': [None, True]} 2026-02-21T09:37:49.8822976Z [175s] Fitting surrogate: 474 points, 474 targets 2026-02-21T09:37:50.1749810Z [175s] Generation 7 starting: 9 neighbors, 1 active search path(s) 2026-02-21T09:37:51.8365551Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 11.7 configs/s 2026-02-21T09:37:52.3719553Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.5 configs/s 2026-02-21T09:37:53.1120717Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1355.9 2026-02-21T09:37:53.1125507Z configs/s 2026-02-21T09:37:53.1825907Z [178s] Generation 7 complete: 2026-02-21T09:37:53.1827215Z ok=11 2026-02-21T09:37:53.1827491Z min=0.0246 2026-02-21T09:37:53.1827746Z mid=0.0255 2026-02-21T09:37:53.1827974Z max=0.0287 2026-02-21T09:37:53.1828254Z best={'block_sizes': [1, 8192], 2026-02-21T09:37:53.1828614Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:37:53.1829074Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:37:53.1829362Z 'num_stages': 7, 2026-02-21T09:37:53.1829638Z 'num_warps': 2, 2026-02-21T09:37:53.1829882Z 'pid_type': 'flat', 2026-02-21T09:37:53.1830163Z 'range_flattens': [None, None], 2026-02-21T09:37:53.1830477Z 'range_multi_buffers': [None, False], 2026-02-21T09:37:53.1830770Z 'range_num_stages': [0, 3], 2026-02-21T09:37:53.1831073Z 'range_unroll_factors': [0, 1], 2026-02-21T09:37:53.1831355Z 'range_warp_specializes': [None, True]} 2026-02-21T09:37:53.1840353Z [178s] Fitting surrogate: 485 points, 485 targets 2026-02-21T09:37:53.5017425Z [179s] Generation 8 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:37:55.1352032Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 14.1 configs/s 2026-02-21T09:37:55.7433556Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.8 configs/s 2026-02-21T09:37:56.4072947Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1506.2 2026-02-21T09:37:56.4074836Z configs/s 2026-02-21T09:37:56.4713596Z [182s] Generation 8 complete: 2026-02-21T09:37:56.4715018Z ok=11 2026-02-21T09:37:56.4715317Z min=0.0246 2026-02-21T09:37:56.4720856Z mid=0.0247 2026-02-21T09:37:56.4725558Z max=0.0369 2026-02-21T09:37:56.4727206Z best={'block_sizes': [1, 8192], 2026-02-21T09:37:56.4727610Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:37:56.4732990Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:37:56.4737858Z 'num_stages': 7, 2026-02-21T09:37:56.4740017Z 'num_warps': 4, 2026-02-21T09:37:56.4740383Z 'pid_type': 'flat', 2026-02-21T09:37:56.4740659Z 'range_flattens': [None, None], 2026-02-21T09:37:56.4744753Z 'range_multi_buffers': [None, False], 2026-02-21T09:37:56.4748810Z 'range_num_stages': [0, 3], 2026-02-21T09:37:56.4752923Z 'range_unroll_factors': [0, 1], 2026-02-21T09:37:56.4757019Z 'range_warp_specializes': [None, True]} 2026-02-21T09:37:56.4761416Z [182s] Fitting surrogate: 496 points, 496 targets 2026-02-21T09:37:56.8179646Z [182s] Generation 9 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:37:58.5249300Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 84.5 configs/s 2026-02-21T09:37:59.2318817Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.2 configs/s 2026-02-21T09:38:00.1049478Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1150.0 2026-02-21T09:38:00.1050520Z configs/s 2026-02-21T09:38:00.1837447Z [185s] Generation 9 complete: 2026-02-21T09:38:00.1839113Z ok=13 2026-02-21T09:38:00.1839332Z min=0.0246 2026-02-21T09:38:00.1839540Z mid=0.0246 2026-02-21T09:38:00.1839712Z max=0.0327 2026-02-21T09:38:00.1839923Z best={'block_sizes': [1, 8192], 2026-02-21T09:38:00.1840180Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:38:00.1840474Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:38:00.1840731Z 'num_stages': 7, 2026-02-21T09:38:00.1840912Z 'num_warps': 2, 2026-02-21T09:38:00.1841118Z 'pid_type': 'flat', 2026-02-21T09:38:00.1841322Z 'range_flattens': [None, None], 2026-02-21T09:38:00.1841895Z 'range_multi_buffers': [None, False], 2026-02-21T09:38:00.1842130Z 'range_num_stages': [0, 3], 2026-02-21T09:38:00.1842373Z 'range_unroll_factors': [0, 1], 2026-02-21T09:38:00.1842594Z 'range_warp_specializes': [None, True]} 2026-02-21T09:38:00.1870003Z [185s] Fitting surrogate: 509 points, 509 targets 2026-02-21T09:38:00.5908799Z [186s] Generation 10 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:38:02.3770417Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 7.4 configs/s 2026-02-21T09:38:03.0261384Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.3 configs/s 2026-02-21T09:38:03.8289812Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1248.7 2026-02-21T09:38:03.8290980Z configs/s 2026-02-21T09:38:03.9038155Z [189s] Generation 10 complete: 2026-02-21T09:38:03.9039837Z ok=12 2026-02-21T09:38:03.9040051Z min=0.0246 2026-02-21T09:38:03.9040254Z mid=0.0246 2026-02-21T09:38:03.9040420Z max=0.0287 2026-02-21T09:38:03.9040628Z best={'block_sizes': [1, 8192], 2026-02-21T09:38:03.9040879Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:38:03.9041173Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:38:03.9041402Z 'num_stages': 7, 2026-02-21T09:38:03.9041694Z 'num_warps': 2, 2026-02-21T09:38:03.9041909Z 'pid_type': 'flat', 2026-02-21T09:38:03.9042514Z 'range_flattens': [None, None], 2026-02-21T09:38:03.9042797Z 'range_multi_buffers': [None, False], 2026-02-21T09:38:03.9043046Z 'range_num_stages': [0, 3], 2026-02-21T09:38:03.9043281Z 'range_unroll_factors': [0, 1], 2026-02-21T09:38:03.9043504Z 'range_warp_specializes': [None, True]} 2026-02-21T09:38:03.9054399Z [189s] Fitting surrogate: 521 points, 521 targets 2026-02-21T09:38:04.2977308Z [189s] Generation 11 starting: 9 neighbors, 1 active search path(s) 2026-02-21T09:38:05.5441422Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 10.2 configs/s 2026-02-21T09:38:06.0697829Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.9 configs/s 2026-02-21T09:38:06.7996180Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1372.9 2026-02-21T09:38:06.7996995Z configs/s 2026-02-21T09:38:06.8686475Z [192s] Generation 11 complete: 2026-02-21T09:38:06.8688038Z ok=11 2026-02-21T09:38:06.8688293Z min=0.0246 2026-02-21T09:38:06.8688517Z mid=0.0246 2026-02-21T09:38:06.8688683Z max=0.0247 2026-02-21T09:38:06.8688894Z best={'block_sizes': [1, 8192], 2026-02-21T09:38:06.8689246Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:38:06.8693512Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:38:06.8695270Z 'num_stages': 7, 2026-02-21T09:38:06.8695548Z 'num_warps': 2, 2026-02-21T09:38:06.8699746Z 'pid_type': 'flat', 2026-02-21T09:38:06.8704334Z 'range_flattens': [None, None], 2026-02-21T09:38:06.8706204Z 'range_multi_buffers': [None, False], 2026-02-21T09:38:06.8706517Z 'range_num_stages': [0, 4], 2026-02-21T09:38:06.8706740Z 'range_unroll_factors': [0, 0], 2026-02-21T09:38:06.8707004Z 'range_warp_specializes': [None, True]} 2026-02-21T09:38:06.8714680Z [192s] Fitting surrogate: 532 points, 532 targets 2026-02-21T09:38:07.1599032Z [192s] Autotuning complete in 192.8s after searching 505 configs. 2026-02-21T09:38:07.1599805Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:38:07.1600809Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:38:07.1601788Z 2026-02-21T09:38:07.1602087Z [192s] Code of selected kernel: /tmp/torchinductor_root/lc/clcx5yb5v47k23m7k6353vegzyi3ciijakdyml5lbrqctuz55van.py 2026-02-21T09:38:07.1850783Z from __future__ import annotations 2026-02-21T09:38:07.1852174Z 2026-02-21T09:38:07.1852410Z import torch 2026-02-21T09:38:07.1852658Z import triton 2026-02-21T09:38:07.1852861Z import triton.language as tl 2026-02-21T09:38:07.1853153Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:38:07.1853456Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:38:07.1854262Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:38:07.1854460Z 2026-02-21T09:38:07.1854581Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:38:07.1854799Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:38:07.1854963Z 2026-02-21T09:38:07.1855042Z @triton.jit 2026-02-21T09:38:07.1855229Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:38:07.1855550Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:38:07.1855838Z pid_0 = tl.program_id(0) 2026-02-21T09:38:07.1856069Z offset_0 = pid_0 2026-02-21T09:38:07.1856314Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:38:07.1856642Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:38:07.1856993Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:38:07.1857295Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:38:07.1857779Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:38:07.1858114Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:38:07.1858424Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:38:07.1858749Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:38:07.1859018Z # src[softmax.py:82-89]: ... 2026-02-21T09:38:07.1859426Z for offset_2 in tl.range(0, 6144, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, disallow_acc_multi_buffer=True): 2026-02-21T09:38:07.1859880Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:38:07.1860155Z mask_1 = indices_2 < 6144 2026-02-21T09:38:07.1860387Z mi_copy = mi 2026-02-21T09:38:07.1860568Z di_copy = di 2026-02-21T09:38:07.1860778Z mi_copy_0 = mi_copy 2026-02-21T09:38:07.1860974Z di_copy_0 = di_copy 2026-02-21T09:38:07.1861223Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:38:07.1861717Z values = tl.load(x + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:38:07.1862180Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:38:07.1862659Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:38:07.1863092Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:38:07.1863420Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:38:07.1863701Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:38:07.1863992Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:38:07.1864315Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:38:07.1864597Z v_2 = mi_copy_0 - v_1 2026-02-21T09:38:07.1864837Z v_3 = libdevice.exp(v_2) 2026-02-21T09:38:07.1865042Z v_4 = di_copy_0 * v_3 2026-02-21T09:38:07.1865298Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:38:07.1865538Z subscript = v_1[:, None] 2026-02-21T09:38:07.1865783Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:38:07.1866001Z v_6 = v_5 - subscript 2026-02-21T09:38:07.1866281Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:38:07.1866610Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:38:07.1866864Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:38:07.1867121Z v_7 = libdevice.exp(v_6) 2026-02-21T09:38:07.1867482Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:38:07.1867903Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:38:07.1868139Z di = v_4 + sum_1 2026-02-21T09:38:07.1868362Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:38:07.1868605Z mi = v_1 2026-02-21T09:38:07.1868934Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:38:07.1869271Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:38:07.1869607Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:38:07.1870097Z for offset_2 in tl.range(0, 6144, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, disallow_acc_multi_buffer=True): 2026-02-21T09:38:07.1870526Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:38:07.1870828Z mask_2 = indices_2 < 6144 2026-02-21T09:38:07.1871059Z mi_copy_1 = mi 2026-02-21T09:38:07.1871246Z di_copy_1 = di 2026-02-21T09:38:07.1871464Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:38:07.1871701Z di_copy_1_0 = di_copy_1 2026-02-21T09:38:07.1871952Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:38:07.1872425Z values_1 = tl.load(x + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:38:07.1872932Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:38:07.1873276Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:38:07.1873498Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:38:07.1873756Z v_10 = v_9 - subscript_1 2026-02-21T09:38:07.1873967Z v_11 = libdevice.exp(v_10) 2026-02-21T09:38:07.1874213Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:38:07.1874436Z v_12 = v_11 / subscript_2 2026-02-21T09:38:07.1874674Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:38:07.1874980Z tl.store(out + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:38:07.1875239Z 2026-02-21T09:38:07.1875387Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:38:07.1875685Z """ 2026-02-21T09:38:07.1875935Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:38:07.1876302Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:38:07.1876562Z Args: 2026-02-21T09:38:07.1876790Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:38:07.1877022Z Returns: 2026-02-21T09:38:07.1877268Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:38:07.1877544Z """ 2026-02-21T09:38:07.1877723Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:38:07.1877969Z m, n = x.size() 2026-02-21T09:38:07.1878175Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:38:07.1878444Z out = torch.empty_like(x) 2026-02-21T09:38:07.1878708Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:38:07.1879085Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:38:07.1879466Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:38:07.1879749Z # src[softmax.py:79-92]: ... 2026-02-21T09:38:07.1880068Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=7) 2026-02-21T09:38:07.1880377Z # src[softmax.py:93]: return out 2026-02-21T09:38:07.1880614Z return out 2026-02-21T09:38:08.0912864Z WARNING:tritonbench.utils.triton_op:Completed input ID 46: 2026-02-21T09:38:08.0916959Z (M, N) 2026-02-21T09:38:08.0920905Z ------------ 2026-02-21T09:38:08.0923119Z (4096, 6144) 2026-02-21T09:38:08.0923345Z 2026-02-21T09:38:08.0927984Z 50%|█████ | 10/20 [29:13<33:10, 199.09s/it]WARNING:tritonbench.utils.triton_op:Running input ID 51: 2026-02-21T09:38:08.0932216Z (M, N) 2026-02-21T09:38:08.0936078Z ------------ 2026-02-21T09:38:08.0939933Z (4096, 6784) 2026-02-21T09:38:08.0944009Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:38:09.3003319Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:38:10.7909365Z INFO:tritonbench.utils.triton_op:Took 2.35ms to get benchmark function for torch_compile_softmax 2026-02-21T09:38:12.0944607Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:38:12.0946348Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:38:12.0946715Z 'dtype': 'torch.float16', 2026-02-21T09:38:12.0952281Z 'shape': (4096, 6784), 2026-02-21T09:38:12.0956621Z 'stride': (6784, 1)},), 2026-02-21T09:38:12.0958170Z 'kwargs': {}} 2026-02-21T09:38:12.0968572Z INFO:tritonbench.utils.triton_op:Took 2.74ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:38:12.2744377Z [0s] Autotune random seed: 2138408546 2026-02-21T09:38:12.3002456Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:38:49.3072244Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:38:51.3370744Z module { 2026-02-21T09:38:51.3374068Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:38:51.3375943Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:38:51.3376224Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:38:51.3376489Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:38:51.3376713Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:38:51.3377079Z %cst = arith.constant dense<6784> : tensor<16x1xi32> 2026-02-21T09:38:51.3378720Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T09:38:51.3379098Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T09:38:51.3379368Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:38:51.3379630Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:38:51.3379861Z %c6784_i32 = arith.constant 6784 : i32 2026-02-21T09:38:51.3380134Z %c6784_i64 = arith.constant 6784 : i64 2026-02-21T09:38:51.3380392Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:38:51.3380756Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c6784_i32], [%c6784_i64, %c1_i64] : , > 2026-02-21T09:38:51.3381152Z %1 = tt.get_program_id x : i32 2026-02-21T09:38:51.3381373Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:38:51.3381689Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T09:38:51.3381931Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:38:51.3382207Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T09:38:51.3382510Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:38:51.3382801Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T09:38:51.3383062Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T09:38:51.3383289Z %c6656_i32 = arith.constant 6656 : i32 2026-02-21T09:38:51.3383537Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:38:51.3383954Z %8:2 = scf.for %arg3 = %c0_i32 to %c6656_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T09:38:51.3384491Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:38:51.3384882Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3385159Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3385423Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3385689Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.3385920Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3386174Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3386438Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:38:51.3386743Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:38:51.3387048Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T09:38:51.3387633Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T09:38:51.3387927Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T09:38:51.3388205Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:38:51.3388526Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T09:38:51.3388937Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3389393Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T09:38:51.3389721Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3390135Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3390417Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T09:38:51.3390926Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3391371Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3391637Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3391893Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.3392128Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3392388Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3392629Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T09:38:51.3392899Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:38:51.3393132Z %68 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:38:51.3393394Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T09:38:51.3393739Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:38:51.3394102Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3394405Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3394638Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3394892Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.3395126Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3395385Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3395678Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:38:51.3395961Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:38:51.3396260Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T09:38:51.3396511Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T09:38:51.3396779Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T09:38:51.3397061Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:38:51.3397357Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T09:38:51.3397776Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3398197Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T09:38:51.3398528Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3398868Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3399184Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T09:38:51.3399601Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3400041Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3400311Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3400552Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.3400820Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3401055Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3401400Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T09:38:51.3401670Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:38:51.3401952Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:38:51.3402218Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:38:51.3402550Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:38:51.3402951Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3403235Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3403508Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3403740Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.3404009Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3404268Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3404544Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:38:51.3404926Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:38:51.3405204Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T09:38:51.3405496Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T09:38:51.3405749Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T09:38:51.3406058Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:38:51.3406377Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T09:38:51.3406793Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3407225Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T09:38:51.3407518Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3407885Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3408176Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T09:38:51.3408617Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3409063Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3409295Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3409540Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.3409763Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3410012Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3410255Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T09:38:51.3410513Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:38:51.3410760Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:38:51.3410988Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:38:51.3411334Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:38:51.3411725Z %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3412032Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3412261Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3412507Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.3412760Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3412980Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3413275Z %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:38:51.3413565Z %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:38:51.3413867Z %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32> 2026-02-21T09:38:51.3414124Z %116 = arith.cmpf une, %98, %98 : tensor<16xf32> 2026-02-21T09:38:51.3414393Z %117 = arith.ori %115, %116 : tensor<16xi1> 2026-02-21T09:38:51.3414703Z %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:38:51.3415046Z %119 = arith.subf %98, %118 : tensor<16xf32> 2026-02-21T09:38:51.3415469Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3415867Z %121 = arith.mulf %107, %120 : tensor<16xf32> 2026-02-21T09:38:51.3416190Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3416526Z %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3416836Z %124 = arith.subf %111, %123 : tensor<16x128xf32> 2026-02-21T09:38:51.3417281Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3417699Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3417950Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.3418222Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.3418477Z tt.reduce.return %128 : f32 2026-02-21T09:38:51.3418730Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3418973Z %127 = arith.addf %121, %126 : tensor<16xf32> 2026-02-21T09:38:51.3419264Z scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32> 2026-02-21T09:38:51.3419524Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:38:51.3419888Z %9 = tt.descriptor_load %0[%4, %c6656_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:38:51.3420256Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3420555Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3420811Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:38:51.3421032Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:38:51.3421289Z tt.reduce.return %50 : f32 2026-02-21T09:38:51.3421511Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3421835Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:38:51.3422113Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:38:51.3422402Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T09:38:51.3422656Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T09:38:51.3422935Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T09:38:51.3423231Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:38:51.3423500Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T09:38:51.3423916Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3424303Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T09:38:51.3424614Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3424964Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3425240Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T09:38:51.3425660Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3426059Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.3426312Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:38:51.3426533Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:38:51.3426783Z tt.reduce.return %50 : f32 2026-02-21T09:38:51.3427031Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:38:51.3427262Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T09:38:51.3427522Z %c6656_i32_2 = arith.constant 6656 : i32 2026-02-21T09:38:51.3427750Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T09:38:51.3428045Z scf.for %arg3 = %c0_i32 to %c6656_i32_2 step %c512_i32_3 : i32 { 2026-02-21T09:38:51.3428430Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:38:51.3428756Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:38:51.3429002Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T09:38:51.3429321Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:38:51.3429653Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T09:38:51.3429958Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:38:51.3430322Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3430625Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3430931Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T09:38:51.3431240Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3431678Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3432112Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3432464Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3432814Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3433117Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3433425Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T09:38:51.3433865Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3434320Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3434672Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3434949Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T09:38:51.3435255Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:38:51.3435595Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3435917Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3436244Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3436492Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:38:51.3436748Z %73 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:38:51.3436980Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T09:38:51.3437283Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:38:51.3437598Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:38:51.3437836Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T09:38:51.3438153Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:38:51.3438458Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T09:38:51.3438778Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:38:51.3439109Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3439435Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3439735Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T09:38:51.3440007Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3440344Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3440689Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3441078Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3441413Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3441849Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3442155Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T09:38:51.3442578Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3443081Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3443418Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3443729Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T09:38:51.3444042Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:38:51.3444386Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3444746Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3445113Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3445401Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:38:51.3445640Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:38:51.3445906Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T09:38:51.3446258Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:38:51.3446566Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T09:38:51.3446849Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T09:38:51.3447151Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:38:51.3447496Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T09:38:51.3447819Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:38:51.3448212Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3448575Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3448865Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T09:38:51.3449175Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3449497Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3449873Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3450260Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3450589Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3450926Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3451205Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T09:38:51.3451686Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3452145Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3452506Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3452817Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T09:38:51.3453101Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:38:51.3453445Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3453774Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3454109Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3454383Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:38:51.3454612Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:38:51.3454880Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T09:38:51.3455216Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:38:51.3455537Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T09:38:51.3455786Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T09:38:51.3456105Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:38:51.3456437Z %129 = arith.muli %128, %cst : tensor<16x1xi32> 2026-02-21T09:38:51.3456737Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:38:51.3457099Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3457405Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3457714Z %133 = arith.addi %131, %132 : tensor<16x128xi32> 2026-02-21T09:38:51.3457992Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3458383Z %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3458758Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3459114Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3459470Z %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3459777Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3460088Z %140 = arith.subf %138, %139 : tensor<16x128xf32> 2026-02-21T09:38:51.3460530Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3460989Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3461345Z %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3461666Z %144 = arith.divf %141, %143 : tensor<16x128xf32> 2026-02-21T09:38:51.3461975Z %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:38:51.3462281Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3462642Z %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3462966Z tt.store %147, %145 : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3463213Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:38:51.3463524Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:38:51.3463824Z %28 = tt.splat %c6656_i32_2 : i32 -> tensor<128xi32> 2026-02-21T09:38:51.3464108Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:38:51.3464394Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:38:51.3464717Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T09:38:51.3465033Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:38:51.3465358Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3465686Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:38:51.3465957Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T09:38:51.3466253Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3466590Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3466928Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3467298Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3467622Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:38:51.3467947Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3468278Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T09:38:51.3468706Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:38:51.3469213Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:38:51.3469534Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:38:51.3469831Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T09:38:51.3470107Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:38:51.3470442Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3470783Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:38:51.3471076Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T09:38:51.3471494Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T09:38:51.3471857Z tt.return 2026-02-21T09:38:51.3472058Z } 2026-02-21T09:38:51.3472224Z } 2026-02-21T09:38:51.3472345Z 2026-02-21T09:38:51.3472416Z {-# 2026-02-21T09:38:51.3472611Z external_resources: { 2026-02-21T09:38:51.3472809Z mlir_reproducer: { 2026-02-21T09:38:51.3477217Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:38:51.3481821Z disable_threading: false, 2026-02-21T09:38:51.3482059Z verify_each: true 2026-02-21T09:38:51.3482245Z } 2026-02-21T09:38:51.3482435Z } 2026-02-21T09:38:51.3482594Z #-} 2026-02-21T09:38:51.3483085Z /tmp/torchinductor_root/7s/c7s4dthdwq5dlgog2pjjyhvd6ij663jx2xp7majmr2ptyyxqymyr.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:38:51.3484346Z /tmp/torchinductor_root/7s/c7s4dthdwq5dlgog2pjjyhvd6ij663jx2xp7majmr2ptyyxqymyr.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:38:51.3485359Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:38:51.3486467Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:38:51.3487566Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:38:51.3487860Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:38:51.7437238Z module { 2026-02-21T09:38:51.7438905Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:38:51.7439460Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:38:51.7439782Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T09:38:51.7440322Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:38:51.7440611Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:38:51.7440839Z %c592_i32 = arith.constant 592 : i32 2026-02-21T09:38:51.7441134Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T09:38:51.7441447Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T09:38:51.7443469Z %cst_2 = arith.constant dense<6784> : tensor<8x1xi32> 2026-02-21T09:38:51.7443775Z %cst_3 = arith.constant dense<6784> : tensor<1024xi32> 2026-02-21T09:38:51.7444066Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:38:51.7444378Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:38:51.7444631Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:38:51.7444886Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:38:51.7445109Z %c6784_i32 = arith.constant 6784 : i32 2026-02-21T09:38:51.7445364Z %c6784_i64 = arith.constant 6784 : i64 2026-02-21T09:38:51.7445627Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:38:51.7445982Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c6784_i32], [%c6784_i64, %c1_i64] : , > 2026-02-21T09:38:51.7446394Z %1 = tt.get_program_id x : i32 2026-02-21T09:38:51.7446644Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T09:38:51.7446943Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:38:51.7447230Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:38:51.7447558Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:38:51.7447834Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:38:51.7448076Z %c6144_i32 = arith.constant 6144 : i32 2026-02-21T09:38:51.7448366Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T09:38:51.7448865Z %6:2 = scf.for %arg3 = %c0_i32 to %c6144_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:38:51.7449355Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7449711Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7450010Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:38:51.7450288Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7450639Z %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7450948Z %71 = arith.muli %70, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7451292Z %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7451712Z %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7452063Z %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7452387Z %75 = arith.addi %73, %74 : tensor<8x1024xi32> 2026-02-21T09:38:51.7452685Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7453169Z %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7453531Z %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7453914Z %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7454243Z %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7454567Z %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:38:51.7454955Z %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7455227Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7455493Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7455724Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.7455987Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7456314Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7456587Z %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:38:51.7456888Z %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:38:51.7457154Z %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32> 2026-02-21T09:38:51.7457438Z %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:38:51.7457690Z %88 = arith.ori %86, %87 : tensor<8xi1> 2026-02-21T09:38:51.7457999Z %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:38:51.7458284Z %90 = arith.subf %arg4, %89 : tensor<8xf32> 2026-02-21T09:38:51.7458720Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7459139Z %92 = arith.mulf %arg5, %91 : tensor<8xf32> 2026-02-21T09:38:51.7459422Z %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7459776Z %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7460086Z %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7460390Z %96 = arith.subf %94, %95 : tensor<8x1024xf32> 2026-02-21T09:38:51.7460816Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7461261Z %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:38:51.7461608Z %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7461847Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7462096Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.7462327Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7462585Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7462853Z %100 = arith.addf %92, %99 : tensor<8xf32> 2026-02-21T09:38:51.7463091Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:38:51.7463355Z %101 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:38:51.7463591Z %102 = arith.addi %arg3, %101 : i32 2026-02-21T09:38:51.7463903Z %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7464202Z %104 = tt.splat %102 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7464482Z %105 = arith.addi %104, %103 : tensor<1024xi32> 2026-02-21T09:38:51.7464773Z %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7465076Z %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7465409Z %108 = arith.muli %107, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7465713Z %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7466079Z %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7466454Z %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7466768Z %112 = arith.addi %110, %111 : tensor<8x1024xi32> 2026-02-21T09:38:51.7467078Z %113 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7467411Z %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7467794Z %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7468139Z %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7468469Z %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7468814Z %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:38:51.7469147Z %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7469521Z %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7469762Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7470022Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.7470260Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7470517Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7470788Z %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:38:51.7471101Z %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:38:51.7482416Z %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32> 2026-02-21T09:38:51.7482753Z %124 = arith.cmpf une, %89, %89 : tensor<8xf32> 2026-02-21T09:38:51.7483011Z %125 = arith.ori %123, %124 : tensor<8xi1> 2026-02-21T09:38:51.7483317Z %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:38:51.7483667Z %127 = arith.subf %89, %126 : tensor<8xf32> 2026-02-21T09:38:51.7484094Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7484532Z %129 = arith.mulf %100, %128 : tensor<8xf32> 2026-02-21T09:38:51.7484833Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7485176Z %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7485488Z %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7485800Z %133 = arith.subf %131, %132 : tensor<8x1024xf32> 2026-02-21T09:38:51.7486253Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7486721Z %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:38:51.7487052Z %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7487290Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7487550Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.7487808Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7488034Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7488303Z %137 = arith.addf %129, %136 : tensor<8xf32> 2026-02-21T09:38:51.7488536Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:38:51.7488790Z %138 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:38:51.7489019Z %139 = arith.addi %arg3, %138 : i32 2026-02-21T09:38:51.7489321Z %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7489619Z %141 = tt.splat %139 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7489891Z %142 = arith.addi %141, %140 : tensor<1024xi32> 2026-02-21T09:38:51.7490176Z %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7490480Z %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7490826Z %145 = arith.muli %144, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7491283Z %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7491703Z %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7492054Z %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7492352Z %149 = arith.addi %147, %148 : tensor<8x1024xi32> 2026-02-21T09:38:51.7492678Z %150 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7493027Z %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7493421Z %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7493777Z %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7494122Z %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7494545Z %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:38:51.7494885Z %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7495197Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7495437Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7495706Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:38:51.7495955Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7496224Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7496533Z %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:38:51.7496833Z %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:38:51.7497137Z %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32> 2026-02-21T09:38:51.7497407Z %161 = arith.cmpf une, %126, %126 : tensor<8xf32> 2026-02-21T09:38:51.7497688Z %162 = arith.ori %160, %161 : tensor<8xi1> 2026-02-21T09:38:51.7497975Z %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:38:51.7498295Z %164 = arith.subf %126, %163 : tensor<8xf32> 2026-02-21T09:38:51.7498738Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7499162Z %166 = arith.mulf %137, %165 : tensor<8xf32> 2026-02-21T09:38:51.7499496Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7499818Z %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7500148Z %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7500452Z %170 = arith.subf %168, %169 : tensor<8x1024xf32> 2026-02-21T09:38:51.7500864Z %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7501324Z %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:38:51.7501647Z %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7501908Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:38:51.7502133Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:38:51.7502381Z tt.reduce.return %175 : f32 2026-02-21T09:38:51.7502624Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7502863Z %174 = arith.addf %166, %173 : tensor<8xf32> 2026-02-21T09:38:51.7503146Z scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:38:51.7503386Z } {tt.flatten} 2026-02-21T09:38:51.7503654Z %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7503953Z %8 = tt.splat %c6144_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7504224Z %9 = arith.addi %8, %7 : tensor<1024xi32> 2026-02-21T09:38:51.7504474Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7504859Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7505168Z %12 = arith.muli %11, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7505456Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7505808Z %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7506107Z %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7506403Z %16 = arith.addi %14, %15 : tensor<8x1024xi32> 2026-02-21T09:38:51.7506699Z %17 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7507012Z %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7507366Z %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7507738Z %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7508060Z %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7508355Z %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:38:51.7508690Z %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7508977Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7509203Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:38:51.7509445Z %66 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:38:51.7509673Z tt.reduce.return %66 : f32 2026-02-21T09:38:51.7509920Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7510173Z %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:38:51.7510459Z %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:38:51.7510746Z %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32> 2026-02-21T09:38:51.7511001Z %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T09:38:51.7511265Z %29 = arith.ori %27, %28 : tensor<8xi1> 2026-02-21T09:38:51.7511525Z %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:38:51.7511853Z %31 = arith.subf %6#0, %30 : tensor<8xf32> 2026-02-21T09:38:51.7512335Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7512759Z %33 = arith.mulf %6#1, %32 : tensor<8xf32> 2026-02-21T09:38:51.7513065Z %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7513380Z %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7513703Z %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7513967Z %37 = arith.subf %35, %36 : tensor<8x1024xf32> 2026-02-21T09:38:51.7514393Z %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7514864Z %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:38:51.7515151Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T09:38:51.7515403Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:38:51.7515620Z %66 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:38:51.7515867Z tt.reduce.return %66 : f32 2026-02-21T09:38:51.7516089Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:38:51.7516345Z %41 = arith.addf %33, %40 : tensor<8xf32> 2026-02-21T09:38:51.7516577Z %c6144_i32_6 = arith.constant 6144 : i32 2026-02-21T09:38:51.7516820Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T09:38:51.7517114Z scf.for %arg3 = %c0_i32 to %c6144_i32_6 step %c3072_i32_7 : i32 { 2026-02-21T09:38:51.7517436Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7517775Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7518089Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:38:51.7518359Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7518703Z %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:38:51.7519101Z %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7519445Z %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7519741Z %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7520034Z %74 = arith.subf %72, %73 : tensor<8x1024xf32> 2026-02-21T09:38:51.7520437Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7520938Z %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7521284Z %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7521579Z %78 = arith.divf %75, %77 : tensor<8x1024xf32> 2026-02-21T09:38:51.7521900Z %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:38:51.7522236Z %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7522525Z %81 = arith.muli %80, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7522845Z %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7523172Z %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7523494Z %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7523763Z %85 = arith.addi %83, %84 : tensor<8x1024xi32> 2026-02-21T09:38:51.7524064Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7524409Z %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7524739Z %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7525092Z %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7525376Z tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7525647Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:38:51.7525871Z %90 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:38:51.7526102Z %91 = arith.addi %arg3, %90 : i32 2026-02-21T09:38:51.7526378Z %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7526698Z %93 = tt.splat %91 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7526940Z %94 = arith.addi %93, %92 : tensor<1024xi32> 2026-02-21T09:38:51.7527222Z %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7527563Z %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:38:51.7527972Z %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7528320Z %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7528616Z %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7528924Z %100 = arith.subf %98, %99 : tensor<8x1024xf32> 2026-02-21T09:38:51.7529341Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7529835Z %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7530199Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7530486Z %104 = arith.divf %101, %103 : tensor<8x1024xf32> 2026-02-21T09:38:51.7530804Z %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:38:51.7531183Z %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7531517Z %107 = arith.muli %106, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7531869Z %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7532235Z %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7532575Z %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7532868Z %111 = arith.addi %109, %110 : tensor<8x1024xi32> 2026-02-21T09:38:51.7533179Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7533509Z %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7533965Z %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7534325Z %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7534674Z tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7534976Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:38:51.7535223Z %116 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:38:51.7535501Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:38:51.7535789Z %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7536136Z %119 = tt.splat %117 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7536396Z %120 = arith.addi %119, %118 : tensor<1024xi32> 2026-02-21T09:38:51.7536698Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7537090Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:38:51.7537490Z %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7537866Z %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7538193Z %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7538518Z %126 = arith.subf %124, %125 : tensor<8x1024xf32> 2026-02-21T09:38:51.7538984Z %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7539460Z %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7539837Z %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7540104Z %130 = arith.divf %127, %129 : tensor<8x1024xf32> 2026-02-21T09:38:51.7540424Z %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:38:51.7540766Z %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7541120Z %133 = arith.muli %132, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7541469Z %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7541838Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7542189Z %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7542473Z %137 = arith.addi %135, %136 : tensor<8x1024xi32> 2026-02-21T09:38:51.7542780Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7543132Z %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7543479Z %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7543839Z %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7544143Z tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7544470Z } {tt.flatten} 2026-02-21T09:38:51.7544720Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:38:51.7545029Z %43 = tt.splat %c6144_i32_6 : i32 -> tensor<1024xi32> 2026-02-21T09:38:51.7545254Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T09:38:51.7545476Z %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32> 2026-02-21T09:38:51.7545813Z %46 = tt.descriptor_load %0[%2, %c6144_i32_6] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:38:51.7546162Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7546451Z %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:38:51.7546714Z %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7546946Z %50 = arith.subf %48, %49 : tensor<8x1024xf32> 2026-02-21T09:38:51.7547367Z %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:38:51.7547775Z %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:38:51.7548060Z %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:38:51.7548296Z %54 = arith.divf %51, %53 : tensor<8x1024xf32> 2026-02-21T09:38:51.7548532Z %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:38:51.7548816Z %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:38:51.7549065Z %57 = arith.muli %56, %cst_2 : tensor<8x1xi32> 2026-02-21T09:38:51.7549326Z %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:38:51.7549609Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7549876Z %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:38:51.7550118Z %61 = arith.addi %59, %60 : tensor<8x1024xi32> 2026-02-21T09:38:51.7550349Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7550630Z %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:38:51.7550920Z %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:38:51.7551208Z %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:38:51.7551454Z tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr> 2026-02-21T09:38:51.7551869Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:38:51.7552210Z tt.return 2026-02-21T09:38:51.7552332Z } 2026-02-21T09:38:51.7552452Z } 2026-02-21T09:38:51.7552520Z 2026-02-21T09:38:51.7552569Z {-# 2026-02-21T09:38:51.7552699Z external_resources: { 2026-02-21T09:38:51.7552854Z mlir_reproducer: { 2026-02-21T09:38:51.7557142Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:38:51.7561569Z disable_threading: false, 2026-02-21T09:38:51.7561737Z verify_each: true 2026-02-21T09:38:51.7561875Z } 2026-02-21T09:38:51.7561991Z } 2026-02-21T09:38:51.7562100Z #-} 2026-02-21T09:38:51.7562571Z /tmp/torchinductor_root/au/cauvte7suhxwgudkpzlyh47qnfkfxs2lm5bmex7bmeotoztnmarb.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:38:51.7563735Z /tmp/torchinductor_root/au/cauvte7suhxwgudkpzlyh47qnfkfxs2lm5bmex7bmeotoztnmarb.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:38:51.7564711Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:38:51.7565750Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:38:51.7566686Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:38:51.7566938Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:38:57.4821020Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.1 configs/s 2026-02-21T09:38:57.4833618Z [45s] Adaptive compile timeout: 30s (90% percentile=10.5s, bounds=[30.0s, 30s]) 2026-02-21T09:38:58.1523000Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1470.2 configs/s 2026-02-21T09:38:58.2057645Z [45s] Initial random population of 100, 5 starting points: 2026-02-21T09:38:58.2059508Z error=14 2026-02-21T09:38:58.2070792Z ok=86 2026-02-21T09:38:58.2074711Z min=0.0450 2026-02-21T09:38:58.2076902Z mid=0.5674 2026-02-21T09:38:58.2077086Z max=156.2122 2026-02-21T09:38:58.2077240Z best={'block_sizes': [1, 8192], 2026-02-21T09:38:58.2077551Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:38:58.2078168Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:38:58.2078366Z 'num_sm_multiplier': 2, 2026-02-21T09:38:58.2078532Z 'num_stages': 7, 2026-02-21T09:38:58.2078669Z 'num_warps': 32, 2026-02-21T09:38:58.2078828Z 'pid_type': 'persistent_blocked', 2026-02-21T09:38:58.2079012Z 'range_flattens': [True, True], 2026-02-21T09:38:58.2079195Z 'range_multi_buffers': [False, None], 2026-02-21T09:38:58.2079376Z 'range_num_stages': [4, 3], 2026-02-21T09:38:58.2079546Z 'range_unroll_factors': [2, 3], 2026-02-21T09:38:58.2079725Z 'range_warp_specializes': [False, False]} 2026-02-21T09:38:59.3475588Z [45s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:38:59.3476016Z [47s] Generation 1 starting: 78 neighbors, 5 active search path(s) 2026-02-21T09:39:08.9813920Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 9.1 configs/s 2026-02-21T09:39:13.9214445Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.0 configs/s 2026-02-21T09:39:15.9897017Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.8 2026-02-21T09:39:15.9897791Z configs/s 2026-02-21T09:39:16.1384118Z [63s] Generation 1 complete: 2026-02-21T09:39:16.1388711Z ok=84 2026-02-21T09:39:16.1392946Z min=0.0307 2026-02-21T09:39:16.1397722Z mid=0.0533 2026-02-21T09:39:16.1401997Z max=0.3523 2026-02-21T09:39:16.1407568Z best={'block_sizes': [1, 8192], 2026-02-21T09:39:16.1409321Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:39:16.1409679Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:39:16.1409881Z 'num_stages': 7, 2026-02-21T09:39:16.1410041Z 'num_warps': 8, 2026-02-21T09:39:16.1410195Z 'pid_type': 'flat', 2026-02-21T09:39:16.1410359Z 'range_flattens': [None, True], 2026-02-21T09:39:16.1410551Z 'range_multi_buffers': [None, None], 2026-02-21T09:39:16.1410743Z 'range_num_stages': [0, 3], 2026-02-21T09:39:16.1410954Z 'range_unroll_factors': [0, 3], 2026-02-21T09:39:16.1411164Z 'range_warp_specializes': [None, False]} 2026-02-21T09:39:16.1411476Z [63s] Fitting surrogate: 184 points, 184 targets 2026-02-21T09:39:17.2129306Z [64s] Generation 2 starting: 72 neighbors, 5 active search path(s) 2026-02-21T09:39:27.8572167Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 6.9 configs/s 2026-02-21T09:39:32.3902748Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.9 configs/s 2026-02-21T09:39:34.6995557Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 437.9 2026-02-21T09:39:34.6999317Z configs/s 2026-02-21T09:39:34.8660285Z [82s] Generation 2 complete: 2026-02-21T09:39:34.8660608Z ok=78 2026-02-21T09:39:34.8664461Z min=0.0266 2026-02-21T09:39:34.8666490Z mid=0.0410 2026-02-21T09:39:34.8666659Z max=0.4158 2026-02-21T09:39:34.8666812Z best={'block_sizes': [1, 8192], 2026-02-21T09:39:34.8667094Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:39:34.8667386Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:39:34.8667574Z 'num_stages': 7, 2026-02-21T09:39:34.8667721Z 'num_warps': 4, 2026-02-21T09:39:34.8667859Z 'pid_type': 'flat', 2026-02-21T09:39:34.8668019Z 'range_flattens': [None, True], 2026-02-21T09:39:34.8668192Z 'range_multi_buffers': [None, None], 2026-02-21T09:39:34.8668379Z 'range_num_stages': [0, 3], 2026-02-21T09:39:34.8668540Z 'range_unroll_factors': [0, 3], 2026-02-21T09:39:34.8668729Z 'range_warp_specializes': [None, False]} 2026-02-21T09:39:34.8676126Z [82s] Fitting surrogate: 262 points, 262 targets 2026-02-21T09:39:35.8042851Z [83s] Generation 3 starting: 59 neighbors, 5 active search path(s) 2026-02-21T09:39:48.9159505Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 1.5 configs/s 2026-02-21T09:39:52.6877518Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s 2026-02-21T09:39:55.8447944Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 321.3 2026-02-21T09:39:55.8448744Z configs/s 2026-02-21T09:39:56.0796534Z [103s] Generation 3 complete: 2026-02-21T09:39:56.0798332Z ok=65 2026-02-21T09:39:56.0798509Z min=0.0266 2026-02-21T09:39:56.0798646Z mid=0.0348 2026-02-21T09:39:56.0798781Z max=0.5059 2026-02-21T09:39:56.0798924Z best={'block_sizes': [1, 8192], 2026-02-21T09:39:56.0799190Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:39:56.0799477Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:39:56.0799674Z 'num_stages': 7, 2026-02-21T09:39:56.0799829Z 'num_warps': 4, 2026-02-21T09:39:56.0799976Z 'pid_type': 'flat', 2026-02-21T09:39:56.0800145Z 'range_flattens': [None, True], 2026-02-21T09:39:56.0800329Z 'range_multi_buffers': [None, None], 2026-02-21T09:39:56.0800524Z 'range_num_stages': [0, 3], 2026-02-21T09:39:56.0801088Z 'range_unroll_factors': [0, 3], 2026-02-21T09:39:56.0801318Z 'range_warp_specializes': [None, False]} 2026-02-21T09:39:56.0814242Z [103s] Fitting surrogate: 327 points, 327 targets 2026-02-21T09:39:56.8741277Z [104s] Generation 4 starting: 49 neighbors, 4 active search path(s) 2026-02-21T09:40:04.8681493Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 3.0 configs/s 2026-02-21T09:40:07.9249215Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 16.9 configs/s 2026-02-21T09:40:10.2020606Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 445.8 2026-02-21T09:40:10.2024163Z configs/s 2026-02-21T09:40:10.3711163Z [118s] Generation 4 complete: 2026-02-21T09:40:10.3717349Z ok=54 2026-02-21T09:40:10.3718807Z min=0.0266 2026-02-21T09:40:10.3719003Z mid=0.0307 2026-02-21T09:40:10.3719137Z max=0.1577 2026-02-21T09:40:10.3719299Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:10.3719597Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:10.3719890Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:10.3720100Z 'num_stages': 6, 2026-02-21T09:40:10.3720245Z 'num_warps': 1, 2026-02-21T09:40:10.3720399Z 'pid_type': 'flat', 2026-02-21T09:40:10.3720560Z 'range_flattens': [None, True], 2026-02-21T09:40:10.3720751Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:10.3720941Z 'range_num_stages': [0, 4], 2026-02-21T09:40:10.3721120Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:10.3721309Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:10.3729579Z [118s] Fitting surrogate: 381 points, 381 targets 2026-02-21T09:40:10.9659047Z [118s] Generation 5 starting: 29 neighbors, 3 active search path(s) 2026-02-21T09:40:18.9878932Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.8 configs/s 2026-02-21T09:40:20.8060037Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.5 configs/s 2026-02-21T09:40:21.9470873Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 882.1 2026-02-21T09:40:21.9475421Z configs/s 2026-02-21T09:40:22.0452472Z [129s] Generation 5 complete: 2026-02-21T09:40:22.0456720Z error=1 2026-02-21T09:40:22.0460838Z ok=32 2026-02-21T09:40:22.0466141Z min=0.0266 2026-02-21T09:40:22.0470672Z mid=0.0410 2026-02-21T09:40:22.0472448Z max=1.0916 2026-02-21T09:40:22.0472675Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:22.0472969Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:22.0473270Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:22.0473491Z 'num_stages': 6, 2026-02-21T09:40:22.0473665Z 'num_warps': 1, 2026-02-21T09:40:22.0473836Z 'pid_type': 'flat', 2026-02-21T09:40:22.0474384Z 'range_flattens': [None, True], 2026-02-21T09:40:22.0474629Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:22.0474828Z 'range_num_stages': [0, 4], 2026-02-21T09:40:22.0475027Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:22.0475625Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:22.0475851Z [129s] Fitting surrogate: 414 points, 414 targets 2026-02-21T09:40:22.5134463Z [130s] Generation 6 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:40:25.9283331Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 17.3 configs/s 2026-02-21T09:40:27.3089441Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.2 configs/s 2026-02-21T09:40:29.6325800Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 591.0 2026-02-21T09:40:29.6329426Z configs/s 2026-02-21T09:40:29.7653991Z [137s] Generation 6 complete: 2026-02-21T09:40:29.7658960Z ok=26 2026-02-21T09:40:29.7662948Z min=0.0266 2026-02-21T09:40:29.7667482Z mid=0.0286 2026-02-21T09:40:29.7670801Z max=0.0492 2026-02-21T09:40:29.7675286Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:29.7680222Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:29.7683624Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:29.7684832Z 'num_stages': 6, 2026-02-21T09:40:29.7685069Z 'num_warps': 1, 2026-02-21T09:40:29.7685260Z 'pid_type': 'flat', 2026-02-21T09:40:29.7685446Z 'range_flattens': [None, True], 2026-02-21T09:40:29.7685651Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:29.7690779Z 'range_num_stages': [0, 4], 2026-02-21T09:40:29.7695268Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:29.7699706Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:29.7704502Z [137s] Fitting surrogate: 440 points, 440 targets 2026-02-21T09:40:30.1938480Z [137s] Generation 7 starting: 21 neighbors, 2 active search path(s) 2026-02-21T09:40:32.9157256Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 16.3 configs/s 2026-02-21T09:40:34.1653446Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.4 configs/s 2026-02-21T09:40:35.5609349Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 723.4 2026-02-21T09:40:35.5613061Z configs/s 2026-02-21T09:40:35.6774765Z [143s] Generation 7 complete: 2026-02-21T09:40:35.6778816Z ok=24 2026-02-21T09:40:35.6780252Z min=0.0266 2026-02-21T09:40:35.6780407Z mid=0.0286 2026-02-21T09:40:35.6780539Z max=0.0471 2026-02-21T09:40:35.6780682Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:35.6780940Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:35.6781207Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:35.6781409Z 'num_stages': 6, 2026-02-21T09:40:35.6781836Z 'num_warps': 1, 2026-02-21T09:40:35.6782002Z 'pid_type': 'flat', 2026-02-21T09:40:35.6782169Z 'range_flattens': [None, True], 2026-02-21T09:40:35.6782362Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:35.6782563Z 'range_num_stages': [0, 4], 2026-02-21T09:40:35.6782775Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:35.6782985Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:35.6790650Z [143s] Fitting surrogate: 464 points, 464 targets 2026-02-21T09:40:36.1115368Z [143s] Generation 8 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:40:39.3597344Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 6.6 configs/s 2026-02-21T09:40:40.5599561Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.3 configs/s 2026-02-21T09:40:41.9825503Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 708.8 2026-02-21T09:40:41.9825932Z configs/s 2026-02-21T09:40:42.0937657Z [149s] Generation 8 complete: 2026-02-21T09:40:42.0941464Z ok=23 2026-02-21T09:40:42.0942858Z min=0.0266 2026-02-21T09:40:42.0943048Z mid=0.0286 2026-02-21T09:40:42.0943195Z max=0.0492 2026-02-21T09:40:42.0943372Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:42.0943713Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:42.0944354Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:42.0944553Z 'num_stages': 6, 2026-02-21T09:40:42.0944694Z 'num_warps': 1, 2026-02-21T09:40:42.0944845Z 'pid_type': 'flat', 2026-02-21T09:40:42.0945002Z 'range_flattens': [None, True], 2026-02-21T09:40:42.0945185Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:42.0945367Z 'range_num_stages': [0, 4], 2026-02-21T09:40:42.0945544Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:42.0945725Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:42.0953449Z [149s] Fitting surrogate: 487 points, 487 targets 2026-02-21T09:40:42.5203542Z [150s] Generation 9 starting: 22 neighbors, 2 active search path(s) 2026-02-21T09:40:45.8698899Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 6.7 configs/s 2026-02-21T09:40:47.1902879Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.2 configs/s 2026-02-21T09:40:48.6791085Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 678.3 2026-02-21T09:40:48.6791483Z configs/s 2026-02-21T09:40:48.7949991Z [156s] Generation 9 complete: 2026-02-21T09:40:48.7954322Z ok=25 2026-02-21T09:40:48.7958885Z min=0.0266 2026-02-21T09:40:48.7960252Z mid=0.0286 2026-02-21T09:40:48.7960418Z max=0.0471 2026-02-21T09:40:48.7960559Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:48.7960815Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:48.7961083Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:48.7961274Z 'num_stages': 6, 2026-02-21T09:40:48.7961423Z 'num_warps': 1, 2026-02-21T09:40:48.7961650Z 'pid_type': 'flat', 2026-02-21T09:40:48.7961820Z 'range_flattens': [None, True], 2026-02-21T09:40:48.7961999Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:48.7962193Z 'range_num_stages': [0, 4], 2026-02-21T09:40:48.7962379Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:48.7962574Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:48.7969481Z [156s] Fitting surrogate: 512 points, 512 targets 2026-02-21T09:40:49.3460385Z [157s] Generation 10 starting: 24 neighbors, 2 active search path(s) 2026-02-21T09:40:53.0436477Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 6.4 configs/s 2026-02-21T09:40:54.5709854Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.9 configs/s 2026-02-21T09:40:56.0492917Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 682.4 2026-02-21T09:40:56.0493678Z configs/s 2026-02-21T09:40:56.1676384Z [163s] Generation 10 complete: 2026-02-21T09:40:56.1680078Z ok=27 2026-02-21T09:40:56.1684072Z min=0.0266 2026-02-21T09:40:56.1688579Z mid=0.0286 2026-02-21T09:40:56.1689955Z max=0.0553 2026-02-21T09:40:56.1690149Z best={'block_sizes': [1, 8192], 2026-02-21T09:40:56.1690450Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:40:56.1691111Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:40:56.1691322Z 'num_stages': 6, 2026-02-21T09:40:56.1691467Z 'num_warps': 1, 2026-02-21T09:40:56.1691888Z 'pid_type': 'flat', 2026-02-21T09:40:56.1692049Z 'range_flattens': [None, True], 2026-02-21T09:40:56.1692240Z 'range_multi_buffers': [None, None], 2026-02-21T09:40:56.1692427Z 'range_num_stages': [0, 4], 2026-02-21T09:40:56.1692607Z 'range_unroll_factors': [0, 0], 2026-02-21T09:40:56.1692786Z 'range_warp_specializes': [None, True]} 2026-02-21T09:40:56.1693072Z [163s] Fitting surrogate: 539 points, 539 targets 2026-02-21T09:40:56.6609603Z [164s] Generation 11 starting: 16 neighbors, 2 active search path(s) 2026-02-21T09:40:58.6317539Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 14.6 configs/s 2026-02-21T09:40:59.6381940Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.7 configs/s 2026-02-21T09:41:00.7858426Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 876.1 2026-02-21T09:41:00.7858903Z configs/s 2026-02-21T09:41:00.8727517Z [168s] Generation 11 complete: 2026-02-21T09:41:00.8732022Z ok=19 2026-02-21T09:41:00.8733548Z min=0.0266 2026-02-21T09:41:00.8733709Z mid=0.0307 2026-02-21T09:41:00.8733845Z max=0.0492 2026-02-21T09:41:00.8733985Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:00.8734248Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:00.8734528Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:00.8734722Z 'num_stages': 6, 2026-02-21T09:41:00.8734894Z 'num_warps': 1, 2026-02-21T09:41:00.8735033Z 'pid_type': 'flat', 2026-02-21T09:41:00.8735195Z 'range_flattens': [None, True], 2026-02-21T09:41:00.8735370Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:00.8735558Z 'range_num_stages': [0, 4], 2026-02-21T09:41:00.8735722Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:00.8735933Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:00.8744061Z [168s] Fitting surrogate: 558 points, 558 targets 2026-02-21T09:41:01.3726255Z [169s] Generation 12 starting: 23 neighbors, 2 active search path(s) 2026-02-21T09:41:06.0257684Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 3.3 configs/s 2026-02-21T09:41:07.4678872Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 24/24 17.2 configs/s 2026-02-21T09:41:09.2848843Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 770.5 2026-02-21T09:41:09.2849205Z configs/s 2026-02-21T09:41:09.3866725Z [177s] Generation 12 complete: 2026-02-21T09:41:09.3872320Z ok=26 2026-02-21T09:41:09.3876284Z min=0.0266 2026-02-21T09:41:09.3880162Z mid=0.0326 2026-02-21T09:41:09.3882113Z max=0.0696 2026-02-21T09:41:09.3882291Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:09.3882585Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:09.3882865Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:09.3883059Z 'num_stages': 6, 2026-02-21T09:41:09.3883197Z 'num_warps': 1, 2026-02-21T09:41:09.3883342Z 'pid_type': 'flat', 2026-02-21T09:41:09.3883495Z 'range_flattens': [None, True], 2026-02-21T09:41:09.3883677Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:09.3883864Z 'range_num_stages': [0, 4], 2026-02-21T09:41:09.3884027Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:09.3884207Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:09.3884415Z [177s] Fitting surrogate: 584 points, 584 targets 2026-02-21T09:41:09.9859588Z [177s] Generation 13 starting: 24 neighbors, 2 active search path(s) 2026-02-21T09:41:13.5285966Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 3.7 configs/s 2026-02-21T09:41:15.0231325Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 17.2 configs/s 2026-02-21T09:41:16.4997873Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 683.5 2026-02-21T09:41:16.5002002Z configs/s 2026-02-21T09:41:16.6131903Z [184s] Generation 13 complete: 2026-02-21T09:41:16.6133661Z ok=27 2026-02-21T09:41:16.6133829Z min=0.0266 2026-02-21T09:41:16.6133974Z mid=0.0286 2026-02-21T09:41:16.6134098Z max=0.0777 2026-02-21T09:41:16.6134253Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:16.6134506Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:16.6134772Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:16.6134961Z 'num_stages': 6, 2026-02-21T09:41:16.6135107Z 'num_warps': 1, 2026-02-21T09:41:16.6135244Z 'pid_type': 'flat', 2026-02-21T09:41:16.6135406Z 'range_flattens': [None, True], 2026-02-21T09:41:16.6135588Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:16.6135767Z 'range_num_stages': [0, 4], 2026-02-21T09:41:16.6135935Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:16.6136423Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:16.6151907Z [184s] Fitting surrogate: 611 points, 611 targets 2026-02-21T09:41:17.1315944Z [184s] Generation 14 starting: 21 neighbors, 2 active search path(s) 2026-02-21T09:41:19.8036820Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 7.6 configs/s 2026-02-21T09:41:21.0349226Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.7 configs/s 2026-02-21T09:41:22.5929678Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 647.7 2026-02-21T09:41:22.5933821Z configs/s 2026-02-21T09:41:22.7070447Z [190s] Generation 14 complete: 2026-02-21T09:41:22.7073747Z ok=24 2026-02-21T09:41:22.7076358Z min=0.0266 2026-02-21T09:41:22.7076527Z mid=0.0286 2026-02-21T09:41:22.7076653Z max=0.0471 2026-02-21T09:41:22.7076798Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:22.7077080Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:22.7077373Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:22.7077562Z 'num_stages': 6, 2026-02-21T09:41:22.7077709Z 'num_warps': 1, 2026-02-21T09:41:22.7077847Z 'pid_type': 'flat', 2026-02-21T09:41:22.7078007Z 'range_flattens': [None, True], 2026-02-21T09:41:22.7078181Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:22.7078368Z 'range_num_stages': [0, 4], 2026-02-21T09:41:22.7078539Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:22.7078712Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:22.7087069Z [190s] Fitting surrogate: 635 points, 635 targets 2026-02-21T09:41:23.2565767Z [190s] Generation 15 starting: 22 neighbors, 2 active search path(s) 2026-02-21T09:41:25.8892241Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 16.1 configs/s 2026-02-21T09:41:27.2487866Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 17.5 configs/s 2026-02-21T09:41:28.6134341Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 738.3 2026-02-21T09:41:28.6138383Z configs/s 2026-02-21T09:41:28.7196746Z [196s] Generation 15 complete: 2026-02-21T09:41:28.7200543Z ok=25 2026-02-21T09:41:28.7204880Z min=0.0266 2026-02-21T09:41:28.7209248Z mid=0.0307 2026-02-21T09:41:28.7211255Z max=0.0491 2026-02-21T09:41:28.7211477Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:28.7216543Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:28.7218412Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:28.7218636Z 'num_stages': 6, 2026-02-21T09:41:28.7218788Z 'num_warps': 1, 2026-02-21T09:41:28.7218930Z 'pid_type': 'flat', 2026-02-21T09:41:28.7219095Z 'range_flattens': [None, True], 2026-02-21T09:41:28.7219272Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:28.7219465Z 'range_num_stages': [0, 4], 2026-02-21T09:41:28.7219636Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:28.7219828Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:28.7220390Z [196s] Fitting surrogate: 660 points, 660 targets 2026-02-21T09:41:29.2377436Z [196s] Generation 16 starting: 18 neighbors, 2 active search path(s) 2026-02-21T09:41:32.1622345Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 7.6 configs/s 2026-02-21T09:41:33.3377415Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 17.7 configs/s 2026-02-21T09:41:34.4991207Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 865.2 2026-02-21T09:41:34.4995687Z configs/s 2026-02-21T09:41:34.5904521Z [202s] Generation 16 complete: 2026-02-21T09:41:34.5909006Z ok=21 2026-02-21T09:41:34.5914074Z min=0.0266 2026-02-21T09:41:34.5918548Z mid=0.0328 2026-02-21T09:41:34.5920064Z max=0.0471 2026-02-21T09:41:34.5920310Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:34.5920942Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:34.5925766Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:34.5929532Z 'num_stages': 6, 2026-02-21T09:41:34.5931046Z 'num_warps': 1, 2026-02-21T09:41:34.5931244Z 'pid_type': 'flat', 2026-02-21T09:41:34.5931417Z 'range_flattens': [None, True], 2026-02-21T09:41:34.5931692Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:34.5931885Z 'range_num_stages': [0, 4], 2026-02-21T09:41:34.5932064Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:34.5932250Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:34.5932584Z [202s] Fitting surrogate: 681 points, 681 targets 2026-02-21T09:41:35.0833163Z [202s] Generation 17 starting: 20 neighbors, 2 active search path(s) 2026-02-21T09:41:40.7907692Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.2 configs/s 2026-02-21T09:41:42.0493347Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.3 configs/s 2026-02-21T09:41:43.1713630Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 896.3 2026-02-21T09:41:43.1717314Z configs/s 2026-02-21T09:41:43.2583981Z [210s] Generation 17 complete: 2026-02-21T09:41:43.2585822Z ok=23 2026-02-21T09:41:43.2586020Z min=0.0266 2026-02-21T09:41:43.2586185Z mid=0.0286 2026-02-21T09:41:43.2586341Z max=0.4762 2026-02-21T09:41:43.2586521Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:43.2586805Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:43.2587115Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:43.2587326Z 'num_stages': 6, 2026-02-21T09:41:43.2587481Z 'num_warps': 1, 2026-02-21T09:41:43.2587662Z 'pid_type': 'flat', 2026-02-21T09:41:43.2587819Z 'range_flattens': [None, True], 2026-02-21T09:41:43.2588004Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:43.2588187Z 'range_num_stages': [0, 4], 2026-02-21T09:41:43.2588359Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:43.2588553Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:43.2605001Z [210s] Fitting surrogate: 704 points, 704 targets 2026-02-21T09:41:43.7900560Z [211s] Generation 18 starting: 22 neighbors, 2 active search path(s) 2026-02-21T09:41:46.2909178Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 12.1 configs/s 2026-02-21T09:41:47.6046711Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.3 configs/s 2026-02-21T09:41:49.2927510Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 598.7 2026-02-21T09:41:49.2928917Z configs/s 2026-02-21T09:41:49.4258354Z [217s] Generation 18 complete: 2026-02-21T09:41:49.4262704Z ok=25 2026-02-21T09:41:49.4265933Z min=0.0266 2026-02-21T09:41:49.4270998Z mid=0.0286 2026-02-21T09:41:49.4275362Z max=0.0451 2026-02-21T09:41:49.4276872Z best={'block_sizes': [1, 8192], 2026-02-21T09:41:49.4277160Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:41:49.4277793Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:41:49.4278012Z 'num_stages': 6, 2026-02-21T09:41:49.4278156Z 'num_warps': 1, 2026-02-21T09:41:49.4278305Z 'pid_type': 'flat', 2026-02-21T09:41:49.4278462Z 'range_flattens': [None, True], 2026-02-21T09:41:49.4278650Z 'range_multi_buffers': [None, None], 2026-02-21T09:41:49.4278835Z 'range_num_stages': [0, 4], 2026-02-21T09:41:49.4279011Z 'range_unroll_factors': [0, 0], 2026-02-21T09:41:49.4279213Z 'range_warp_specializes': [None, True]} 2026-02-21T09:41:49.4284610Z [217s] Fitting surrogate: 729 points, 729 targets 2026-02-21T09:41:49.7051149Z [217s] Autotuning complete in 217.4s after searching 688 configs. 2026-02-21T09:41:49.7051675Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:41:49.7052714Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:41:49.7053602Z 2026-02-21T09:41:49.7053867Z [217s] Code of selected kernel: /tmp/torchinductor_root/fp/cfpjpq5pjqbo6r2tocqrc7cpd4xs2lpovq445vunpnr4iei6c2ry.py 2026-02-21T09:41:49.7282717Z from __future__ import annotations 2026-02-21T09:41:49.7286552Z 2026-02-21T09:41:49.7289959Z import torch 2026-02-21T09:41:49.7291419Z import triton 2026-02-21T09:41:49.7291720Z import triton.language as tl 2026-02-21T09:41:49.7291947Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:41:49.7292258Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:41:49.7292567Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:41:49.7292776Z 2026-02-21T09:41:49.7292849Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:41:49.7293030Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:41:49.7293179Z 2026-02-21T09:41:49.7293239Z @triton.jit 2026-02-21T09:41:49.7293392Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:41:49.7293643Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:41:49.7293905Z pid_0 = tl.program_id(0) 2026-02-21T09:41:49.7294069Z offset_0 = pid_0 2026-02-21T09:41:49.7294250Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:41:49.7294527Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:41:49.7294838Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:41:49.7295130Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:41:49.7295399Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:41:49.7295664Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:41:49.7295968Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:41:49.7296244Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:41:49.7296829Z # src[softmax.py:82-89]: ... 2026-02-21T09:41:49.7297151Z for offset_2 in tl.range(0, 6784, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True): 2026-02-21T09:41:49.7297554Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:41:49.7297798Z mask_1 = indices_2 < 6784 2026-02-21T09:41:49.7297975Z mi_copy = mi 2026-02-21T09:41:49.7298122Z di_copy = di 2026-02-21T09:41:49.7298264Z mi_copy_0 = mi_copy 2026-02-21T09:41:49.7298433Z di_copy_0 = di_copy 2026-02-21T09:41:49.7298611Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:41:49.7298983Z values = tl.load(x + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:41:49.7299367Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:41:49.7299872Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:41:49.7300278Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:41:49.7300537Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:41:49.7300780Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:41:49.7300986Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:41:49.7301248Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:41:49.7301483Z v_2 = mi_copy_0 - v_1 2026-02-21T09:41:49.7301703Z v_3 = libdevice.exp(v_2) 2026-02-21T09:41:49.7301869Z v_4 = di_copy_0 * v_3 2026-02-21T09:41:49.7302067Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:41:49.7302273Z subscript = v_1[:, None] 2026-02-21T09:41:49.7302449Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:41:49.7302650Z v_6 = v_5 - subscript 2026-02-21T09:41:49.7302866Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:41:49.7303140Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:41:49.7303357Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:41:49.7303555Z v_7 = libdevice.exp(v_6) 2026-02-21T09:41:49.7303878Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:41:49.7304229Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:41:49.7304435Z di = v_4 + sum_1 2026-02-21T09:41:49.7304592Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:41:49.7304768Z mi = v_1 2026-02-21T09:41:49.7304962Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:41:49.7305231Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:41:49.7305527Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:41:49.7305908Z for offset_2 in tl.range(0, 6784, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True): 2026-02-21T09:41:49.7306255Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:41:49.7306480Z mask_2 = indices_2 < 6784 2026-02-21T09:41:49.7306651Z mi_copy_1 = mi 2026-02-21T09:41:49.7306796Z di_copy_1 = di 2026-02-21T09:41:49.7306949Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:41:49.7307106Z di_copy_1_0 = di_copy_1 2026-02-21T09:41:49.7307296Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:41:49.7307655Z values_1 = tl.load(x + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:41:49.7308081Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:41:49.7308361Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:41:49.7308634Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:41:49.7308821Z v_10 = v_9 - subscript_1 2026-02-21T09:41:49.7308988Z v_11 = libdevice.exp(v_10) 2026-02-21T09:41:49.7309168Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:41:49.7309353Z v_12 = v_11 / subscript_2 2026-02-21T09:41:49.7309522Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:41:49.7309795Z tl.store(out + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:41:49.7310008Z 2026-02-21T09:41:49.7310137Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:41:49.7310374Z """ 2026-02-21T09:41:49.7310576Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:41:49.7310885Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:41:49.7311110Z Args: 2026-02-21T09:41:49.7311328Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:41:49.7311563Z Returns: 2026-02-21T09:41:49.7311740Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:41:49.7311951Z """ 2026-02-21T09:41:49.7312086Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:41:49.7312269Z m, n = x.size() 2026-02-21T09:41:49.7312434Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:41:49.7312639Z out = torch.empty_like(x) 2026-02-21T09:41:49.7312870Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:41:49.7313183Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:41:49.7313504Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:41:49.7313745Z # src[softmax.py:79-92]: ... 2026-02-21T09:41:49.7314012Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6) 2026-02-21T09:41:49.7314284Z # src[softmax.py:93]: return out 2026-02-21T09:41:49.7314476Z return out 2026-02-21T09:41:50.6833239Z WARNING:tritonbench.utils.triton_op:Completed input ID 51: 2026-02-21T09:41:50.6835056Z (M, N) 2026-02-21T09:41:50.6835223Z ------------ 2026-02-21T09:41:50.6835374Z (4096, 6784) 2026-02-21T09:41:50.6835453Z 2026-02-21T09:41:50.6845245Z 55%|█████▌ | 11/20 [32:55<30:56, 206.28s/it]WARNING:tritonbench.utils.triton_op:Running input ID 56: 2026-02-21T09:41:50.6846803Z (M, N) 2026-02-21T09:41:50.6847017Z ------------ 2026-02-21T09:41:50.6847172Z (4096, 7424) 2026-02-21T09:41:50.6852214Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:41:51.8862556Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:41:53.3695168Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax 2026-02-21T09:41:54.7130296Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:41:54.7134592Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:41:54.7137832Z 'dtype': 'torch.float16', 2026-02-21T09:41:54.7142214Z 'shape': (4096, 7424), 2026-02-21T09:41:54.7143554Z 'stride': (7424, 1)},), 2026-02-21T09:41:54.7143775Z 'kwargs': {}} 2026-02-21T09:41:54.7156213Z INFO:tritonbench.utils.triton_op:Took 2.94ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:41:54.8891223Z [0s] Autotune random seed: 2138408546 2026-02-21T09:41:54.9142655Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:42:32.1148487Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:42:40.4993824Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.8 configs/s 2026-02-21T09:42:40.5004592Z [45s] Adaptive compile timeout: 30s (90% percentile=10.5s, bounds=[30.0s, 30s]) 2026-02-21T09:42:41.1703655Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1466.1 configs/s 2026-02-21T09:42:41.2271369Z [46s] Initial random population of 100, 5 starting points: 2026-02-21T09:42:41.2275834Z error=12 2026-02-21T09:42:41.2277346Z ok=88 2026-02-21T09:42:41.2277525Z min=0.0451 2026-02-21T09:42:41.2277655Z mid=0.6451 2026-02-21T09:42:41.2277786Z max=172.3709 2026-02-21T09:42:41.2277935Z best={'block_sizes': [1, 8192], 2026-02-21T09:42:41.2278200Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:42:41.2278490Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:42:41.2278702Z 'num_sm_multiplier': 2, 2026-02-21T09:42:41.2278863Z 'num_stages': 7, 2026-02-21T09:42:41.2279001Z 'num_warps': 32, 2026-02-21T09:42:41.2279157Z 'pid_type': 'persistent_blocked', 2026-02-21T09:42:41.2279339Z 'range_flattens': [True, True], 2026-02-21T09:42:41.2279517Z 'range_multi_buffers': [False, None], 2026-02-21T09:42:41.2279696Z 'range_num_stages': [4, 3], 2026-02-21T09:42:41.2279863Z 'range_unroll_factors': [2, 3], 2026-02-21T09:42:41.2280340Z 'range_warp_specializes': [False, False]} 2026-02-21T09:42:41.2291075Z [46s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:42:42.3191231Z [47s] Generation 1 starting: 76 neighbors, 5 active search path(s) 2026-02-21T09:43:02.1211244Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.1 configs/s 2026-02-21T09:43:06.8472129Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.1 configs/s 2026-02-21T09:43:08.2875575Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 697.3 2026-02-21T09:43:08.2877340Z configs/s 2026-02-21T09:43:08.3888009Z [73s] Generation 1 complete: 2026-02-21T09:43:08.3893200Z error=1 2026-02-21T09:43:08.3898432Z ok=81 2026-02-21T09:43:08.3902621Z min=0.0328 2026-02-21T09:43:08.3907134Z mid=0.0553 2026-02-21T09:43:08.3909505Z max=0.3358 2026-02-21T09:43:08.3909722Z best={'block_sizes': [1, 8192], 2026-02-21T09:43:08.3910106Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:43:08.3910469Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:43:08.3910700Z 'num_sm_multiplier': 4, 2026-02-21T09:43:08.3910875Z 'num_stages': 7, 2026-02-21T09:43:08.3911035Z 'num_warps': 8, 2026-02-21T09:43:08.3911377Z 'pid_type': 'persistent_blocked', 2026-02-21T09:43:08.3911721Z 'range_flattens': [False, True], 2026-02-21T09:43:08.3912036Z 'range_multi_buffers': [False, None], 2026-02-21T09:43:08.3912301Z 'range_num_stages': [4, 3], 2026-02-21T09:43:08.3912601Z 'range_unroll_factors': [2, 3], 2026-02-21T09:43:08.3912913Z 'range_warp_specializes': [False, False]} 2026-02-21T09:43:08.3913330Z [73s] Fitting surrogate: 182 points, 182 targets 2026-02-21T09:43:09.4914279Z [74s] Generation 2 starting: 71 neighbors, 5 active search path(s) 2026-02-21T09:43:18.8657849Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 10.4 configs/s 2026-02-21T09:43:23.2805853Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s 2026-02-21T09:43:28.5395609Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 215.9 2026-02-21T09:43:28.5396966Z configs/s 2026-02-21T09:43:28.8427634Z [93s] Generation 2 complete: 2026-02-21T09:43:28.8432102Z ok=77 2026-02-21T09:43:28.8437260Z min=0.0307 2026-02-21T09:43:28.8441247Z mid=0.0431 2026-02-21T09:43:28.8443339Z max=0.4157 2026-02-21T09:43:28.8443650Z best={'block_sizes': [1, 8192], 2026-02-21T09:43:28.8448398Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:43:28.8449958Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:43:28.8450280Z 'num_stages': 7, 2026-02-21T09:43:28.8455256Z 'num_warps': 4, 2026-02-21T09:43:28.8459783Z 'pid_type': 'flat', 2026-02-21T09:43:28.8463477Z 'range_flattens': [None, True], 2026-02-21T09:43:28.8465496Z 'range_multi_buffers': [None, None], 2026-02-21T09:43:28.8465842Z 'range_num_stages': [0, 3], 2026-02-21T09:43:28.8466554Z 'range_unroll_factors': [0, 3], 2026-02-21T09:43:28.8470518Z 'range_warp_specializes': [None, False]} 2026-02-21T09:43:28.8474461Z [93s] Fitting surrogate: 259 points, 259 targets 2026-02-21T09:43:29.8156081Z [94s] Generation 3 starting: 64 neighbors, 5 active search path(s) 2026-02-21T09:43:47.5672143Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 1.2 configs/s 2026-02-21T09:43:51.5429604Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.8 configs/s 2026-02-21T09:43:55.1715095Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 279.2 2026-02-21T09:43:55.1715965Z configs/s 2026-02-21T09:43:55.4186099Z [120s] Generation 3 complete: 2026-02-21T09:43:55.4189771Z ok=69 2026-02-21T09:43:55.4194331Z min=0.0307 2026-02-21T09:43:55.4199415Z mid=0.0410 2026-02-21T09:43:55.4203352Z max=0.6092 2026-02-21T09:43:55.4208146Z best={'block_sizes': [1, 8192], 2026-02-21T09:43:55.4208872Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:43:55.4209173Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:43:55.4209373Z 'num_stages': 7, 2026-02-21T09:43:55.4209537Z 'num_warps': 4, 2026-02-21T09:43:55.4209694Z 'pid_type': 'flat', 2026-02-21T09:43:55.4209866Z 'range_flattens': [None, True], 2026-02-21T09:43:55.4210051Z 'range_multi_buffers': [None, None], 2026-02-21T09:43:55.4210248Z 'range_num_stages': [0, 3], 2026-02-21T09:43:55.4210419Z 'range_unroll_factors': [0, 3], 2026-02-21T09:43:55.4210613Z 'range_warp_specializes': [None, False]} 2026-02-21T09:43:55.4210831Z [120s] Fitting surrogate: 328 points, 328 targets 2026-02-21T09:43:56.2887391Z [121s] Generation 4 starting: 62 neighbors, 5 active search path(s) 2026-02-21T09:44:05.3317637Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 5.1 configs/s 2026-02-21T09:44:09.2168226Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.9 configs/s 2026-02-21T09:44:12.9197138Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 273.9 2026-02-21T09:44:12.9201202Z configs/s 2026-02-21T09:44:13.1792314Z [138s] Generation 4 complete: 2026-02-21T09:44:13.1794078Z ok=68 2026-02-21T09:44:13.1794285Z min=0.0307 2026-02-21T09:44:13.1798404Z mid=0.0328 2026-02-21T09:44:13.1802885Z max=0.1454 2026-02-21T09:44:13.1806106Z best={'block_sizes': [1, 8192], 2026-02-21T09:44:13.1810639Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:44:13.1812047Z 'load_eviction_policies': ['', ''], 2026-02-21T09:44:13.1812268Z 'num_stages': 1, 2026-02-21T09:44:13.1812422Z 'num_warps': 4, 2026-02-21T09:44:13.1812568Z 'pid_type': 'flat', 2026-02-21T09:44:13.1812737Z 'range_flattens': [None, True], 2026-02-21T09:44:13.1812920Z 'range_multi_buffers': [None, None], 2026-02-21T09:44:13.1813149Z 'range_num_stages': [0, 3], 2026-02-21T09:44:13.1813329Z 'range_unroll_factors': [0, 1], 2026-02-21T09:44:13.1813517Z 'range_warp_specializes': [None, False]} 2026-02-21T09:44:13.1813743Z [138s] Fitting surrogate: 396 points, 396 targets 2026-02-21T09:44:13.9510041Z [139s] Generation 5 starting: 51 neighbors, 4 active search path(s) 2026-02-21T09:44:22.5035197Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 2.8 configs/s 2026-02-21T09:44:26.2028005Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 14.2 configs/s 2026-02-21T09:44:29.7077085Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.2 2026-02-21T09:44:29.7078438Z configs/s 2026-02-21T09:44:29.9520156Z [155s] Generation 5 complete: 2026-02-21T09:44:29.9525450Z ok=55 2026-02-21T09:44:29.9526829Z min=0.0307 2026-02-21T09:44:29.9526995Z mid=0.0328 2026-02-21T09:44:29.9527127Z max=0.1474 2026-02-21T09:44:29.9527303Z best={'block_sizes': [1, 8192], 2026-02-21T09:44:29.9527911Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T09:44:29.9528149Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:44:29.9528340Z 'num_stages': 5, 2026-02-21T09:44:29.9528480Z 'num_warps': 4, 2026-02-21T09:44:29.9528626Z 'pid_type': 'flat', 2026-02-21T09:44:29.9528784Z 'range_flattens': [None, False], 2026-02-21T09:44:29.9528968Z 'range_multi_buffers': [None, None], 2026-02-21T09:44:29.9529154Z 'range_num_stages': [0, 2], 2026-02-21T09:44:29.9529318Z 'range_unroll_factors': [0, 1], 2026-02-21T09:44:29.9529502Z 'range_warp_specializes': [None, False]} 2026-02-21T09:44:29.9536743Z [155s] Fitting surrogate: 451 points, 451 targets 2026-02-21T09:44:30.7260040Z [155s] Generation 6 starting: 52 neighbors, 4 active search path(s) 2026-02-21T09:44:40.1558436Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 1.0 configs/s 2026-02-21T09:44:43.3287861Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.9 configs/s 2026-02-21T09:44:46.7268240Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 298.4 2026-02-21T09:44:46.7269551Z configs/s 2026-02-21T09:44:46.9578933Z [172s] Generation 6 complete: 2026-02-21T09:44:46.9583602Z ok=56 2026-02-21T09:44:46.9585011Z min=0.0307 2026-02-21T09:44:46.9585180Z mid=0.0327 2026-02-21T09:44:46.9585304Z max=0.6860 2026-02-21T09:44:46.9585459Z best={'block_sizes': [1, 8192], 2026-02-21T09:44:46.9585714Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:44:46.9585983Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:44:46.9586168Z 'num_stages': 5, 2026-02-21T09:44:46.9586311Z 'num_warps': 4, 2026-02-21T09:44:46.9586450Z 'pid_type': 'flat', 2026-02-21T09:44:46.9586614Z 'range_flattens': [None, False], 2026-02-21T09:44:46.9586807Z 'range_multi_buffers': [None, None], 2026-02-21T09:44:46.9586991Z 'range_num_stages': [0, 2], 2026-02-21T09:44:46.9587197Z 'range_unroll_factors': [0, 1], 2026-02-21T09:44:46.9587398Z 'range_warp_specializes': [None, False]} 2026-02-21T09:44:46.9597770Z [172s] Fitting surrogate: 507 points, 507 targets 2026-02-21T09:44:47.6952763Z [172s] Generation 7 starting: 35 neighbors, 3 active search path(s) 2026-02-21T09:44:53.3498715Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 5.2 configs/s 2026-02-21T09:44:55.4962466Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 17.1 configs/s 2026-02-21T09:44:57.8564599Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 428.3 2026-02-21T09:44:57.8566183Z configs/s 2026-02-21T09:44:58.0225651Z [183s] Generation 7 complete: 2026-02-21T09:44:58.0225898Z ok=38 2026-02-21T09:44:58.0230575Z min=0.0307 2026-02-21T09:44:58.0235142Z mid=0.0328 2026-02-21T09:44:58.0240186Z max=0.6861 2026-02-21T09:44:58.0241869Z best={'block_sizes': [1, 8192], 2026-02-21T09:44:58.0242181Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:44:58.0242804Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:44:58.0242994Z 'num_stages': 5, 2026-02-21T09:44:58.0243144Z 'num_warps': 4, 2026-02-21T09:44:58.0243287Z 'pid_type': 'flat', 2026-02-21T09:44:58.0243451Z 'range_flattens': [None, False], 2026-02-21T09:44:58.0243632Z 'range_multi_buffers': [None, False], 2026-02-21T09:44:58.0243822Z 'range_num_stages': [0, 3], 2026-02-21T09:44:58.0243987Z 'range_unroll_factors': [0, 1], 2026-02-21T09:44:58.0244173Z 'range_warp_specializes': [None, False]} 2026-02-21T09:44:58.0244398Z [183s] Fitting surrogate: 545 points, 545 targets 2026-02-21T09:44:58.4634572Z [183s] Generation 8 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:45:01.9451036Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.2 configs/s 2026-02-21T09:45:02.7334226Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.5 configs/s 2026-02-21T09:45:03.6063031Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1146.8 2026-02-21T09:45:03.6063875Z configs/s 2026-02-21T09:45:03.6801198Z [188s] Generation 8 complete: 2026-02-21T09:45:03.6802708Z ok=14 2026-02-21T09:45:03.6802920Z min=0.0307 2026-02-21T09:45:03.6803096Z mid=0.0307 2026-02-21T09:45:03.6803329Z max=0.0656 2026-02-21T09:45:03.6807723Z best={'block_sizes': [1, 8192], 2026-02-21T09:45:03.6811206Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:45:03.6815135Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:45:03.6818996Z 'num_stages': 6, 2026-02-21T09:45:03.6823483Z 'num_warps': 4, 2026-02-21T09:45:03.6824842Z 'pid_type': 'flat', 2026-02-21T09:45:03.6825129Z 'range_flattens': [None, True], 2026-02-21T09:45:03.6825400Z 'range_multi_buffers': [None, False], 2026-02-21T09:45:03.6825635Z 'range_num_stages': [0, 3], 2026-02-21T09:45:03.6825892Z 'range_unroll_factors': [0, 1], 2026-02-21T09:45:03.6826138Z 'range_warp_specializes': [None, False]} 2026-02-21T09:45:03.6826532Z [188s] Fitting surrogate: 559 points, 559 targets 2026-02-21T09:45:04.0930037Z [189s] Generation 9 starting: 13 neighbors, 1 active search path(s) 2026-02-21T09:45:07.5811015Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 4.8 configs/s 2026-02-21T09:45:08.3675224Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.5 configs/s 2026-02-21T09:45:09.2723969Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1105.6 2026-02-21T09:45:09.2725532Z configs/s 2026-02-21T09:45:09.3432853Z [194s] Generation 9 complete: 2026-02-21T09:45:09.3434719Z ok=14 2026-02-21T09:45:09.3434958Z min=0.0307 2026-02-21T09:45:09.3435205Z mid=0.0307 2026-02-21T09:45:09.3440078Z max=0.0635 2026-02-21T09:45:09.3441530Z best={'block_sizes': [1, 8192], 2026-02-21T09:45:09.3442124Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:45:09.3442446Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:45:09.3442708Z 'num_stages': 6, 2026-02-21T09:45:09.3442891Z 'num_warps': 4, 2026-02-21T09:45:09.3443103Z 'pid_type': 'flat', 2026-02-21T09:45:09.3443302Z 'range_flattens': [None, True], 2026-02-21T09:45:09.3443552Z 'range_multi_buffers': [None, False], 2026-02-21T09:45:09.3443779Z 'range_num_stages': [0, 3], 2026-02-21T09:45:09.3444018Z 'range_unroll_factors': [0, 1], 2026-02-21T09:45:09.3444264Z 'range_warp_specializes': [None, False]} 2026-02-21T09:45:09.3463491Z [194s] Fitting surrogate: 573 points, 573 targets 2026-02-21T09:45:09.6355063Z [194s] Autotuning complete in 194.7s after searching 552 configs. 2026-02-21T09:45:09.6357201Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:45:09.6358310Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:45:09.6359509Z 2026-02-21T09:45:09.6359823Z [194s] Code of selected kernel: /tmp/torchinductor_root/xa/cxazsujyi2oj7h2x3luypkwqyn3moastc6gnnus24dsfv6e5nbnw.py 2026-02-21T09:45:09.6580873Z from __future__ import annotations 2026-02-21T09:45:09.6581206Z 2026-02-21T09:45:09.6581416Z import torch 2026-02-21T09:45:09.6581692Z import triton 2026-02-21T09:45:09.6581977Z import triton.language as tl 2026-02-21T09:45:09.6587056Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:45:09.6587529Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:45:09.6587937Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:45:09.6588214Z 2026-02-21T09:45:09.6588379Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:45:09.6588938Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:45:09.6589146Z 2026-02-21T09:45:09.6589323Z @triton.jit 2026-02-21T09:45:09.6589603Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:45:09.6589998Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:45:09.6590358Z pid_0 = tl.program_id(0) 2026-02-21T09:45:09.6590655Z offset_0 = pid_0 2026-02-21T09:45:09.6590895Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:45:09.6591311Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:45:09.6591792Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:45:09.6592124Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:45:09.6592461Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:45:09.6592772Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:45:09.6593101Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:45:09.6593438Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:45:09.6593718Z # src[softmax.py:82-89]: ... 2026-02-21T09:45:09.6594223Z for offset_2 in tl.range(0, 7424, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:45:09.6594760Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:45:09.6595077Z mask_1 = indices_2 < 7424 2026-02-21T09:45:09.6595321Z mi_copy = mi 2026-02-21T09:45:09.6595520Z di_copy = di 2026-02-21T09:45:09.6595727Z mi_copy_0 = mi_copy 2026-02-21T09:45:09.6595912Z di_copy_0 = di_copy 2026-02-21T09:45:09.6596139Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:45:09.6596542Z values = tl.load(x + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:45:09.6596994Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:45:09.6597468Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:45:09.6597894Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:45:09.6598216Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:45:09.6598490Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:45:09.6598767Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:45:09.6599064Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:45:09.6599367Z v_2 = mi_copy_0 - v_1 2026-02-21T09:45:09.6599592Z v_3 = libdevice.exp(v_2) 2026-02-21T09:45:09.6599769Z v_4 = di_copy_0 * v_3 2026-02-21T09:45:09.6600021Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:45:09.6600262Z subscript = v_1[:, None] 2026-02-21T09:45:09.6600601Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:45:09.6600825Z v_6 = v_5 - subscript 2026-02-21T09:45:09.6601110Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:45:09.6601429Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:45:09.6601740Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:45:09.6601988Z v_7 = libdevice.exp(v_6) 2026-02-21T09:45:09.6602347Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:45:09.6602773Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:45:09.6603018Z di = v_4 + sum_1 2026-02-21T09:45:09.6603250Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:45:09.6603460Z mi = v_1 2026-02-21T09:45:09.6603726Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:45:09.6604132Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:45:09.6604468Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:45:09.6605035Z for offset_2 in tl.range(0, 7424, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T09:45:09.6605535Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:45:09.6605837Z mask_2 = indices_2 < 7424 2026-02-21T09:45:09.6606074Z mi_copy_1 = mi 2026-02-21T09:45:09.6606264Z di_copy_1 = di 2026-02-21T09:45:09.6606484Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:45:09.6606688Z di_copy_1_0 = di_copy_1 2026-02-21T09:45:09.6606948Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:45:09.6607351Z values_1 = tl.load(x + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:45:09.6607853Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:45:09.6608172Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:45:09.6608421Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:45:09.6608660Z v_10 = v_9 - subscript_1 2026-02-21T09:45:09.6608869Z v_11 = libdevice.exp(v_10) 2026-02-21T09:45:09.6609116Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:45:09.6609336Z v_12 = v_11 / subscript_2 2026-02-21T09:45:09.6609567Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:45:09.6609883Z tl.store(out + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:45:09.6610144Z 2026-02-21T09:45:09.6610289Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:45:09.6610590Z """ 2026-02-21T09:45:09.6610828Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:45:09.6611203Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:45:09.6611461Z Args: 2026-02-21T09:45:09.6611731Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:45:09.6611959Z Returns: 2026-02-21T09:45:09.6612201Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:45:09.6612479Z """ 2026-02-21T09:45:09.6612664Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:45:09.6612920Z m, n = x.size() 2026-02-21T09:45:09.6613149Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:45:09.6613419Z out = torch.empty_like(x) 2026-02-21T09:45:09.6613684Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:45:09.6614054Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:45:09.6614429Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:45:09.6614699Z # src[softmax.py:79-92]: ... 2026-02-21T09:45:09.6615076Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6) 2026-02-21T09:45:09.6615417Z # src[softmax.py:93]: return out 2026-02-21T09:45:09.6615672Z return out 2026-02-21T09:45:10.1774765Z WARNING:tritonbench.utils.triton_op:Completed input ID 56: 2026-02-21T09:45:10.1779071Z (M, N) 2026-02-21T09:45:10.1782995Z ------------ 2026-02-21T09:45:10.1787582Z (4096, 7424) 2026-02-21T09:45:10.1789295Z 2026-02-21T09:45:10.1789890Z 60%|██████ | 12/20 [36:15<27:13, 204.22s/it]WARNING:tritonbench.utils.triton_op:Running input ID 61: 2026-02-21T09:45:10.1790239Z (M, N) 2026-02-21T09:45:10.1790439Z ------------ 2026-02-21T09:45:10.1790618Z (4096, 8064) 2026-02-21T09:45:10.1790995Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:45:11.3456929Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:45:12.8374712Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax 2026-02-21T09:45:14.1619411Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:45:14.1623440Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:45:14.1624907Z 'dtype': 'torch.float16', 2026-02-21T09:45:14.1625221Z 'shape': (4096, 8064), 2026-02-21T09:45:14.1625457Z 'stride': (8064, 1)},), 2026-02-21T09:45:14.1625712Z 'kwargs': {}} 2026-02-21T09:45:14.1640845Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:45:14.3378789Z [0s] Autotune random seed: 2138408546 2026-02-21T09:45:14.3630531Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:45:50.8322637Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:45:59.4848212Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.5 configs/s 2026-02-21T09:45:59.4856932Z [45s] Adaptive compile timeout: 30s (90% percentile=11.2s, bounds=[30.0s, 30s]) 2026-02-21T09:46:00.0609302Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1695.9 configs/s 2026-02-21T09:46:00.1126530Z [45s] Initial random population of 100, 5 starting points: 2026-02-21T09:46:00.1130790Z error=12 2026-02-21T09:46:00.1132384Z ok=88 2026-02-21T09:46:00.1132692Z min=0.0451 2026-02-21T09:46:00.1137462Z mid=0.6431 2026-02-21T09:46:00.1138799Z max=187.0090 2026-02-21T09:46:00.1139072Z best={'block_sizes': [1, 8192], 2026-02-21T09:46:00.1139494Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:46:00.1144679Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:46:00.1146068Z 'num_sm_multiplier': 2, 2026-02-21T09:46:00.1146352Z 'num_stages': 7, 2026-02-21T09:46:00.1146543Z 'num_warps': 32, 2026-02-21T09:46:00.1146796Z 'pid_type': 'persistent_blocked', 2026-02-21T09:46:00.1147029Z 'range_flattens': [True, True], 2026-02-21T09:46:00.1147308Z 'range_multi_buffers': [False, None], 2026-02-21T09:46:00.1147556Z 'range_num_stages': [4, 3], 2026-02-21T09:46:00.1147793Z 'range_unroll_factors': [2, 3], 2026-02-21T09:46:00.1148040Z 'range_warp_specializes': [False, False]} 2026-02-21T09:46:00.1148384Z [45s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:46:01.1359140Z [46s] Generation 1 starting: 76 neighbors, 5 active search path(s) 2026-02-21T09:46:21.2927857Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 1.4 configs/s 2026-02-21T09:46:26.0370386Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.8 configs/s 2026-02-21T09:46:29.0537718Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 334.6 2026-02-21T09:46:29.0539450Z configs/s 2026-02-21T09:46:29.2292580Z [74s] Generation 1 complete: 2026-02-21T09:46:29.2296944Z ok=82 2026-02-21T09:46:29.2301303Z min=0.0328 2026-02-21T09:46:29.2306530Z mid=0.0553 2026-02-21T09:46:29.2310822Z max=0.4742 2026-02-21T09:46:29.2314123Z best={'block_sizes': [1, 8192], 2026-02-21T09:46:29.2316142Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:46:29.2316583Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:46:29.2316884Z 'num_sm_multiplier': 4, 2026-02-21T09:46:29.2320787Z 'num_stages': 7, 2026-02-21T09:46:29.2324535Z 'num_warps': 8, 2026-02-21T09:46:29.2329591Z 'pid_type': 'persistent_blocked', 2026-02-21T09:46:29.2332888Z 'range_flattens': [False, True], 2026-02-21T09:46:29.2336681Z 'range_multi_buffers': [False, None], 2026-02-21T09:46:29.2340970Z 'range_num_stages': [4, 3], 2026-02-21T09:46:29.2345449Z 'range_unroll_factors': [2, 3], 2026-02-21T09:46:29.2350398Z 'range_warp_specializes': [False, False]} 2026-02-21T09:46:29.2354720Z [74s] Fitting surrogate: 182 points, 182 targets 2026-02-21T09:46:30.2292022Z [75s] Generation 2 starting: 78 neighbors, 5 active search path(s) 2026-02-21T09:46:42.5202419Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 3.2 configs/s 2026-02-21T09:46:47.4642207Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.9 configs/s 2026-02-21T09:46:52.6621297Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 217.2 2026-02-21T09:46:52.6623958Z configs/s 2026-02-21T09:46:52.9411195Z [98s] Generation 2 complete: 2026-02-21T09:46:52.9415365Z ok=84 2026-02-21T09:46:52.9419783Z min=0.0328 2026-02-21T09:46:52.9424830Z mid=0.0451 2026-02-21T09:46:52.9426356Z max=0.2847 2026-02-21T09:46:52.9426580Z best={'block_sizes': [1, 8192], 2026-02-21T09:46:52.9426909Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:46:52.9427227Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:46:52.9427491Z 'num_stages': 1, 2026-02-21T09:46:52.9427707Z 'num_warps': 8, 2026-02-21T09:46:52.9427892Z 'pid_type': 'flat', 2026-02-21T09:46:52.9428160Z 'range_flattens': [None, True], 2026-02-21T09:46:52.9428406Z 'range_multi_buffers': [None, False], 2026-02-21T09:46:52.9428660Z 'range_num_stages': [0, 4], 2026-02-21T09:46:52.9428868Z 'range_unroll_factors': [0, 0], 2026-02-21T09:46:52.9429118Z 'range_warp_specializes': [None, True]} 2026-02-21T09:46:52.9429369Z [98s] Fitting surrogate: 266 points, 266 targets 2026-02-21T09:46:53.9876345Z [99s] Generation 3 starting: 74 neighbors, 5 active search path(s) 2026-02-21T09:47:06.3188125Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 2.0 configs/s 2026-02-21T09:47:10.7616048Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s 2026-02-21T09:47:15.9740401Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 194.5 2026-02-21T09:47:15.9741016Z configs/s 2026-02-21T09:47:16.2992423Z [121s] Generation 3 complete: 2026-02-21T09:47:16.2996554Z ok=79 2026-02-21T09:47:16.3000047Z min=0.0307 2026-02-21T09:47:16.3001673Z mid=0.0369 2026-02-21T09:47:16.3001974Z max=0.4690 2026-02-21T09:47:16.3002198Z best={'block_sizes': [1, 8192], 2026-02-21T09:47:16.3002526Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:47:16.3007140Z 'load_eviction_policies': ['', ''], 2026-02-21T09:47:16.3009127Z 'num_stages': 1, 2026-02-21T09:47:16.3009393Z 'num_warps': 4, 2026-02-21T09:47:16.3009587Z 'pid_type': 'flat', 2026-02-21T09:47:16.3009821Z 'range_flattens': [None, True], 2026-02-21T09:47:16.3010015Z 'range_multi_buffers': [None, None], 2026-02-21T09:47:16.3010269Z 'range_num_stages': [0, 4], 2026-02-21T09:47:16.3010464Z 'range_unroll_factors': [0, 1], 2026-02-21T09:47:16.3010712Z 'range_warp_specializes': [None, False]} 2026-02-21T09:47:16.3011073Z [121s] Fitting surrogate: 345 points, 345 targets 2026-02-21T09:47:17.2920750Z [122s] Generation 4 starting: 73 neighbors, 5 active search path(s) 2026-02-21T09:47:26.7697037Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 19.8 configs/s 2026-02-21T09:47:31.2046305Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s 2026-02-21T09:47:37.1531398Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 188.9 2026-02-21T09:47:37.1535590Z configs/s 2026-02-21T09:47:37.5020541Z [143s] Generation 4 complete: 2026-02-21T09:47:37.5024652Z ok=78 2026-02-21T09:47:37.5026093Z min=0.0307 2026-02-21T09:47:37.5026336Z mid=0.0328 2026-02-21T09:47:37.5026500Z max=0.1352 2026-02-21T09:47:37.5026681Z best={'block_sizes': [1, 8192], 2026-02-21T09:47:37.5026942Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:47:37.5027249Z 'load_eviction_policies': ['', ''], 2026-02-21T09:47:37.5027494Z 'num_stages': 1, 2026-02-21T09:47:37.5027676Z 'num_warps': 4, 2026-02-21T09:47:37.5027886Z 'pid_type': 'flat', 2026-02-21T09:47:37.5028089Z 'range_flattens': [None, True], 2026-02-21T09:47:37.5028763Z 'range_multi_buffers': [None, None], 2026-02-21T09:47:37.5029027Z 'range_num_stages': [0, 4], 2026-02-21T09:47:37.5029264Z 'range_unroll_factors': [0, 1], 2026-02-21T09:47:37.5029489Z 'range_warp_specializes': [None, False]} 2026-02-21T09:47:37.5038644Z [143s] Fitting surrogate: 423 points, 423 targets 2026-02-21T09:47:38.1474305Z [143s] Generation 5 starting: 41 neighbors, 3 active search path(s) 2026-02-21T09:47:43.9479220Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 4.9 configs/s 2026-02-21T09:47:46.4118104Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s 2026-02-21T09:47:49.5068546Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.0 2026-02-21T09:47:49.5072655Z configs/s 2026-02-21T09:47:49.7137354Z [155s] Generation 5 complete: 2026-02-21T09:47:49.7142239Z ok=45 2026-02-21T09:47:49.7147109Z min=0.0307 2026-02-21T09:47:49.7148375Z mid=0.0328 2026-02-21T09:47:49.7148628Z max=0.0615 2026-02-21T09:47:49.7148817Z best={'block_sizes': [1, 8192], 2026-02-21T09:47:49.7149130Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:47:49.7149438Z 'load_eviction_policies': ['', ''], 2026-02-21T09:47:49.7149659Z 'num_stages': 1, 2026-02-21T09:47:49.7149872Z 'num_warps': 4, 2026-02-21T09:47:49.7150059Z 'pid_type': 'flat', 2026-02-21T09:47:49.7150255Z 'range_flattens': [None, True], 2026-02-21T09:47:49.7150477Z 'range_multi_buffers': [None, None], 2026-02-21T09:47:49.7150735Z 'range_num_stages': [0, 4], 2026-02-21T09:47:49.7150947Z 'range_unroll_factors': [0, 1], 2026-02-21T09:47:49.7151204Z 'range_warp_specializes': [None, False]} 2026-02-21T09:47:49.7157243Z [155s] Fitting surrogate: 468 points, 468 targets 2026-02-21T09:47:50.1138136Z [155s] Generation 6 starting: 17 neighbors, 2 active search path(s) 2026-02-21T09:47:52.4098451Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 16.8 configs/s 2026-02-21T09:47:53.4093172Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.8 configs/s 2026-02-21T09:47:54.9233788Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 665.1 2026-02-21T09:47:54.9235215Z configs/s 2026-02-21T09:47:55.0373559Z [160s] Generation 6 complete: 2026-02-21T09:47:55.0377980Z ok=20 2026-02-21T09:47:55.0382989Z min=0.0307 2026-02-21T09:47:55.0387389Z mid=0.0328 2026-02-21T09:47:55.0391310Z max=0.0460 2026-02-21T09:47:55.0395726Z best={'block_sizes': [1, 8192], 2026-02-21T09:47:55.0400178Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:47:55.0403957Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:47:55.0404342Z 'num_stages': 2, 2026-02-21T09:47:55.0404588Z 'num_warps': 4, 2026-02-21T09:47:55.0408585Z 'pid_type': 'flat', 2026-02-21T09:47:55.0412901Z 'range_flattens': [None, True], 2026-02-21T09:47:55.0417634Z 'range_multi_buffers': [None, None], 2026-02-21T09:47:55.0421776Z 'range_num_stages': [0, 4], 2026-02-21T09:47:55.0426306Z 'range_unroll_factors': [0, 0], 2026-02-21T09:47:55.0428495Z 'range_warp_specializes': [None, True]} 2026-02-21T09:47:55.0428884Z [160s] Fitting surrogate: 488 points, 488 targets 2026-02-21T09:47:55.2055290Z [160s] Autotuning complete in 160.8s after searching 468 configs. 2026-02-21T09:47:55.2059114Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:47:55.2060409Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T09:47:55.2061366Z 2026-02-21T09:47:55.2061929Z [160s] Code of selected kernel: /tmp/torchinductor_root/qj/cqjadztvwdqy7k5dxsiy766h4h5udvc56cnlx2p24djkf67drx4o.py 2026-02-21T09:47:55.2289794Z from __future__ import annotations 2026-02-21T09:47:55.2293437Z 2026-02-21T09:47:55.2294910Z import torch 2026-02-21T09:47:55.2295162Z import triton 2026-02-21T09:47:55.2295390Z import triton.language as tl 2026-02-21T09:47:55.2295648Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:47:55.2295990Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:47:55.2296323Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:47:55.2296552Z 2026-02-21T09:47:55.2296643Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:47:55.2296863Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:47:55.2297025Z 2026-02-21T09:47:55.2297104Z @triton.jit 2026-02-21T09:47:55.2297319Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:47:55.2297611Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:47:55.2297918Z pid_0 = tl.program_id(0) 2026-02-21T09:47:55.2298219Z offset_0 = pid_0 2026-02-21T09:47:55.2298498Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:47:55.2304798Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:47:55.2309272Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:47:55.2311244Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:47:55.2311664Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:47:55.2312004Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:47:55.2312318Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:47:55.2312639Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:47:55.2312932Z # src[softmax.py:82-89]: ... 2026-02-21T09:47:55.2313267Z for offset_2 in tl.range(0, 8064, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True): 2026-02-21T09:47:55.2313700Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:47:55.2314277Z mask_1 = indices_2 < 8064 2026-02-21T09:47:55.2314517Z mi_copy = mi 2026-02-21T09:47:55.2314702Z di_copy = di 2026-02-21T09:47:55.2314914Z mi_copy_0 = mi_copy 2026-02-21T09:47:55.2315110Z di_copy_0 = di_copy 2026-02-21T09:47:55.2315363Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:47:55.2315804Z values = tl.load(x + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:47:55.2316232Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:47:55.2316706Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:47:55.2317133Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:47:55.2317527Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:47:55.2317825Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:47:55.2318076Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:47:55.2318391Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:47:55.2318673Z v_2 = mi_copy_0 - v_1 2026-02-21T09:47:55.2318916Z v_3 = libdevice.exp(v_2) 2026-02-21T09:47:55.2319126Z v_4 = di_copy_0 * v_3 2026-02-21T09:47:55.2319387Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:47:55.2319626Z subscript = v_1[:, None] 2026-02-21T09:47:55.2319866Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:47:55.2320107Z v_6 = v_5 - subscript 2026-02-21T09:47:55.2320356Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:47:55.2320679Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:47:55.2320930Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:47:55.2321204Z v_7 = libdevice.exp(v_6) 2026-02-21T09:47:55.2321638Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:47:55.2322034Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:47:55.2322306Z di = v_4 + sum_1 2026-02-21T09:47:55.2322509Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:47:55.2322757Z mi = v_1 2026-02-21T09:47:55.2323001Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:47:55.2323340Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:47:55.2323701Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:47:55.2324127Z for offset_2 in tl.range(0, 8064, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True): 2026-02-21T09:47:55.2324543Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:47:55.2324819Z mask_2 = indices_2 < 8064 2026-02-21T09:47:55.2325056Z mi_copy_1 = mi 2026-02-21T09:47:55.2325243Z di_copy_1 = di 2026-02-21T09:47:55.2325464Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:47:55.2325669Z di_copy_1_0 = di_copy_1 2026-02-21T09:47:55.2325922Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:47:55.2326355Z values_1 = tl.load(x + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T09:47:55.2326827Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:47:55.2327168Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:47:55.2327400Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:47:55.2327646Z v_10 = v_9 - subscript_1 2026-02-21T09:47:55.2327851Z v_11 = libdevice.exp(v_10) 2026-02-21T09:47:55.2328092Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:47:55.2328336Z v_12 = v_11 / subscript_2 2026-02-21T09:47:55.2328631Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:47:55.2328974Z tl.store(out + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:47:55.2329209Z 2026-02-21T09:47:55.2329359Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:47:55.2329661Z """ 2026-02-21T09:47:55.2329906Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:47:55.2330280Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:47:55.2330569Z Args: 2026-02-21T09:47:55.2330766Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:47:55.2331026Z Returns: 2026-02-21T09:47:55.2331241Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:47:55.2331512Z """ 2026-02-21T09:47:55.2331727Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:47:55.2331967Z m, n = x.size() 2026-02-21T09:47:55.2332227Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:47:55.2332504Z out = torch.empty_like(x) 2026-02-21T09:47:55.2332799Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:47:55.2333170Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:47:55.2333552Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:47:55.2333837Z # src[softmax.py:79-92]: ... 2026-02-21T09:47:55.2334173Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=2) 2026-02-21T09:47:55.2334503Z # src[softmax.py:93]: return out 2026-02-21T09:47:55.2334749Z return out 2026-02-21T09:47:56.3013641Z WARNING:tritonbench.utils.triton_op:Completed input ID 61: 2026-02-21T09:47:56.3018060Z (M, N) 2026-02-21T09:47:56.3019392Z ------------ 2026-02-21T09:47:56.3019651Z (4096, 8064) 2026-02-21T09:47:56.3019823Z 2026-02-21T09:47:56.3025633Z 65%|██████▌ | 13/20 [39:01<22:28, 192.68s/it]WARNING:tritonbench.utils.triton_op:Running input ID 66: 2026-02-21T09:47:56.3026423Z (M, N) 2026-02-21T09:47:56.3026618Z ------------ 2026-02-21T09:47:56.3026876Z (4096, 8704) 2026-02-21T09:47:56.3027215Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax 2026-02-21T09:47:57.4606067Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:47:58.7412181Z INFO:tritonbench.utils.triton_op:Took 2.40ms to get benchmark function for torch_compile_softmax 2026-02-21T09:48:00.0113709Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:48:00.0117874Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:48:00.0122277Z 'dtype': 'torch.float16', 2026-02-21T09:48:00.0126570Z 'shape': (4096, 8704), 2026-02-21T09:48:00.0127990Z 'stride': (8704, 1)},), 2026-02-21T09:48:00.0128300Z 'kwargs': {}} 2026-02-21T09:48:00.0134955Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:48:00.1880129Z [0s] Autotune random seed: 2138408546 2026-02-21T09:48:00.2132384Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:48:35.5266458Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T09:48:37.9747102Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:48:46.7199986Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.3 configs/s 2026-02-21T09:48:46.7212577Z [46s] Adaptive compile timeout: 30s (90% percentile=12.3s, bounds=[30.0s, 30s]) 2026-02-21T09:48:47.9035033Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 836.4 configs/s 2026-02-21T09:48:47.9779976Z [47s] Initial random population of 100, 5 starting points: 2026-02-21T09:48:47.9783216Z error=11 2026-02-21T09:48:47.9787666Z timeout=1 2026-02-21T09:48:47.9791514Z ok=88 2026-02-21T09:48:47.9793255Z min=0.0635 2026-02-21T09:48:47.9793495Z mid=0.4302 2026-02-21T09:48:47.9796574Z max=203.7668 2026-02-21T09:48:47.9796804Z best={'block_sizes': [1, 1024], 2026-02-21T09:48:47.9797156Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:48:47.9797443Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:48:47.9798819Z 'num_stages': 6, 2026-02-21T09:48:47.9799029Z 'num_warps': 4, 2026-02-21T09:48:47.9799235Z 'pid_type': 'flat', 2026-02-21T09:48:47.9804148Z 'range_flattens': [None, None], 2026-02-21T09:48:47.9807099Z 'range_multi_buffers': [None, True], 2026-02-21T09:48:47.9810978Z 'range_num_stages': [0, 0], 2026-02-21T09:48:47.9815092Z 'range_unroll_factors': [0, 4], 2026-02-21T09:48:47.9819179Z 'range_warp_specializes': [None, False]} 2026-02-21T09:48:47.9823365Z [47s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:48:49.7116290Z [49s] Generation 1 starting: 84 neighbors, 5 active search path(s) 2026-02-21T09:49:02.3013196Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 11.3 configs/s 2026-02-21T09:49:07.3687946Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.1 configs/s 2026-02-21T09:49:12.8999140Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 182.6 2026-02-21T09:49:12.9000032Z configs/s 2026-02-21T09:49:13.1771114Z [72s] Generation 1 complete: 2026-02-21T09:49:13.1772571Z ok=89 2026-02-21T09:49:13.1772797Z min=0.0492 2026-02-21T09:49:13.1773011Z mid=0.0676 2026-02-21T09:49:13.1773199Z max=0.4833 2026-02-21T09:49:13.1773408Z best={'block_sizes': [2, 512], 2026-02-21T09:49:13.1773791Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:49:13.1774146Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:49:13.1774369Z 'num_stages': 3, 2026-02-21T09:49:13.1774576Z 'num_warps': 1, 2026-02-21T09:49:13.1774763Z 'pid_type': 'flat', 2026-02-21T09:49:13.1774991Z 'range_flattens': [None, False], 2026-02-21T09:49:13.1775218Z 'range_multi_buffers': [None, False], 2026-02-21T09:49:13.1775472Z 'range_num_stages': [0, 3], 2026-02-21T09:49:13.1775707Z 'range_unroll_factors': [0, 1], 2026-02-21T09:49:13.1775926Z 'range_warp_specializes': [None, False]} 2026-02-21T09:49:13.1787946Z [72s] Fitting surrogate: 189 points, 189 targets 2026-02-21T09:49:14.4563931Z [74s] Generation 2 starting: 75 neighbors, 5 active search path(s) 2026-02-21T09:49:27.0144879Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 5.2 configs/s 2026-02-21T09:49:31.5472950Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 17.2 configs/s 2026-02-21T09:49:38.8004899Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 139.4 2026-02-21T09:49:38.8005416Z configs/s 2026-02-21T09:49:39.1875538Z [98s] Generation 2 complete: 2026-02-21T09:49:39.1880487Z ok=80 2026-02-21T09:49:39.1884183Z min=0.0451 2026-02-21T09:49:39.1887329Z mid=0.0572 2026-02-21T09:49:39.1891902Z max=0.1965 2026-02-21T09:49:39.1895181Z best={'block_sizes': [1, 512], 2026-02-21T09:49:39.1899538Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:49:39.1903280Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:49:39.1903638Z 'num_stages': 2, 2026-02-21T09:49:39.1907171Z 'num_warps': 1, 2026-02-21T09:49:39.1909214Z 'pid_type': 'flat', 2026-02-21T09:49:39.1909474Z 'range_flattens': [None, False], 2026-02-21T09:49:39.1909796Z 'range_multi_buffers': [None, False], 2026-02-21T09:49:39.1910031Z 'range_num_stages': [0, 3], 2026-02-21T09:49:39.1912959Z 'range_unroll_factors': [0, 1], 2026-02-21T09:49:39.1915468Z 'range_warp_specializes': [None, False]} 2026-02-21T09:49:39.1915761Z [98s] Fitting surrogate: 269 points, 269 targets 2026-02-21T09:49:40.1649386Z [99s] Generation 3 starting: 68 neighbors, 5 active search path(s) 2026-02-21T09:50:14.0756934Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 0.2 configs/s 2026-02-21T09:50:15.7744356Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T09:50:15.7745294Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:50:15.7751272Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:50:15.7751817Z %cst = arith.constant dense<0.000000e+00> : tensor<8x2048xf16> 2026-02-21T09:50:15.7752159Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T09:50:15.7752468Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:50:15.7753080Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:50:15.7753463Z %cst_0 = arith.constant dense<8704> : tensor<8x1xi32> 2026-02-21T09:50:15.7753885Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x2048xf32> 2026-02-21T09:50:15.7754243Z %cst_2 = arith.constant dense<0xFC00> : tensor<8x2048xf16> 2026-02-21T09:50:15.7754608Z %cst_3 = arith.constant dense<8704> : tensor<2048xi32> 2026-02-21T09:50:15.7754944Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:50:15.7755314Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:50:15.7755636Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:50:15.7755944Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:50:15.7756227Z %c8704_i32 = arith.constant 8704 : i32 2026-02-21T09:50:15.7756544Z %c8704_i64 = arith.constant 8704 : i64 2026-02-21T09:50:15.7756858Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:50:15.7757287Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : , > 2026-02-21T09:50:15.7757740Z %1 = tt.get_program_id x : i32 2026-02-21T09:50:15.7757996Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:50:15.7758282Z %3 = arith.minsi %2, %c512_i32 : i32 2026-02-21T09:50:15.7758555Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:50:15.7758868Z %4 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:50:15.7759215Z %5 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:50:15.7759563Z %6 = tt.splat %4 : i32 -> tensor<8xi32> 2026-02-21T09:50:15.7759875Z %7 = arith.addi %6, %5 : tensor<8xi32> 2026-02-21T09:50:15.7760150Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T09:50:15.7760470Z %c4096_i32_6 = arith.constant 4096 : i32 2026-02-21T09:50:15.7760989Z %8:2 = scf.for %arg3 = %c0_i32 to %c8192_i32 step %c4096_i32_6 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:50:15.7761657Z %62 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7762078Z %63 = tt.splat %arg3 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7762409Z %64 = arith.addi %63, %62 : tensor<2048xi32> 2026-02-21T09:50:15.7762782Z %65 = arith.cmpi slt, %64, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7763194Z %66 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<8x2048xf16> 2026-02-21T09:50:15.7763680Z %67 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7764130Z %68 = tt.broadcast %67 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7764542Z %69 = arith.select %68, %66, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16> 2026-02-21T09:50:15.7764967Z %70 = arith.extf %69 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7765302Z %71 = "tt.reduce"(%70) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7765624Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:50:15.7766058Z %118 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:50:15.7766377Z tt.reduce.return %118 : f32 2026-02-21T09:50:15.7766649Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7766993Z %72 = arith.truncf %71 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:50:15.7767350Z %73 = arith.extf %72 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:50:15.7767665Z %74 = arith.cmpf ogt, %arg4, %73 : tensor<8xf32> 2026-02-21T09:50:15.7768082Z %75 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:50:15.7768420Z %76 = arith.ori %74, %75 : tensor<8xi1> 2026-02-21T09:50:15.7768756Z %77 = arith.select %76, %arg4, %73 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:50:15.7769123Z %78 = arith.subf %arg4, %77 : tensor<8xf32> 2026-02-21T09:50:15.7769735Z %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7770230Z %80 = arith.mulf %arg5, %79 : tensor<8xf32> 2026-02-21T09:50:15.7770621Z %81 = tt.expand_dims %77 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7771011Z %82 = arith.extf %66 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7771400Z %83 = tt.broadcast %81 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7771831Z %84 = arith.subf %82, %83 : tensor<8x2048xf32> 2026-02-21T09:50:15.7772341Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7772931Z %86 = arith.select %68, %85, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32> 2026-02-21T09:50:15.7773300Z %87 = "tt.reduce"(%86) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7773608Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:50:15.7773875Z %118 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:50:15.7774183Z tt.reduce.return %118 : f32 2026-02-21T09:50:15.7774496Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7774780Z %88 = arith.addf %80, %87 : tensor<8xf32> 2026-02-21T09:50:15.7775090Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:50:15.7775367Z %89 = arith.muli %c2048_i32, %c1_i32_9 : i32 2026-02-21T09:50:15.7775671Z %90 = arith.addi %arg3, %89 : i32 2026-02-21T09:50:15.7776002Z %91 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7776382Z %92 = tt.splat %90 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7776663Z %93 = arith.addi %92, %91 : tensor<2048xi32> 2026-02-21T09:50:15.7777018Z %94 = arith.cmpi slt, %93, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7777461Z %95 = tt.descriptor_load %0[%4, %90] : !tt.tensordesc> -> tensor<8x2048xf16> 2026-02-21T09:50:15.7777920Z %96 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7778338Z %97 = tt.broadcast %96 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7778703Z %98 = arith.select %97, %95, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16> 2026-02-21T09:50:15.7779101Z %99 = arith.extf %98 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7779449Z %100 = "tt.reduce"(%99) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7779724Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:50:15.7780017Z %118 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:50:15.7780285Z tt.reduce.return %118 : f32 2026-02-21T09:50:15.7780573Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7780887Z %101 = arith.truncf %100 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:50:15.7781245Z %102 = arith.extf %101 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:50:15.7781625Z %103 = arith.cmpf ogt, %77, %102 : tensor<8xf32> 2026-02-21T09:50:15.7781926Z %104 = arith.cmpf une, %77, %77 : tensor<8xf32> 2026-02-21T09:50:15.7782323Z %105 = arith.ori %103, %104 : tensor<8xi1> 2026-02-21T09:50:15.7782640Z %106 = arith.select %105, %77, %102 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:50:15.7782995Z %107 = arith.subf %77, %106 : tensor<8xf32> 2026-02-21T09:50:15.7783459Z %108 = tt.extern_elementwise %107 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7783959Z %109 = arith.mulf %88, %108 : tensor<8xf32> 2026-02-21T09:50:15.7784288Z %110 = tt.expand_dims %106 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7784675Z %111 = arith.extf %95 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7785055Z %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7785386Z %113 = arith.subf %111, %112 : tensor<8x2048xf32> 2026-02-21T09:50:15.7785960Z %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7786520Z %115 = arith.select %97, %114, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32> 2026-02-21T09:50:15.7786869Z %116 = "tt.reduce"(%115) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7787167Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:50:15.7787424Z %118 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:50:15.7787721Z tt.reduce.return %118 : f32 2026-02-21T09:50:15.7787983Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7788295Z %117 = arith.addf %109, %116 : tensor<8xf32> 2026-02-21T09:50:15.7788594Z scf.yield %106, %117 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:50:15.7788920Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:15.7789264Z %9 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7789620Z %10 = tt.splat %c8192_i32 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7789951Z %11 = arith.addi %10, %9 : tensor<2048xi32> 2026-02-21T09:50:15.7790249Z %12 = arith.cmpi slt, %11, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7790690Z %13 = tt.descriptor_load %0[%4, %c8192_i32] : !tt.tensordesc> -> tensor<8x2048xf16> 2026-02-21T09:50:15.7791146Z %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7791593Z %15 = tt.broadcast %14 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7791993Z %16 = arith.select %15, %13, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16> 2026-02-21T09:50:15.7792364Z %17 = arith.extf %16 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7792710Z %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7792977Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:50:15.7793270Z %62 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:50:15.7793540Z tt.reduce.return %62 : f32 2026-02-21T09:50:15.7793833Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7794160Z %19 = arith.truncf %18 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:50:15.7794483Z %20 = arith.extf %19 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:50:15.7794813Z %21 = arith.cmpf ogt, %8#0, %20 : tensor<8xf32> 2026-02-21T09:50:15.7795106Z %22 = arith.cmpf une, %8#0, %8#0 : tensor<8xf32> 2026-02-21T09:50:15.7795410Z %23 = arith.ori %21, %22 : tensor<8xi1> 2026-02-21T09:50:15.7795690Z %24 = arith.select %23, %8#0, %20 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:50:15.7795999Z %25 = arith.subf %8#0, %24 : tensor<8xf32> 2026-02-21T09:50:15.7796441Z %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7796861Z %27 = arith.mulf %8#1, %26 : tensor<8xf32> 2026-02-21T09:50:15.7797196Z %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7797601Z %29 = arith.extf %13 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7797913Z %30 = tt.broadcast %28 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7798205Z %31 = arith.subf %29, %30 : tensor<8x2048xf32> 2026-02-21T09:50:15.7798664Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7799171Z %33 = arith.select %15, %32, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32> 2026-02-21T09:50:15.7799479Z %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({ 2026-02-21T09:50:15.7799780Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:50:15.7800015Z %62 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:50:15.7800287Z tt.reduce.return %62 : f32 2026-02-21T09:50:15.7800521Z }) : (tensor<8x2048xf32>) -> tensor<8xf32> 2026-02-21T09:50:15.7800856Z %35 = arith.addf %27, %34 : tensor<8xf32> 2026-02-21T09:50:15.7801134Z %c8192_i32_7 = arith.constant 8192 : i32 2026-02-21T09:50:15.7801385Z %c4096_i32_8 = arith.constant 4096 : i32 2026-02-21T09:50:15.7801773Z scf.for %arg3 = %c0_i32 to %c8192_i32_7 step %c4096_i32_8 : i32 { 2026-02-21T09:50:15.7802124Z %62 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7802470Z %63 = tt.splat %arg3 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7802737Z %64 = arith.addi %63, %62 : tensor<2048xi32> 2026-02-21T09:50:15.7803039Z %65 = arith.cmpi slt, %64, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7803393Z %66 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:50:15.7803717Z %67 = arith.muli %66, %cst_0 : tensor<8x1xi32> 2026-02-21T09:50:15.7804075Z %68 = tt.expand_dims %64 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> 2026-02-21T09:50:15.7804438Z %69 = tt.broadcast %67 : tensor<8x1xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7804796Z %70 = tt.broadcast %68 : tensor<1x2048xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7805121Z %71 = arith.addi %69, %70 : tensor<8x2048xi32> 2026-02-21T09:50:15.7805421Z %72 = tt.splat %arg0 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7805795Z %73 = tt.addptr %72, %71 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7806159Z %74 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7806546Z %75 = tt.broadcast %74 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7806917Z %76 = tt.load %73, %75, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7807342Z %77 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7807718Z %78 = arith.extf %76 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7808044Z %79 = tt.broadcast %77 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7808366Z %80 = arith.subf %78, %79 : tensor<8x2048xf32> 2026-02-21T09:50:15.7808805Z %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7809321Z %82 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7809693Z %83 = tt.broadcast %82 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7809982Z %84 = arith.divf %81, %83 : tensor<8x2048xf32> 2026-02-21T09:50:15.7810307Z %85 = arith.truncf %84 : tensor<8x2048xf32> to tensor<8x2048xf16> 2026-02-21T09:50:15.7810636Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7811002Z %87 = tt.addptr %86, %71 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7811324Z tt.store %87, %85, %75 : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7811768Z %c1_i32_9 = arith.constant 1 : i32 2026-02-21T09:50:15.7812053Z %88 = arith.muli %c2048_i32, %c1_i32_9 : i32 2026-02-21T09:50:15.7812304Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:50:15.7812626Z %90 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7812938Z %91 = tt.splat %89 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7813225Z %92 = arith.addi %91, %90 : tensor<2048xi32> 2026-02-21T09:50:15.7813500Z %93 = arith.cmpi slt, %92, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7813850Z %94 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:50:15.7814165Z %95 = arith.muli %94, %cst_0 : tensor<8x1xi32> 2026-02-21T09:50:15.7814487Z %96 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> 2026-02-21T09:50:15.7814871Z %97 = tt.broadcast %95 : tensor<8x1xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7815254Z %98 = tt.broadcast %96 : tensor<1x2048xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7815579Z %99 = arith.addi %97, %98 : tensor<8x2048xi32> 2026-02-21T09:50:15.7815877Z %100 = tt.splat %arg0 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7816256Z %101 = tt.addptr %100, %99 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7816662Z %102 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7817022Z %103 = tt.broadcast %102 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7817437Z %104 = tt.load %101, %103, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7817835Z %105 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7818219Z %106 = arith.extf %104 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7818577Z %107 = tt.broadcast %105 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7818880Z %108 = arith.subf %106, %107 : tensor<8x2048xf32> 2026-02-21T09:50:15.7819362Z %109 = tt.extern_elementwise %108 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7819856Z %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7820239Z %111 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7820567Z %112 = arith.divf %109, %111 : tensor<8x2048xf32> 2026-02-21T09:50:15.7820871Z %113 = arith.truncf %112 : tensor<8x2048xf32> to tensor<8x2048xf16> 2026-02-21T09:50:15.7821239Z %114 = tt.splat %arg1 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7821614Z %115 = tt.addptr %114, %99 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7821980Z tt.store %115, %113, %103 : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7822255Z } {tt.num_stages = 1 : i32} 2026-02-21T09:50:15.7822575Z %36 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T09:50:15.7822929Z %37 = tt.splat %c8192_i32_7 : i32 -> tensor<2048xi32> 2026-02-21T09:50:15.7823203Z %38 = arith.addi %37, %36 : tensor<2048xi32> 2026-02-21T09:50:15.7823510Z %39 = arith.cmpi slt, %38, %cst_3 : tensor<2048xi32> 2026-02-21T09:50:15.7823834Z %40 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:50:15.7824181Z %41 = arith.muli %40, %cst_0 : tensor<8x1xi32> 2026-02-21T09:50:15.7824498Z %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> 2026-02-21T09:50:15.7824885Z %43 = tt.broadcast %41 : tensor<8x1xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7825238Z %44 = tt.broadcast %42 : tensor<1x2048xi32> -> tensor<8x2048xi32> 2026-02-21T09:50:15.7825532Z %45 = arith.addi %43, %44 : tensor<8x2048xi32> 2026-02-21T09:50:15.7825923Z %46 = tt.splat %arg0 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7826261Z %47 = tt.addptr %46, %45 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7826647Z %48 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1> 2026-02-21T09:50:15.7827000Z %49 = tt.broadcast %48 : tensor<1x2048xi1> -> tensor<8x2048xi1> 2026-02-21T09:50:15.7827393Z %50 = tt.load %47, %49, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7827810Z %51 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7828154Z %52 = arith.extf %50 : tensor<8x2048xf16> to tensor<8x2048xf32> 2026-02-21T09:50:15.7828502Z %53 = tt.broadcast %51 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7828788Z %54 = arith.subf %52, %53 : tensor<8x2048xf32> 2026-02-21T09:50:15.7829306Z %55 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32> 2026-02-21T09:50:15.7829818Z %56 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:50:15.7830157Z %57 = tt.broadcast %56 : tensor<8x1xf32> -> tensor<8x2048xf32> 2026-02-21T09:50:15.7830470Z %58 = arith.divf %55, %57 : tensor<8x2048xf32> 2026-02-21T09:50:15.7830763Z %59 = arith.truncf %58 : tensor<8x2048xf32> to tensor<8x2048xf16> 2026-02-21T09:50:15.7831117Z %60 = tt.splat %arg1 : !tt.ptr -> tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7831450Z %61 = tt.addptr %60, %45 : tensor<8x2048x!tt.ptr>, tensor<8x2048xi32> 2026-02-21T09:50:15.7831833Z tt.store %61, %59, %49 : tensor<8x2048x!tt.ptr> 2026-02-21T09:50:15.7832143Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:50:15.7832394Z tt.return 2026-02-21T09:50:15.7832605Z } 2026-02-21T09:50:15.7832775Z } 2026-02-21T09:50:15.7832900Z 2026-02-21T09:50:15.7832979Z {-# 2026-02-21T09:50:15.7833162Z external_resources: { 2026-02-21T09:50:15.7833415Z mlir_reproducer: { 2026-02-21T09:50:15.7838225Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:50:15.7843525Z disable_threading: false, 2026-02-21T09:50:15.7843801Z verify_each: true 2026-02-21T09:50:15.7844073Z } 2026-02-21T09:50:15.7844285Z } 2026-02-21T09:50:15.7844460Z #-} 2026-02-21T09:50:15.7845027Z /tmp/torchinductor_root/54/c546bkyi7sl4ccjuehpnzcndjhoufwdajeu553u5laotkp7oj7xq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:50:15.7846512Z /tmp/torchinductor_root/54/c546bkyi7sl4ccjuehpnzcndjhoufwdajeu553u5laotkp7oj7xq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:50:15.7847706Z [135s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:50:15.7849079Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 2048], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=8, num_stages=8, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:50:15.7850189Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:50:15.7850517Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:50:18.1963608Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.2 configs/s 2026-02-21T09:50:23.2333237Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 200.9 2026-02-21T09:50:23.2336778Z configs/s 2026-02-21T09:50:23.5080037Z [143s] Generation 3 complete: 2026-02-21T09:50:23.5083958Z error=2 2026-02-21T09:50:23.5088177Z ok=71 2026-02-21T09:50:23.5092695Z min=0.0451 2026-02-21T09:50:23.5097714Z mid=0.0513 2026-02-21T09:50:23.5102167Z max=0.4997 2026-02-21T09:50:23.5103631Z best={'block_sizes': [1, 512], 2026-02-21T09:50:23.5104000Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T09:50:23.5104357Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:50:23.5104587Z 'num_stages': 2, 2026-02-21T09:50:23.5104810Z 'num_warps': 1, 2026-02-21T09:50:23.5104996Z 'pid_type': 'flat', 2026-02-21T09:50:23.5105223Z 'range_flattens': [None, False], 2026-02-21T09:50:23.5105448Z 'range_multi_buffers': [None, False], 2026-02-21T09:50:23.5105702Z 'range_num_stages': [0, 3], 2026-02-21T09:50:23.5105940Z 'range_unroll_factors': [0, 1], 2026-02-21T09:50:23.5106159Z 'range_warp_specializes': [None, False]} 2026-02-21T09:50:23.5106454Z [143s] Fitting surrogate: 342 points, 342 targets 2026-02-21T09:50:24.3527177Z [144s] Generation 4 starting: 59 neighbors, 5 active search path(s) 2026-02-21T09:50:52.2447243Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.3 configs/s 2026-02-21T09:50:55.7913613Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.1 configs/s 2026-02-21T09:51:00.1320973Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 232.8 2026-02-21T09:51:00.1322341Z configs/s 2026-02-21T09:51:00.3816581Z [180s] Generation 4 complete: 2026-02-21T09:51:00.3820845Z ok=64 2026-02-21T09:51:00.3824778Z min=0.0451 2026-02-21T09:51:00.3826398Z mid=0.0511 2026-02-21T09:51:00.3826649Z max=0.9605 2026-02-21T09:51:00.3826845Z best={'block_sizes': [1, 512], 2026-02-21T09:51:00.3827177Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:51:00.3827490Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:51:00.3827752Z 'num_stages': 2, 2026-02-21T09:51:00.3827941Z 'num_warps': 1, 2026-02-21T09:51:00.3828158Z 'pid_type': 'flat', 2026-02-21T09:51:00.3828392Z 'range_flattens': [None, False], 2026-02-21T09:51:00.3828611Z 'range_multi_buffers': [None, False], 2026-02-21T09:51:00.3828863Z 'range_num_stages': [0, 3], 2026-02-21T09:51:00.3829566Z 'range_unroll_factors': [0, 1], 2026-02-21T09:51:00.3833330Z 'range_warp_specializes': [None, False]} 2026-02-21T09:51:00.3835382Z [180s] Fitting surrogate: 406 points, 406 targets 2026-02-21T09:51:01.1403830Z [180s] Generation 5 starting: 50 neighbors, 4 active search path(s) 2026-02-21T09:51:08.5171345Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 8.3 configs/s 2026-02-21T09:51:11.5599629Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 17.3 configs/s 2026-02-21T09:51:15.9164878Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.0 2026-02-21T09:51:15.9168355Z configs/s 2026-02-21T09:51:16.1313593Z [195s] Generation 5 complete: 2026-02-21T09:51:16.1316684Z ok=55 2026-02-21T09:51:16.1319899Z min=0.0451 2026-02-21T09:51:16.1323946Z mid=0.0532 2026-02-21T09:51:16.1328602Z max=0.2764 2026-02-21T09:51:16.1330722Z best={'block_sizes': [1, 512], 2026-02-21T09:51:16.1331503Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:51:16.1336667Z 'load_eviction_policies': ['first', ''], 2026-02-21T09:51:16.1340991Z 'num_stages': 2, 2026-02-21T09:51:16.1341321Z 'num_warps': 1, 2026-02-21T09:51:16.1346136Z 'pid_type': 'flat', 2026-02-21T09:51:16.1350508Z 'range_flattens': [None, False], 2026-02-21T09:51:16.1350865Z 'range_multi_buffers': [None, False], 2026-02-21T09:51:16.1355527Z 'range_num_stages': [0, 3], 2026-02-21T09:51:16.1359927Z 'range_unroll_factors': [0, 1], 2026-02-21T09:51:16.1361702Z 'range_warp_specializes': [None, False]} 2026-02-21T09:51:16.8563014Z [195s] Fitting surrogate: 461 points, 461 targets 2026-02-21T09:51:16.8563443Z [196s] Generation 6 starting: 45 neighbors, 4 active search path(s) 2026-02-21T09:51:23.0759538Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 7.3 configs/s 2026-02-21T09:51:25.7705044Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 17.4 configs/s 2026-02-21T09:51:29.6622202Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 259.7 2026-02-21T09:51:29.6622662Z configs/s 2026-02-21T09:51:29.8945506Z [209s] Generation 6 complete: 2026-02-21T09:51:29.8949735Z ok=50 2026-02-21T09:51:29.8953665Z min=0.0451 2026-02-21T09:51:29.8957918Z mid=0.0471 2026-02-21T09:51:29.8959425Z max=0.0799 2026-02-21T09:51:29.8959688Z best={'block_sizes': [1, 512], 2026-02-21T09:51:29.8959997Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:51:29.8960428Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:51:29.8965485Z 'num_stages': 8, 2026-02-21T09:51:29.8967640Z 'num_warps': 1, 2026-02-21T09:51:29.8967923Z 'pid_type': 'flat', 2026-02-21T09:51:29.8972142Z 'range_flattens': [None, None], 2026-02-21T09:51:29.8973662Z 'range_multi_buffers': [None, None], 2026-02-21T09:51:29.8973936Z 'range_num_stages': [0, 3], 2026-02-21T09:51:29.8974174Z 'range_unroll_factors': [0, 1], 2026-02-21T09:51:29.8974802Z 'range_warp_specializes': [None, False]} 2026-02-21T09:51:29.8975103Z [209s] Fitting surrogate: 511 points, 511 targets 2026-02-21T09:51:30.2619258Z [210s] Generation 7 starting: 7 neighbors, 1 active search path(s) 2026-02-21T09:51:32.0076495Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 3.8 configs/s 2026-02-21T09:51:32.4199140Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 19.4 configs/s 2026-02-21T09:51:33.1281076Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1395.7 2026-02-21T09:51:33.1285176Z configs/s 2026-02-21T09:51:33.1858206Z [212s] Generation 7 complete: 2026-02-21T09:51:33.1858514Z ok=8 2026-02-21T09:51:33.1862577Z min=0.0451 2026-02-21T09:51:33.1867054Z mid=0.0451 2026-02-21T09:51:33.1871327Z max=0.0635 2026-02-21T09:51:33.1875743Z best={'block_sizes': [1, 512], 2026-02-21T09:51:33.1878376Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T09:51:33.1878861Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:51:33.1882787Z 'num_stages': 8, 2026-02-21T09:51:33.1887191Z 'num_warps': 1, 2026-02-21T09:51:33.1891517Z 'pid_type': 'flat', 2026-02-21T09:51:33.1895398Z 'range_flattens': [None, None], 2026-02-21T09:51:33.1898683Z 'range_multi_buffers': [None, None], 2026-02-21T09:51:33.1902450Z 'range_num_stages': [0, 3], 2026-02-21T09:51:33.1907374Z 'range_unroll_factors': [0, 1], 2026-02-21T09:51:33.1911773Z 'range_warp_specializes': [None, False]} 2026-02-21T09:51:33.1913896Z [212s] Fitting surrogate: 519 points, 519 targets 2026-02-21T09:51:33.4434631Z [213s] Autotuning complete in 213.2s after searching 497 configs. 2026-02-21T09:51:33.4438521Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:51:33.4439649Z @helion.kernel(config=helion.Config(block_sizes=[1, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:51:33.4440524Z 2026-02-21T09:51:33.4440812Z [213s] Code of selected kernel: /tmp/torchinductor_root/kg/ckgyyc5wosnskbkyt7nxbsquxpemydjjp6uf3lfhfvqntz2bcqqp.py 2026-02-21T09:51:33.4637249Z from __future__ import annotations 2026-02-21T09:51:33.4637527Z 2026-02-21T09:51:33.4637706Z import torch 2026-02-21T09:51:33.4637890Z import triton 2026-02-21T09:51:33.4638171Z import triton.language as tl 2026-02-21T09:51:33.4638419Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:51:33.4638779Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:51:33.4639119Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:51:33.4639350Z 2026-02-21T09:51:33.4639454Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:51:33.4639708Z _BLOCK_SIZE_1 = tl.constexpr(512) 2026-02-21T09:51:33.4639846Z 2026-02-21T09:51:33.4639921Z @triton.jit 2026-02-21T09:51:33.4640138Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:51:33.4640424Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:51:33.4640749Z pid_0 = tl.program_id(0) 2026-02-21T09:51:33.4640951Z offset_0 = pid_0 2026-02-21T09:51:33.4641190Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:51:33.4641702Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:51:33.4642040Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:51:33.4642382Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:51:33.4642673Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:51:33.4642996Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:51:33.4643315Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:51:33.4643927Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:51:33.4644229Z # src[softmax.py:82-89]: ... 2026-02-21T09:51:33.4644603Z for offset_2 in tl.range(0, 8704, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3): 2026-02-21T09:51:33.4645048Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:51:33.4645309Z mi_copy = mi 2026-02-21T09:51:33.4645514Z di_copy = di 2026-02-21T09:51:33.4645697Z mi_copy_0 = mi_copy 2026-02-21T09:51:33.4645919Z di_copy_0 = di_copy 2026-02-21T09:51:33.4646138Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:51:33.4646544Z values = tl.load(x + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T09:51:33.4647018Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:51:33.4647382Z local_amax = tl.cast(tl.max(values, 1), tl.float16) 2026-02-21T09:51:33.4647731Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:51:33.4648058Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:51:33.4648321Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:51:33.4648662Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:51:33.4648956Z v_2 = mi_copy_0 - v_1 2026-02-21T09:51:33.4649205Z v_3 = libdevice.exp(v_2) 2026-02-21T09:51:33.4649421Z v_4 = di_copy_0 * v_3 2026-02-21T09:51:33.4649692Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:51:33.4649925Z subscript = v_1[:, None] 2026-02-21T09:51:33.4650169Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:51:33.4650431Z v_6 = v_5 - subscript 2026-02-21T09:51:33.4650694Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:51:33.4651044Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:51:33.4651312Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:51:33.4651616Z v_7 = libdevice.exp(v_6) 2026-02-21T09:51:33.4651852Z sum_1 = tl.cast(tl.sum(v_7, 1), tl.float32) 2026-02-21T09:51:33.4652131Z di = v_4 + sum_1 2026-02-21T09:51:33.4652342Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:51:33.4652598Z mi = v_1 2026-02-21T09:51:33.4652879Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:51:33.4653208Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:51:33.4653583Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:51:33.4654049Z for offset_2 in tl.range(0, 8704, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3): 2026-02-21T09:51:33.4654497Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:51:33.4654780Z mi_copy_1 = mi 2026-02-21T09:51:33.4654998Z di_copy_1 = di 2026-02-21T09:51:33.4655231Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:51:33.4655434Z di_copy_1_0 = di_copy_1 2026-02-21T09:51:33.4655677Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:51:33.4656036Z values_1 = tl.load(x + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T09:51:33.4656504Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:51:33.4656821Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:51:33.4657082Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:51:33.4657334Z v_10 = v_9 - subscript_1 2026-02-21T09:51:33.4657549Z v_11 = libdevice.exp(v_10) 2026-02-21T09:51:33.4657793Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:51:33.4658018Z v_12 = v_11 / subscript_2 2026-02-21T09:51:33.4658258Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:51:33.4658634Z tl.store(out + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), v_13, None) 2026-02-21T09:51:33.4658878Z 2026-02-21T09:51:33.4659037Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:51:33.4659339Z """ 2026-02-21T09:51:33.4659587Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:51:33.4659967Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:51:33.4660230Z Args: 2026-02-21T09:51:33.4660458Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:51:33.4660690Z Returns: 2026-02-21T09:51:33.4660936Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:51:33.4661182Z """ 2026-02-21T09:51:33.4661383Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:51:33.4661671Z m, n = x.size() 2026-02-21T09:51:33.4661877Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:51:33.4662212Z out = torch.empty_like(x) 2026-02-21T09:51:33.4662483Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:51:33.4662868Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:51:33.4663203Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:51:33.4663506Z # src[softmax.py:79-92]: ... 2026-02-21T09:51:33.4663830Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=8) 2026-02-21T09:51:33.4664140Z # src[softmax.py:93]: return out 2026-02-21T09:51:33.4664379Z return out 2026-02-21T09:51:34.5504665Z WARNING:tritonbench.utils.triton_op:Completed input ID 66: 2026-02-21T09:51:34.5506219Z (M, N) 2026-02-21T09:51:34.5506509Z ------------ 2026-02-21T09:51:34.5506723Z (4096, 8704) 2026-02-21T09:51:34.5506918Z 2026-02-21T09:51:34.5518132Z 70%|███████ | 14/20 [42:39<20:02, 200.40s/it]WARNING:tritonbench.utils.triton_op:Running input ID 71: 2026-02-21T09:51:34.5522790Z (M, N) 2026-02-21T09:51:34.5525950Z ------------ 2026-02-21T09:51:34.5530959Z (4096, 9344) 2026-02-21T09:51:34.5532943Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:51:35.7681199Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:51:37.1276165Z INFO:tritonbench.utils.triton_op:Took 2.42ms to get benchmark function for torch_compile_softmax 2026-02-21T09:51:38.4721349Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:51:38.4725512Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:51:38.4730483Z 'dtype': 'torch.float16', 2026-02-21T09:51:38.4734314Z 'shape': (4096, 9344), 2026-02-21T09:51:38.4735888Z 'stride': (9344, 1)},), 2026-02-21T09:51:38.4736307Z 'kwargs': {}} 2026-02-21T09:51:38.4744200Z INFO:tritonbench.utils.triton_op:Took 2.45ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:51:38.6479554Z [0s] Autotune random seed: 2138408546 2026-02-21T09:51:38.6732335Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:52:14.3133198Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T09:52:17.2048686Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:52:19.2281335Z module { 2026-02-21T09:52:19.2282122Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:19.2282677Z %c256_i32 = arith.constant 256 : i32 2026-02-21T09:52:19.2283253Z %c128_i32 = arith.constant 128 : i32 2026-02-21T09:52:19.2283512Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:19.2283736Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:19.2284019Z %cst = arith.constant dense<9344> : tensor<16x1xi32> 2026-02-21T09:52:19.2284319Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T09:52:19.2284651Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T09:52:19.2284940Z %c16_i32 = arith.constant 16 : i32 2026-02-21T09:52:19.2285168Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:52:19.2285427Z %c9344_i32 = arith.constant 9344 : i32 2026-02-21T09:52:19.2285650Z %c9344_i64 = arith.constant 9344 : i64 2026-02-21T09:52:19.2285898Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:52:19.2286372Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : , > 2026-02-21T09:52:19.2286918Z %1 = tt.get_program_id x : i32 2026-02-21T09:52:19.2287178Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T09:52:19.2287396Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T09:52:19.2287667Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T09:52:19.2287911Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T09:52:19.2288218Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T09:52:19.2288507Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T09:52:19.2288778Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T09:52:19.2292086Z %c9216_i32 = arith.constant 9216 : i32 2026-02-21T09:52:19.2292360Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:52:19.2292851Z %8:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T09:52:19.2293390Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:52:19.2293819Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2294146Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2294394Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2294667Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.2294915Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2295209Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2295530Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:52:19.2295862Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:52:19.2296151Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T09:52:19.2296466Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T09:52:19.2296735Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T09:52:19.2297059Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:52:19.2297362Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T09:52:19.2297821Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2298273Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T09:52:19.2298584Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2298971Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2299264Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T09:52:19.2299724Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2300179Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2300429Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2300692Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.2301013Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2301275Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2301516Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T09:52:19.2301823Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:52:19.2302055Z %68 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:52:19.2302317Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T09:52:19.2302661Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:52:19.2303023Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2303354Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2303583Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2303841Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.2304076Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2304405Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2304696Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:52:19.2304984Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:52:19.2305285Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T09:52:19.2305555Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T09:52:19.2305822Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T09:52:19.2306087Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:52:19.2306386Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T09:52:19.2306803Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2307191Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T09:52:19.2307517Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2307850Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2308162Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T09:52:19.2308596Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2308994Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2309253Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2309478Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.2309729Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2309956Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2310216Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T09:52:19.2310455Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:19.2310712Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:52:19.2310967Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T09:52:19.2311282Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:52:19.2311698Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2311965Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2312220Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2312447Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.2312705Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2312960Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2313217Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:52:19.2313525Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:52:19.2313787Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T09:52:19.2314062Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T09:52:19.2314396Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T09:52:19.2314691Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:52:19.2314994Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T09:52:19.2315400Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2315837Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T09:52:19.2316130Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2316497Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2316788Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T09:52:19.2317235Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2317754Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2317991Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2318245Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.2318471Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2318733Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2318975Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T09:52:19.2319244Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:52:19.2319501Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:52:19.2319735Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T09:52:19.2320088Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:52:19.2320455Z %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2320763Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2321001Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2321253Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.2321509Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2321774Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2322071Z %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:52:19.2322369Z %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:52:19.2322675Z %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32> 2026-02-21T09:52:19.2322940Z %116 = arith.cmpf une, %98, %98 : tensor<16xf32> 2026-02-21T09:52:19.2323213Z %117 = arith.ori %115, %116 : tensor<16xi1> 2026-02-21T09:52:19.2323517Z %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:52:19.2323800Z %119 = arith.subf %98, %118 : tensor<16xf32> 2026-02-21T09:52:19.2324229Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2324636Z %121 = arith.mulf %107, %120 : tensor<16xf32> 2026-02-21T09:52:19.2324963Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2325339Z %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2325630Z %124 = arith.subf %111, %123 : tensor<16x128xf32> 2026-02-21T09:52:19.2326071Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2326479Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2326737Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.2326958Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.2327214Z tt.reduce.return %128 : f32 2026-02-21T09:52:19.2327463Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2327705Z %127 = arith.addf %121, %126 : tensor<16xf32> 2026-02-21T09:52:19.2328058Z scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32> 2026-02-21T09:52:19.2328314Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:52:19.2328676Z %9 = tt.descriptor_load %0[%4, %c9216_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T09:52:19.2329047Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2329348Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2329606Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:52:19.2329830Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:52:19.2330091Z tt.reduce.return %50 : f32 2026-02-21T09:52:19.2330317Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2330609Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T09:52:19.2330890Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T09:52:19.2331281Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T09:52:19.2331572Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T09:52:19.2331850Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T09:52:19.2332152Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T09:52:19.2332426Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T09:52:19.2332846Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2333242Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T09:52:19.2333561Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2333940Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2334230Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T09:52:19.2334680Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2335112Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.2335377Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:52:19.2335603Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:52:19.2335869Z tt.reduce.return %50 : f32 2026-02-21T09:52:19.2336128Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T09:52:19.2336373Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T09:52:19.2336648Z %c9216_i32_2 = arith.constant 9216 : i32 2026-02-21T09:52:19.2336890Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T09:52:19.2337201Z scf.for %arg3 = %c0_i32 to %c9216_i32_2 step %c512_i32_3 : i32 { 2026-02-21T09:52:19.2337538Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:52:19.2337884Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T09:52:19.2338142Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T09:52:19.2338475Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:52:19.2338825Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T09:52:19.2339139Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:52:19.2339521Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2339840Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2340158Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T09:52:19.2340516Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2340858Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2341250Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2341649Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2342085Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2342463Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2342778Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T09:52:19.2343203Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2343708Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2344051Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2344357Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T09:52:19.2344660Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:52:19.2344969Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2345370Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2345672Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2345950Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T09:52:19.2346180Z %73 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T09:52:19.2346445Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T09:52:19.2346749Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:52:19.2347036Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T09:52:19.2347301Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T09:52:19.2347590Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:52:19.2347915Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T09:52:19.2348206Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:52:19.2348564Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2348917Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2349221Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T09:52:19.2349546Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2349867Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2350239Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2350583Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2350925Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2351258Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2351639Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T09:52:19.2352091Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2352593Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2352956Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2353257Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T09:52:19.2353532Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:52:19.2353868Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2354180Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2354471Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2354722Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:19.2354981Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T09:52:19.2355241Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T09:52:19.2355578Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:52:19.2355896Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T09:52:19.2356142Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T09:52:19.2356464Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:52:19.2356774Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T09:52:19.2357111Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:52:19.2357474Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2357782Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2358114Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T09:52:19.2358451Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2358805Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2359185Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2359540Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2359902Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2360210Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2360522Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T09:52:19.2360939Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2361425Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2361821Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2362106Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T09:52:19.2362421Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:52:19.2362738Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2363093Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2363428Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2363673Z %c3_i32 = arith.constant 3 : i32 2026-02-21T09:52:19.2363930Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T09:52:19.2364164Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T09:52:19.2364467Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:52:19.2364761Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T09:52:19.2365041Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T09:52:19.2365358Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:52:19.2365664Z %129 = arith.muli %128, %cst : tensor<16x1xi32> 2026-02-21T09:52:19.2365991Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:52:19.2366316Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2366649Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2366909Z %133 = arith.addi %131, %132 : tensor<16x128xi32> 2026-02-21T09:52:19.2367213Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2367565Z %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2367912Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2368283Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2368669Z %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2369009Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2369317Z %140 = arith.subf %138, %139 : tensor<16x128xf32> 2026-02-21T09:52:19.2369734Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2370185Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2370516Z %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2370825Z %144 = arith.divf %141, %143 : tensor<16x128xf32> 2026-02-21T09:52:19.2371112Z %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:52:19.2371507Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2371900Z %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2372201Z tt.store %147, %145 : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2372489Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T09:52:19.2372772Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T09:52:19.2373112Z %28 = tt.splat %c9216_i32_2 : i32 -> tensor<128xi32> 2026-02-21T09:52:19.2373371Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T09:52:19.2373655Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T09:52:19.2373974Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T09:52:19.2374264Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T09:52:19.2374618Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2374924Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T09:52:19.2375230Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T09:52:19.2375532Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2375850Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2376221Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2376568Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2376923Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T09:52:19.2377225Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2379383Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T09:52:19.2379798Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T09:52:19.2380326Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T09:52:19.2380696Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T09:52:19.2380985Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T09:52:19.2381306Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T09:52:19.2381670Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2382036Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T09:52:19.2382350Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T09:52:19.2382745Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T09:52:19.2383133Z tt.return 2026-02-21T09:52:19.2383341Z } 2026-02-21T09:52:19.2383501Z } 2026-02-21T09:52:19.2383626Z 2026-02-21T09:52:19.2383701Z {-# 2026-02-21T09:52:19.2383909Z external_resources: { 2026-02-21T09:52:19.2384187Z mlir_reproducer: { 2026-02-21T09:52:19.2388931Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:52:19.2393429Z disable_threading: false, 2026-02-21T09:52:19.2393664Z verify_each: true 2026-02-21T09:52:19.2393848Z } 2026-02-21T09:52:19.2394036Z } 2026-02-21T09:52:19.2394195Z #-} 2026-02-21T09:52:19.2394678Z /tmp/torchinductor_root/nt/cntgwioi7k7tujhdfmow44l2zvmhvplvsustl766b4zlecuifsdz.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:19.2395919Z /tmp/torchinductor_root/nt/cntgwioi7k7tujhdfmow44l2zvmhvplvsustl766b4zlecuifsdz.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:19.2396958Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:19.2398087Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:52:19.2399174Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:19.2399473Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:19.6310771Z module { 2026-02-21T09:52:19.6311441Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:52:19.6312232Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:52:19.6312584Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T09:52:19.6312910Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:52:19.6313172Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:52:19.6313433Z %c592_i32 = arith.constant 592 : i32 2026-02-21T09:52:19.6313722Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T09:52:19.6314324Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T09:52:19.6314613Z %cst_2 = arith.constant dense<9344> : tensor<8x1xi32> 2026-02-21T09:52:19.6314925Z %cst_3 = arith.constant dense<9344> : tensor<1024xi32> 2026-02-21T09:52:19.6315212Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:52:19.6315529Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:52:19.6315814Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:52:19.6316037Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:52:19.6316298Z %c9344_i32 = arith.constant 9344 : i32 2026-02-21T09:52:19.6316522Z %c9344_i64 = arith.constant 9344 : i64 2026-02-21T09:52:19.6316773Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:52:19.6317134Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : , > 2026-02-21T09:52:19.6317642Z %1 = tt.get_program_id x : i32 2026-02-21T09:52:19.6317930Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T09:52:19.6318191Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:52:19.6318495Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:52:19.6318784Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:52:19.6319046Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:52:19.6319276Z %c9216_i32 = arith.constant 9216 : i32 2026-02-21T09:52:19.6319538Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T09:52:19.6319982Z %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:52:19.6320451Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6320783Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6321036Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:52:19.6321322Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6321681Z %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6322013Z %71 = arith.muli %70, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6322344Z %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6322683Z %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6323020Z %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6323304Z %75 = arith.addi %73, %74 : tensor<8x1024xi32> 2026-02-21T09:52:19.6323610Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6324027Z %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6324454Z %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6324794Z %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6325118Z %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6325450Z %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:52:19.6325800Z %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6326078Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6326343Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6326576Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.6326841Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6327096Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6327390Z %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:52:19.6327669Z %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:52:19.6327965Z %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32> 2026-02-21T09:52:19.6328300Z %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:52:19.6328550Z %88 = arith.ori %86, %87 : tensor<8xi1> 2026-02-21T09:52:19.6328849Z %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:52:19.6329125Z %90 = arith.subf %arg4, %89 : tensor<8xf32> 2026-02-21T09:52:19.6329555Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6329984Z %92 = arith.mulf %arg5, %91 : tensor<8xf32> 2026-02-21T09:52:19.6330274Z %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6330632Z %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6330934Z %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6331303Z %96 = arith.subf %94, %95 : tensor<8x1024xf32> 2026-02-21T09:52:19.6331745Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6332227Z %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:52:19.6332553Z %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6332790Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6333063Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.6333295Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6333557Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6333794Z %100 = arith.addf %92, %99 : tensor<8xf32> 2026-02-21T09:52:19.6334057Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:19.6334319Z %101 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:52:19.6334550Z %102 = arith.addi %arg3, %101 : i32 2026-02-21T09:52:19.6334860Z %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6335158Z %104 = tt.splat %102 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6335439Z %105 = arith.addi %104, %103 : tensor<1024xi32> 2026-02-21T09:52:19.6335701Z %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6336033Z %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6336371Z %108 = arith.muli %107, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6336679Z %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6337054Z %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6337365Z %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6337720Z %112 = arith.addi %110, %111 : tensor<8x1024xi32> 2026-02-21T09:52:19.6338007Z %113 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6338364Z %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6338742Z %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6339082Z %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6339410Z %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6339726Z %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:52:19.6340086Z %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6340391Z %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6340624Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6340880Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.6341112Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6341366Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6341689Z %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:52:19.6342004Z %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:52:19.6342273Z %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32> 2026-02-21T09:52:19.6342554Z %124 = arith.cmpf une, %89, %89 : tensor<8xf32> 2026-02-21T09:52:19.6342823Z %125 = arith.ori %123, %124 : tensor<8xi1> 2026-02-21T09:52:19.6343100Z %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:52:19.6343411Z %127 = arith.subf %89, %126 : tensor<8xf32> 2026-02-21T09:52:19.6343816Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6344260Z %129 = arith.mulf %100, %128 : tensor<8xf32> 2026-02-21T09:52:19.6344653Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6344992Z %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6345332Z %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6345617Z %133 = arith.subf %131, %132 : tensor<8x1024xf32> 2026-02-21T09:52:19.6346068Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6346502Z %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:52:19.6346840Z %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6347104Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6347321Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.6347561Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6347792Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6348069Z %137 = arith.addf %129, %136 : tensor<8xf32> 2026-02-21T09:52:19.6348307Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:19.6348571Z %138 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:52:19.6348844Z %139 = arith.addi %arg3, %138 : i32 2026-02-21T09:52:19.6349121Z %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6349452Z %141 = tt.splat %139 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6349703Z %142 = arith.addi %141, %140 : tensor<1024xi32> 2026-02-21T09:52:19.6349991Z %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6350293Z %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6350625Z %145 = arith.muli %144, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6350990Z %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6351332Z %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6351720Z %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6352017Z %149 = arith.addi %147, %148 : tensor<8x1024xi32> 2026-02-21T09:52:19.6352326Z %150 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6352651Z %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6353030Z %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6353400Z %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6353701Z %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6354044Z %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:52:19.6354374Z %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6354679Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6354973Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6355214Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:52:19.6355486Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6355716Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6356027Z %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:52:19.6356326Z %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:52:19.6356636Z %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32> 2026-02-21T09:52:19.6356903Z %161 = arith.cmpf une, %126, %126 : tensor<8xf32> 2026-02-21T09:52:19.6357185Z %162 = arith.ori %160, %161 : tensor<8xi1> 2026-02-21T09:52:19.6357500Z %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:52:19.6357793Z %164 = arith.subf %126, %163 : tensor<8xf32> 2026-02-21T09:52:19.6358324Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6358751Z %166 = arith.mulf %137, %165 : tensor<8xf32> 2026-02-21T09:52:19.6359091Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6359463Z %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6359788Z %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6360117Z %170 = arith.subf %168, %169 : tensor<8x1024xf32> 2026-02-21T09:52:19.6360551Z %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6361067Z %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:52:19.6361387Z %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6361699Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:52:19.6361963Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:52:19.6362204Z tt.reduce.return %175 : f32 2026-02-21T09:52:19.6362504Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6362758Z %174 = arith.addf %166, %173 : tensor<8xf32> 2026-02-21T09:52:19.6363060Z scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:52:19.6363317Z } {tt.flatten} 2026-02-21T09:52:19.6363596Z %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6363930Z %8 = tt.splat %c9216_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6364190Z %9 = arith.addi %8, %7 : tensor<1024xi32> 2026-02-21T09:52:19.6364483Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6364830Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6365166Z %12 = arith.muli %11, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6365482Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6365839Z %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6366173Z %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6366450Z %16 = arith.addi %14, %15 : tensor<8x1024xi32> 2026-02-21T09:52:19.6366754Z %17 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6367133Z %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6367469Z %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6367824Z %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6368140Z %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6368446Z %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:52:19.6368811Z %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6369083Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6369359Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:52:19.6369586Z %66 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:52:19.6369848Z tt.reduce.return %66 : f32 2026-02-21T09:52:19.6370100Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6370367Z %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:52:19.6370666Z %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:52:19.6370927Z %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32> 2026-02-21T09:52:19.6371204Z %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T09:52:19.6371447Z %29 = arith.ori %27, %28 : tensor<8xi1> 2026-02-21T09:52:19.6371779Z %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:52:19.6372128Z %31 = arith.subf %6#0, %30 : tensor<8xf32> 2026-02-21T09:52:19.6372519Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6372955Z %33 = arith.mulf %6#1, %32 : tensor<8xf32> 2026-02-21T09:52:19.6373239Z %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6373588Z %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6373867Z %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6374173Z %37 = arith.subf %35, %36 : tensor<8x1024xf32> 2026-02-21T09:52:19.6374605Z %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6375052Z %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:52:19.6375373Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T09:52:19.6375609Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:52:19.6375861Z %66 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:52:19.6376090Z tt.reduce.return %66 : f32 2026-02-21T09:52:19.6376348Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:52:19.6376617Z %41 = arith.addf %33, %40 : tensor<8xf32> 2026-02-21T09:52:19.6376853Z %c9216_i32_6 = arith.constant 9216 : i32 2026-02-21T09:52:19.6377123Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T09:52:19.6377395Z scf.for %arg3 = %c0_i32 to %c9216_i32_6 step %c3072_i32_7 : i32 { 2026-02-21T09:52:19.6377754Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6378059Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6378369Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:52:19.6378664Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6379023Z %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:52:19.6379441Z %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6379768Z %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6380098Z %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6380372Z %74 = arith.subf %72, %73 : tensor<8x1024xf32> 2026-02-21T09:52:19.6380808Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6381289Z %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6381650Z %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6381955Z %78 = arith.divf %75, %77 : tensor<8x1024xf32> 2026-02-21T09:52:19.6382234Z %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:52:19.6382613Z %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6382934Z %81 = arith.muli %80, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6383232Z %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6383587Z %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6383891Z %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6384193Z %85 = arith.addi %83, %84 : tensor<8x1024xi32> 2026-02-21T09:52:19.6384468Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6384817Z %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6385203Z %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6385606Z %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6385931Z tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6386185Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:52:19.6386442Z %90 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:52:19.6386710Z %91 = arith.addi %arg3, %90 : i32 2026-02-21T09:52:19.6386991Z %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6387316Z %93 = tt.splat %91 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6387557Z %94 = arith.addi %93, %92 : tensor<1024xi32> 2026-02-21T09:52:19.6387847Z %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6388187Z %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:52:19.6388593Z %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6388951Z %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6389252Z %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6389559Z %100 = arith.subf %98, %99 : tensor<8x1024xf32> 2026-02-21T09:52:19.6389978Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6390474Z %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6390834Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6391118Z %104 = arith.divf %101, %103 : tensor<8x1024xf32> 2026-02-21T09:52:19.6391442Z %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:52:19.6391846Z %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6392178Z %107 = arith.muli %106, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6392491Z %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6392856Z %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6393200Z %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6393488Z %111 = arith.addi %109, %110 : tensor<8x1024xi32> 2026-02-21T09:52:19.6393809Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6394138Z %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6394520Z %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6394856Z %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6395185Z tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6395476Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:52:19.6395740Z %116 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:52:19.6396001Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:52:19.6396281Z %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6396609Z %119 = tt.splat %117 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6396861Z %120 = arith.addi %119, %118 : tensor<1024xi32> 2026-02-21T09:52:19.6397170Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6397543Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:52:19.6397922Z %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6398282Z %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6398586Z %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6398946Z %126 = arith.subf %124, %125 : tensor<8x1024xf32> 2026-02-21T09:52:19.6399396Z %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6399869Z %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6400222Z %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6400516Z %130 = arith.divf %127, %129 : tensor<8x1024xf32> 2026-02-21T09:52:19.6400847Z %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:52:19.6401222Z %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6401569Z %133 = arith.muli %132, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6401931Z %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6402284Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6402633Z %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6402939Z %137 = arith.addi %135, %136 : tensor<8x1024xi32> 2026-02-21T09:52:19.6403265Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6403639Z %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6404011Z %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6404395Z %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6404707Z tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6405024Z } {tt.flatten} 2026-02-21T09:52:19.6405284Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:52:19.6405641Z %43 = tt.splat %c9216_i32_6 : i32 -> tensor<1024xi32> 2026-02-21T09:52:19.6405940Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T09:52:19.6406207Z %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32> 2026-02-21T09:52:19.6406615Z %46 = tt.descriptor_load %0[%2, %c9216_i32_6] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:52:19.6407024Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6407395Z %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:52:19.6407744Z %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6408031Z %50 = arith.subf %48, %49 : tensor<8x1024xf32> 2026-02-21T09:52:19.6408490Z %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:52:19.6408944Z %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:52:19.6409295Z %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:52:19.6409597Z %54 = arith.divf %51, %53 : tensor<8x1024xf32> 2026-02-21T09:52:19.6409912Z %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:52:19.6410264Z %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:52:19.6410554Z %57 = arith.muli %56, %cst_2 : tensor<8x1xi32> 2026-02-21T09:52:19.6410882Z %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:52:19.6411208Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6411570Z %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:52:19.6411844Z %61 = arith.addi %59, %60 : tensor<8x1024xi32> 2026-02-21T09:52:19.6412142Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6412573Z %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:52:19.6412910Z %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:52:19.6413265Z %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:52:19.6413552Z tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr> 2026-02-21T09:52:19.6413992Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:52:19.6414406Z tt.return 2026-02-21T09:52:19.6414576Z } 2026-02-21T09:52:19.6414774Z } 2026-02-21T09:52:19.6414866Z 2026-02-21T09:52:19.6414937Z {-# 2026-02-21T09:52:19.6415137Z external_resources: { 2026-02-21T09:52:19.6415337Z mlir_reproducer: { 2026-02-21T09:52:19.6419692Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:52:19.6424314Z disable_threading: false, 2026-02-21T09:52:19.6424521Z verify_each: true 2026-02-21T09:52:19.6424737Z } 2026-02-21T09:52:19.6424923Z } 2026-02-21T09:52:19.6425078Z #-} 2026-02-21T09:52:19.6425568Z /tmp/torchinductor_root/js/cjswdicw4cghex4jyofrnzl3zuhjuy5ja2oac7iwkk7djpqzg7jc.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:52:19.6426831Z /tmp/torchinductor_root/js/cjswdicw4cghex4jyofrnzl3zuhjuy5ja2oac7iwkk7djpqzg7jc.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:52:19.6427891Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:52:19.6429014Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:52:19.6430034Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:52:19.6430341Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:52:26.2978724Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.9 configs/s 2026-02-21T09:52:26.2989139Z [47s] Adaptive compile timeout: 30s (90% percentile=13.3s, bounds=[30.0s, 30s]) 2026-02-21T09:52:27.4009535Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 894.7 configs/s 2026-02-21T09:52:27.4748055Z [48s] Initial random population of 100, 5 starting points: 2026-02-21T09:52:27.4752992Z error=13 2026-02-21T09:52:27.4757226Z timeout=1 2026-02-21T09:52:27.4761732Z ok=86 2026-02-21T09:52:27.4763183Z min=0.0656 2026-02-21T09:52:27.4763424Z mid=0.5489 2026-02-21T09:52:27.4763602Z max=218.3107 2026-02-21T09:52:27.4763824Z best={'block_sizes': [1, 1024], 2026-02-21T09:52:27.4764094Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:52:27.4764438Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:52:27.4764669Z 'num_stages': 6, 2026-02-21T09:52:27.4764878Z 'num_warps': 4, 2026-02-21T09:52:27.4765060Z 'pid_type': 'flat', 2026-02-21T09:52:27.4765327Z 'range_flattens': [None, None], 2026-02-21T09:52:27.4765543Z 'range_multi_buffers': [None, True], 2026-02-21T09:52:27.4765793Z 'range_num_stages': [0, 0], 2026-02-21T09:52:27.4766030Z 'range_unroll_factors': [0, 4], 2026-02-21T09:52:27.4766253Z 'range_warp_specializes': [None, False]} 2026-02-21T09:52:27.4767957Z [48s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:52:28.5095211Z [49s] Generation 1 starting: 75 neighbors, 5 active search path(s) 2026-02-21T09:52:51.3903431Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.5 configs/s 2026-02-21T09:52:56.0902569Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.7 configs/s 2026-02-21T09:53:00.2842133Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.2 2026-02-21T09:53:00.2846985Z configs/s 2026-02-21T09:53:00.4932858Z [81s] Generation 1 complete: 2026-02-21T09:53:00.4934508Z ok=81 2026-02-21T09:53:00.4934824Z min=0.0492 2026-02-21T09:53:00.4935084Z mid=0.0798 2026-02-21T09:53:00.4935299Z max=1.3835 2026-02-21T09:53:00.4935570Z best={'block_sizes': [1, 1024], 2026-02-21T09:53:00.4935963Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:53:00.4936410Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:53:00.4936749Z 'num_stages': 3, 2026-02-21T09:53:00.4936998Z 'num_warps': 1, 2026-02-21T09:53:00.4937273Z 'pid_type': 'flat', 2026-02-21T09:53:00.4937537Z 'range_flattens': [None, False], 2026-02-21T09:53:00.4937866Z 'range_multi_buffers': [None, True], 2026-02-21T09:53:00.4938171Z 'range_num_stages': [0, 2], 2026-02-21T09:53:00.4938470Z 'range_unroll_factors': [0, 3], 2026-02-21T09:53:00.4938691Z 'range_warp_specializes': [None, None]} 2026-02-21T09:53:00.4947570Z [81s] Fitting surrogate: 181 points, 181 targets 2026-02-21T09:53:02.0471298Z [83s] Generation 2 starting: 65 neighbors, 5 active search path(s) 2026-02-21T09:53:20.3982501Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.5 configs/s 2026-02-21T09:53:24.4776246Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.1 configs/s 2026-02-21T09:53:30.4813986Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 168.2 2026-02-21T09:53:30.4815833Z configs/s 2026-02-21T09:53:30.8112255Z [112s] Generation 2 complete: 2026-02-21T09:53:30.8115641Z ok=71 2026-02-21T09:53:30.8120157Z min=0.0511 2026-02-21T09:53:30.8127409Z mid=0.0614 2026-02-21T09:53:30.8132143Z max=0.1434 2026-02-21T09:53:30.8132688Z best={'block_sizes': [1, 1024], 2026-02-21T09:53:30.8133008Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:53:30.8133393Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:53:30.8133670Z 'num_stages': 3, 2026-02-21T09:53:30.8133862Z 'num_warps': 1, 2026-02-21T09:53:30.8134090Z 'pid_type': 'flat', 2026-02-21T09:53:30.8134661Z 'range_flattens': [None, False], 2026-02-21T09:53:30.8134993Z 'range_multi_buffers': [None, True], 2026-02-21T09:53:30.8135226Z 'range_num_stages': [0, 2], 2026-02-21T09:53:30.8139428Z 'range_unroll_factors': [0, 3], 2026-02-21T09:53:30.8139754Z 'range_warp_specializes': [None, None]} 2026-02-21T09:53:30.8140051Z [112s] Fitting surrogate: 252 points, 252 targets 2026-02-21T09:53:31.8072213Z [113s] Generation 3 starting: 61 neighbors, 5 active search path(s) 2026-02-21T09:53:43.7563118Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 1.4 configs/s 2026-02-21T09:53:47.3615601Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 17.1 configs/s 2026-02-21T09:53:53.3387923Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 169.2 2026-02-21T09:53:53.3389086Z configs/s 2026-02-21T09:53:53.6509645Z [134s] Generation 3 complete: 2026-02-21T09:53:53.6512628Z ok=67 2026-02-21T09:53:53.6515931Z min=0.0492 2026-02-21T09:53:53.6519057Z mid=0.0573 2026-02-21T09:53:53.6523660Z max=0.2036 2026-02-21T09:53:53.6528116Z best={'block_sizes': [1, 1024], 2026-02-21T09:53:53.6532030Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:53:53.6532430Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:53:53.6536360Z 'num_stages': 2, 2026-02-21T09:53:53.6540763Z 'num_warps': 1, 2026-02-21T09:53:53.6545625Z 'pid_type': 'flat', 2026-02-21T09:53:53.6549869Z 'range_flattens': [None, False], 2026-02-21T09:53:53.6554351Z 'range_multi_buffers': [None, True], 2026-02-21T09:53:53.6558710Z 'range_num_stages': [0, 2], 2026-02-21T09:53:53.6561812Z 'range_unroll_factors': [0, 3], 2026-02-21T09:53:53.6563861Z 'range_warp_specializes': [None, None]} 2026-02-21T09:53:53.6564607Z [134s] Fitting surrogate: 319 points, 319 targets 2026-02-21T09:53:54.3824253Z [135s] Generation 4 starting: 47 neighbors, 4 active search path(s) 2026-02-21T09:54:04.4298995Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 2.7 configs/s 2026-02-21T09:54:07.2780369Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.1 configs/s 2026-02-21T09:54:11.6049648Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.8 2026-02-21T09:54:11.6050982Z configs/s 2026-02-21T09:54:11.8126641Z [153s] Generation 4 complete: 2026-02-21T09:54:11.8131000Z ok=52 2026-02-21T09:54:11.8135370Z min=0.0492 2026-02-21T09:54:11.8139312Z mid=0.0553 2026-02-21T09:54:11.8143786Z max=0.2171 2026-02-21T09:54:11.8148132Z best={'block_sizes': [1, 1024], 2026-02-21T09:54:11.8152769Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:54:11.8156028Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T09:54:11.8156370Z 'num_stages': 6, 2026-02-21T09:54:11.8160432Z 'num_warps': 1, 2026-02-21T09:54:11.8162198Z 'pid_type': 'flat', 2026-02-21T09:54:11.8162489Z 'range_flattens': [None, None], 2026-02-21T09:54:11.8163025Z 'range_multi_buffers': [None, True], 2026-02-21T09:54:11.8163302Z 'range_num_stages': [0, 1], 2026-02-21T09:54:11.8163524Z 'range_unroll_factors': [0, 3], 2026-02-21T09:54:11.8163795Z 'range_warp_specializes': [None, False]} 2026-02-21T09:54:11.8164072Z [153s] Fitting surrogate: 371 points, 371 targets 2026-02-21T09:54:12.4133986Z [153s] Generation 5 starting: 34 neighbors, 3 active search path(s) 2026-02-21T09:54:18.6278894Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 4.8 configs/s 2026-02-21T09:54:20.7034889Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 17.2 configs/s 2026-02-21T09:54:23.8668086Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 318.6 2026-02-21T09:54:23.8669215Z configs/s 2026-02-21T09:54:24.0436721Z [165s] Generation 5 complete: 2026-02-21T09:54:24.0440958Z ok=38 2026-02-21T09:54:24.0442794Z min=0.0480 2026-02-21T09:54:24.0443052Z mid=0.0532 2026-02-21T09:54:24.0443266Z max=0.1272 2026-02-21T09:54:24.0443447Z best={'block_sizes': [1, 8192], 2026-02-21T09:54:24.0443750Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:54:24.0444029Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:54:24.0444293Z 'num_stages': 1, 2026-02-21T09:54:24.0444473Z 'num_warps': 4, 2026-02-21T09:54:24.0444680Z 'pid_type': 'flat', 2026-02-21T09:54:24.0444873Z 'range_flattens': [None, True], 2026-02-21T09:54:24.0445123Z 'range_multi_buffers': [None, None], 2026-02-21T09:54:24.0445376Z 'range_num_stages': [0, 3], 2026-02-21T09:54:24.0445583Z 'range_unroll_factors': [0, 2], 2026-02-21T09:54:24.0445833Z 'range_warp_specializes': [None, False]} 2026-02-21T09:54:24.0453518Z [165s] Fitting surrogate: 409 points, 409 targets 2026-02-21T09:54:24.4942612Z [165s] Generation 6 starting: 26 neighbors, 2 active search path(s) 2026-02-21T09:54:32.1565750Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 1.5 configs/s 2026-02-21T09:54:33.7572283Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.3 configs/s 2026-02-21T09:54:36.1271102Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 424.1 2026-02-21T09:54:36.1271832Z configs/s 2026-02-21T09:54:36.2625102Z [177s] Generation 6 complete: 2026-02-21T09:54:36.2630284Z ok=28 2026-02-21T09:54:36.2631890Z min=0.0472 2026-02-21T09:54:36.2632121Z mid=0.0532 2026-02-21T09:54:36.2632290Z max=0.2273 2026-02-21T09:54:36.2632500Z best={'block_sizes': [1, 8192], 2026-02-21T09:54:36.2632770Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:54:36.2633072Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:54:36.2633628Z 'num_stages': 1, 2026-02-21T09:54:36.2633847Z 'num_warps': 4, 2026-02-21T09:54:36.2634034Z 'pid_type': 'flat', 2026-02-21T09:54:36.2634269Z 'range_flattens': [None, True], 2026-02-21T09:54:36.2634504Z 'range_multi_buffers': [None, None], 2026-02-21T09:54:36.2634866Z 'range_num_stages': [0, 3], 2026-02-21T09:54:36.2635099Z 'range_unroll_factors': [0, 2], 2026-02-21T09:54:36.2635321Z 'range_warp_specializes': [None, False]} 2026-02-21T09:54:36.2640449Z [177s] Fitting surrogate: 437 points, 437 targets 2026-02-21T09:54:36.5508845Z [177s] Generation 7 starting: 9 neighbors, 1 active search path(s) 2026-02-21T09:54:39.9020148Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 2.4 configs/s 2026-02-21T09:54:40.4376117Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.4 configs/s 2026-02-21T09:54:41.1847572Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1319.1 2026-02-21T09:54:41.1851794Z configs/s 2026-02-21T09:54:41.2435829Z [182s] Generation 7 complete: 2026-02-21T09:54:41.2440107Z ok=10 2026-02-21T09:54:41.2441841Z min=0.0490 2026-02-21T09:54:41.2442140Z mid=0.0553 2026-02-21T09:54:41.2447092Z max=0.0758 2026-02-21T09:54:41.2448463Z best={'block_sizes': [1, 8192], 2026-02-21T09:54:41.2448862Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:54:41.2453879Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:54:41.2455395Z 'num_stages': 1, 2026-02-21T09:54:41.2455702Z 'num_warps': 4, 2026-02-21T09:54:41.2458684Z 'pid_type': 'flat', 2026-02-21T09:54:41.2458984Z 'range_flattens': [None, True], 2026-02-21T09:54:41.2459252Z 'range_multi_buffers': [None, None], 2026-02-21T09:54:41.2462956Z 'range_num_stages': [0, 3], 2026-02-21T09:54:41.2467276Z 'range_unroll_factors': [0, 2], 2026-02-21T09:54:41.2471446Z 'range_warp_specializes': [None, False]} 2026-02-21T09:54:41.2475911Z [182s] Fitting surrogate: 447 points, 447 targets 2026-02-21T09:54:41.4195133Z [182s] Autotuning complete in 182.7s after searching 426 configs. 2026-02-21T09:54:41.4199534Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:54:41.4201923Z @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:54:41.4202876Z 2026-02-21T09:54:41.4203160Z [182s] Code of selected kernel: /tmp/torchinductor_root/7k/c7k7oelmsebgubewu5tkpv7n5yseqxxi252ufqh4z6tavu5nsnni.py 2026-02-21T09:54:41.4419596Z from __future__ import annotations 2026-02-21T09:54:41.4423153Z 2026-02-21T09:54:41.4427815Z import torch 2026-02-21T09:54:41.4429781Z import triton 2026-02-21T09:54:41.4430054Z import triton.language as tl 2026-02-21T09:54:41.4430317Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:54:41.4430666Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:54:41.4431027Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:54:41.4431230Z 2026-02-21T09:54:41.4431322Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:54:41.4431689Z _BLOCK_SIZE_1 = tl.constexpr(8192) 2026-02-21T09:54:41.4431830Z 2026-02-21T09:54:41.4431910Z @triton.jit 2026-02-21T09:54:41.4432124Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:54:41.4432416Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:54:41.4432734Z pid_0 = tl.program_id(0) 2026-02-21T09:54:41.4432936Z offset_0 = pid_0 2026-02-21T09:54:41.4433181Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:54:41.4433546Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:54:41.4433876Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:54:41.4434416Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:54:41.4434703Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:54:41.4435029Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:54:41.4435422Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:54:41.4435736Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:54:41.4436036Z # src[softmax.py:82-89]: ... 2026-02-21T09:54:41.4436428Z for offset_2 in tl.range(0, 9344, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, flatten=True): 2026-02-21T09:54:41.4436900Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:54:41.4437175Z mask_1 = indices_2 < 9344 2026-02-21T09:54:41.4437410Z mi_copy = mi 2026-02-21T09:54:41.4437591Z di_copy = di 2026-02-21T09:54:41.4437798Z mi_copy_0 = mi_copy 2026-02-21T09:54:41.4438043Z di_copy_0 = di_copy 2026-02-21T09:54:41.4438262Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:54:41.4438771Z values = tl.load(x + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:54:41.4439235Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:54:41.4439685Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:54:41.4440146Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:54:41.4440447Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:54:41.4440749Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:54:41.4441022Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:54:41.4441325Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:54:41.4441671Z v_2 = mi_copy_0 - v_1 2026-02-21T09:54:41.4441888Z v_3 = libdevice.exp(v_2) 2026-02-21T09:54:41.4442124Z v_4 = di_copy_0 * v_3 2026-02-21T09:54:41.4442354Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:54:41.4442632Z subscript = v_1[:, None] 2026-02-21T09:54:41.4442845Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:54:41.4443091Z v_6 = v_5 - subscript 2026-02-21T09:54:41.4443376Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:54:41.4443679Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:54:41.4443956Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:54:41.4444181Z v_7 = libdevice.exp(v_6) 2026-02-21T09:54:41.4444567Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:54:41.4444957Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:54:41.4445225Z di = v_4 + sum_1 2026-02-21T09:54:41.4445451Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:54:41.4445663Z mi = v_1 2026-02-21T09:54:41.4445929Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:54:41.4446238Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:54:41.4446594Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:54:41.4447063Z for offset_2 in tl.range(0, 9344, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, flatten=True): 2026-02-21T09:54:41.4447523Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:54:41.4447821Z mask_2 = indices_2 < 9344 2026-02-21T09:54:41.4448025Z mi_copy_1 = mi 2026-02-21T09:54:41.4448239Z di_copy_1 = di 2026-02-21T09:54:41.4448430Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:54:41.4448662Z di_copy_1_0 = di_copy_1 2026-02-21T09:54:41.4448932Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:54:41.4449306Z values_1 = tl.load(x + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), mask_2[None, :], other=0) 2026-02-21T09:54:41.4449759Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:54:41.4450102Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:54:41.4450355Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:54:41.4450588Z v_10 = v_9 - subscript_1 2026-02-21T09:54:41.4450825Z v_11 = libdevice.exp(v_10) 2026-02-21T09:54:41.4451048Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:54:41.4451292Z v_12 = v_11 / subscript_2 2026-02-21T09:54:41.4451566Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:54:41.4451908Z tl.store(out + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:54:41.4452143Z 2026-02-21T09:54:41.4452322Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:54:41.4452598Z """ 2026-02-21T09:54:41.4452882Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:54:41.4453326Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:54:41.4453623Z Args: 2026-02-21T09:54:41.4453828Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:54:41.4454097Z Returns: 2026-02-21T09:54:41.4454321Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:54:41.4454608Z """ 2026-02-21T09:54:41.4454816Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:54:41.4455044Z m, n = x.size() 2026-02-21T09:54:41.4455276Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:54:41.4455516Z out = torch.empty_like(x) 2026-02-21T09:54:41.4455804Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:54:41.4456159Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:54:41.4456539Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:54:41.4456842Z # src[softmax.py:79-92]: ... 2026-02-21T09:54:41.4457140Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=1) 2026-02-21T09:54:41.4457478Z # src[softmax.py:93]: return out 2026-02-21T09:54:41.4457687Z return out 2026-02-21T09:54:42.5312639Z WARNING:tritonbench.utils.triton_op:Completed input ID 71: 2026-02-21T09:54:42.5316845Z (M, N) 2026-02-21T09:54:42.5318288Z ------------ 2026-02-21T09:54:42.5318563Z (4096, 9344) 2026-02-21T09:54:42.5318671Z 2026-02-21T09:54:42.5325786Z 75%|███████▌ | 15/20 [45:47<16:23, 196.66s/it]WARNING:tritonbench.utils.triton_op:Running input ID 77: 2026-02-21T09:54:42.5327561Z (M, N) 2026-02-21T09:54:42.5327836Z ------------- 2026-02-21T09:54:42.5331973Z (4096, 10112) 2026-02-21T09:54:42.5336443Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:54:43.7389299Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:54:45.1404267Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for torch_compile_softmax 2026-02-21T09:54:46.4619831Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:54:46.4624367Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:54:46.4628175Z 'dtype': 'torch.float16', 2026-02-21T09:54:46.4629702Z 'shape': (4096, 10112), 2026-02-21T09:54:46.4630049Z 'stride': (10112, 1)},), 2026-02-21T09:54:46.4635015Z 'kwargs': {}} 2026-02-21T09:54:46.4639708Z INFO:tritonbench.utils.triton_op:Took 2.17ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:54:46.6356122Z [0s] Autotune random seed: 2138408546 2026-02-21T09:54:46.6600668Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:55:22.6519002Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T09:55:25.6869697Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T09:55:28.0705784Z module { 2026-02-21T09:55:28.0708124Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T09:55:28.0713051Z %c512_i32 = arith.constant 512 : i32 2026-02-21T09:55:28.0717597Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T09:55:28.0718868Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T09:55:28.0719223Z %c0_i32 = arith.constant 0 : i32 2026-02-21T09:55:28.0719481Z %c592_i32 = arith.constant 592 : i32 2026-02-21T09:55:28.0720149Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T09:55:28.0720526Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T09:55:28.0720830Z %cst_2 = arith.constant dense<10112> : tensor<8x1xi32> 2026-02-21T09:55:28.0721157Z %cst_3 = arith.constant dense<10112> : tensor<1024xi32> 2026-02-21T09:55:28.0721460Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T09:55:28.0721881Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T09:55:28.0723515Z %c8_i32 = arith.constant 8 : i32 2026-02-21T09:55:28.0723792Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T09:55:28.0724075Z %c10112_i32 = arith.constant 10112 : i32 2026-02-21T09:55:28.0724351Z %c10112_i64 = arith.constant 10112 : i64 2026-02-21T09:55:28.0724581Z %c1_i64 = arith.constant 1 : i64 2026-02-21T09:55:28.0724999Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10112_i32], [%c10112_i64, %c1_i64] : , > 2026-02-21T09:55:28.0725377Z %1 = tt.get_program_id x : i32 2026-02-21T09:55:28.0725662Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T09:55:28.0725923Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T09:55:28.0726251Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T09:55:28.0726567Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T09:55:28.0726807Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T09:55:28.0727081Z %c9216_i32 = arith.constant 9216 : i32 2026-02-21T09:55:28.0727317Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T09:55:28.0727789Z %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T09:55:28.0728287Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0728652Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0728956Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:55:28.0729238Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0729591Z %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0729884Z %71 = arith.muli %70, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0730239Z %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0730594Z %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0730948Z %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0731278Z %75 = arith.addi %73, %74 : tensor<8x1024xi32> 2026-02-21T09:55:28.0731651Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0732298Z %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0732673Z %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0733119Z %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0733464Z %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0733793Z %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:55:28.0734158Z %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0734464Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0734744Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0734991Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:55:28.0735270Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0735544Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0735827Z %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:55:28.0736159Z %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:55:28.0736515Z %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32> 2026-02-21T09:55:28.0736814Z %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T09:55:28.0737069Z %88 = arith.ori %86, %87 : tensor<8xi1> 2026-02-21T09:55:28.0737371Z %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:55:28.0737680Z %90 = arith.subf %arg4, %89 : tensor<8xf32> 2026-02-21T09:55:28.0738159Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0738576Z %92 = arith.mulf %arg5, %91 : tensor<8xf32> 2026-02-21T09:55:28.0738863Z %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0739211Z %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0739512Z %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0739830Z %96 = arith.subf %94, %95 : tensor<8x1024xf32> 2026-02-21T09:55:28.0740269Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0740720Z %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:55:28.0741037Z %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0741276Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0741527Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:55:28.0741833Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0742062Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0742328Z %100 = arith.addf %92, %99 : tensor<8xf32> 2026-02-21T09:55:28.0742560Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:28.0742809Z %101 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:55:28.0743055Z %102 = arith.addi %arg3, %101 : i32 2026-02-21T09:55:28.0743363Z %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0743667Z %104 = tt.splat %102 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0743947Z %105 = arith.addi %104, %103 : tensor<1024xi32> 2026-02-21T09:55:28.0744236Z %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0744538Z %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0744871Z %108 = arith.muli %107, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0745176Z %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0745542Z %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0745877Z %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0746209Z %112 = arith.addi %110, %111 : tensor<8x1024xi32> 2026-02-21T09:55:28.0746524Z %113 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0746893Z %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0747275Z %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0747618Z %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0747953Z %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0748299Z %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:55:28.0748627Z %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0748934Z %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0749172Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0749433Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:55:28.0749673Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0749986Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0750281Z %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:55:28.0750569Z %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:55:28.0750874Z %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32> 2026-02-21T09:55:28.0751131Z %124 = arith.cmpf une, %89, %89 : tensor<8xf32> 2026-02-21T09:55:28.0751405Z %125 = arith.ori %123, %124 : tensor<8xi1> 2026-02-21T09:55:28.0751700Z %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:55:28.0752002Z %127 = arith.subf %89, %126 : tensor<8xf32> 2026-02-21T09:55:28.0752431Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0752836Z %129 = arith.mulf %100, %128 : tensor<8xf32> 2026-02-21T09:55:28.0753157Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0753488Z %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0753821Z %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0754100Z %133 = arith.subf %131, %132 : tensor<8x1024xf32> 2026-02-21T09:55:28.0754540Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0755031Z %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:55:28.0755331Z %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0755593Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0755812Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:55:28.0756068Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0756322Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0756568Z %137 = arith.addf %129, %136 : tensor<8xf32> 2026-02-21T09:55:28.0756831Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:28.0757059Z %138 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:55:28.0757320Z %139 = arith.addi %arg3, %138 : i32 2026-02-21T09:55:28.0757593Z %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0757919Z %141 = tt.splat %139 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0758167Z %142 = arith.addi %141, %140 : tensor<1024xi32> 2026-02-21T09:55:28.0758461Z %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0758790Z %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0759093Z %145 = arith.muli %144, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0759457Z %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0759799Z %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0760172Z %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0760495Z %149 = arith.addi %147, %148 : tensor<8x1024xi32> 2026-02-21T09:55:28.0760783Z %150 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0761136Z %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0761486Z %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0761878Z %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0762177Z %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0762520Z %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:55:28.0762878Z %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0763212Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0763478Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0763706Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T09:55:28.0763971Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0764197Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0764497Z %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:55:28.0764813Z %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:55:28.0765088Z %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32> 2026-02-21T09:55:28.0765377Z %161 = arith.cmpf une, %126, %126 : tensor<8xf32> 2026-02-21T09:55:28.0765625Z %162 = arith.ori %160, %161 : tensor<8xi1> 2026-02-21T09:55:28.0765927Z %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:55:28.0766210Z %164 = arith.subf %126, %163 : tensor<8xf32> 2026-02-21T09:55:28.0766645Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0767075Z %166 = arith.mulf %137, %165 : tensor<8xf32> 2026-02-21T09:55:28.0767365Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0767725Z %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0768031Z %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0768341Z %170 = arith.subf %168, %169 : tensor<8x1024xf32> 2026-02-21T09:55:28.0768786Z %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0769262Z %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:55:28.0769604Z %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0769848Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T09:55:28.0770114Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T09:55:28.0770351Z tt.reduce.return %175 : f32 2026-02-21T09:55:28.0770616Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0770901Z %174 = arith.addf %166, %173 : tensor<8xf32> 2026-02-21T09:55:28.0771175Z scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32> 2026-02-21T09:55:28.0771461Z } {tt.flatten} 2026-02-21T09:55:28.0771737Z %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0772085Z %8 = tt.splat %c9216_i32 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0773998Z %9 = arith.addi %8, %7 : tensor<1024xi32> 2026-02-21T09:55:28.0774305Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0774688Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0775025Z %12 = arith.muli %11, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0775405Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0775785Z %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0776103Z %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0776419Z %16 = arith.addi %14, %15 : tensor<8x1024xi32> 2026-02-21T09:55:28.0776739Z %17 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0777084Z %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0777463Z %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0777846Z %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0778136Z %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0778469Z %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T09:55:28.0778850Z %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0779115Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0779374Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:55:28.0779598Z %66 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T09:55:28.0779855Z tt.reduce.return %66 : f32 2026-02-21T09:55:28.0780080Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0780371Z %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16> 2026-02-21T09:55:28.0780677Z %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32> 2026-02-21T09:55:28.0780939Z %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32> 2026-02-21T09:55:28.0781221Z %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T09:55:28.0781459Z %29 = arith.ori %27, %28 : tensor<8xi1> 2026-02-21T09:55:28.0781789Z %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32> 2026-02-21T09:55:28.0782053Z %31 = arith.subf %6#0, %30 : tensor<8xf32> 2026-02-21T09:55:28.0782477Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0782897Z %33 = arith.mulf %6#1, %32 : tensor<8xf32> 2026-02-21T09:55:28.0783186Z %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0783534Z %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0783834Z %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0784135Z %37 = arith.subf %35, %36 : tensor<8x1024xf32> 2026-02-21T09:55:28.0784534Z %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0785005Z %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T09:55:28.0785325Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T09:55:28.0785559Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T09:55:28.0785812Z %66 = arith.addf %arg3, %arg4 : f32 2026-02-21T09:55:28.0786041Z tt.reduce.return %66 : f32 2026-02-21T09:55:28.0786264Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T09:55:28.0786498Z %41 = arith.addf %33, %40 : tensor<8xf32> 2026-02-21T09:55:28.0786761Z %c9216_i32_6 = arith.constant 9216 : i32 2026-02-21T09:55:28.0787025Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T09:55:28.0787300Z scf.for %arg3 = %c0_i32 to %c9216_i32_6 step %c3072_i32_7 : i32 { 2026-02-21T09:55:28.0787655Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0788031Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0788337Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T09:55:28.0788597Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0788976Z %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:55:28.0789423Z %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0789748Z %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0790070Z %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0790343Z %74 = arith.subf %72, %73 : tensor<8x1024xf32> 2026-02-21T09:55:28.0790776Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0791253Z %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0791617Z %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0791918Z %78 = arith.divf %75, %77 : tensor<8x1024xf32> 2026-02-21T09:55:28.0792221Z %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:55:28.0792573Z %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0792868Z %81 = arith.muli %80, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0793195Z %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0793545Z %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0793839Z %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0794138Z %85 = arith.addi %83, %84 : tensor<8x1024xi32> 2026-02-21T09:55:28.0794423Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0794771Z %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0795138Z %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0795468Z %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0795785Z tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0796040Z %c1_i32 = arith.constant 1 : i32 2026-02-21T09:55:28.0796299Z %90 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T09:55:28.0796532Z %91 = arith.addi %arg3, %90 : i32 2026-02-21T09:55:28.0796834Z %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0797159Z %93 = tt.splat %91 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0797401Z %94 = arith.addi %93, %92 : tensor<1024xi32> 2026-02-21T09:55:28.0797685Z %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0798021Z %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:55:28.0798405Z %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0798728Z %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0799036Z %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0799308Z %100 = arith.subf %98, %99 : tensor<8x1024xf32> 2026-02-21T09:55:28.0799730Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0800222Z %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0800548Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0800907Z %104 = arith.divf %101, %103 : tensor<8x1024xf32> 2026-02-21T09:55:28.0801197Z %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:55:28.0801627Z %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0801964Z %107 = arith.muli %106, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0802294Z %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0802659Z %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0802968Z %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0803287Z %111 = arith.addi %109, %110 : tensor<8x1024xi32> 2026-02-21T09:55:28.0803599Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0803927Z %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0804307Z %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0804646Z %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0804968Z tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0805256Z %c2_i32 = arith.constant 2 : i32 2026-02-21T09:55:28.0805518Z %116 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T09:55:28.0805779Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T09:55:28.0806055Z %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0806383Z %119 = tt.splat %117 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0806631Z %120 = arith.addi %119, %118 : tensor<1024xi32> 2026-02-21T09:55:28.0806923Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0807272Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:55:28.0807682Z %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0808018Z %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0808327Z %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0808628Z %126 = arith.subf %124, %125 : tensor<8x1024xf32> 2026-02-21T09:55:28.0809036Z %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0809523Z %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0809877Z %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0810156Z %130 = arith.divf %127, %129 : tensor<8x1024xf32> 2026-02-21T09:55:28.0810474Z %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:55:28.0810802Z %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0811130Z %133 = arith.muli %132, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0811444Z %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0811839Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0812173Z %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0812461Z %137 = arith.addi %135, %136 : tensor<8x1024xi32> 2026-02-21T09:55:28.0812766Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0813107Z %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0813489Z %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0813887Z %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0814237Z tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0814597Z } {tt.flatten} 2026-02-21T09:55:28.0814858Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T09:55:28.0815206Z %43 = tt.splat %c9216_i32_6 : i32 -> tensor<1024xi32> 2026-02-21T09:55:28.0815511Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T09:55:28.0815816Z %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32> 2026-02-21T09:55:28.0816221Z %46 = tt.descriptor_load %0[%2, %c9216_i32_6] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T09:55:28.0816632Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0816998Z %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T09:55:28.0817313Z %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0817623Z %50 = arith.subf %48, %49 : tensor<8x1024xf32> 2026-02-21T09:55:28.0818054Z %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T09:55:28.0818554Z %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T09:55:28.0818946Z %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T09:55:28.0819232Z %54 = arith.divf %51, %53 : tensor<8x1024xf32> 2026-02-21T09:55:28.0819536Z %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T09:55:28.0819871Z %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T09:55:28.0820206Z %57 = arith.muli %56, %cst_2 : tensor<8x1xi32> 2026-02-21T09:55:28.0820545Z %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T09:55:28.0820894Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0821226Z %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T09:55:28.0821500Z %61 = arith.addi %59, %60 : tensor<8x1024xi32> 2026-02-21T09:55:28.0821824Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0822139Z %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T09:55:28.0822498Z %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T09:55:28.0822850Z %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T09:55:28.0823137Z tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr> 2026-02-21T09:55:28.0823577Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T09:55:28.0823952Z tt.return 2026-02-21T09:55:28.0824146Z } 2026-02-21T09:55:28.0824313Z } 2026-02-21T09:55:28.0824428Z 2026-02-21T09:55:28.0824499Z {-# 2026-02-21T09:55:28.0824662Z external_resources: { 2026-02-21T09:55:28.0824865Z mlir_reproducer: { 2026-02-21T09:55:28.0829277Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T09:55:28.0833886Z disable_threading: false, 2026-02-21T09:55:28.0834091Z verify_each: true 2026-02-21T09:55:28.0834301Z } 2026-02-21T09:55:28.0834464Z } 2026-02-21T09:55:28.0834655Z #-} 2026-02-21T09:55:28.0835119Z /tmp/torchinductor_root/oz/cozizfrq6lg2d66agxaqvnd37afoobfh5mjafjqpkqpqj4x5xmon.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T09:55:28.0836401Z /tmp/torchinductor_root/oz/cozizfrq6lg2d66agxaqvnd37afoobfh5mjafjqpkqpqj4x5xmon.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T09:55:28.0837446Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T09:55:28.0838562Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T09:55:28.0839539Z Error: RuntimeError: PassManager::run failed 2026-02-21T09:55:28.0839858Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T09:55:35.0485274Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.6 configs/s 2026-02-21T09:55:35.0495935Z [48s] Adaptive compile timeout: 30s (90% percentile=14.3s, bounds=[30.0s, 30s]) 2026-02-21T09:55:36.1672028Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 881.4 configs/s 2026-02-21T09:55:36.2382145Z [49s] Initial random population of 100, 5 starting points: 2026-02-21T09:55:36.2386392Z error=12 2026-02-21T09:55:36.2388047Z timeout=1 2026-02-21T09:55:36.2388260Z ok=87 2026-02-21T09:55:36.2392952Z min=0.0676 2026-02-21T09:55:36.2397361Z mid=0.4945 2026-02-21T09:55:36.2398840Z max=234.3701 2026-02-21T09:55:36.2399047Z best={'block_sizes': [1, 1024], 2026-02-21T09:55:36.2399298Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:55:36.2399827Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:55:36.2400019Z 'num_stages': 6, 2026-02-21T09:55:36.2400175Z 'num_warps': 4, 2026-02-21T09:55:36.2400321Z 'pid_type': 'flat', 2026-02-21T09:55:36.2400512Z 'range_flattens': [None, None], 2026-02-21T09:55:36.2400693Z 'range_multi_buffers': [None, True], 2026-02-21T09:55:36.2400888Z 'range_num_stages': [0, 0], 2026-02-21T09:55:36.2401059Z 'range_unroll_factors': [0, 4], 2026-02-21T09:55:36.2401252Z 'range_warp_specializes': [None, False]} 2026-02-21T09:55:36.2402844Z [49s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:55:37.2300693Z [50s] Generation 1 starting: 74 neighbors, 5 active search path(s) 2026-02-21T09:55:59.4151963Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.0 configs/s 2026-02-21T09:56:03.9999093Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 17.0 configs/s 2026-02-21T09:56:09.3928077Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 214.6 2026-02-21T09:56:09.3928709Z configs/s 2026-02-21T09:56:09.6223955Z [82s] Generation 1 complete: 2026-02-21T09:56:09.6228834Z ok=80 2026-02-21T09:56:09.6232158Z min=0.0553 2026-02-21T09:56:09.6233970Z mid=0.0819 2026-02-21T09:56:09.6234157Z max=1.2482 2026-02-21T09:56:09.6234348Z best={'block_sizes': [1, 1024], 2026-02-21T09:56:09.6234635Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T09:56:09.6234964Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T09:56:09.6235188Z 'num_stages': 3, 2026-02-21T09:56:09.6235348Z 'num_warps': 1, 2026-02-21T09:56:09.6235517Z 'pid_type': 'flat', 2026-02-21T09:56:09.6235687Z 'range_flattens': [None, False], 2026-02-21T09:56:09.6235919Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:09.6236129Z 'range_num_stages': [0, 2], 2026-02-21T09:56:09.6236346Z 'range_unroll_factors': [0, 3], 2026-02-21T09:56:09.6236556Z 'range_warp_specializes': [None, None]} 2026-02-21T09:56:09.6236987Z [82s] Fitting surrogate: 180 points, 180 targets 2026-02-21T09:56:10.6497922Z [83s] Generation 2 starting: 72 neighbors, 5 active search path(s) 2026-02-21T09:56:35.6919637Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.7 configs/s 2026-02-21T09:56:40.1101745Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s 2026-02-21T09:56:45.8972772Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 174.6 2026-02-21T09:56:45.8973598Z configs/s 2026-02-21T09:56:46.1792290Z [119s] Generation 2 complete: 2026-02-21T09:56:46.1796474Z ok=78 2026-02-21T09:56:46.1800913Z min=0.0531 2026-02-21T09:56:46.1802490Z mid=0.0635 2026-02-21T09:56:46.1802654Z max=6.4338 2026-02-21T09:56:46.1802811Z best={'block_sizes': [1, 1024], 2026-02-21T09:56:46.1803083Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T09:56:46.1803320Z 'load_eviction_policies': ['last', ''], 2026-02-21T09:56:46.1803526Z 'num_stages': 5, 2026-02-21T09:56:46.1803670Z 'num_warps': 1, 2026-02-21T09:56:46.1803822Z 'pid_type': 'flat', 2026-02-21T09:56:46.1803995Z 'range_flattens': [None, False], 2026-02-21T09:56:46.1804201Z 'range_multi_buffers': [None, False], 2026-02-21T09:56:46.1804388Z 'range_num_stages': [0, 1], 2026-02-21T09:56:46.1804568Z 'range_unroll_factors': [0, 0], 2026-02-21T09:56:46.1804752Z 'range_warp_specializes': [None, False]} 2026-02-21T09:56:46.1807563Z [119s] Fitting surrogate: 258 points, 258 targets 2026-02-21T09:56:47.0162884Z [120s] Generation 3 starting: 58 neighbors, 5 active search path(s) 2026-02-21T09:57:09.3156623Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.4 configs/s 2026-02-21T09:57:12.8436928Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.2 configs/s 2026-02-21T09:57:17.7613985Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 205.3 2026-02-21T09:57:17.7615021Z configs/s 2026-02-21T09:57:18.0278302Z [151s] Generation 3 complete: 2026-02-21T09:57:18.0282533Z ok=63 2026-02-21T09:57:18.0286983Z min=0.0492 2026-02-21T09:57:18.0291306Z mid=0.0594 2026-02-21T09:57:18.0295874Z max=0.9093 2026-02-21T09:57:18.0297357Z best={'block_sizes': [1, 16384], 2026-02-21T09:57:18.0297603Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:57:18.0297841Z 'load_eviction_policies': ['', 'last'], 2026-02-21T09:57:18.0298028Z 'num_stages': 1, 2026-02-21T09:57:18.0298187Z 'num_warps': 4, 2026-02-21T09:57:18.0298344Z 'pid_type': 'flat', 2026-02-21T09:57:18.0298505Z 'range_flattens': [None, None], 2026-02-21T09:57:18.0298695Z 'range_multi_buffers': [None, False], 2026-02-21T09:57:18.0298880Z 'range_num_stages': [0, 4], 2026-02-21T09:57:18.0299053Z 'range_unroll_factors': [0, 1], 2026-02-21T09:57:18.0299517Z 'range_warp_specializes': [None, False]} 2026-02-21T09:57:18.7590756Z [151s] Fitting surrogate: 321 points, 321 targets 2026-02-21T09:57:18.7591133Z [152s] Generation 4 starting: 44 neighbors, 5 active search path(s) 2026-02-21T09:57:29.7989270Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 2.1 configs/s 2026-02-21T09:57:32.5339804Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 17.1 configs/s 2026-02-21T09:57:35.1140730Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 532.8 2026-02-21T09:57:35.1144717Z configs/s 2026-02-21T09:57:35.2286136Z [168s] Generation 4 complete: 2026-02-21T09:57:35.2288059Z ok=49 2026-02-21T09:57:35.2288224Z min=0.0389 2026-02-21T09:57:35.2288363Z mid=0.0635 2026-02-21T09:57:35.2288484Z max=0.9082 2026-02-21T09:57:35.2288635Z best={'block_sizes': [1, 16384], 2026-02-21T09:57:35.2288855Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:57:35.2289098Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:57:35.2289302Z 'num_stages': 1, 2026-02-21T09:57:35.2289441Z 'num_warps': 8, 2026-02-21T09:57:35.2289584Z 'pid_type': 'flat', 2026-02-21T09:57:35.2289738Z 'range_flattens': [None, None], 2026-02-21T09:57:35.2290209Z 'range_multi_buffers': [None, False], 2026-02-21T09:57:35.2290417Z 'range_num_stages': [0, 4], 2026-02-21T09:57:35.2290590Z 'range_unroll_factors': [0, 1], 2026-02-21T09:57:35.2290777Z 'range_warp_specializes': [None, False]} 2026-02-21T09:57:35.2301923Z [168s] Fitting surrogate: 370 points, 370 targets 2026-02-21T09:57:36.1650331Z [169s] Generation 5 starting: 61 neighbors, 5 active search path(s) 2026-02-21T09:57:47.3418738Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 8.0 configs/s 2026-02-21T09:57:51.0794447Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.1 configs/s 2026-02-21T09:57:54.0379201Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 341.0 2026-02-21T09:57:54.0380748Z configs/s 2026-02-21T09:57:54.2118418Z [187s] Generation 5 complete: 2026-02-21T09:57:54.2122834Z ok=66 2026-02-21T09:57:54.2126826Z min=0.0389 2026-02-21T09:57:54.2131135Z mid=0.0594 2026-02-21T09:57:54.2132566Z max=0.2827 2026-02-21T09:57:54.2132770Z best={'block_sizes': [1, 16384], 2026-02-21T09:57:54.2132986Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:57:54.2133212Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:57:54.2133402Z 'num_stages': 1, 2026-02-21T09:57:54.2133544Z 'num_warps': 8, 2026-02-21T09:57:54.2133694Z 'pid_type': 'flat', 2026-02-21T09:57:54.2133851Z 'range_flattens': [None, None], 2026-02-21T09:57:54.2134040Z 'range_multi_buffers': [None, False], 2026-02-21T09:57:54.2134226Z 'range_num_stages': [0, 4], 2026-02-21T09:57:54.2134394Z 'range_unroll_factors': [0, 0], 2026-02-21T09:57:54.2134571Z 'range_warp_specializes': [None, False]} 2026-02-21T09:57:54.2138355Z [187s] Fitting surrogate: 436 points, 436 targets 2026-02-21T09:57:54.9274996Z [188s] Generation 6 starting: 46 neighbors, 4 active search path(s) 2026-02-21T09:58:03.6176098Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 9.6 configs/s 2026-02-21T09:58:06.4739001Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.1 configs/s 2026-02-21T09:58:09.0613311Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 389.5 2026-02-21T09:58:09.0616913Z configs/s 2026-02-21T09:58:09.2105203Z [202s] Generation 6 complete: 2026-02-21T09:58:09.2110035Z ok=50 2026-02-21T09:58:09.2111430Z min=0.0389 2026-02-21T09:58:09.2111829Z mid=0.0553 2026-02-21T09:58:09.2111962Z max=0.9032 2026-02-21T09:58:09.2112101Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:09.2112321Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:09.2112537Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:09.2113063Z 'num_stages': 1, 2026-02-21T09:58:09.2113217Z 'num_warps': 8, 2026-02-21T09:58:09.2113377Z 'pid_type': 'flat', 2026-02-21T09:58:09.2113541Z 'range_flattens': [None, None], 2026-02-21T09:58:09.2113720Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:09.2113922Z 'range_num_stages': [0, 4], 2026-02-21T09:58:09.2114177Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:09.2114360Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:09.2123889Z [202s] Fitting surrogate: 486 points, 486 targets 2026-02-21T09:58:09.7220672Z [203s] Generation 7 starting: 30 neighbors, 2 active search path(s) 2026-02-21T09:58:17.1096158Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.8 configs/s 2026-02-21T09:58:18.9376851Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.4 configs/s 2026-02-21T09:58:20.3414640Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 711.8 2026-02-21T09:58:20.3418673Z configs/s 2026-02-21T09:58:20.4353390Z [213s] Generation 7 complete: 2026-02-21T09:58:20.4358334Z ok=33 2026-02-21T09:58:20.4362661Z min=0.0389 2026-02-21T09:58:20.4367044Z mid=0.0595 2026-02-21T09:58:20.4371629Z max=0.1659 2026-02-21T09:58:20.4376759Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:20.4380562Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:20.4384208Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:20.4386214Z 'num_stages': 1, 2026-02-21T09:58:20.4386391Z 'num_warps': 8, 2026-02-21T09:58:20.4386553Z 'pid_type': 'flat', 2026-02-21T09:58:20.4386800Z 'range_flattens': [None, None], 2026-02-21T09:58:20.4386999Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:20.4387206Z 'range_num_stages': [0, 4], 2026-02-21T09:58:20.4392106Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:20.4397141Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:20.4400889Z [213s] Fitting surrogate: 519 points, 519 targets 2026-02-21T09:58:20.8105058Z [214s] Generation 8 starting: 10 neighbors, 1 active search path(s) 2026-02-21T09:58:24.0013182Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 1.5 configs/s 2026-02-21T09:58:24.6464743Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s 2026-02-21T09:58:24.9993724Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2710.2 2026-02-21T09:58:24.9995306Z configs/s 2026-02-21T09:58:25.0393725Z [218s] Generation 8 complete: 2026-02-21T09:58:25.0398071Z ok=12 2026-02-21T09:58:25.0402494Z min=0.0389 2026-02-21T09:58:25.0406786Z mid=0.0655 2026-02-21T09:58:25.0408804Z max=0.0900 2026-02-21T09:58:25.0413987Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:25.0417450Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:25.0417744Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:25.0421969Z 'num_stages': 1, 2026-02-21T09:58:25.0422442Z 'num_warps': 8, 2026-02-21T09:58:25.0424801Z 'pid_type': 'flat', 2026-02-21T09:58:25.0425072Z 'range_flattens': [None, None], 2026-02-21T09:58:25.0425287Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:25.0430074Z 'range_num_stages': [0, 4], 2026-02-21T09:58:25.0433130Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:25.0436803Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:25.0440430Z [218s] Fitting surrogate: 531 points, 531 targets 2026-02-21T09:58:25.4128335Z [218s] Generation 9 starting: 12 neighbors, 1 active search path(s) 2026-02-21T09:58:27.8865932Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 7.6 configs/s 2026-02-21T09:58:28.6504517Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.1 configs/s 2026-02-21T09:58:29.0030140Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2712.8 2026-02-21T09:58:29.0032016Z configs/s 2026-02-21T09:58:29.0427186Z [222s] Generation 9 complete: 2026-02-21T09:58:29.0432100Z ok=14 2026-02-21T09:58:29.0436017Z min=0.0389 2026-02-21T09:58:29.0440389Z mid=0.0614 2026-02-21T09:58:29.0444316Z max=0.0840 2026-02-21T09:58:29.0448109Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:29.0449832Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:29.0450127Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:29.0450332Z 'num_stages': 1, 2026-02-21T09:58:29.0450480Z 'num_warps': 8, 2026-02-21T09:58:29.0450635Z 'pid_type': 'flat', 2026-02-21T09:58:29.0450795Z 'range_flattens': [None, None], 2026-02-21T09:58:29.0450999Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:29.0451189Z 'range_num_stages': [0, 4], 2026-02-21T09:58:29.0451351Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:29.0451624Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:29.0451947Z [222s] Fitting surrogate: 545 points, 545 targets 2026-02-21T09:58:29.3931497Z [222s] Generation 10 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:58:31.7487318Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 6.8 configs/s 2026-02-21T09:58:32.4468237Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.5 configs/s 2026-02-21T09:58:32.9962973Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1778.2 2026-02-21T09:58:32.9967155Z configs/s 2026-02-21T09:58:33.0446535Z [226s] Generation 10 complete: 2026-02-21T09:58:33.0450854Z ok=13 2026-02-21T09:58:33.0454868Z min=0.0389 2026-02-21T09:58:33.0459561Z mid=0.0594 2026-02-21T09:58:33.0463888Z max=0.0840 2026-02-21T09:58:33.0468514Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:33.0472909Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:33.0473210Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:33.0477910Z 'num_stages': 1, 2026-02-21T09:58:33.0482336Z 'num_warps': 8, 2026-02-21T09:58:33.0486739Z 'pid_type': 'flat', 2026-02-21T09:58:33.0490520Z 'range_flattens': [None, None], 2026-02-21T09:58:33.0495540Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:33.0499586Z 'range_num_stages': [0, 4], 2026-02-21T09:58:33.0503026Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:33.0505761Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:33.0506116Z [226s] Fitting surrogate: 558 points, 558 targets 2026-02-21T09:58:33.4495578Z [226s] Generation 11 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:58:36.4149384Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 5.0 configs/s 2026-02-21T09:58:37.1242535Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s 2026-02-21T09:58:37.3797248Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3676.0 2026-02-21T09:58:37.3797613Z configs/s 2026-02-21T09:58:37.4143642Z [230s] Generation 11 complete: 2026-02-21T09:58:37.4148270Z ok=13 2026-02-21T09:58:37.4149361Z min=0.0389 2026-02-21T09:58:37.4149545Z mid=0.0635 2026-02-21T09:58:37.4149687Z max=0.0922 2026-02-21T09:58:37.4149843Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:37.4150061Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:37.4150556Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:37.4150743Z 'num_stages': 1, 2026-02-21T09:58:37.4150893Z 'num_warps': 8, 2026-02-21T09:58:37.4151032Z 'pid_type': 'flat', 2026-02-21T09:58:37.4151193Z 'range_flattens': [None, None], 2026-02-21T09:58:37.4151369Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:37.4151633Z 'range_num_stages': [0, 4], 2026-02-21T09:58:37.4151800Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:37.4151986Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:37.4164971Z [230s] Fitting surrogate: 571 points, 571 targets 2026-02-21T09:58:37.8623752Z [231s] Generation 12 starting: 16 neighbors, 1 active search path(s) 2026-02-21T09:58:40.9609062Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 8.3 configs/s 2026-02-21T09:58:41.9596612Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.8 configs/s 2026-02-21T09:58:41.9601796Z [235s] Generation 12 complete: 2026-02-21T09:58:41.9603833Z ok=18 2026-02-21T09:58:41.9604032Z min=0.0389 2026-02-21T09:58:41.9604170Z mid=0.0719 2026-02-21T09:58:41.9604290Z max=0.2233 2026-02-21T09:58:41.9604430Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:41.9604642Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:41.9604861Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:41.9605041Z 'num_stages': 1, 2026-02-21T09:58:41.9605186Z 'num_warps': 8, 2026-02-21T09:58:41.9605324Z 'pid_type': 'flat', 2026-02-21T09:58:41.9605485Z 'range_flattens': [None, None], 2026-02-21T09:58:41.9605668Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:41.9605849Z 'range_num_stages': [0, 4], 2026-02-21T09:58:41.9606026Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:41.9606202Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:41.9620557Z [235s] Fitting surrogate: 589 points, 589 targets 2026-02-21T09:58:42.3411379Z [235s] Generation 13 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:58:44.8638198Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 16.0 configs/s 2026-02-21T09:58:45.5725032Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s 2026-02-21T09:58:45.8317251Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3619.9 2026-02-21T09:58:45.8318993Z configs/s 2026-02-21T09:58:45.8667170Z [239s] Generation 13 complete: 2026-02-21T09:58:45.8668697Z ok=13 2026-02-21T09:58:45.8668876Z min=0.0389 2026-02-21T09:58:45.8669013Z mid=0.0636 2026-02-21T09:58:45.8669152Z max=0.0942 2026-02-21T09:58:45.8669297Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:45.8669536Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:45.8669753Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:45.8670250Z 'num_stages': 1, 2026-02-21T09:58:45.8670397Z 'num_warps': 8, 2026-02-21T09:58:45.8670636Z 'pid_type': 'flat', 2026-02-21T09:58:45.8670861Z 'range_flattens': [None, None], 2026-02-21T09:58:45.8671057Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:45.8671257Z 'range_num_stages': [0, 4], 2026-02-21T09:58:45.8671422Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:45.8671915Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:45.8686095Z [239s] Fitting surrogate: 602 points, 602 targets 2026-02-21T09:58:46.2596600Z [239s] Generation 14 starting: 11 neighbors, 1 active search path(s) 2026-02-21T09:58:48.6542029Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 11.7 configs/s 2026-02-21T09:58:49.3620850Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.2 configs/s 2026-02-21T09:58:49.6216460Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3630.7 2026-02-21T09:58:49.6217766Z configs/s 2026-02-21T09:58:49.6566516Z [242s] Generation 14 complete: 2026-02-21T09:58:49.6571795Z ok=13 2026-02-21T09:58:49.6575258Z min=0.0389 2026-02-21T09:58:49.6576705Z mid=0.0655 2026-02-21T09:58:49.6576871Z max=0.1352 2026-02-21T09:58:49.6577010Z best={'block_sizes': [1, 16384], 2026-02-21T09:58:49.6577232Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T09:58:49.6577450Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T09:58:49.6577641Z 'num_stages': 1, 2026-02-21T09:58:49.6577787Z 'num_warps': 8, 2026-02-21T09:58:49.6577923Z 'pid_type': 'flat', 2026-02-21T09:58:49.6578082Z 'range_flattens': [None, None], 2026-02-21T09:58:49.6578259Z 'range_multi_buffers': [None, False], 2026-02-21T09:58:49.6578449Z 'range_num_stages': [0, 4], 2026-02-21T09:58:49.6578614Z 'range_unroll_factors': [0, 0], 2026-02-21T09:58:49.6578807Z 'range_warp_specializes': [None, False]} 2026-02-21T09:58:49.6590311Z [242s] Fitting surrogate: 615 points, 615 targets 2026-02-21T09:58:49.9372819Z [243s] Autotuning complete in 243.3s after searching 587 configs. 2026-02-21T09:58:49.9373196Z One can hardcode the best config and skip autotuning with: 2026-02-21T09:58:49.9378076Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T09:58:49.9378958Z 2026-02-21T09:58:49.9379229Z [243s] Code of selected kernel: /tmp/torchinductor_root/vu/cvuzgumv3hnvyvaeve4ak7ugh5wvpszrpourxpqbu2qumrcgmzqb.py 2026-02-21T09:58:49.9595239Z from __future__ import annotations 2026-02-21T09:58:49.9595444Z 2026-02-21T09:58:49.9600203Z import torch 2026-02-21T09:58:49.9603548Z import triton 2026-02-21T09:58:49.9607328Z import triton.language as tl 2026-02-21T09:58:49.9611261Z from torch._inductor.runtime import triton_helpers 2026-02-21T09:58:49.9615907Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T09:58:49.9620418Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T09:58:49.9621501Z 2026-02-21T09:58:49.9621642Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T09:58:49.9621848Z _BLOCK_SIZE_1 = tl.constexpr(16384) 2026-02-21T09:58:49.9621970Z 2026-02-21T09:58:49.9622039Z @triton.jit 2026-02-21T09:58:49.9627026Z def _helion_softmax_two_pass(x, out): 2026-02-21T09:58:49.9627364Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:58:49.9630068Z pid_0 = tl.program_id(0) 2026-02-21T09:58:49.9630243Z offset_0 = pid_0 2026-02-21T09:58:49.9630446Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T09:58:49.9634935Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:58:49.9638293Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T09:58:49.9642843Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:58:49.9646988Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T09:58:49.9650537Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:58:49.9654790Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:58:49.9656053Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:58:49.9656298Z # src[softmax.py:82-89]: ... 2026-02-21T09:58:49.9656647Z for offset_2 in tl.range(0, 10112, _BLOCK_SIZE_1, warp_specialize=False, num_stages=4, disallow_acc_multi_buffer=True): 2026-02-21T09:58:49.9657048Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:58:49.9657281Z mask_1 = indices_2 < 10112 2026-02-21T09:58:49.9657455Z mi_copy = mi 2026-02-21T09:58:49.9657782Z di_copy = di 2026-02-21T09:58:49.9657944Z mi_copy_0 = mi_copy 2026-02-21T09:58:49.9658103Z di_copy_0 = di_copy 2026-02-21T09:58:49.9658293Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T09:58:49.9658675Z values = tl.load(x + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:58:49.9659129Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T09:58:49.9659570Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T09:58:49.9659966Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T09:58:49.9660226Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T09:58:49.9660467Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T09:58:49.9660675Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T09:58:49.9660939Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:58:49.9661177Z v_2 = mi_copy_0 - v_1 2026-02-21T09:58:49.9661358Z v_3 = libdevice.exp(v_2) 2026-02-21T09:58:49.9661526Z v_4 = di_copy_0 * v_3 2026-02-21T09:58:49.9661843Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:58:49.9662071Z subscript = v_1[:, None] 2026-02-21T09:58:49.9662254Z v_5 = tl.cast(values, tl.float32) 2026-02-21T09:58:49.9662446Z v_6 = v_5 - subscript 2026-02-21T09:58:49.9662666Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T09:58:49.9662952Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T09:58:49.9663183Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T09:58:49.9663388Z v_7 = libdevice.exp(v_6) 2026-02-21T09:58:49.9663731Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T09:58:49.9664106Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T09:58:49.9664323Z di = v_4 + sum_1 2026-02-21T09:58:49.9664492Z # src[softmax.py:89]: mi = mi_next 2026-02-21T09:58:49.9664679Z mi = v_1 2026-02-21T09:58:49.9664886Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T09:58:49.9665175Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:58:49.9665482Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:58:49.9665940Z for offset_2 in tl.range(0, 10112, _BLOCK_SIZE_1, warp_specialize=False, num_stages=4, disallow_acc_multi_buffer=True): 2026-02-21T09:58:49.9666356Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T09:58:49.9666599Z mask_2 = indices_2 < 10112 2026-02-21T09:58:49.9666779Z mi_copy_1 = mi 2026-02-21T09:58:49.9666930Z di_copy_1 = di 2026-02-21T09:58:49.9667094Z mi_copy_1_0 = mi_copy_1 2026-02-21T09:58:49.9667306Z di_copy_1_0 = di_copy_1 2026-02-21T09:58:49.9667503Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T09:58:49.9667894Z values_1 = tl.load(x + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last') 2026-02-21T09:58:49.9668338Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T09:58:49.9668636Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T09:58:49.9668830Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T09:58:49.9669026Z v_10 = v_9 - subscript_1 2026-02-21T09:58:49.9669199Z v_11 = libdevice.exp(v_10) 2026-02-21T09:58:49.9669386Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T09:58:49.9669587Z v_12 = v_11 / subscript_2 2026-02-21T09:58:49.9669755Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T09:58:49.9670065Z tl.store(out + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T09:58:49.9670284Z 2026-02-21T09:58:49.9670413Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T09:58:49.9670651Z """ 2026-02-21T09:58:49.9670852Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T09:58:49.9671197Z This version uses fewer passes but is less numerically stable. 2026-02-21T09:58:49.9671420Z Args: 2026-02-21T09:58:49.9671624Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T09:58:49.9671824Z Returns: 2026-02-21T09:58:49.9671998Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T09:58:49.9672208Z """ 2026-02-21T09:58:49.9672340Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T09:58:49.9672522Z m, n = x.size() 2026-02-21T09:58:49.9672685Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T09:58:49.9672892Z out = torch.empty_like(x) 2026-02-21T09:58:49.9673125Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T09:58:49.9673435Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T09:58:49.9673786Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T09:58:49.9674024Z # src[softmax.py:79-92]: ... 2026-02-21T09:58:49.9674286Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=8, num_stages=1) 2026-02-21T09:58:49.9674553Z # src[softmax.py:93]: return out 2026-02-21T09:58:49.9674724Z return out 2026-02-21T09:58:50.7350628Z WARNING:tritonbench.utils.triton_op:Completed input ID 77: 2026-02-21T09:58:50.7354928Z (M, N) 2026-02-21T09:58:50.7359469Z ------------- 2026-02-21T09:58:50.7363362Z (4096, 10112) 2026-02-21T09:58:50.7364731Z 2026-02-21T09:58:50.7365331Z 80%|████████ | 16/20 [49:56<14:08, 212.17s/it]WARNING:tritonbench.utils.triton_op:Running input ID 82: 2026-02-21T09:58:50.7365713Z (M, N) 2026-02-21T09:58:50.7367426Z ------------- 2026-02-21T09:58:50.7367613Z (4096, 10752) 2026-02-21T09:58:50.7367902Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T09:58:51.9227403Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T09:58:53.2734245Z INFO:tritonbench.utils.triton_op:Took 2.35ms to get benchmark function for torch_compile_softmax 2026-02-21T09:58:54.5246971Z WARNING:__main__:Input tensor metadata: 2026-02-21T09:58:54.5248538Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T09:58:54.5248818Z 'dtype': 'torch.float16', 2026-02-21T09:58:54.5249019Z 'shape': (4096, 10752), 2026-02-21T09:58:54.5253710Z 'stride': (10752, 1)},), 2026-02-21T09:58:54.5258070Z 'kwargs': {}} 2026-02-21T09:58:54.5269163Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T09:58:55.3766789Z [0s] Autotune random seed: 2138408546 2026-02-21T09:58:55.4023982Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T09:59:32.0400207Z [36s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T09:59:34.6357160Z [39s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T09:59:36.8459290Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T09:59:45.7658778Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.1 configs/s 2026-02-21T09:59:45.7670604Z [50s] Adaptive compile timeout: 30s (90% percentile=15.9s, bounds=[30.0s, 30s]) 2026-02-21T09:59:47.0472604Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 772.5 configs/s 2026-02-21T09:59:47.1272177Z [51s] Initial random population of 100, 5 starting points: 2026-02-21T09:59:47.1276413Z error=11 2026-02-21T09:59:47.1277883Z timeout=2 2026-02-21T09:59:47.1278101Z ok=87 2026-02-21T09:59:47.1278247Z min=0.0717 2026-02-21T09:59:47.1282959Z mid=0.5366 2026-02-21T09:59:47.1287606Z max=249.8283 2026-02-21T09:59:47.1289595Z best={'block_sizes': [1, 1024], 2026-02-21T09:59:47.1289870Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T09:59:47.1290124Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T09:59:47.1290314Z 'num_stages': 6, 2026-02-21T09:59:47.1290479Z 'num_warps': 4, 2026-02-21T09:59:47.1290646Z 'pid_type': 'flat', 2026-02-21T09:59:47.1290813Z 'range_flattens': [None, None], 2026-02-21T09:59:47.1290996Z 'range_multi_buffers': [None, True], 2026-02-21T09:59:47.1291177Z 'range_num_stages': [0, 0], 2026-02-21T09:59:47.1291660Z 'range_unroll_factors': [0, 4], 2026-02-21T09:59:47.1291871Z 'range_warp_specializes': [None, False]} 2026-02-21T09:59:47.1292164Z [51s] Fitting surrogate: 100 points, 100 targets 2026-02-21T09:59:48.1878557Z [52s] Generation 1 starting: 74 neighbors, 5 active search path(s) 2026-02-21T10:00:03.8775537Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 4.8 configs/s 2026-02-21T10:00:08.3964167Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 17.0 configs/s 2026-02-21T10:00:13.8702571Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 184.1 2026-02-21T10:00:13.8703893Z configs/s 2026-02-21T10:00:14.1189164Z [78s] Generation 1 complete: 2026-02-21T10:00:14.1193390Z ok=79 2026-02-21T10:00:14.1197809Z min=0.0594 2026-02-21T10:00:14.1202293Z mid=0.0840 2026-02-21T10:00:14.1206155Z max=0.9549 2026-02-21T10:00:14.1206415Z best={'block_sizes': [1, 1024], 2026-02-21T10:00:14.1206679Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T10:00:14.1210636Z 'load_eviction_policies': ['first', ''], 2026-02-21T10:00:14.1213776Z 'num_stages': 1, 2026-02-21T10:00:14.1218807Z 'num_warps': 1, 2026-02-21T10:00:14.1223714Z 'pid_type': 'flat', 2026-02-21T10:00:14.1228142Z 'range_flattens': [None, True], 2026-02-21T10:00:14.1231383Z 'range_multi_buffers': [None, None], 2026-02-21T10:00:14.1235067Z 'range_num_stages': [0, 4], 2026-02-21T10:00:14.1239512Z 'range_unroll_factors': [0, 1], 2026-02-21T10:00:14.1242824Z 'range_warp_specializes': [None, False]} 2026-02-21T10:00:14.1247187Z [78s] Fitting surrogate: 179 points, 179 targets 2026-02-21T10:00:15.1225159Z [79s] Generation 2 starting: 71 neighbors, 5 active search path(s) 2026-02-21T10:00:28.6057829Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 4.6 configs/s 2026-02-21T10:00:32.8727794Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.0 configs/s 2026-02-21T10:00:40.4511159Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 149.1 2026-02-21T10:00:40.4512063Z configs/s 2026-02-21T10:00:40.7874767Z [105s] Generation 2 complete: 2026-02-21T10:00:40.7878835Z ok=77 2026-02-21T10:00:40.7880461Z min=0.0573 2026-02-21T10:00:40.7880624Z mid=0.0656 2026-02-21T10:00:40.7880747Z max=0.5509 2026-02-21T10:00:40.7880895Z best={'block_sizes': [1, 16384], 2026-02-21T10:00:40.7881144Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:00:40.7881411Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T10:00:40.7881711Z 'num_sm_multiplier': 128, 2026-02-21T10:00:40.7881877Z 'num_stages': 5, 2026-02-21T10:00:40.7882329Z 'num_warps': 2, 2026-02-21T10:00:40.7882498Z 'pid_type': 'persistent_blocked', 2026-02-21T10:00:40.7882698Z 'range_flattens': [False, False], 2026-02-21T10:00:40.7882879Z 'range_multi_buffers': [True, False], 2026-02-21T10:00:40.7883070Z 'range_num_stages': [2, 1], 2026-02-21T10:00:40.7883326Z 'range_unroll_factors': [0, 0], 2026-02-21T10:00:40.7883526Z 'range_warp_specializes': [True, None]} 2026-02-21T10:00:40.7887916Z [105s] Fitting surrogate: 256 points, 256 targets 2026-02-21T10:00:41.7508140Z [106s] Generation 3 starting: 64 neighbors, 5 active search path(s) 2026-02-21T10:01:21.5611486Z [146s] Timeout after 30s compiling Config(block_sizes=[8, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T10:01:21.5621468Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.2 configs/s 2026-02-21T10:01:25.3387611Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.4 configs/s 2026-02-21T10:01:29.4390860Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 245.7 2026-02-21T10:01:29.4394862Z configs/s 2026-02-21T10:01:29.6504446Z [154s] Generation 3 complete: 2026-02-21T10:01:29.6506296Z timeout=1 2026-02-21T10:01:29.6506452Z ok=68 2026-02-21T10:01:29.6506587Z min=0.0451 2026-02-21T10:01:29.6506716Z mid=0.0635 2026-02-21T10:01:29.6506847Z max=0.5408 2026-02-21T10:01:29.6506987Z best={'block_sizes': [1, 16384], 2026-02-21T10:01:29.6507250Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:01:29.6507532Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:01:29.6507743Z 'num_sm_multiplier': 128, 2026-02-21T10:01:29.6507913Z 'num_stages': 5, 2026-02-21T10:01:29.6508055Z 'num_warps': 2, 2026-02-21T10:01:29.6508235Z 'pid_type': 'persistent_blocked', 2026-02-21T10:01:29.6508733Z 'range_flattens': [False, False], 2026-02-21T10:01:29.6508922Z 'range_multi_buffers': [True, False], 2026-02-21T10:01:29.6509109Z 'range_num_stages': [2, 1], 2026-02-21T10:01:29.6509295Z 'range_unroll_factors': [0, 0], 2026-02-21T10:01:29.6509486Z 'range_warp_specializes': [True, None]} 2026-02-21T10:01:29.6522019Z [154s] Fitting surrogate: 325 points, 325 targets 2026-02-21T10:01:30.5574020Z [155s] Generation 4 starting: 63 neighbors, 5 active search path(s) 2026-02-21T10:01:42.0677260Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 11.9 configs/s 2026-02-21T10:01:45.8841788Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.2 configs/s 2026-02-21T10:01:49.9513481Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 248.2 2026-02-21T10:01:49.9513820Z configs/s 2026-02-21T10:01:50.1768195Z [174s] Generation 4 complete: 2026-02-21T10:01:50.1771931Z ok=68 2026-02-21T10:01:50.1776826Z min=0.0390 2026-02-21T10:01:50.1781289Z mid=0.0594 2026-02-21T10:01:50.1783205Z max=0.1412 2026-02-21T10:01:50.1783396Z best={'block_sizes': [1, 16384], 2026-02-21T10:01:50.1783671Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:01:50.1784228Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:01:50.1784428Z 'num_sm_multiplier': 64, 2026-02-21T10:01:50.1784597Z 'num_stages': 5, 2026-02-21T10:01:50.1784734Z 'num_warps': 4, 2026-02-21T10:01:50.1784899Z 'pid_type': 'persistent_blocked', 2026-02-21T10:01:50.1785094Z 'range_flattens': [False, False], 2026-02-21T10:01:50.1785275Z 'range_multi_buffers': [True, False], 2026-02-21T10:01:50.1785467Z 'range_num_stages': [2, 1], 2026-02-21T10:01:50.1785633Z 'range_unroll_factors': [0, 0], 2026-02-21T10:01:50.1785817Z 'range_warp_specializes': [True, None]} 2026-02-21T10:01:50.1786115Z [174s] Fitting surrogate: 393 points, 393 targets 2026-02-21T10:01:51.0761006Z [175s] Generation 5 starting: 62 neighbors, 5 active search path(s) 2026-02-21T10:02:05.7450657Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.7 configs/s 2026-02-21T10:02:09.4074756Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.1 configs/s 2026-02-21T10:02:13.5181329Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 245.7 2026-02-21T10:02:13.5182690Z configs/s 2026-02-21T10:02:13.7563240Z [198s] Generation 5 complete: 2026-02-21T10:02:13.7567465Z ok=67 2026-02-21T10:02:13.7571394Z min=0.0409 2026-02-21T10:02:13.7579546Z mid=0.0533 2026-02-21T10:02:13.7583257Z max=0.4260 2026-02-21T10:02:13.7585598Z best={'block_sizes': [1, 16384], 2026-02-21T10:02:13.7587815Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:02:13.7588112Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:02:13.7588332Z 'num_sm_multiplier': 64, 2026-02-21T10:02:13.7588505Z 'num_stages': 5, 2026-02-21T10:02:13.7588657Z 'num_warps': 4, 2026-02-21T10:02:13.7588819Z 'pid_type': 'persistent_blocked', 2026-02-21T10:02:13.7589001Z 'range_flattens': [False, False], 2026-02-21T10:02:13.7589190Z 'range_multi_buffers': [True, False], 2026-02-21T10:02:13.7589380Z 'range_num_stages': [2, 1], 2026-02-21T10:02:13.7589554Z 'range_unroll_factors': [0, 0], 2026-02-21T10:02:13.7589739Z 'range_warp_specializes': [True, None]} 2026-02-21T10:02:13.7589958Z [198s] Fitting surrogate: 460 points, 460 targets 2026-02-21T10:02:14.4430804Z [199s] Generation 6 starting: 42 neighbors, 4 active search path(s) 2026-02-21T10:02:23.7630457Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 2.5 configs/s 2026-02-21T10:02:26.2556609Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.2 configs/s 2026-02-21T10:02:30.3796779Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.3 2026-02-21T10:02:30.3801394Z configs/s 2026-02-21T10:02:30.5842757Z [215s] Generation 6 complete: 2026-02-21T10:02:30.5846495Z ok=46 2026-02-21T10:02:30.5848017Z min=0.0389 2026-02-21T10:02:30.5848189Z mid=0.0470 2026-02-21T10:02:30.5848339Z max=0.7865 2026-02-21T10:02:30.5848504Z best={'block_sizes': [1, 16384], 2026-02-21T10:02:30.5848745Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:02:30.5848986Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:02:30.5849183Z 'num_stages': 6, 2026-02-21T10:02:30.5849325Z 'num_warps': 1, 2026-02-21T10:02:30.5849473Z 'pid_type': 'flat', 2026-02-21T10:02:30.5849632Z 'range_flattens': [None, False], 2026-02-21T10:02:30.5849821Z 'range_multi_buffers': [None, False], 2026-02-21T10:02:30.5850007Z 'range_num_stages': [0, 0], 2026-02-21T10:02:30.5850179Z 'range_unroll_factors': [0, 0], 2026-02-21T10:02:30.5850367Z 'range_warp_specializes': [None, True]} 2026-02-21T10:02:30.5859327Z [215s] Fitting surrogate: 506 points, 506 targets 2026-02-21T10:02:31.2069450Z [215s] Generation 7 starting: 32 neighbors, 3 active search path(s) 2026-02-21T10:02:43.3170181Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.3 configs/s 2026-02-21T10:02:45.2911120Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 17.1 configs/s 2026-02-21T10:02:46.9730319Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.8 2026-02-21T10:02:46.9731443Z configs/s 2026-02-21T10:02:47.0911313Z [231s] Generation 7 complete: 2026-02-21T10:02:47.0916260Z ok=35 2026-02-21T10:02:47.0920672Z min=0.0409 2026-02-21T10:02:47.0923847Z mid=0.0512 2026-02-21T10:02:47.0928250Z max=0.7873 2026-02-21T10:02:47.0932778Z best={'block_sizes': [1, 16384], 2026-02-21T10:02:47.0934283Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:02:47.0934581Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:02:47.0934774Z 'num_stages': 6, 2026-02-21T10:02:47.0934940Z 'num_warps': 1, 2026-02-21T10:02:47.0935081Z 'pid_type': 'flat', 2026-02-21T10:02:47.0935249Z 'range_flattens': [None, True], 2026-02-21T10:02:47.0935427Z 'range_multi_buffers': [None, False], 2026-02-21T10:02:47.0935869Z 'range_num_stages': [0, 0], 2026-02-21T10:02:47.0936051Z 'range_unroll_factors': [0, 0], 2026-02-21T10:02:47.0936243Z 'range_warp_specializes': [None, True]} 2026-02-21T10:02:47.0936472Z [231s] Fitting surrogate: 541 points, 541 targets 2026-02-21T10:02:47.5715596Z [232s] Generation 8 starting: 19 neighbors, 2 active search path(s) 2026-02-21T10:02:54.4129430Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.3 configs/s 2026-02-21T10:02:55.6060739Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.4 configs/s 2026-02-21T10:02:56.9346476Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 754.6 2026-02-21T10:02:56.9346928Z configs/s 2026-02-21T10:02:57.0247580Z [241s] Generation 8 complete: 2026-02-21T10:02:57.0252387Z ok=22 2026-02-21T10:02:57.0256692Z min=0.0409 2026-02-21T10:02:57.0261206Z mid=0.0410 2026-02-21T10:02:57.0266265Z max=0.6174 2026-02-21T10:02:57.0271180Z best={'block_sizes': [1, 16384], 2026-02-21T10:02:57.0272577Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:02:57.0272848Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:02:57.0273051Z 'num_stages': 6, 2026-02-21T10:02:57.0273197Z 'num_warps': 1, 2026-02-21T10:02:57.0273347Z 'pid_type': 'flat', 2026-02-21T10:02:57.0273514Z 'range_flattens': [None, True], 2026-02-21T10:02:57.0273689Z 'range_multi_buffers': [None, False], 2026-02-21T10:02:57.0273880Z 'range_num_stages': [0, 1], 2026-02-21T10:02:57.0274042Z 'range_unroll_factors': [0, 0], 2026-02-21T10:02:57.0274224Z 'range_warp_specializes': [None, True]} 2026-02-21T10:02:57.0274444Z [241s] Fitting surrogate: 563 points, 563 targets 2026-02-21T10:02:57.3069561Z [241s] Autotuning complete in 241.9s after searching 532 configs. 2026-02-21T10:02:57.3071262Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:02:57.3072318Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T10:02:57.3073265Z 2026-02-21T10:02:57.3073521Z [241s] Code of selected kernel: /tmp/torchinductor_root/pr/cprifmp37aytfpvnokgwzudeoq5y7jzemovnyvoi4hezk62ibwks.py 2026-02-21T10:02:57.3297479Z from __future__ import annotations 2026-02-21T10:02:57.3299312Z 2026-02-21T10:02:57.3299471Z import torch 2026-02-21T10:02:57.3299633Z import triton 2026-02-21T10:02:57.3299790Z import triton.language as tl 2026-02-21T10:02:57.3300248Z from torch._inductor.runtime import triton_helpers 2026-02-21T10:02:57.3300539Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T10:02:57.3300820Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T10:02:57.3300998Z 2026-02-21T10:02:57.3301075Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T10:02:57.3301258Z _BLOCK_SIZE_1 = tl.constexpr(16384) 2026-02-21T10:02:57.3301375Z 2026-02-21T10:02:57.3301432Z @triton.jit 2026-02-21T10:02:57.3301643Z def _helion_softmax_two_pass(x, out): 2026-02-21T10:02:57.3301895Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:02:57.3302149Z pid_0 = tl.program_id(0) 2026-02-21T10:02:57.3302310Z offset_0 = pid_0 2026-02-21T10:02:57.3302491Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T10:02:57.3302780Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:02:57.3303077Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T10:02:57.3303353Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:02:57.3303606Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T10:02:57.3303935Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:02:57.3304213Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:02:57.3304476Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:02:57.3304717Z # src[softmax.py:82-89]: ... 2026-02-21T10:02:57.3305079Z for offset_2 in tl.range(0, 10752, _BLOCK_SIZE_1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T10:02:57.3305502Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:02:57.3305733Z mask_1 = indices_2 < 10752 2026-02-21T10:02:57.3305907Z mi_copy = mi 2026-02-21T10:02:57.3306046Z di_copy = di 2026-02-21T10:02:57.3306198Z mi_copy_0 = mi_copy 2026-02-21T10:02:57.3306359Z di_copy_0 = di_copy 2026-02-21T10:02:57.3306538Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:02:57.3306916Z values = tl.load(x + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:02:57.3307301Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:02:57.3307709Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T10:02:57.3308101Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T10:02:57.3308363Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T10:02:57.3308624Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T10:02:57.3308828Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T10:02:57.3309089Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:02:57.3309325Z v_2 = mi_copy_0 - v_1 2026-02-21T10:02:57.3309549Z v_3 = libdevice.exp(v_2) 2026-02-21T10:02:57.3309720Z v_4 = di_copy_0 * v_3 2026-02-21T10:02:57.3309907Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:02:57.3310150Z subscript = v_1[:, None] 2026-02-21T10:02:57.3310317Z v_5 = tl.cast(values, tl.float32) 2026-02-21T10:02:57.3310496Z v_6 = v_5 - subscript 2026-02-21T10:02:57.3310701Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:02:57.3310967Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:02:57.3311182Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T10:02:57.3311364Z v_7 = libdevice.exp(v_6) 2026-02-21T10:02:57.3311721Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T10:02:57.3312123Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T10:02:57.3312345Z di = v_4 + sum_1 2026-02-21T10:02:57.3312516Z # src[softmax.py:89]: mi = mi_next 2026-02-21T10:02:57.3312708Z mi = v_1 2026-02-21T10:02:57.3312922Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:02:57.3313214Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:02:57.3313534Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:02:57.3314018Z for offset_2 in tl.range(0, 10752, _BLOCK_SIZE_1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T10:02:57.3314459Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:02:57.3314708Z mask_2 = indices_2 < 10752 2026-02-21T10:02:57.3314895Z mi_copy_1 = mi 2026-02-21T10:02:57.3315055Z di_copy_1 = di 2026-02-21T10:02:57.3315228Z mi_copy_1_0 = mi_copy_1 2026-02-21T10:02:57.3315415Z di_copy_1_0 = di_copy_1 2026-02-21T10:02:57.3315617Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:02:57.3316059Z values_1 = tl.load(x + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:02:57.3316516Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:02:57.3316817Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T10:02:57.3317014Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T10:02:57.3317210Z v_10 = v_9 - subscript_1 2026-02-21T10:02:57.3317392Z v_11 = libdevice.exp(v_10) 2026-02-21T10:02:57.3317572Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T10:02:57.3317764Z v_12 = v_11 / subscript_2 2026-02-21T10:02:57.3317941Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T10:02:57.3318232Z tl.store(out + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T10:02:57.3318458Z 2026-02-21T10:02:57.3318590Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T10:02:57.3318839Z """ 2026-02-21T10:02:57.3319071Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T10:02:57.3319387Z This version uses fewer passes but is less numerically stable. 2026-02-21T10:02:57.3319620Z Args: 2026-02-21T10:02:57.3319783Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T10:02:57.3319990Z Returns: 2026-02-21T10:02:57.3320172Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T10:02:57.3320395Z """ 2026-02-21T10:02:57.3320535Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T10:02:57.3320719Z m, n = x.size() 2026-02-21T10:02:57.3320896Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T10:02:57.3321105Z out = torch.empty_like(x) 2026-02-21T10:02:57.3321332Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:02:57.3321675Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:02:57.3322012Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:02:57.3322245Z # src[softmax.py:79-92]: ... 2026-02-21T10:02:57.3322530Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6) 2026-02-21T10:02:57.3322805Z # src[softmax.py:93]: return out 2026-02-21T10:02:57.3322971Z return out 2026-02-21T10:02:58.2790354Z WARNING:tritonbench.utils.triton_op:Completed input ID 82: 2026-02-21T10:02:58.2794477Z (M, N) 2026-02-21T10:02:58.2798208Z ------------- 2026-02-21T10:02:58.2799647Z (4096, 10752) 2026-02-21T10:02:58.2799773Z 2026-02-21T10:02:58.2800364Z 85%|████████▌ | 17/20 [54:03<11:08, 222.81s/it]WARNING:tritonbench.utils.triton_op:Running input ID 87: 2026-02-21T10:02:58.2800691Z (M, N) 2026-02-21T10:02:58.2805983Z ------------- 2026-02-21T10:02:58.2807646Z (4096, 11392) 2026-02-21T10:02:58.2808054Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T10:02:59.5120131Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T10:03:00.8957588Z INFO:tritonbench.utils.triton_op:Took 2.24ms to get benchmark function for torch_compile_softmax 2026-02-21T10:03:02.2235223Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:03:02.2239608Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:03:02.2243949Z 'dtype': 'torch.float16', 2026-02-21T10:03:02.2248783Z 'shape': (4096, 11392), 2026-02-21T10:03:02.2253163Z 'stride': (11392, 1)},), 2026-02-21T10:03:02.2256351Z 'kwargs': {}} 2026-02-21T10:03:02.2260423Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T10:03:02.4003460Z [0s] Autotune random seed: 2138408546 2026-02-21T10:03:02.4296242Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:03:39.5119503Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T10:03:43.9689502Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s 2026-02-21T10:03:46.0365023Z module { 2026-02-21T10:03:46.0367071Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:03:46.0367542Z %c256_i32 = arith.constant 256 : i32 2026-02-21T10:03:46.0367772Z %c128_i32 = arith.constant 128 : i32 2026-02-21T10:03:46.0367957Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:03:46.0368155Z %c1_i32 = arith.constant 1 : i32 2026-02-21T10:03:46.0368375Z %cst = arith.constant dense<11392> : tensor<16x1xi32> 2026-02-21T10:03:46.0368639Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32> 2026-02-21T10:03:46.0368913Z %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32> 2026-02-21T10:03:46.0369129Z %c16_i32 = arith.constant 16 : i32 2026-02-21T10:03:46.0369319Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T10:03:46.0369508Z %c11392_i32 = arith.constant 11392 : i32 2026-02-21T10:03:46.0369703Z %c11392_i64 = arith.constant 11392 : i64 2026-02-21T10:03:46.0369883Z %c1_i64 = arith.constant 1 : i64 2026-02-21T10:03:46.0370218Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c11392_i32], [%c11392_i64, %c1_i64] : , > 2026-02-21T10:03:46.0370551Z %1 = tt.get_program_id x : i32 2026-02-21T10:03:46.0370733Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T10:03:46.0370920Z %3 = arith.minsi %2, %c256_i32 : i32 2026-02-21T10:03:46.0371413Z scf.for %arg2 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T10:03:46.0372429Z %4 = arith.muli %arg2, %c16_i32 : i32 2026-02-21T10:03:46.0372674Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T10:03:46.0373066Z %6 = tt.splat %4 : i32 -> tensor<16xi32> 2026-02-21T10:03:46.0373289Z %7 = arith.addi %6, %5 : tensor<16xi32> 2026-02-21T10:03:46.0373496Z %c11264_i32 = arith.constant 11264 : i32 2026-02-21T10:03:46.0373717Z %c512_i32 = arith.constant 512 : i32 2026-02-21T10:03:46.0374167Z %8:2 = scf.for %arg3 = %c0_i32 to %c11264_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>) : i32 { 2026-02-21T10:03:46.0374710Z %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T10:03:46.0375183Z %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0375479Z %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0375707Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0379252Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:03:46.0384050Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0387247Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0391153Z %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16> 2026-02-21T10:03:46.0395920Z %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32> 2026-02-21T10:03:46.0396277Z %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32> 2026-02-21T10:03:46.0396522Z %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32> 2026-02-21T10:03:46.0396741Z %57 = arith.ori %55, %56 : tensor<16xi1> 2026-02-21T10:03:46.0396988Z %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32> 2026-02-21T10:03:46.0397247Z %59 = arith.subf %arg4, %58 : tensor<16xf32> 2026-02-21T10:03:46.0397623Z %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0397992Z %61 = arith.mulf %arg5, %60 : tensor<16xf32> 2026-02-21T10:03:46.0398251Z %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0398553Z %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0398791Z %64 = arith.subf %51, %63 : tensor<16x128xf32> 2026-02-21T10:03:46.0399160Z %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0399517Z %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0399718Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0399911Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:03:46.0400105Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0400299Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0400495Z %67 = arith.addf %61, %66 : tensor<16xf32> 2026-02-21T10:03:46.0400695Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T10:03:46.0400886Z %68 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T10:03:46.0401093Z %69 = arith.addi %arg3, %68 : i32 2026-02-21T10:03:46.0401365Z %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T10:03:46.0401773Z %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0402015Z %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0402204Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0402396Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:03:46.0402587Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0402781Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0403004Z %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16> 2026-02-21T10:03:46.0403420Z %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32> 2026-02-21T10:03:46.0403659Z %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32> 2026-02-21T10:03:46.0403874Z %76 = arith.cmpf une, %58, %58 : tensor<16xf32> 2026-02-21T10:03:46.0404129Z %77 = arith.ori %75, %76 : tensor<16xi1> 2026-02-21T10:03:46.0404358Z %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32> 2026-02-21T10:03:46.0404599Z %79 = arith.subf %58, %78 : tensor<16xf32> 2026-02-21T10:03:46.0404948Z %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0405317Z %81 = arith.mulf %67, %80 : tensor<16xf32> 2026-02-21T10:03:46.0405580Z %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0405914Z %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0406172Z %84 = arith.subf %71, %83 : tensor<16x128xf32> 2026-02-21T10:03:46.0406584Z %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0406979Z %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0407184Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0407374Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:03:46.0407575Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0407769Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0407981Z %87 = arith.addf %81, %86 : tensor<16xf32> 2026-02-21T10:03:46.0408183Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:03:46.0408387Z %88 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T10:03:46.0408587Z %89 = arith.addi %arg3, %88 : i32 2026-02-21T10:03:46.0408886Z %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T10:03:46.0409232Z %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0409473Z %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0409677Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0409871Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:03:46.0410074Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0410266Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0410502Z %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16> 2026-02-21T10:03:46.0410763Z %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32> 2026-02-21T10:03:46.0411001Z %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32> 2026-02-21T10:03:46.0411224Z %96 = arith.cmpf une, %78, %78 : tensor<16xf32> 2026-02-21T10:03:46.0411432Z %97 = arith.ori %95, %96 : tensor<16xi1> 2026-02-21T10:03:46.0411735Z %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32> 2026-02-21T10:03:46.0411975Z %99 = arith.subf %78, %98 : tensor<16xf32> 2026-02-21T10:03:46.0412352Z %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0412734Z %101 = arith.mulf %87, %100 : tensor<16xf32> 2026-02-21T10:03:46.0412995Z %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0413307Z %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0413562Z %104 = arith.subf %91, %103 : tensor<16x128xf32> 2026-02-21T10:03:46.0413932Z %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0414312Z %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0414505Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0414689Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:03:46.0414903Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0415093Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0415295Z %107 = arith.addf %101, %106 : tensor<16xf32> 2026-02-21T10:03:46.0415525Z %c3_i32 = arith.constant 3 : i32 2026-02-21T10:03:46.0415716Z %108 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T10:03:46.0415904Z %109 = arith.addi %arg3, %108 : i32 2026-02-21T10:03:46.0416191Z %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T10:03:46.0416516Z %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0416761Z %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0416948Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0417168Z %128 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:03:46.0417359Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0417554Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0417785Z %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16> 2026-02-21T10:03:46.0418060Z %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32> 2026-02-21T10:03:46.0418305Z %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32> 2026-02-21T10:03:46.0418528Z %116 = arith.cmpf une, %98, %98 : tensor<16xf32> 2026-02-21T10:03:46.0418742Z %117 = arith.ori %115, %116 : tensor<16xi1> 2026-02-21T10:03:46.0418976Z %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32> 2026-02-21T10:03:46.0419225Z %119 = arith.subf %98, %118 : tensor<16xf32> 2026-02-21T10:03:46.0419591Z %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0419959Z %121 = arith.mulf %107, %120 : tensor<16xf32> 2026-02-21T10:03:46.0420224Z %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0420526Z %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0420786Z %124 = arith.subf %111, %123 : tensor<16x128xf32> 2026-02-21T10:03:46.0421167Z %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0421579Z %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0421778Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:03:46.0421953Z %128 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:03:46.0422144Z tt.reduce.return %128 : f32 2026-02-21T10:03:46.0422323Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0422528Z %127 = arith.addf %121, %126 : tensor<16xf32> 2026-02-21T10:03:46.0422755Z scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32> 2026-02-21T10:03:46.0422970Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T10:03:46.0423272Z %9 = tt.descriptor_load %0[%4, %c11264_i32] : !tt.tensordesc> -> tensor<16x128xf16> 2026-02-21T10:03:46.0423602Z %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0423835Z %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0424023Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T10:03:46.0424209Z %50 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T10:03:46.0424404Z tt.reduce.return %50 : f32 2026-02-21T10:03:46.0424581Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0424802Z %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16> 2026-02-21T10:03:46.0425036Z %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32> 2026-02-21T10:03:46.0425263Z %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32> 2026-02-21T10:03:46.0425473Z %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32> 2026-02-21T10:03:46.0425680Z %16 = arith.ori %14, %15 : tensor<16xi1> 2026-02-21T10:03:46.0425940Z %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32> 2026-02-21T10:03:46.0426179Z %18 = arith.subf %8#0, %17 : tensor<16xf32> 2026-02-21T10:03:46.0426580Z %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0426925Z %20 = arith.mulf %8#1, %19 : tensor<16xf32> 2026-02-21T10:03:46.0427174Z %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0427457Z %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0427694Z %23 = arith.subf %10, %22 : tensor<16x128xf32> 2026-02-21T10:03:46.0428056Z %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0428436Z %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({ 2026-02-21T10:03:46.0428634Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T10:03:46.0428810Z %50 = arith.addf %arg3, %arg4 : f32 2026-02-21T10:03:46.0428998Z tt.reduce.return %50 : f32 2026-02-21T10:03:46.0429207Z }) : (tensor<16x128xf32>) -> tensor<16xf32> 2026-02-21T10:03:46.0429411Z %26 = arith.addf %20, %25 : tensor<16xf32> 2026-02-21T10:03:46.0429615Z %c11264_i32_2 = arith.constant 11264 : i32 2026-02-21T10:03:46.0429811Z %c512_i32_3 = arith.constant 512 : i32 2026-02-21T10:03:46.0430070Z scf.for %arg3 = %c0_i32 to %c11264_i32_2 step %c512_i32_3 : i32 { 2026-02-21T10:03:46.0430352Z %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:03:46.0430614Z %51 = tt.splat %arg3 : i32 -> tensor<128xi32> 2026-02-21T10:03:46.0430816Z %52 = arith.addi %51, %50 : tensor<128xi32> 2026-02-21T10:03:46.0431082Z %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T10:03:46.0431362Z %54 = arith.muli %53, %cst : tensor<16x1xi32> 2026-02-21T10:03:46.0431661Z %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T10:03:46.0431960Z %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0432220Z %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0432457Z %58 = arith.addi %56, %57 : tensor<16x128xi32> 2026-02-21T10:03:46.0432692Z %59 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0432981Z %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0433286Z %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0433597Z %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0433887Z %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0434142Z %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0434380Z %65 = arith.subf %63, %64 : tensor<16x128xf32> 2026-02-21T10:03:46.0434751Z %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0435150Z %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0435434Z %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0435662Z %69 = arith.divf %66, %68 : tensor<16x128xf32> 2026-02-21T10:03:46.0435898Z %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T10:03:46.0436164Z %71 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0436443Z %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0436698Z tt.store %72, %70 : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0436929Z %c1_i32_4 = arith.constant 1 : i32 2026-02-21T10:03:46.0437124Z %73 = arith.muli %c128_i32, %c1_i32_4 : i32 2026-02-21T10:03:46.0437314Z %74 = arith.addi %arg3, %73 : i32 2026-02-21T10:03:46.0437576Z %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:03:46.0437819Z %76 = tt.splat %74 : i32 -> tensor<128xi32> 2026-02-21T10:03:46.0438019Z %77 = arith.addi %76, %75 : tensor<128xi32> 2026-02-21T10:03:46.0438267Z %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T10:03:46.0438526Z %79 = arith.muli %78, %cst : tensor<16x1xi32> 2026-02-21T10:03:46.0438785Z %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T10:03:46.0439077Z %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0439370Z %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0439604Z %83 = arith.addi %81, %82 : tensor<16x128xi32> 2026-02-21T10:03:46.0439842Z %84 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0440148Z %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0440444Z %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0440760Z %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0441041Z %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0441301Z %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0441565Z %90 = arith.subf %88, %89 : tensor<16x128xf32> 2026-02-21T10:03:46.0441929Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0442339Z %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0442618Z %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0442858Z %94 = arith.divf %91, %93 : tensor<16x128xf32> 2026-02-21T10:03:46.0443091Z %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T10:03:46.0443366Z %96 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0443646Z %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0443894Z tt.store %97, %95 : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0444100Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:03:46.0444294Z %98 = arith.muli %c128_i32, %c2_i32 : i32 2026-02-21T10:03:46.0444496Z %99 = arith.addi %arg3, %98 : i32 2026-02-21T10:03:46.0444730Z %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:03:46.0444991Z %101 = tt.splat %99 : i32 -> tensor<128xi32> 2026-02-21T10:03:46.0445200Z %102 = arith.addi %101, %100 : tensor<128xi32> 2026-02-21T10:03:46.0445450Z %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T10:03:46.0445718Z %104 = arith.muli %103, %cst : tensor<16x1xi32> 2026-02-21T10:03:46.0445978Z %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T10:03:46.0446281Z %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0446553Z %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0446795Z %108 = arith.addi %106, %107 : tensor<16x128xi32> 2026-02-21T10:03:46.0447042Z %109 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0447324Z %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0447667Z %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0447980Z %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0448300Z %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0448572Z %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0448811Z %115 = arith.subf %113, %114 : tensor<16x128xf32> 2026-02-21T10:03:46.0449193Z %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0449636Z %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0449942Z %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0450226Z %119 = arith.divf %116, %118 : tensor<16x128xf32> 2026-02-21T10:03:46.0450481Z %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T10:03:46.0450778Z %121 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0451105Z %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0451386Z tt.store %122, %120 : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0451632Z %c3_i32 = arith.constant 3 : i32 2026-02-21T10:03:46.0451835Z %123 = arith.muli %c128_i32, %c3_i32 : i32 2026-02-21T10:03:46.0452040Z %124 = arith.addi %arg3, %123 : i32 2026-02-21T10:03:46.0452283Z %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:03:46.0452551Z %126 = tt.splat %124 : i32 -> tensor<128xi32> 2026-02-21T10:03:46.0452764Z %127 = arith.addi %126, %125 : tensor<128xi32> 2026-02-21T10:03:46.0453032Z %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T10:03:46.0453308Z %129 = arith.muli %128, %cst : tensor<16x1xi32> 2026-02-21T10:03:46.0453586Z %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T10:03:46.0453902Z %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0454183Z %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0454466Z %133 = arith.addi %131, %132 : tensor<16x128xi32> 2026-02-21T10:03:46.0454715Z %134 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0455018Z %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0455349Z %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0455679Z %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0455986Z %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0456271Z %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0456535Z %140 = arith.subf %138, %139 : tensor<16x128xf32> 2026-02-21T10:03:46.0456965Z %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0457387Z %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0457685Z %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0457926Z %144 = arith.divf %141, %143 : tensor<16x128xf32> 2026-02-21T10:03:46.0458168Z %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T10:03:46.0458442Z %146 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0458718Z %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0459021Z tt.store %147, %145 : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0459230Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T10:03:46.0459482Z %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T10:03:46.0459765Z %28 = tt.splat %c11264_i32_2 : i32 -> tensor<128xi32> 2026-02-21T10:03:46.0459986Z %29 = arith.addi %28, %27 : tensor<128xi32> 2026-02-21T10:03:46.0460237Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T10:03:46.0460489Z %31 = arith.muli %30, %cst : tensor<16x1xi32> 2026-02-21T10:03:46.0460743Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32> 2026-02-21T10:03:46.0461022Z %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0461287Z %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32> 2026-02-21T10:03:46.0461596Z %35 = arith.addi %33, %34 : tensor<16x128xi32> 2026-02-21T10:03:46.0461842Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0462123Z %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0462457Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0462768Z %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0463044Z %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32> 2026-02-21T10:03:46.0463307Z %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0463537Z %42 = arith.subf %40, %41 : tensor<16x128xf32> 2026-02-21T10:03:46.0463909Z %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32> 2026-02-21T10:03:46.0464328Z %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32> 2026-02-21T10:03:46.0464608Z %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32> 2026-02-21T10:03:46.0464841Z %46 = arith.divf %43, %45 : tensor<16x128xf32> 2026-02-21T10:03:46.0465072Z %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16> 2026-02-21T10:03:46.0465340Z %48 = tt.splat %arg1 : !tt.ptr -> tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0465614Z %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr>, tensor<16x128xi32> 2026-02-21T10:03:46.0465864Z tt.store %49, %47 : tensor<16x128x!tt.ptr> 2026-02-21T10:03:46.0466170Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T10:03:46.0466445Z tt.return 2026-02-21T10:03:46.0466578Z } 2026-02-21T10:03:46.0466695Z } 2026-02-21T10:03:46.0466771Z 2026-02-21T10:03:46.0466821Z {-# 2026-02-21T10:03:46.0466946Z external_resources: { 2026-02-21T10:03:46.0467111Z mlir_reproducer: { 2026-02-21T10:03:46.0471481Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T10:03:46.0476094Z disable_threading: false, 2026-02-21T10:03:46.0476277Z verify_each: true 2026-02-21T10:03:46.0476426Z } 2026-02-21T10:03:46.0476555Z } 2026-02-21T10:03:46.0476672Z #-} 2026-02-21T10:03:46.0477134Z /tmp/torchinductor_root/cc/cccgmf6dglaruus34hbvdsr7nfyctkle4fuecbsoxmkrcgdutke5.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:03:46.0478352Z /tmp/torchinductor_root/cc/cccgmf6dglaruus34hbvdsr7nfyctkle4fuecbsoxmkrcgdutke5.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:03:46.0479332Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:03:46.0480393Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T10:03:46.0481344Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:03:46.0481633Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:03:53.9298852Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.9 configs/s 2026-02-21T10:03:53.9308427Z [51s] Adaptive compile timeout: 30s (90% percentile=16.5s, bounds=[30.0s, 30s]) 2026-02-21T10:03:55.4259193Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 661.4 configs/s 2026-02-21T10:03:55.5087133Z [53s] Initial random population of 100, 5 starting points: 2026-02-21T10:03:55.5091447Z error=12 2026-02-21T10:03:55.5092960Z timeout=1 2026-02-21T10:03:55.5093125Z ok=87 2026-02-21T10:03:55.5093251Z min=0.0757 2026-02-21T10:03:55.5093384Z mid=0.5713 2026-02-21T10:03:55.5093505Z max=267.8518 2026-02-21T10:03:55.5093659Z best={'block_sizes': [1, 4096], 2026-02-21T10:03:55.5093924Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:03:55.5094185Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T10:03:55.5094376Z 'num_stages': 5, 2026-02-21T10:03:55.5094511Z 'num_warps': 1, 2026-02-21T10:03:55.5094715Z 'pid_type': 'flat', 2026-02-21T10:03:55.5094877Z 'range_flattens': [None, False], 2026-02-21T10:03:55.5095076Z 'range_multi_buffers': [None, False], 2026-02-21T10:03:55.5095259Z 'range_num_stages': [0, 1], 2026-02-21T10:03:55.5099835Z 'range_unroll_factors': [0, 0], 2026-02-21T10:03:55.5101331Z 'range_warp_specializes': [None, False]} 2026-02-21T10:03:55.5101709Z [53s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:03:56.4444644Z [54s] Generation 1 starting: 73 neighbors, 5 active search path(s) 2026-02-21T10:04:12.6976664Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 3.7 configs/s 2026-02-21T10:04:17.2107432Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 17.0 configs/s 2026-02-21T10:04:22.3785047Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 194.7 2026-02-21T10:04:22.3785950Z configs/s 2026-02-21T10:04:22.6192151Z [80s] Generation 1 complete: 2026-02-21T10:04:22.6196443Z ok=79 2026-02-21T10:04:22.6201419Z min=0.0614 2026-02-21T10:04:22.6205387Z mid=0.0901 2026-02-21T10:04:22.6209704Z max=0.8972 2026-02-21T10:04:22.6214060Z best={'block_sizes': [1, 4096], 2026-02-21T10:04:22.6217641Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:04:22.6220659Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:04:22.6220905Z 'num_stages': 7, 2026-02-21T10:04:22.6221061Z 'num_warps': 4, 2026-02-21T10:04:22.6221203Z 'pid_type': 'flat', 2026-02-21T10:04:22.6221368Z 'range_flattens': [None, None], 2026-02-21T10:04:22.6221648Z 'range_multi_buffers': [None, True], 2026-02-21T10:04:22.6221841Z 'range_num_stages': [0, 0], 2026-02-21T10:04:22.6222014Z 'range_unroll_factors': [0, 4], 2026-02-21T10:04:22.6222207Z 'range_warp_specializes': [None, False]} 2026-02-21T10:04:22.6222427Z [80s] Fitting surrogate: 179 points, 179 targets 2026-02-21T10:04:23.6000365Z [81s] Generation 2 starting: 69 neighbors, 5 active search path(s) 2026-02-21T10:04:38.2608885Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 4.3 configs/s 2026-02-21T10:04:42.3483323Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.1 configs/s 2026-02-21T10:04:42.9343800Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1662.5 2026-02-21T10:04:42.9344221Z configs/s 2026-02-21T10:04:42.9852518Z [100s] Generation 2 complete: 2026-02-21T10:04:42.9856932Z ok=74 2026-02-21T10:04:42.9858486Z min=0.0421 2026-02-21T10:04:42.9858655Z mid=0.0716 2026-02-21T10:04:42.9858779Z max=0.1883 2026-02-21T10:04:42.9858927Z best={'block_sizes': [1, 16384], 2026-02-21T10:04:42.9859166Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:04:42.9859431Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:04:42.9859639Z 'num_stages': 7, 2026-02-21T10:04:42.9859846Z 'num_warps': 4, 2026-02-21T10:04:42.9860004Z 'pid_type': 'flat', 2026-02-21T10:04:42.9860165Z 'range_flattens': [None, None], 2026-02-21T10:04:42.9860379Z 'range_multi_buffers': [None, True], 2026-02-21T10:04:42.9863545Z 'range_num_stages': [0, 0], 2026-02-21T10:04:42.9868560Z 'range_unroll_factors': [0, 4], 2026-02-21T10:04:42.9868909Z 'range_warp_specializes': [None, False]} 2026-02-21T10:04:42.9869198Z [100s] Fitting surrogate: 253 points, 253 targets 2026-02-21T10:04:43.9194428Z [101s] Generation 3 starting: 62 neighbors, 5 active search path(s) 2026-02-21T10:04:56.4580254Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 4.4 configs/s 2026-02-21T10:05:00.1172688Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.2 configs/s 2026-02-21T10:05:01.7754284Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 603.8 2026-02-21T10:05:01.7755102Z configs/s 2026-02-21T10:05:01.8736201Z [119s] Generation 3 complete: 2026-02-21T10:05:01.8738228Z ok=67 2026-02-21T10:05:01.8738448Z min=0.0429 2026-02-21T10:05:01.8738646Z mid=0.0655 2026-02-21T10:05:01.8738844Z max=0.9596 2026-02-21T10:05:01.8739034Z best={'block_sizes': [1, 16384], 2026-02-21T10:05:01.8739314Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:05:01.8739605Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:05:01.8739848Z 'num_stages': 7, 2026-02-21T10:05:01.8739995Z 'num_warps': 4, 2026-02-21T10:05:01.8740134Z 'pid_type': 'flat', 2026-02-21T10:05:01.8740295Z 'range_flattens': [None, None], 2026-02-21T10:05:01.8740470Z 'range_multi_buffers': [None, True], 2026-02-21T10:05:01.8740657Z 'range_num_stages': [0, 0], 2026-02-21T10:05:01.8740822Z 'range_unroll_factors': [0, 4], 2026-02-21T10:05:01.8741236Z 'range_warp_specializes': [None, False]} 2026-02-21T10:05:01.8751030Z [119s] Fitting surrogate: 320 points, 320 targets 2026-02-21T10:05:02.5497715Z [120s] Generation 4 starting: 41 neighbors, 4 active search path(s) 2026-02-21T10:05:12.1968073Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 2.7 configs/s 2026-02-21T10:05:14.6602893Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.4 configs/s 2026-02-21T10:05:16.9856099Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 432.4 2026-02-21T10:05:16.9860043Z configs/s 2026-02-21T10:05:17.1222820Z [134s] Generation 4 complete: 2026-02-21T10:05:17.1227103Z ok=45 2026-02-21T10:05:17.1231443Z min=0.0410 2026-02-21T10:05:17.1232755Z mid=0.0614 2026-02-21T10:05:17.1232917Z max=0.7526 2026-02-21T10:05:17.1233078Z best={'block_sizes': [1, 16384], 2026-02-21T10:05:17.1233317Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:05:17.1233581Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:05:17.1233783Z 'num_stages': 7, 2026-02-21T10:05:17.1233932Z 'num_warps': 4, 2026-02-21T10:05:17.1234073Z 'pid_type': 'flat', 2026-02-21T10:05:17.1234238Z 'range_flattens': [None, None], 2026-02-21T10:05:17.1234669Z 'range_multi_buffers': [None, True], 2026-02-21T10:05:17.1234881Z 'range_num_stages': [0, 0], 2026-02-21T10:05:17.1235054Z 'range_unroll_factors': [0, 4], 2026-02-21T10:05:17.1235234Z 'range_warp_specializes': [None, False]} 2026-02-21T10:05:17.1239353Z [134s] Fitting surrogate: 365 points, 365 targets 2026-02-21T10:05:17.5216597Z [135s] Generation 5 starting: 21 neighbors, 2 active search path(s) 2026-02-21T10:05:22.0605943Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 3.7 configs/s 2026-02-21T10:05:23.3531014Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.6 configs/s 2026-02-21T10:05:24.8055765Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 689.4 2026-02-21T10:05:24.8057646Z configs/s 2026-02-21T10:05:24.9002593Z [142s] Generation 5 complete: 2026-02-21T10:05:24.9007037Z ok=24 2026-02-21T10:05:24.9011334Z min=0.0409 2026-02-21T10:05:24.9012770Z mid=0.0593 2026-02-21T10:05:24.9012961Z max=0.0820 2026-02-21T10:05:24.9013119Z best={'block_sizes': [1, 16384], 2026-02-21T10:05:24.9013359Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:05:24.9013599Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:05:24.9013791Z 'num_stages': 8, 2026-02-21T10:05:24.9013932Z 'num_warps': 1, 2026-02-21T10:05:24.9014079Z 'pid_type': 'flat', 2026-02-21T10:05:24.9014232Z 'range_flattens': [None, None], 2026-02-21T10:05:24.9014416Z 'range_multi_buffers': [None, None], 2026-02-21T10:05:24.9014605Z 'range_num_stages': [0, 3], 2026-02-21T10:05:24.9014767Z 'range_unroll_factors': [0, 1], 2026-02-21T10:05:24.9014952Z 'range_warp_specializes': [None, True]} 2026-02-21T10:05:24.9017610Z [142s] Fitting surrogate: 389 points, 389 targets 2026-02-21T10:05:25.2806936Z [142s] Generation 6 starting: 17 neighbors, 2 active search path(s) 2026-02-21T10:05:28.8889913Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 9.3 configs/s 2026-02-21T10:05:29.8835672Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.9 configs/s 2026-02-21T10:05:31.3039487Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 704.6 2026-02-21T10:05:31.3039994Z configs/s 2026-02-21T10:05:31.3968314Z [148s] Generation 6 complete: 2026-02-21T10:05:31.3972657Z ok=19 2026-02-21T10:05:31.3977079Z min=0.0410 2026-02-21T10:05:31.3981459Z mid=0.0430 2026-02-21T10:05:31.3985897Z max=0.0799 2026-02-21T10:05:31.3987351Z best={'block_sizes': [1, 16384], 2026-02-21T10:05:31.3987614Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:05:31.3988142Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:05:31.3988360Z 'num_stages': 8, 2026-02-21T10:05:31.3988513Z 'num_warps': 1, 2026-02-21T10:05:31.3988663Z 'pid_type': 'flat', 2026-02-21T10:05:31.3988818Z 'range_flattens': [None, None], 2026-02-21T10:05:31.3989003Z 'range_multi_buffers': [None, None], 2026-02-21T10:05:31.3989341Z 'range_num_stages': [0, 3], 2026-02-21T10:05:31.3989513Z 'range_unroll_factors': [0, 1], 2026-02-21T10:05:31.3989690Z 'range_warp_specializes': [None, True]} 2026-02-21T10:05:31.3989915Z [148s] Fitting surrogate: 408 points, 408 targets 2026-02-21T10:05:31.7673179Z [149s] Generation 7 starting: 13 neighbors, 2 active search path(s) 2026-02-21T10:05:35.9842158Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.7 configs/s 2026-02-21T10:05:36.7541055Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.0 configs/s 2026-02-21T10:05:37.9881379Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 809.5 2026-02-21T10:05:37.9882626Z configs/s 2026-02-21T10:05:38.0734950Z [155s] Generation 7 complete: 2026-02-21T10:05:38.0736753Z ok=15 2026-02-21T10:05:38.0736923Z min=0.0409 2026-02-21T10:05:38.0737052Z mid=0.0471 2026-02-21T10:05:38.0737184Z max=0.1352 2026-02-21T10:05:38.0737581Z best={'block_sizes': [1, 16384], 2026-02-21T10:05:38.0737846Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:05:38.0738086Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:05:38.0738283Z 'num_stages': 8, 2026-02-21T10:05:38.0738421Z 'num_warps': 1, 2026-02-21T10:05:38.0738566Z 'pid_type': 'flat', 2026-02-21T10:05:38.0738719Z 'range_flattens': [None, None], 2026-02-21T10:05:38.0738897Z 'range_multi_buffers': [None, None], 2026-02-21T10:05:38.0739084Z 'range_num_stages': [0, 3], 2026-02-21T10:05:38.0739246Z 'range_unroll_factors': [0, 1], 2026-02-21T10:05:38.0739429Z 'range_warp_specializes': [None, True]} 2026-02-21T10:05:38.0749921Z [155s] Fitting surrogate: 423 points, 423 targets 2026-02-21T10:05:38.2463756Z [155s] Autotuning complete in 155.8s after searching 400 configs. 2026-02-21T10:05:38.2464149Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:05:38.2465124Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T10:05:38.2465973Z 2026-02-21T10:05:38.2466232Z [155s] Code of selected kernel: /tmp/torchinductor_root/di/cdiqaqmp6hqpf6mdpxmaeajmeqiazcpvadjtb4krqe5mqsv4i3vl.py 2026-02-21T10:05:38.2689116Z from __future__ import annotations 2026-02-21T10:05:38.2691090Z 2026-02-21T10:05:38.2691315Z import torch 2026-02-21T10:05:38.2691506Z import triton 2026-02-21T10:05:38.2691900Z import triton.language as tl 2026-02-21T10:05:38.2692145Z from torch._inductor.runtime import triton_helpers 2026-02-21T10:05:38.2692692Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T10:05:38.2692995Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T10:05:38.2693179Z 2026-02-21T10:05:38.2693362Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T10:05:38.2693549Z _BLOCK_SIZE_1 = tl.constexpr(16384) 2026-02-21T10:05:38.2693683Z 2026-02-21T10:05:38.2693742Z @triton.jit 2026-02-21T10:05:38.2693897Z def _helion_softmax_two_pass(x, out): 2026-02-21T10:05:38.2694170Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:05:38.2694441Z pid_0 = tl.program_id(0) 2026-02-21T10:05:38.2694610Z offset_0 = pid_0 2026-02-21T10:05:38.2694798Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T10:05:38.2695084Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:05:38.2695461Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T10:05:38.2695731Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:05:38.2695996Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T10:05:38.2696264Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:05:38.2696554Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:05:38.2696825Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:05:38.2697065Z # src[softmax.py:82-89]: ... 2026-02-21T10:05:38.2697393Z for offset_2 in tl.range(0, 11392, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, num_stages=3): 2026-02-21T10:05:38.2697772Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:05:38.2698010Z mask_1 = indices_2 < 11392 2026-02-21T10:05:38.2698174Z mi_copy = mi 2026-02-21T10:05:38.2698321Z di_copy = di 2026-02-21T10:05:38.2698466Z mi_copy_0 = mi_copy 2026-02-21T10:05:38.2698620Z di_copy_0 = di_copy 2026-02-21T10:05:38.2698808Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:05:38.2699216Z values = tl.load(x + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:05:38.2699613Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:05:38.2700015Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T10:05:38.2700413Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T10:05:38.2700680Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T10:05:38.2700913Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T10:05:38.2701129Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T10:05:38.2701385Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:05:38.2701672Z v_2 = mi_copy_0 - v_1 2026-02-21T10:05:38.2701843Z v_3 = libdevice.exp(v_2) 2026-02-21T10:05:38.2702020Z v_4 = di_copy_0 * v_3 2026-02-21T10:05:38.2702219Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:05:38.2702424Z subscript = v_1[:, None] 2026-02-21T10:05:38.2702614Z v_5 = tl.cast(values, tl.float32) 2026-02-21T10:05:38.2702796Z v_6 = v_5 - subscript 2026-02-21T10:05:38.2703023Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:05:38.2703288Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:05:38.2703515Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T10:05:38.2703712Z v_7 = libdevice.exp(v_6) 2026-02-21T10:05:38.2704026Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T10:05:38.2704388Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T10:05:38.2704585Z di = v_4 + sum_1 2026-02-21T10:05:38.2704803Z # src[softmax.py:89]: mi = mi_next 2026-02-21T10:05:38.2704970Z mi = v_1 2026-02-21T10:05:38.2705175Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:05:38.2705493Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:05:38.2705795Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:05:38.2706210Z for offset_2 in tl.range(0, 11392, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, num_stages=3): 2026-02-21T10:05:38.2706572Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:05:38.2706813Z mask_2 = indices_2 < 11392 2026-02-21T10:05:38.2706984Z mi_copy_1 = mi 2026-02-21T10:05:38.2707142Z di_copy_1 = di 2026-02-21T10:05:38.2707291Z mi_copy_1_0 = mi_copy_1 2026-02-21T10:05:38.2707508Z di_copy_1_0 = di_copy_1 2026-02-21T10:05:38.2707699Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:05:38.2708066Z values_1 = tl.load(x + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:05:38.2708511Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:05:38.2708785Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T10:05:38.2708977Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T10:05:38.2709157Z v_10 = v_9 - subscript_1 2026-02-21T10:05:38.2709328Z v_11 = libdevice.exp(v_10) 2026-02-21T10:05:38.2709506Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T10:05:38.2709682Z v_12 = v_11 / subscript_2 2026-02-21T10:05:38.2709858Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T10:05:38.2710132Z tl.store(out + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T10:05:38.2710354Z 2026-02-21T10:05:38.2710481Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T10:05:38.2710708Z """ 2026-02-21T10:05:38.2710915Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T10:05:38.2711254Z This version uses fewer passes but is less numerically stable. 2026-02-21T10:05:38.2711474Z Args: 2026-02-21T10:05:38.2711664Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T10:05:38.2711853Z Returns: 2026-02-21T10:05:38.2712034Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T10:05:38.2712238Z """ 2026-02-21T10:05:38.2712381Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T10:05:38.2712557Z m, n = x.size() 2026-02-21T10:05:38.2712733Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T10:05:38.2712939Z out = torch.empty_like(x) 2026-02-21T10:05:38.2713159Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:05:38.2713480Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:05:38.2713788Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:05:38.2714032Z # src[softmax.py:79-92]: ... 2026-02-21T10:05:38.2714288Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=8) 2026-02-21T10:05:38.2714565Z # src[softmax.py:93]: return out 2026-02-21T10:05:38.2714737Z return out 2026-02-21T10:05:39.2720443Z WARNING:tritonbench.utils.triton_op:Completed input ID 87: 2026-02-21T10:05:39.2722317Z (M, N) 2026-02-21T10:05:39.2722481Z ------------- 2026-02-21T10:05:39.2722636Z (4096, 11392) 2026-02-21T10:05:39.2722766Z 2026-02-21T10:05:39.2734771Z 90%|█████████ | 18/20 [56:44<06:48, 204.23s/it]WARNING:tritonbench.utils.triton_op:Running input ID 92: 2026-02-21T10:05:39.2738796Z (M, N) 2026-02-21T10:05:39.2743117Z ------------- 2026-02-21T10:05:39.2748115Z (4096, 12032) 2026-02-21T10:05:39.2751476Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T10:05:40.4471262Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T10:05:41.8207464Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for torch_compile_softmax 2026-02-21T10:05:43.1571417Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:05:43.1575894Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:05:43.1579111Z 'dtype': 'torch.float16', 2026-02-21T10:05:43.1583478Z 'shape': (4096, 12032), 2026-02-21T10:05:43.1585033Z 'stride': (12032, 1)},), 2026-02-21T10:05:43.1585257Z 'kwargs': {}} 2026-02-21T10:05:43.1594307Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T10:05:43.3339296Z [0s] Autotune random seed: 2138408546 2026-02-21T10:05:43.3593977Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:06:20.4021737Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T10:06:23.2134804Z [39s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T10:06:25.2487547Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T10:06:34.6461307Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.5 configs/s 2026-02-21T10:06:34.6469913Z [51s] Adaptive compile timeout: 30s (90% percentile=17.6s, bounds=[30.0s, 30s]) 2026-02-21T10:06:36.8640468Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 661.8 configs/s 2026-02-21T10:06:36.9525256Z [53s] Initial random population of 100, 5 starting points: 2026-02-21T10:06:36.9529644Z error=11 2026-02-21T10:06:36.9533109Z timeout=2 2026-02-21T10:06:36.9539035Z ok=87 2026-02-21T10:06:36.9542509Z min=0.0757 2026-02-21T10:06:36.9546831Z mid=0.5504 2026-02-21T10:06:36.9550279Z max=280.9385 2026-02-21T10:06:36.9552618Z best={'block_sizes': [1, 4096], 2026-02-21T10:06:36.9553002Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:06:36.9553311Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T10:06:36.9557890Z 'num_stages': 5, 2026-02-21T10:06:36.9561822Z 'num_warps': 1, 2026-02-21T10:06:36.9563007Z 'pid_type': 'flat', 2026-02-21T10:06:36.9563249Z 'range_flattens': [None, False], 2026-02-21T10:06:36.9563477Z 'range_multi_buffers': [None, False], 2026-02-21T10:06:36.9563704Z 'range_num_stages': [0, 1], 2026-02-21T10:06:36.9563894Z 'range_unroll_factors': [0, 0], 2026-02-21T10:06:36.9564109Z 'range_warp_specializes': [None, False]} 2026-02-21T10:06:36.9564448Z [53s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:06:37.9882443Z [54s] Generation 1 starting: 73 neighbors, 5 active search path(s) 2026-02-21T10:07:03.4796825Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 0.5 configs/s 2026-02-21T10:07:07.9494607Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.9 configs/s 2026-02-21T10:07:11.6557298Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.1 2026-02-21T10:07:11.6560827Z configs/s 2026-02-21T10:07:11.8339404Z [88s] Generation 1 complete: 2026-02-21T10:07:11.8340627Z ok=79 2026-02-21T10:07:11.8341133Z min=0.0573 2026-02-21T10:07:11.8341281Z mid=0.0882 2026-02-21T10:07:11.8341502Z max=0.7086 2026-02-21T10:07:11.8341920Z best={'block_sizes': [1, 4096], 2026-02-21T10:07:11.8342180Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:07:11.8342466Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T10:07:11.8342769Z 'num_sm_multiplier': 128, 2026-02-21T10:07:11.8342938Z 'num_stages': 5, 2026-02-21T10:07:11.8343096Z 'num_warps': 1, 2026-02-21T10:07:11.8343258Z 'pid_type': 'persistent_interleaved', 2026-02-21T10:07:11.8343463Z 'range_flattens': [True, False], 2026-02-21T10:07:11.8343641Z 'range_multi_buffers': [True, False], 2026-02-21T10:07:11.8343826Z 'range_num_stages': [2, 1], 2026-02-21T10:07:11.8343995Z 'range_unroll_factors': [0, 0], 2026-02-21T10:07:11.8344183Z 'range_warp_specializes': [True, None]} 2026-02-21T10:07:11.8354149Z [88s] Fitting surrogate: 179 points, 179 targets 2026-02-21T10:07:12.8814370Z [89s] Generation 2 starting: 81 neighbors, 5 active search path(s) 2026-02-21T10:07:48.3457595Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 0.3 configs/s 2026-02-21T10:07:53.2780170Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.2 configs/s 2026-02-21T10:07:59.2428864Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 169.1 2026-02-21T10:07:59.2429647Z configs/s 2026-02-21T10:07:59.5409129Z [136s] Generation 2 complete: 2026-02-21T10:07:59.5410843Z error=1 2026-02-21T10:07:59.5411003Z ok=86 2026-02-21T10:07:59.5411130Z min=0.0555 2026-02-21T10:07:59.5411265Z mid=0.0726 2026-02-21T10:07:59.5411385Z max=1.8360 2026-02-21T10:07:59.5411527Z best={'block_sizes': [1, 4096], 2026-02-21T10:07:59.5411956Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T10:07:59.5412221Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T10:07:59.5412414Z 'num_sm_multiplier': 128, 2026-02-21T10:07:59.5412599Z 'num_stages': 5, 2026-02-21T10:07:59.5412740Z 'num_warps': 2, 2026-02-21T10:07:59.5412917Z 'pid_type': 'persistent_interleaved', 2026-02-21T10:07:59.5413118Z 'range_flattens': [True, False], 2026-02-21T10:07:59.5413297Z 'range_multi_buffers': [True, False], 2026-02-21T10:07:59.5413487Z 'range_num_stages': [2, 1], 2026-02-21T10:07:59.5413656Z 'range_unroll_factors': [0, 0], 2026-02-21T10:07:59.5413838Z 'range_warp_specializes': [None, False]} 2026-02-21T10:07:59.5430724Z [136s] Fitting surrogate: 266 points, 266 targets 2026-02-21T10:08:00.6434436Z [137s] Generation 3 starting: 73 neighbors, 5 active search path(s) 2026-02-21T10:08:17.7459972Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 3.1 configs/s 2026-02-21T10:08:22.1700943Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.1 configs/s 2026-02-21T10:08:25.4870347Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 303.6 2026-02-21T10:08:25.4874236Z configs/s 2026-02-21T10:08:25.6828700Z [162s] Generation 3 complete: 2026-02-21T10:08:25.6833023Z ok=79 2026-02-21T10:08:25.6837386Z min=0.0430 2026-02-21T10:08:25.6841200Z mid=0.0676 2026-02-21T10:08:25.6845685Z max=0.5201 2026-02-21T10:08:25.6850240Z best={'block_sizes': [1, 16384], 2026-02-21T10:08:25.6854685Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:08:25.6856007Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:08:25.6856227Z 'num_stages': 6, 2026-02-21T10:08:25.6856377Z 'num_warps': 4, 2026-02-21T10:08:25.6856520Z 'pid_type': 'flat', 2026-02-21T10:08:25.6856689Z 'range_flattens': [None, None], 2026-02-21T10:08:25.6856875Z 'range_multi_buffers': [None, True], 2026-02-21T10:08:25.6857072Z 'range_num_stages': [0, 0], 2026-02-21T10:08:25.6857242Z 'range_unroll_factors': [0, 4], 2026-02-21T10:08:25.6857423Z 'range_warp_specializes': [None, False]} 2026-02-21T10:08:25.6857637Z [162s] Fitting surrogate: 345 points, 345 targets 2026-02-21T10:08:27.2176510Z [163s] Generation 4 starting: 50 neighbors, 4 active search path(s) 2026-02-21T10:08:38.0029543Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 4.4 configs/s 2026-02-21T10:08:41.0188822Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.2 configs/s 2026-02-21T10:08:44.5172814Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.3 2026-02-21T10:08:44.5175978Z configs/s 2026-02-21T10:08:44.7158410Z [181s] Generation 4 complete: 2026-02-21T10:08:44.7160464Z ok=54 2026-02-21T10:08:44.7160688Z min=0.0430 2026-02-21T10:08:44.7165341Z mid=0.0594 2026-02-21T10:08:44.7167327Z max=0.4955 2026-02-21T10:08:44.7167517Z best={'block_sizes': [1, 16384], 2026-02-21T10:08:44.7167768Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:08:44.7168014Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:08:44.7168229Z 'num_stages': 6, 2026-02-21T10:08:44.7168411Z 'num_warps': 4, 2026-02-21T10:08:44.7168565Z 'pid_type': 'flat', 2026-02-21T10:08:44.7168728Z 'range_flattens': [None, None], 2026-02-21T10:08:44.7168907Z 'range_multi_buffers': [None, True], 2026-02-21T10:08:44.7169101Z 'range_num_stages': [0, 0], 2026-02-21T10:08:44.7169560Z 'range_unroll_factors': [0, 4], 2026-02-21T10:08:44.7169767Z 'range_warp_specializes': [None, False]} 2026-02-21T10:08:44.7176270Z [181s] Fitting surrogate: 399 points, 399 targets 2026-02-21T10:08:45.4604732Z [182s] Generation 5 starting: 46 neighbors, 4 active search path(s) 2026-02-21T10:08:57.0369977Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 3.5 configs/s 2026-02-21T10:08:59.8744648Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.2 configs/s 2026-02-21T10:09:02.5140455Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 381.4 2026-02-21T10:09:02.5144471Z configs/s 2026-02-21T10:09:02.6684626Z [199s] Generation 5 complete: 2026-02-21T10:09:02.6688955Z ok=50 2026-02-21T10:09:02.6693345Z min=0.0429 2026-02-21T10:09:02.6697763Z mid=0.0594 2026-02-21T10:09:02.6701679Z max=0.7075 2026-02-21T10:09:02.6704952Z best={'block_sizes': [1, 16384], 2026-02-21T10:09:02.6705290Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:09:02.6709488Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:09:02.6713208Z 'num_stages': 6, 2026-02-21T10:09:02.6716425Z 'num_warps': 4, 2026-02-21T10:09:02.6719704Z 'pid_type': 'flat', 2026-02-21T10:09:02.6719990Z 'range_flattens': [None, None], 2026-02-21T10:09:02.6720213Z 'range_multi_buffers': [None, True], 2026-02-21T10:09:02.6724194Z 'range_num_stages': [0, 0], 2026-02-21T10:09:02.6728508Z 'range_unroll_factors': [0, 4], 2026-02-21T10:09:02.6732298Z 'range_warp_specializes': [None, False]} 2026-02-21T10:09:02.6736725Z [199s] Fitting surrogate: 449 points, 449 targets 2026-02-21T10:09:03.1433878Z [199s] Generation 6 starting: 28 neighbors, 2 active search path(s) 2026-02-21T10:09:12.2548576Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 1.3 configs/s 2026-02-21T10:09:13.9795206Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 17.3 configs/s 2026-02-21T10:09:16.1828935Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 456.1 2026-02-21T10:09:16.1833068Z configs/s 2026-02-21T10:09:16.3168171Z [212s] Generation 6 complete: 2026-02-21T10:09:16.3170022Z ok=31 2026-02-21T10:09:16.3170179Z min=0.0410 2026-02-21T10:09:16.3170316Z mid=0.0593 2026-02-21T10:09:16.3170437Z max=0.5233 2026-02-21T10:09:16.3170581Z best={'block_sizes': [1, 16384], 2026-02-21T10:09:16.3170809Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:09:16.3171056Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:09:16.3171253Z 'num_stages': 6, 2026-02-21T10:09:16.3171883Z 'num_warps': 4, 2026-02-21T10:09:16.3172049Z 'pid_type': 'flat', 2026-02-21T10:09:16.3172214Z 'range_flattens': [None, None], 2026-02-21T10:09:16.3172400Z 'range_multi_buffers': [None, True], 2026-02-21T10:09:16.3172583Z 'range_num_stages': [0, 0], 2026-02-21T10:09:16.3172755Z 'range_unroll_factors': [0, 4], 2026-02-21T10:09:16.3173030Z 'range_warp_specializes': [None, False]} 2026-02-21T10:09:16.3186077Z [212s] Fitting surrogate: 480 points, 480 targets 2026-02-21T10:09:16.7680881Z [213s] Generation 7 starting: 24 neighbors, 2 active search path(s) 2026-02-21T10:09:27.4817103Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 0.5 configs/s 2026-02-21T10:09:28.9706037Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.3 configs/s 2026-02-21T10:09:30.8318105Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 539.3 2026-02-21T10:09:30.8322116Z configs/s 2026-02-21T10:09:30.9448542Z [227s] Generation 7 complete: 2026-02-21T10:09:30.9452978Z ok=27 2026-02-21T10:09:30.9457259Z min=0.0410 2026-02-21T10:09:30.9458617Z mid=0.0573 2026-02-21T10:09:30.9458786Z max=0.1597 2026-02-21T10:09:30.9458930Z best={'block_sizes': [1, 16384], 2026-02-21T10:09:30.9459442Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:09:30.9459733Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:09:30.9459925Z 'num_stages': 6, 2026-02-21T10:09:30.9460104Z 'num_warps': 4, 2026-02-21T10:09:30.9460244Z 'pid_type': 'flat', 2026-02-21T10:09:30.9460404Z 'range_flattens': [None, None], 2026-02-21T10:09:30.9460580Z 'range_multi_buffers': [None, True], 2026-02-21T10:09:30.9460767Z 'range_num_stages': [0, 0], 2026-02-21T10:09:30.9460931Z 'range_unroll_factors': [0, 4], 2026-02-21T10:09:30.9461115Z 'range_warp_specializes': [None, False]} 2026-02-21T10:09:30.9465864Z [227s] Fitting surrogate: 507 points, 507 targets 2026-02-21T10:09:31.5479687Z [228s] Generation 8 starting: 29 neighbors, 2 active search path(s) 2026-02-21T10:09:39.4077052Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 3.5 configs/s 2026-02-21T10:09:41.2305545Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.4 configs/s 2026-02-21T10:09:43.2755226Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 491.8 2026-02-21T10:09:43.2758775Z configs/s 2026-02-21T10:09:43.4060578Z [240s] Generation 8 complete: 2026-02-21T10:09:43.4065531Z ok=32 2026-02-21T10:09:43.4066971Z min=0.0429 2026-02-21T10:09:43.4067138Z mid=0.0431 2026-02-21T10:09:43.4067261Z max=0.1270 2026-02-21T10:09:43.4067412Z best={'block_sizes': [1, 16384], 2026-02-21T10:09:43.4067640Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:09:43.4067888Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:09:43.4068083Z 'num_stages': 6, 2026-02-21T10:09:43.4068220Z 'num_warps': 4, 2026-02-21T10:09:43.4068380Z 'pid_type': 'flat', 2026-02-21T10:09:43.4068538Z 'range_flattens': [None, None], 2026-02-21T10:09:43.4068987Z 'range_multi_buffers': [None, True], 2026-02-21T10:09:43.4069170Z 'range_num_stages': [0, 0], 2026-02-21T10:09:43.4069341Z 'range_unroll_factors': [0, 4], 2026-02-21T10:09:43.4069525Z 'range_warp_specializes': [None, False]} 2026-02-21T10:09:43.4080434Z [240s] Fitting surrogate: 539 points, 539 targets 2026-02-21T10:09:43.7844189Z [240s] Generation 9 starting: 11 neighbors, 1 active search path(s) 2026-02-21T10:10:03.2401226Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 0.2 configs/s 2026-02-21T10:10:03.9607946Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.8 configs/s 2026-02-21T10:10:04.6140076Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1505.8 2026-02-21T10:10:04.6144933Z configs/s 2026-02-21T10:10:04.6669754Z [261s] Generation 9 complete: 2026-02-21T10:10:04.6674300Z ok=13 2026-02-21T10:10:04.6675526Z min=0.0411 2026-02-21T10:10:04.6675748Z mid=0.0594 2026-02-21T10:10:04.6675914Z max=0.4300 2026-02-21T10:10:04.6676107Z best={'block_sizes': [1, 16384], 2026-02-21T10:10:04.6676595Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:10:04.6676887Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:10:04.6677109Z 'num_stages': 6, 2026-02-21T10:10:04.6677254Z 'num_warps': 4, 2026-02-21T10:10:04.6677427Z 'pid_type': 'flat', 2026-02-21T10:10:04.6677583Z 'range_flattens': [None, None], 2026-02-21T10:10:04.6677782Z 'range_multi_buffers': [None, True], 2026-02-21T10:10:04.6677965Z 'range_num_stages': [0, 0], 2026-02-21T10:10:04.6678139Z 'range_unroll_factors': [0, 4], 2026-02-21T10:10:04.6678318Z 'range_warp_specializes': [None, False]} 2026-02-21T10:10:04.6693335Z [261s] Fitting surrogate: 552 points, 552 targets 2026-02-21T10:10:05.0334402Z [261s] Generation 10 starting: 10 neighbors, 1 active search path(s) 2026-02-21T10:10:09.5712543Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 2.0 configs/s 2026-02-21T10:10:10.1704140Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s 2026-02-21T10:10:10.8365288Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1483.6 2026-02-21T10:10:10.8367026Z configs/s 2026-02-21T10:10:10.8923567Z [267s] Generation 10 complete: 2026-02-21T10:10:10.8925540Z ok=12 2026-02-21T10:10:10.8925708Z min=0.0428 2026-02-21T10:10:10.8925844Z mid=0.0430 2026-02-21T10:10:10.8925963Z max=0.7067 2026-02-21T10:10:10.8926109Z best={'block_sizes': [1, 16384], 2026-02-21T10:10:10.8926342Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:10:10.8926594Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:10:10.8926783Z 'num_stages': 6, 2026-02-21T10:10:10.8926928Z 'num_warps': 4, 2026-02-21T10:10:10.8927070Z 'pid_type': 'flat', 2026-02-21T10:10:10.8927250Z 'range_flattens': [None, None], 2026-02-21T10:10:10.8927687Z 'range_multi_buffers': [None, True], 2026-02-21T10:10:10.8927877Z 'range_num_stages': [0, 0], 2026-02-21T10:10:10.8928053Z 'range_unroll_factors': [0, 4], 2026-02-21T10:10:10.8928239Z 'range_warp_specializes': [None, False]} 2026-02-21T10:10:10.8940206Z [267s] Fitting surrogate: 564 points, 564 targets 2026-02-21T10:10:11.3583978Z [267s] Generation 11 starting: 10 neighbors, 1 active search path(s) 2026-02-21T10:10:14.7780430Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.7 configs/s 2026-02-21T10:10:15.3679405Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.5 configs/s 2026-02-21T10:10:16.9297969Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1305.3 2026-02-21T10:10:16.9299303Z configs/s 2026-02-21T10:10:16.9931488Z [273s] Generation 11 complete: 2026-02-21T10:10:16.9935677Z ok=12 2026-02-21T10:10:16.9939820Z min=0.0429 2026-02-21T10:10:16.9944374Z mid=0.0430 2026-02-21T10:10:16.9948329Z max=0.4566 2026-02-21T10:10:16.9952702Z best={'block_sizes': [1, 16384], 2026-02-21T10:10:16.9954379Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:10:16.9954721Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:10:16.9955169Z 'num_stages': 6, 2026-02-21T10:10:16.9959360Z 'num_warps': 4, 2026-02-21T10:10:16.9959633Z 'pid_type': 'flat', 2026-02-21T10:10:16.9959845Z 'range_flattens': [None, None], 2026-02-21T10:10:16.9960053Z 'range_multi_buffers': [None, True], 2026-02-21T10:10:16.9964677Z 'range_num_stages': [0, 0], 2026-02-21T10:10:16.9968051Z 'range_unroll_factors': [0, 4], 2026-02-21T10:10:16.9971836Z 'range_warp_specializes': [None, False]} 2026-02-21T10:10:16.9976263Z [273s] Fitting surrogate: 576 points, 576 targets 2026-02-21T10:10:17.2729691Z [273s] Autotuning complete in 273.9s after searching 548 configs. 2026-02-21T10:10:17.2730121Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:10:17.2731302Z @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T10:10:17.2732308Z 2026-02-21T10:10:17.2732562Z [273s] Code of selected kernel: /tmp/torchinductor_root/es/cesmpf2kmfidgpmoulxbj57xbmpocavecijc4vzx5bpbfemvuaoq.py 2026-02-21T10:10:17.2951638Z from __future__ import annotations 2026-02-21T10:10:17.2955842Z 2026-02-21T10:10:17.2960422Z import torch 2026-02-21T10:10:17.2961992Z import triton 2026-02-21T10:10:17.2962186Z import triton.language as tl 2026-02-21T10:10:17.2962414Z from torch._inductor.runtime import triton_helpers 2026-02-21T10:10:17.2962702Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T10:10:17.2963030Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T10:10:17.2963214Z 2026-02-21T10:10:17.2963292Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T10:10:17.2963469Z _BLOCK_SIZE_1 = tl.constexpr(16384) 2026-02-21T10:10:17.2963592Z 2026-02-21T10:10:17.2963659Z @triton.jit 2026-02-21T10:10:17.2963803Z def _helion_softmax_two_pass(x, out): 2026-02-21T10:10:17.2964060Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:10:17.2964307Z pid_0 = tl.program_id(0) 2026-02-21T10:10:17.2964477Z offset_0 = pid_0 2026-02-21T10:10:17.2964647Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T10:10:17.2964930Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:10:17.2965225Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T10:10:17.2965484Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:10:17.2965739Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T10:10:17.2966247Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:10:17.2966529Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:10:17.2966792Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:10:17.2967028Z # src[softmax.py:82-89]: ... 2026-02-21T10:10:17.2967384Z for offset_2 in tl.range(0, 12032, _BLOCK_SIZE_1, loop_unroll_factor=4, warp_specialize=False, disallow_acc_multi_buffer=False): 2026-02-21T10:10:17.2967783Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:10:17.2968019Z mask_1 = indices_2 < 12032 2026-02-21T10:10:17.2968181Z mi_copy = mi 2026-02-21T10:10:17.2968347Z di_copy = di 2026-02-21T10:10:17.2968494Z mi_copy_0 = mi_copy 2026-02-21T10:10:17.2968645Z di_copy_0 = di_copy 2026-02-21T10:10:17.2968887Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:10:17.2969259Z values = tl.load(x + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:10:17.2969657Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:10:17.2970113Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T10:10:17.2970500Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T10:10:17.2970765Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T10:10:17.2971000Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T10:10:17.2971215Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T10:10:17.2971477Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:10:17.2971782Z v_2 = mi_copy_0 - v_1 2026-02-21T10:10:17.2971959Z v_3 = libdevice.exp(v_2) 2026-02-21T10:10:17.2972125Z v_4 = di_copy_0 * v_3 2026-02-21T10:10:17.2972323Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:10:17.2972524Z subscript = v_1[:, None] 2026-02-21T10:10:17.2972703Z v_5 = tl.cast(values, tl.float32) 2026-02-21T10:10:17.2972920Z v_6 = v_5 - subscript 2026-02-21T10:10:17.2973137Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:10:17.2973403Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:10:17.2973612Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T10:10:17.2973798Z v_7 = libdevice.exp(v_6) 2026-02-21T10:10:17.2974109Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T10:10:17.2974466Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T10:10:17.2974660Z di = v_4 + sum_1 2026-02-21T10:10:17.2974829Z # src[softmax.py:89]: mi = mi_next 2026-02-21T10:10:17.2975006Z mi = v_1 2026-02-21T10:10:17.2975205Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:10:17.2975475Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:10:17.2975768Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:10:17.2976214Z for offset_2 in tl.range(0, 12032, _BLOCK_SIZE_1, loop_unroll_factor=4, warp_specialize=False, disallow_acc_multi_buffer=False): 2026-02-21T10:10:17.2976610Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:10:17.2976846Z mask_2 = indices_2 < 12032 2026-02-21T10:10:17.2977018Z mi_copy_1 = mi 2026-02-21T10:10:17.2977161Z di_copy_1 = di 2026-02-21T10:10:17.2977315Z mi_copy_1_0 = mi_copy_1 2026-02-21T10:10:17.2977475Z di_copy_1_0 = di_copy_1 2026-02-21T10:10:17.2977666Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:10:17.2978032Z values_1 = tl.load(x + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first') 2026-02-21T10:10:17.2978512Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:10:17.2978797Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T10:10:17.2978987Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T10:10:17.2979181Z v_10 = v_9 - subscript_1 2026-02-21T10:10:17.2979377Z v_11 = libdevice.exp(v_10) 2026-02-21T10:10:17.2979558Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T10:10:17.2979740Z v_12 = v_11 / subscript_2 2026-02-21T10:10:17.2979925Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T10:10:17.2980199Z tl.store(out + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T10:10:17.2980422Z 2026-02-21T10:10:17.2980587Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T10:10:17.2980827Z """ 2026-02-21T10:10:17.2981029Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T10:10:17.2981335Z This version uses fewer passes but is less numerically stable. 2026-02-21T10:10:17.2981585Z Args: 2026-02-21T10:10:17.2981808Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T10:10:17.2982001Z Returns: 2026-02-21T10:10:17.2982185Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T10:10:17.2982397Z """ 2026-02-21T10:10:17.2982532Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T10:10:17.2982715Z m, n = x.size() 2026-02-21T10:10:17.2982882Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T10:10:17.2983089Z out = torch.empty_like(x) 2026-02-21T10:10:17.2983310Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:10:17.2983630Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:10:17.2983934Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:10:17.2984173Z # src[softmax.py:79-92]: ... 2026-02-21T10:10:17.2984429Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6) 2026-02-21T10:10:17.2984725Z # src[softmax.py:93]: return out 2026-02-21T10:10:17.2984903Z return out 2026-02-21T10:10:18.5000414Z WARNING:tritonbench.utils.triton_op:Completed input ID 92: 2026-02-21T10:10:18.5004674Z (M, N) 2026-02-21T10:10:18.5009195Z ------------- 2026-02-21T10:10:18.5013284Z (4096, 12032) 2026-02-21T10:10:18.5013448Z 2026-02-21T10:10:18.5018978Z 95%|█████████▌| 19/20 [1:01:23<03:46, 226.76s/it]WARNING:tritonbench.utils.triton_op:Running input ID 97: 2026-02-21T10:10:18.5022927Z (M, N) 2026-02-21T10:10:18.5024612Z ------------- 2026-02-21T10:10:18.5024790Z (4096, 12672) 2026-02-21T10:10:18.5025064Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax 2026-02-21T10:10:19.6547503Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax 2026-02-21T10:10:21.0207058Z INFO:tritonbench.utils.triton_op:Took 2.09ms to get benchmark function for torch_compile_softmax 2026-02-21T10:10:22.3439481Z WARNING:__main__:Input tensor metadata: 2026-02-21T10:10:22.3443473Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T10:10:22.3445542Z 'dtype': 'torch.float16', 2026-02-21T10:10:22.3445802Z 'shape': (4096, 12672), 2026-02-21T10:10:22.3446012Z 'stride': (12672, 1)},), 2026-02-21T10:10:22.3446218Z 'kwargs': {}} 2026-02-21T10:10:22.3458069Z INFO:tritonbench.utils.triton_op:Took 1.79ms to get benchmark function for helion_softmax_tritonbench 2026-02-21T10:10:22.5184967Z [0s] Autotune random seed: 2138408546 2026-02-21T10:10:22.5433727Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T10:11:00.2231038Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False]) 2026-02-21T10:11:03.5032819Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T10:11:06.3581526Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s 2026-02-21T10:11:08.8679513Z module { 2026-02-21T10:11:08.8683806Z tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T10:11:08.8688445Z %c512_i32 = arith.constant 512 : i32 2026-02-21T10:11:08.8692413Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16> 2026-02-21T10:11:08.8696856Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T10:11:08.8699021Z %c0_i32 = arith.constant 0 : i32 2026-02-21T10:11:08.8699262Z %c592_i32 = arith.constant 592 : i32 2026-02-21T10:11:08.8699500Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T10:11:08.8699776Z %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16> 2026-02-21T10:11:08.8700031Z %cst_2 = arith.constant dense<12672> : tensor<8x1xi32> 2026-02-21T10:11:08.8700265Z %cst_3 = arith.constant dense<12672> : tensor<1024xi32> 2026-02-21T10:11:08.8700513Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32> 2026-02-21T10:11:08.8700760Z %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32> 2026-02-21T10:11:08.8700990Z %c8_i32 = arith.constant 8 : i32 2026-02-21T10:11:08.8701176Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T10:11:08.8701369Z %c12672_i32 = arith.constant 12672 : i32 2026-02-21T10:11:08.8701761Z %c12672_i64 = arith.constant 12672 : i64 2026-02-21T10:11:08.8702193Z %c1_i64 = arith.constant 1 : i64 2026-02-21T10:11:08.8702529Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c12672_i32], [%c12672_i64, %c1_i64] : , > 2026-02-21T10:11:08.8702858Z %1 = tt.get_program_id x : i32 2026-02-21T10:11:08.8703074Z scf.for %arg2 = %1 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T10:11:08.8703288Z %2 = arith.muli %arg2, %c8_i32 : i32 2026-02-21T10:11:08.8703520Z %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T10:11:08.8703763Z %4 = tt.splat %2 : i32 -> tensor<8xi32> 2026-02-21T10:11:08.8703960Z %5 = arith.addi %4, %3 : tensor<8xi32> 2026-02-21T10:11:08.8704154Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T10:11:08.8704339Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T10:11:08.8704714Z %6:2 = scf.for %arg3 = %c0_i32 to %c12288_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>) : i32 { 2026-02-21T10:11:08.8705141Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8705408Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8705618Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T10:11:08.8705844Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8706111Z %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8706366Z %71 = arith.muli %70, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8706629Z %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8706926Z %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8707282Z %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8707665Z %75 = arith.addi %73, %74 : tensor<8x1024xi32> 2026-02-21T10:11:08.8707909Z %76 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8708231Z %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8711285Z %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8711619Z %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8711918Z %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8712186Z %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T10:11:08.8712477Z %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8712772Z %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8712968Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8713168Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:11:08.8713361Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8713567Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8713834Z %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16> 2026-02-21T10:11:08.8714076Z %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32> 2026-02-21T10:11:08.8714308Z %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32> 2026-02-21T10:11:08.8714530Z %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32> 2026-02-21T10:11:08.8714745Z %88 = arith.ori %86, %87 : tensor<8xi1> 2026-02-21T10:11:08.8714968Z %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32> 2026-02-21T10:11:08.8715212Z %90 = arith.subf %arg4, %89 : tensor<8xf32> 2026-02-21T10:11:08.8715569Z %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8715934Z %92 = arith.mulf %arg5, %91 : tensor<8xf32> 2026-02-21T10:11:08.8716191Z %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8716511Z %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8716785Z %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8717016Z %96 = arith.subf %94, %95 : tensor<8x1024xf32> 2026-02-21T10:11:08.8717380Z %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8717786Z %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T10:11:08.8718045Z %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8718245Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8718425Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:11:08.8718618Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8718799Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8719006Z %100 = arith.addf %92, %99 : tensor<8xf32> 2026-02-21T10:11:08.8719206Z %c1_i32 = arith.constant 1 : i32 2026-02-21T10:11:08.8719393Z %101 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T10:11:08.8719589Z %102 = arith.addi %arg3, %101 : i32 2026-02-21T10:11:08.8719828Z %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8720092Z %104 = tt.splat %102 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8720308Z %105 = arith.addi %104, %103 : tensor<1024xi32> 2026-02-21T10:11:08.8720544Z %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8720826Z %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8721105Z %108 = arith.muli %107, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8721429Z %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8721779Z %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8722073Z %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8722335Z %112 = arith.addi %110, %111 : tensor<8x1024xi32> 2026-02-21T10:11:08.8722594Z %113 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8722902Z %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8723229Z %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8723551Z %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8723824Z %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8724155Z %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T10:11:08.8724470Z %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8724723Z %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8724963Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8725154Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:11:08.8725358Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8725549Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8725788Z %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16> 2026-02-21T10:11:08.8726042Z %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32> 2026-02-21T10:11:08.8726288Z %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32> 2026-02-21T10:11:08.8726518Z %124 = arith.cmpf une, %89, %89 : tensor<8xf32> 2026-02-21T10:11:08.8726729Z %125 = arith.ori %123, %124 : tensor<8xi1> 2026-02-21T10:11:08.8726972Z %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32> 2026-02-21T10:11:08.8727218Z %127 = arith.subf %89, %126 : tensor<8xf32> 2026-02-21T10:11:08.8727640Z %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8728009Z %129 = arith.mulf %100, %128 : tensor<8xf32> 2026-02-21T10:11:08.8728257Z %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8728557Z %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8728825Z %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8729074Z %133 = arith.subf %131, %132 : tensor<8x1024xf32> 2026-02-21T10:11:08.8729446Z %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8729873Z %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T10:11:08.8730144Z %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8730333Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8730519Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:11:08.8730754Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8730938Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8731143Z %137 = arith.addf %129, %136 : tensor<8xf32> 2026-02-21T10:11:08.8731342Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:11:08.8731528Z %138 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T10:11:08.8731772Z %139 = arith.addi %arg3, %138 : i32 2026-02-21T10:11:08.8732017Z %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8732281Z %141 = tt.splat %139 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8732492Z %142 = arith.addi %141, %140 : tensor<1024xi32> 2026-02-21T10:11:08.8732774Z %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8733040Z %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8733309Z %145 = arith.muli %144, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8733580Z %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8733887Z %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8734166Z %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8734418Z %149 = arith.addi %147, %148 : tensor<8x1024xi32> 2026-02-21T10:11:08.8734666Z %150 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8734950Z %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8735302Z %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8735607Z %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8735861Z %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8736169Z %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T10:11:08.8736451Z %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8736691Z %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8736879Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8737070Z %175 = arith.maxnumf %arg6, %arg7 : f32 2026-02-21T10:11:08.8737267Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8737449Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8737677Z %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16> 2026-02-21T10:11:08.8737916Z %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32> 2026-02-21T10:11:08.8738152Z %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32> 2026-02-21T10:11:08.8738364Z %161 = arith.cmpf une, %126, %126 : tensor<8xf32> 2026-02-21T10:11:08.8738603Z %162 = arith.ori %160, %161 : tensor<8xi1> 2026-02-21T10:11:08.8738840Z %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32> 2026-02-21T10:11:08.8739077Z %164 = arith.subf %126, %163 : tensor<8xf32> 2026-02-21T10:11:08.8739439Z %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8739796Z %166 = arith.mulf %137, %165 : tensor<8xf32> 2026-02-21T10:11:08.8740050Z %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8740333Z %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8740604Z %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8740852Z %170 = arith.subf %168, %169 : tensor<8x1024xf32> 2026-02-21T10:11:08.8741223Z %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8741678Z %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T10:11:08.8741937Z %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8742136Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T10:11:08.8742323Z %175 = arith.addf %arg6, %arg7 : f32 2026-02-21T10:11:08.8742508Z tt.reduce.return %175 : f32 2026-02-21T10:11:08.8742696Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8742891Z %174 = arith.addf %166, %173 : tensor<8xf32> 2026-02-21T10:11:08.8743114Z scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32> 2026-02-21T10:11:08.8743314Z } {tt.flatten} 2026-02-21T10:11:08.8743522Z %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8743855Z %8 = tt.splat %c12288_i32 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8744074Z %9 = arith.addi %8, %7 : tensor<1024xi32> 2026-02-21T10:11:08.8744291Z %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8744548Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8744805Z %12 = arith.muli %11, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8745053Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8745357Z %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8745625Z %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8745871Z %16 = arith.addi %14, %15 : tensor<8x1024xi32> 2026-02-21T10:11:08.8746144Z %17 = tt.splat %arg0 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8746419Z %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8746722Z %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8747008Z %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8747292Z %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8747559Z %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16> 2026-02-21T10:11:08.8747834Z %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8748068Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8748257Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T10:11:08.8748445Z %66 = arith.maxnumf %arg3, %arg4 : f32 2026-02-21T10:11:08.8748633Z tt.reduce.return %66 : f32 2026-02-21T10:11:08.8748822Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8749040Z %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16> 2026-02-21T10:11:08.8749279Z %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32> 2026-02-21T10:11:08.8749505Z %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32> 2026-02-21T10:11:08.8749752Z %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32> 2026-02-21T10:11:08.8749960Z %29 = arith.ori %27, %28 : tensor<8xi1> 2026-02-21T10:11:08.8750178Z %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32> 2026-02-21T10:11:08.8750407Z %31 = arith.subf %6#0, %30 : tensor<8xf32> 2026-02-21T10:11:08.8750757Z %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8751120Z %33 = arith.mulf %6#1, %32 : tensor<8xf32> 2026-02-21T10:11:08.8751372Z %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8751691Z %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8751959Z %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8752192Z %37 = arith.subf %35, %36 : tensor<8x1024xf32> 2026-02-21T10:11:08.8752577Z %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8752991Z %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T10:11:08.8753240Z %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({ 2026-02-21T10:11:08.8753440Z ^bb0(%arg3: f32, %arg4: f32): 2026-02-21T10:11:08.8753620Z %66 = arith.addf %arg3, %arg4 : f32 2026-02-21T10:11:08.8753812Z tt.reduce.return %66 : f32 2026-02-21T10:11:08.8753995Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T10:11:08.8754200Z %41 = arith.addf %33, %40 : tensor<8xf32> 2026-02-21T10:11:08.8754414Z %c12288_i32_6 = arith.constant 12288 : i32 2026-02-21T10:11:08.8754611Z %c3072_i32_7 = arith.constant 3072 : i32 2026-02-21T10:11:08.8754901Z scf.for %arg3 = %c0_i32 to %c12288_i32_6 step %c3072_i32_7 : i32 { 2026-02-21T10:11:08.8755183Z %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8755448Z %67 = tt.splat %arg3 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8755653Z %68 = arith.addi %67, %66 : tensor<1024xi32> 2026-02-21T10:11:08.8755873Z %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8756185Z %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T10:11:08.8756524Z %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8756815Z %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8757074Z %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8757345Z %74 = arith.subf %72, %73 : tensor<8x1024xf32> 2026-02-21T10:11:08.8757711Z %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8758126Z %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8758442Z %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8758671Z %78 = arith.divf %75, %77 : tensor<8x1024xf32> 2026-02-21T10:11:08.8758912Z %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T10:11:08.8759192Z %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8759456Z %81 = arith.muli %80, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8759773Z %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8760058Z %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8760321Z %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8760553Z %85 = arith.addi %83, %84 : tensor<8x1024xi32> 2026-02-21T10:11:08.8760821Z %86 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8761107Z %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8761414Z %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8761745Z %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8761990Z tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8762213Z %c1_i32 = arith.constant 1 : i32 2026-02-21T10:11:08.8762401Z %90 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T10:11:08.8762598Z %91 = arith.addi %arg3, %90 : i32 2026-02-21T10:11:08.8762829Z %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8763087Z %93 = tt.splat %91 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8763295Z %94 = arith.addi %93, %92 : tensor<1024xi32> 2026-02-21T10:11:08.8763506Z %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8763808Z %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T10:11:08.8764137Z %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8764421Z %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8764679Z %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8764923Z %100 = arith.subf %98, %99 : tensor<8x1024xf32> 2026-02-21T10:11:08.8765299Z %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8765712Z %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8766030Z %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8766275Z %104 = arith.divf %101, %103 : tensor<8x1024xf32> 2026-02-21T10:11:08.8766524Z %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T10:11:08.8766815Z %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8767075Z %107 = arith.muli %106, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8767340Z %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8767635Z %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8767909Z %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8768155Z %111 = arith.addi %109, %110 : tensor<8x1024xi32> 2026-02-21T10:11:08.8768432Z %112 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8768723Z %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8769028Z %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8769357Z %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8769613Z tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8769838Z %c2_i32 = arith.constant 2 : i32 2026-02-21T10:11:08.8770035Z %116 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T10:11:08.8770229Z %117 = arith.addi %arg3, %116 : i32 2026-02-21T10:11:08.8770475Z %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8770735Z %119 = tt.splat %117 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8770951Z %120 = arith.addi %119, %118 : tensor<1024xi32> 2026-02-21T10:11:08.8771170Z %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8771485Z %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T10:11:08.8771913Z %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8772217Z %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8772501Z %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8772749Z %126 = arith.subf %124, %125 : tensor<8x1024xf32> 2026-02-21T10:11:08.8773150Z %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8773595Z %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8773895Z %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8774154Z %130 = arith.divf %127, %129 : tensor<8x1024xf32> 2026-02-21T10:11:08.8774409Z %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T10:11:08.8774712Z %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8774986Z %133 = arith.muli %132, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8775277Z %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8775597Z %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8775878Z %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8776145Z %137 = arith.addi %135, %136 : tensor<8x1024xi32> 2026-02-21T10:11:08.8776395Z %138 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8776703Z %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8777028Z %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8777378Z %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8777657Z tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8777883Z } {tt.flatten} 2026-02-21T10:11:08.8778101Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T10:11:08.8778385Z %43 = tt.splat %c12288_i32_6 : i32 -> tensor<1024xi32> 2026-02-21T10:11:08.8778619Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T10:11:08.8778976Z %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32> 2026-02-21T10:11:08.8779333Z %46 = tt.descriptor_load %0[%2, %c12288_i32_6] : !tt.tensordesc> -> tensor<8x1024xf16> 2026-02-21T10:11:08.8779718Z %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8780046Z %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32> 2026-02-21T10:11:08.8780313Z %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8780544Z %50 = arith.subf %48, %49 : tensor<8x1024xf32> 2026-02-21T10:11:08.8780916Z %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T10:11:08.8781349Z %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32> 2026-02-21T10:11:08.8781654Z %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32> 2026-02-21T10:11:08.8781883Z %54 = arith.divf %51, %53 : tensor<8x1024xf32> 2026-02-21T10:11:08.8782123Z %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16> 2026-02-21T10:11:08.8782405Z %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32> 2026-02-21T10:11:08.8782668Z %57 = arith.muli %56, %cst_2 : tensor<8x1xi32> 2026-02-21T10:11:08.8782931Z %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T10:11:08.8783226Z %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8783524Z %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32> 2026-02-21T10:11:08.8783767Z %61 = arith.addi %59, %60 : tensor<8x1024xi32> 2026-02-21T10:11:08.8783998Z %62 = tt.splat %arg1 : !tt.ptr -> tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8784275Z %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr>, tensor<8x1024xi32> 2026-02-21T10:11:08.8784566Z %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1> 2026-02-21T10:11:08.8784860Z %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1> 2026-02-21T10:11:08.8785110Z tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr> 2026-02-21T10:11:08.8785476Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T10:11:08.8785820Z tt.return 2026-02-21T10:11:08.8785948Z } 2026-02-21T10:11:08.8786075Z } 2026-02-21T10:11:08.8786144Z 2026-02-21T10:11:08.8786193Z {-# 2026-02-21T10:11:08.8786332Z external_resources: { 2026-02-21T10:11:08.8786490Z mlir_reproducer: { 2026-02-21T10:11:08.8790806Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T10:11:08.8795263Z disable_threading: false, 2026-02-21T10:11:08.8795438Z verify_each: true 2026-02-21T10:11:08.8795581Z } 2026-02-21T10:11:08.8795707Z } 2026-02-21T10:11:08.8795818Z #-} 2026-02-21T10:11:08.8796243Z /tmp/torchinductor_root/4d/c4dzntrkfazw2bra6phnmfryd2xwzdozc56m5imk4xkozxkd2n3l.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T10:11:08.8797438Z /tmp/torchinductor_root/4d/c4dzntrkfazw2bra6phnmfryd2xwzdozc56m5imk4xkozxkd2n3l.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T10:11:08.8798416Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T10:11:08.8799499Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T10:11:08.8800444Z Error: RuntimeError: PassManager::run failed 2026-02-21T10:11:08.8800696Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T10:11:16.0629448Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.2 configs/s 2026-02-21T10:11:16.0639878Z [53s] Adaptive compile timeout: 30s (90% percentile=19.2s, bounds=[30.0s, 30s]) 2026-02-21T10:11:17.5460951Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 664.7 configs/s 2026-02-21T10:11:17.6304034Z [55s] Initial random population of 100, 5 starting points: 2026-02-21T10:11:17.6308244Z error=12 2026-02-21T10:11:17.6309891Z timeout=2 2026-02-21T10:11:17.6310065Z ok=86 2026-02-21T10:11:17.6310211Z min=0.0901 2026-02-21T10:11:17.6310353Z mid=0.6880 2026-02-21T10:11:17.6310486Z max=293.4436 2026-02-21T10:11:17.6310634Z best={'block_sizes': [1, 1024], 2026-02-21T10:11:17.6310873Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:11:17.6311124Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T10:11:17.6311313Z 'num_stages': 6, 2026-02-21T10:11:17.6311454Z 'num_warps': 4, 2026-02-21T10:11:17.6311686Z 'pid_type': 'flat', 2026-02-21T10:11:17.6311844Z 'range_flattens': [None, None], 2026-02-21T10:11:17.6312030Z 'range_multi_buffers': [None, True], 2026-02-21T10:11:17.6312212Z 'range_num_stages': [0, 0], 2026-02-21T10:11:17.6312385Z 'range_unroll_factors': [0, 4], 2026-02-21T10:11:17.6312574Z 'range_warp_specializes': [None, False]} 2026-02-21T10:11:17.6329152Z [55s] Fitting surrogate: 100 points, 100 targets 2026-02-21T10:11:18.7696217Z [56s] Generation 1 starting: 83 neighbors, 5 active search path(s) 2026-02-21T10:11:52.7590910Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.3 configs/s 2026-02-21T10:11:57.8599743Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.0 configs/s 2026-02-21T10:12:03.6672162Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 173.1 2026-02-21T10:12:03.6673558Z configs/s 2026-02-21T10:12:03.9132603Z [101s] Generation 1 complete: 2026-02-21T10:12:03.9135775Z ok=89 2026-02-21T10:12:03.9139705Z min=0.0737 2026-02-21T10:12:03.9143566Z mid=0.1127 2026-02-21T10:12:03.9148097Z max=0.7374 2026-02-21T10:12:03.9149745Z best={'block_sizes': [1, 512], 2026-02-21T10:12:03.9150078Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'], 2026-02-21T10:12:03.9154753Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T10:12:03.9159111Z 'num_stages': 6, 2026-02-21T10:12:03.9160507Z 'num_warps': 1, 2026-02-21T10:12:03.9160975Z 'pid_type': 'flat', 2026-02-21T10:12:03.9161174Z 'range_flattens': [None, None], 2026-02-21T10:12:03.9161395Z 'range_multi_buffers': [None, True], 2026-02-21T10:12:03.9161817Z 'range_num_stages': [0, 0], 2026-02-21T10:12:03.9162017Z 'range_unroll_factors': [0, 4], 2026-02-21T10:12:03.9162318Z 'range_warp_specializes': [None, False]} 2026-02-21T10:12:03.9162636Z [101s] Fitting surrogate: 189 points, 189 targets 2026-02-21T10:12:04.9121884Z [102s] Generation 2 starting: 71 neighbors, 5 active search path(s) 2026-02-21T10:12:22.7382177Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 5.3 configs/s 2026-02-21T10:12:27.0999093Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.1 configs/s 2026-02-21T10:12:34.1513259Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.9 2026-02-21T10:12:34.1513902Z configs/s 2026-02-21T10:12:34.4584483Z [131s] Generation 2 complete: 2026-02-21T10:12:34.4588197Z ok=77 2026-02-21T10:12:34.4592653Z min=0.0676 2026-02-21T10:12:34.4597260Z mid=0.0840 2026-02-21T10:12:34.4597492Z max=0.2999 2026-02-21T10:12:34.4597648Z best={'block_sizes': [1, 1024], 2026-02-21T10:12:34.4598197Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:12:34.4598512Z 'load_eviction_policies': ['', 'last'], 2026-02-21T10:12:34.4598738Z 'num_stages': 3, 2026-02-21T10:12:34.4598913Z 'num_warps': 1, 2026-02-21T10:12:34.4599065Z 'pid_type': 'flat', 2026-02-21T10:12:34.4599287Z 'range_flattens': [None, False], 2026-02-21T10:12:34.4599500Z 'range_multi_buffers': [None, None], 2026-02-21T10:12:34.4605628Z 'range_num_stages': [0, 2], 2026-02-21T10:12:34.4605864Z 'range_unroll_factors': [0, 3], 2026-02-21T10:12:34.4606059Z 'range_warp_specializes': [None, None]} 2026-02-21T10:12:34.4606288Z [131s] Fitting surrogate: 266 points, 266 targets 2026-02-21T10:12:35.3041486Z [132s] Generation 3 starting: 58 neighbors, 5 active search path(s) 2026-02-21T10:12:50.2638018Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.8 configs/s 2026-02-21T10:12:53.9015223Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.3 configs/s 2026-02-21T10:13:00.6007286Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 150.4 2026-02-21T10:13:00.6008660Z configs/s 2026-02-21T10:13:00.9107488Z [158s] Generation 3 complete: 2026-02-21T10:13:00.9112268Z ok=64 2026-02-21T10:13:00.9115478Z min=0.0676 2026-02-21T10:13:00.9119948Z mid=0.0778 2026-02-21T10:13:00.9122538Z max=0.2601 2026-02-21T10:13:00.9122713Z best={'block_sizes': [1, 512], 2026-02-21T10:13:00.9122977Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T10:13:00.9123246Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T10:13:00.9123444Z 'num_stages': 3, 2026-02-21T10:13:00.9123585Z 'num_warps': 1, 2026-02-21T10:13:00.9123731Z 'pid_type': 'flat', 2026-02-21T10:13:00.9123911Z 'range_flattens': [None, True], 2026-02-21T10:13:00.9124437Z 'range_multi_buffers': [None, None], 2026-02-21T10:13:00.9124628Z 'range_num_stages': [0, 2], 2026-02-21T10:13:00.9124793Z 'range_unroll_factors': [0, 4], 2026-02-21T10:13:00.9124989Z 'range_warp_specializes': [None, None]} 2026-02-21T10:13:00.9125212Z [158s] Fitting surrogate: 330 points, 330 targets 2026-02-21T10:13:01.7151698Z [159s] Generation 4 starting: 52 neighbors, 5 active search path(s) 2026-02-21T10:13:14.5845382Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 2.8 configs/s 2026-02-21T10:13:17.7010833Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.3 configs/s 2026-02-21T10:13:23.5849780Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 171.2 2026-02-21T10:13:23.5851024Z configs/s 2026-02-21T10:13:23.8584164Z [181s] Generation 4 complete: 2026-02-21T10:13:23.8586149Z ok=57 2026-02-21T10:13:23.8586337Z min=0.0655 2026-02-21T10:13:23.8586483Z mid=0.0738 2026-02-21T10:13:23.8586604Z max=0.3216 2026-02-21T10:13:23.8586746Z best={'block_sizes': [1, 1024], 2026-02-21T10:13:23.8586995Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T10:13:23.8587460Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T10:13:23.8587664Z 'num_stages': 8, 2026-02-21T10:13:23.8587809Z 'num_warps': 1, 2026-02-21T10:13:23.8587961Z 'pid_type': 'flat', 2026-02-21T10:13:23.8588116Z 'range_flattens': [None, None], 2026-02-21T10:13:23.8588300Z 'range_multi_buffers': [None, None], 2026-02-21T10:13:23.8588482Z 'range_num_stages': [0, 3], 2026-02-21T10:13:23.8588658Z 'range_unroll_factors': [0, 2], 2026-02-21T10:13:23.8588842Z 'range_warp_specializes': [None, None]} 2026-02-21T10:13:23.8599912Z [181s] Fitting surrogate: 387 points, 387 targets 2026-02-21T10:13:24.4364517Z [181s] Generation 5 starting: 34 neighbors, 3 active search path(s) 2026-02-21T10:13:32.8357743Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 3.1 configs/s 2026-02-21T10:13:34.8437697Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 17.3 configs/s 2026-02-21T10:13:38.5568633Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 270.8 2026-02-21T10:13:38.5569033Z configs/s 2026-02-21T10:13:38.7475025Z [196s] Generation 5 complete: 2026-02-21T10:13:38.7479935Z ok=37 2026-02-21T10:13:38.7483795Z min=0.0655 2026-02-21T10:13:38.7488760Z mid=0.0685 2026-02-21T10:13:38.7490747Z max=0.2909 2026-02-21T10:13:38.7490955Z best={'block_sizes': [1, 1024], 2026-02-21T10:13:38.7491270Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T10:13:38.7491658Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T10:13:38.7491886Z 'num_stages': 8, 2026-02-21T10:13:38.7492042Z 'num_warps': 1, 2026-02-21T10:13:38.7492195Z 'pid_type': 'flat', 2026-02-21T10:13:38.7492380Z 'range_flattens': [None, None], 2026-02-21T10:13:38.7492582Z 'range_multi_buffers': [None, None], 2026-02-21T10:13:38.7492766Z 'range_num_stages': [0, 3], 2026-02-21T10:13:38.7492939Z 'range_unroll_factors': [0, 2], 2026-02-21T10:13:38.7493128Z 'range_warp_specializes': [None, None]} 2026-02-21T10:13:38.7493354Z [196s] Fitting surrogate: 424 points, 424 targets 2026-02-21T10:13:39.1828793Z [196s] Generation 6 starting: 21 neighbors, 2 active search path(s) 2026-02-21T10:13:44.6847005Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 5.3 configs/s 2026-02-21T10:13:45.9219371Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.6 configs/s 2026-02-21T10:13:48.1878981Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 441.3 2026-02-21T10:13:48.1882932Z configs/s 2026-02-21T10:13:48.3109245Z [205s] Generation 6 complete: 2026-02-21T10:13:48.3113411Z ok=23 2026-02-21T10:13:48.3115258Z min=0.0655 2026-02-21T10:13:48.3115482Z mid=0.0676 2026-02-21T10:13:48.3115772Z max=0.2459 2026-02-21T10:13:48.3115947Z best={'block_sizes': [1, 1024], 2026-02-21T10:13:48.3116239Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T10:13:48.3116571Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T10:13:48.3116884Z 'num_stages': 8, 2026-02-21T10:13:48.3117032Z 'num_warps': 1, 2026-02-21T10:13:48.3117190Z 'pid_type': 'flat', 2026-02-21T10:13:48.3117354Z 'range_flattens': [None, None], 2026-02-21T10:13:48.3117555Z 'range_multi_buffers': [None, None], 2026-02-21T10:13:48.3117758Z 'range_num_stages': [0, 4], 2026-02-21T10:13:48.3117932Z 'range_unroll_factors': [0, 2], 2026-02-21T10:13:48.3118130Z 'range_warp_specializes': [None, None]} 2026-02-21T10:13:48.3144135Z [205s] Fitting surrogate: 447 points, 447 targets 2026-02-21T10:13:48.6066004Z [206s] Generation 7 starting: 11 neighbors, 1 active search path(s) 2026-02-21T10:13:51.5454995Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.7 configs/s 2026-02-21T10:13:52.1846939Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.6 configs/s 2026-02-21T10:13:53.5229308Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 741.6 2026-02-21T10:13:53.5231068Z configs/s 2026-02-21T10:13:53.6046988Z [211s] Generation 7 complete: 2026-02-21T10:13:53.6051925Z ok=13 2026-02-21T10:13:53.6055107Z min=0.0655 2026-02-21T10:13:53.6059564Z mid=0.0657 2026-02-21T10:13:53.6063932Z max=0.1107 2026-02-21T10:13:53.6065438Z best={'block_sizes': [1, 1024], 2026-02-21T10:13:53.6065718Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T10:13:53.6065959Z 'load_eviction_policies': ['', ''], 2026-02-21T10:13:53.6066153Z 'num_stages': 1, 2026-02-21T10:13:53.6066299Z 'num_warps': 1, 2026-02-21T10:13:53.6066452Z 'pid_type': 'flat', 2026-02-21T10:13:53.6066626Z 'range_flattens': [None, None], 2026-02-21T10:13:53.6066817Z 'range_multi_buffers': [None, None], 2026-02-21T10:13:53.6067015Z 'range_num_stages': [0, 2], 2026-02-21T10:13:53.6067181Z 'range_unroll_factors': [0, 2], 2026-02-21T10:13:53.6067369Z 'range_warp_specializes': [None, None]} 2026-02-21T10:13:53.6067689Z [211s] Fitting surrogate: 460 points, 460 targets 2026-02-21T10:13:53.7665435Z [211s] Autotuning complete in 211.2s after searching 439 configs. 2026-02-21T10:13:53.7667078Z One can hardcode the best config and skip autotuning with: 2026-02-21T10:13:53.7667985Z @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[None, None]), static_shapes=True) 2026-02-21T10:13:53.7668757Z 2026-02-21T10:13:53.7669017Z [211s] Code of selected kernel: /tmp/torchinductor_root/2i/c2iwy5mebrvu2qeluv3o7rszzw2fzncbw5e2bp6uphikash4umg5.py 2026-02-21T10:13:53.7895747Z from __future__ import annotations 2026-02-21T10:13:53.7895989Z 2026-02-21T10:13:53.7900119Z import torch 2026-02-21T10:13:53.7904168Z import triton 2026-02-21T10:13:53.7908647Z import triton.language as tl 2026-02-21T10:13:53.7913217Z from torch._inductor.runtime import triton_helpers 2026-02-21T10:13:53.7915367Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T10:13:53.7915699Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T10:13:53.7915876Z 2026-02-21T10:13:53.7915951Z _BLOCK_SIZE_0 = tl.constexpr(1) 2026-02-21T10:13:53.7916142Z _BLOCK_SIZE_1 = tl.constexpr(1024) 2026-02-21T10:13:53.7916259Z 2026-02-21T10:13:53.7916329Z @triton.jit 2026-02-21T10:13:53.7916479Z def _helion_softmax_two_pass(x, out): 2026-02-21T10:13:53.7916746Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:13:53.7917217Z pid_0 = tl.program_id(0) 2026-02-21T10:13:53.7917410Z offset_0 = pid_0 2026-02-21T10:13:53.7917590Z indices_0 = offset_0 + tl.zeros([1], tl.int32) 2026-02-21T10:13:53.7917882Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:13:53.7918191Z mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32) 2026-02-21T10:13:53.7918541Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:13:53.7918804Z di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T10:13:53.7919056Z # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:13:53.7919326Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:13:53.7919573Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:13:53.7919805Z # src[softmax.py:82-89]: ... 2026-02-21T10:13:53.7920071Z for offset_2 in tl.range(0, 12672, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1): 2026-02-21T10:13:53.7920395Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:13:53.7920639Z mask_1 = indices_2 < 12672 2026-02-21T10:13:53.7920804Z mi_copy = mi 2026-02-21T10:13:53.7920950Z di_copy = di 2026-02-21T10:13:53.7921092Z mi_copy_0 = mi_copy 2026-02-21T10:13:53.7921294Z di_copy_0 = di_copy 2026-02-21T10:13:53.7921476Z # src[softmax.py:83]: values = x[tile_m, tile_n] 2026-02-21T10:13:53.7921875Z values = tl.load(x + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), mask_1[None, :], other=0) 2026-02-21T10:13:53.7922217Z # src[softmax.py:84]: local_amax = torch.amax(values, dim=1) 2026-02-21T10:13:53.7922617Z _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16)) 2026-02-21T10:13:53.7923018Z local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16) 2026-02-21T10:13:53.7923278Z # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax) 2026-02-21T10:13:53.7923519Z v_0 = tl.cast(local_amax, tl.float32) 2026-02-21T10:13:53.7923735Z v_1 = triton_helpers.maximum(mi_copy_0, v_0) 2026-02-21T10:13:53.7923991Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:13:53.7924238Z v_2 = mi_copy_0 - v_1 2026-02-21T10:13:53.7924409Z v_3 = libdevice.exp(v_2) 2026-02-21T10:13:53.7924581Z v_4 = di_copy_0 * v_3 2026-02-21T10:13:53.7924764Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:13:53.7924967Z subscript = v_1[:, None] 2026-02-21T10:13:53.7925134Z v_5 = tl.cast(values, tl.float32) 2026-02-21T10:13:53.7925316Z v_6 = v_5 - subscript 2026-02-21T10:13:53.7925527Z # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp( 2026-02-21T10:13:53.7925781Z # src[softmax.py:87]: values - mi_next[:, None] 2026-02-21T10:13:53.7926000Z # src[softmax.py:88]: ).sum(dim=1) 2026-02-21T10:13:53.7926185Z v_7 = libdevice.exp(v_6) 2026-02-21T10:13:53.7926505Z _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32)) 2026-02-21T10:13:53.7926904Z sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32) 2026-02-21T10:13:53.7927114Z di = v_4 + sum_1 2026-02-21T10:13:53.7927285Z # src[softmax.py:89]: mi = mi_next 2026-02-21T10:13:53.7927456Z mi = v_1 2026-02-21T10:13:53.7927660Z # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n): 2026-02-21T10:13:53.7927927Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:13:53.7928225Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:13:53.7928578Z for offset_2 in tl.range(0, 12672, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1): 2026-02-21T10:13:53.7928908Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32) 2026-02-21T10:13:53.7929207Z mask_2 = indices_2 < 12672 2026-02-21T10:13:53.7929377Z mi_copy_1 = mi 2026-02-21T10:13:53.7929529Z di_copy_1 = di 2026-02-21T10:13:53.7929676Z mi_copy_1_0 = mi_copy_1 2026-02-21T10:13:53.7929845Z di_copy_1_0 = di_copy_1 2026-02-21T10:13:53.7930028Z # src[softmax.py:91]: values = x[tile_m, tile_n] 2026-02-21T10:13:53.7930390Z values_1 = tl.load(x + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), mask_2[None, :], other=0) 2026-02-21T10:13:53.7930784Z # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None] 2026-02-21T10:13:53.7931065Z subscript_1 = mi_copy_1_0[:, None] 2026-02-21T10:13:53.7931260Z v_9 = tl.cast(values_1, tl.float32) 2026-02-21T10:13:53.7931440Z v_10 = v_9 - subscript_1 2026-02-21T10:13:53.7931658Z v_11 = libdevice.exp(v_10) 2026-02-21T10:13:53.7931832Z subscript_2 = di_copy_1_0[:, None] 2026-02-21T10:13:53.7932016Z v_12 = v_11 / subscript_2 2026-02-21T10:13:53.7932194Z v_13 = tl.cast(v_12, tl.float16) 2026-02-21T10:13:53.7932461Z tl.store(out + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), v_13, mask_2[None, :]) 2026-02-21T10:13:53.7932672Z 2026-02-21T10:13:53.7932841Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher): 2026-02-21T10:13:53.7933075Z """ 2026-02-21T10:13:53.7933288Z Numerically optimized Helion kernel performing softmax in two passes. 2026-02-21T10:13:53.7933592Z This version uses fewer passes but is less numerically stable. 2026-02-21T10:13:53.7933820Z Args: 2026-02-21T10:13:53.7933990Z x (torch.Tensor): Input tensor of shape [m, n]. 2026-02-21T10:13:53.7934183Z Returns: 2026-02-21T10:13:53.7934370Z torch.Tensor: Softmax output tensor of the same shape. 2026-02-21T10:13:53.7934577Z """ 2026-02-21T10:13:53.7934721Z # src[softmax.py:75]: m, n = x.size() 2026-02-21T10:13:53.7934897Z m, n = x.size() 2026-02-21T10:13:53.7935076Z # src[softmax.py:76]: out = torch.empty_like(x) 2026-02-21T10:13:53.7935285Z out = torch.empty_like(x) 2026-02-21T10:13:53.7935530Z # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m): 2026-02-21T10:13:53.7935868Z # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32) 2026-02-21T10:13:53.7936197Z # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32) 2026-02-21T10:13:53.7936450Z # src[softmax.py:79-92]: ... 2026-02-21T10:13:53.7936716Z _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=1) 2026-02-21T10:13:53.7937007Z # src[softmax.py:93]: return out 2026-02-21T10:13:53.7937185Z return out 2026-02-21T10:13:54.5308979Z WARNING:tritonbench.utils.triton_op:Completed input ID 97: 2026-02-21T10:13:54.5310916Z (M, N) 2026-02-21T10:13:54.5311091Z ------------- 2026-02-21T10:13:54.5311243Z (4096, 12672) 2026-02-21T10:13:54.5311323Z 2026-02-21T10:13:54.5311691Z 100%|██████████| 20/20 [1:04:59<00:00, 223.54s/it] 2026-02-21T10:13:54.5315005Z 100%|██████████| 20/20 [1:04:59<00:00, 194.99s/it] 2026-02-21T10:13:54.5341086Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp4vqrkpfg.csv 2026-02-21T10:13:56.2112601Z (M, N) triton_softmax-speedup triton_softmax-accuracy torch_compile_softmax-speedup torch_compile_softmax-accuracy helion_softmax_tritonbench-speedup helion_softmax_tritonbench-accuracy 2026-02-21T10:13:56.2113461Z ------------- ------------------------ ------------------------- ------------------------------- -------------------------------- ------------------------------------ ------------------------------------- 2026-02-21T10:13:56.2114073Z (4096, 256) 0.931476 1 1.25056 1 1.39284 1 2026-02-21T10:13:56.2114834Z (4096, 896) 1.85641 1 1.43552 1 2.02169 1 2026-02-21T10:13:56.2115363Z (4096, 1536) 3.52879 1 2.21949 1 4.483 1 2026-02-21T10:13:56.2115937Z (4096, 2176) 2.36344 1 1.98727 1 4.14473 1 2026-02-21T10:13:56.2116410Z (4096, 2816) 2.41335 1 1.68812 1 4.28562 1 2026-02-21T10:13:56.2116882Z (4096, 3584) 2.7054 1 1.54707 1 3.36528 1 2026-02-21T10:13:56.2117419Z (4096, 4224) 3.72627 1 1.97747 1 4.89165 1 2026-02-21T10:13:56.2117896Z (4096, 4864) 3.78826 1 1.85434 1 5.02964 1 2026-02-21T10:13:56.2118362Z (4096, 5504) 4.1215 1 1.88542 1 4.8864 1 2026-02-21T10:13:56.2118827Z (4096, 6144) 4.15768 1 2.09188 1 4.54654 1 2026-02-21T10:13:56.2119288Z (4096, 6784) 4.28904 1 1.67317 1 4.62843 1 2026-02-21T10:13:56.2119762Z (4096, 7424) 4.89155 1 1.86434 1 4.77075 1 2026-02-21T10:13:56.2120231Z (4096, 8064) 4.84175 1 1.78775 1 4.6481 1 2026-02-21T10:13:56.2120692Z (4096, 8704) 2.67811 1 1.91351 1 2.96255 1 2026-02-21T10:13:56.2121165Z (4096, 9344) 1.74363 1 0.985811 1 1.85503 1 2026-02-21T10:13:56.2121770Z (4096, 10112) 1.74389 1 0.950067 1 2.43056 1 2026-02-21T10:13:56.2122246Z (4096, 10752) 1.73459 1 1.06085 1 2.27849 1 2026-02-21T10:13:56.2122711Z (4096, 11392) 1.74057 1 0.862269 1 2.2063 1 2026-02-21T10:13:56.2123210Z (4096, 12032) 1.74097 1 0.843324 1 2.2441 1 2026-02-21T10:13:56.2123679Z (4096, 12672) 1.7545 1 0.83287 1 1.35196 1 2026-02-21T10:13:56.2124217Z average 2.83756 1 1.53556 1 3.42118 1 2026-02-21T10:14:00.7122044Z ✅ Completed benchmark for kernel: softmax 2026-02-21T10:14:00.7129413Z [ 2026-02-21T10:14:00.7134406Z { 2026-02-21T10:14:00.7136381Z "benchmark": { 2026-02-21T10:14:00.7136596Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7136783Z "extra_info": { 2026-02-21T10:14:00.7136948Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7137110Z } 2026-02-21T10:14:00.7137239Z }, 2026-02-21T10:14:00.7137390Z "model": { 2026-02-21T10:14:00.7137546Z "name": "softmax" 2026-02-21T10:14:00.7137696Z }, 2026-02-21T10:14:00.7137812Z "metric": { 2026-02-21T10:14:00.7137961Z "name": "triton_speedup", 2026-02-21T10:14:00.7138468Z "benchmark_values": [ 2026-02-21T10:14:00.7138659Z 0.931475984939404, 2026-02-21T10:14:00.7138808Z 1.8564143200843095, 2026-02-21T10:14:00.7138962Z 3.528789059960257, 2026-02-21T10:14:00.7139104Z 2.3634400016828625, 2026-02-21T10:14:00.7139255Z 2.4133479657797072, 2026-02-21T10:14:00.7139397Z 2.7053978669166314, 2026-02-21T10:14:00.7139548Z 3.7262709990419967, 2026-02-21T10:14:00.7139702Z 3.788257933823361, 2026-02-21T10:14:00.7139842Z 4.121495204138856, 2026-02-21T10:14:00.7139989Z 4.157681244965289, 2026-02-21T10:14:00.7140126Z 4.289042305257703, 2026-02-21T10:14:00.7140270Z 4.891546023739693, 2026-02-21T10:14:00.7140413Z 4.841745530599691, 2026-02-21T10:14:00.7140555Z 2.678114711491322, 2026-02-21T10:14:00.7140694Z 1.743631011334874, 2026-02-21T10:14:00.7140837Z 1.7438875801093323, 2026-02-21T10:14:00.7140976Z 1.7345903226492485, 2026-02-21T10:14:00.7141122Z 1.7405653952784175, 2026-02-21T10:14:00.7141275Z 1.7409664670249556, 2026-02-21T10:14:00.7141413Z 1.7544973423304502 2026-02-21T10:14:00.7141621Z ] 2026-02-21T10:14:00.7141740Z }, 2026-02-21T10:14:00.7141867Z "shape": [ 2026-02-21T10:14:00.7141994Z "(4096, 256)", 2026-02-21T10:14:00.7142136Z "(4096, 896)", 2026-02-21T10:14:00.7142267Z "(4096, 1536)", 2026-02-21T10:14:00.7142411Z "(4096, 2176)", 2026-02-21T10:14:00.7142543Z "(4096, 2816)", 2026-02-21T10:14:00.7142759Z "(4096, 3584)", 2026-02-21T10:14:00.7142898Z "(4096, 4224)", 2026-02-21T10:14:00.7143029Z "(4096, 4864)", 2026-02-21T10:14:00.7143171Z "(4096, 5504)", 2026-02-21T10:14:00.7143300Z "(4096, 6144)", 2026-02-21T10:14:00.7143436Z "(4096, 6784)", 2026-02-21T10:14:00.7143653Z "(4096, 7424)", 2026-02-21T10:14:00.7143799Z "(4096, 8064)", 2026-02-21T10:14:00.7143933Z "(4096, 8704)", 2026-02-21T10:14:00.7144079Z "(4096, 9344)", 2026-02-21T10:14:00.7144215Z "(4096, 10112)", 2026-02-21T10:14:00.7144364Z "(4096, 10752)", 2026-02-21T10:14:00.7144506Z "(4096, 11392)", 2026-02-21T10:14:00.7144638Z "(4096, 12032)", 2026-02-21T10:14:00.7144780Z "(4096, 12672)" 2026-02-21T10:14:00.7144907Z ] 2026-02-21T10:14:00.7145031Z }, 2026-02-21T10:14:00.7145145Z { 2026-02-21T10:14:00.7145271Z "benchmark": { 2026-02-21T10:14:00.7145416Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7145588Z "extra_info": { 2026-02-21T10:14:00.7145728Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7145887Z } 2026-02-21T10:14:00.7145998Z }, 2026-02-21T10:14:00.7146120Z "model": { 2026-02-21T10:14:00.7146253Z "name": "softmax" 2026-02-21T10:14:00.7146450Z }, 2026-02-21T10:14:00.7146577Z "metric": { 2026-02-21T10:14:00.7146718Z "name": "triton_accuracy", 2026-02-21T10:14:00.7146889Z "benchmark_values": [ 2026-02-21T10:14:00.7147033Z 1.0, 2026-02-21T10:14:00.7147159Z 1.0, 2026-02-21T10:14:00.7147282Z 1.0, 2026-02-21T10:14:00.7147469Z 1.0, 2026-02-21T10:14:00.7147583Z 1.0, 2026-02-21T10:14:00.7147705Z 1.0, 2026-02-21T10:14:00.7147818Z 1.0, 2026-02-21T10:14:00.7147939Z 1.0, 2026-02-21T10:14:00.7148052Z 1.0, 2026-02-21T10:14:00.7148172Z 1.0, 2026-02-21T10:14:00.7148294Z 1.0, 2026-02-21T10:14:00.7148409Z 1.0, 2026-02-21T10:14:00.7148532Z 1.0, 2026-02-21T10:14:00.7148647Z 1.0, 2026-02-21T10:14:00.7148771Z 1.0, 2026-02-21T10:14:00.7148886Z 1.0, 2026-02-21T10:14:00.7149010Z 1.0, 2026-02-21T10:14:00.7149123Z 1.0, 2026-02-21T10:14:00.7149244Z 1.0, 2026-02-21T10:14:00.7149362Z 1.0 2026-02-21T10:14:00.7149490Z ] 2026-02-21T10:14:00.7149604Z }, 2026-02-21T10:14:00.7149735Z "shape": [ 2026-02-21T10:14:00.7149872Z "(4096, 256)", 2026-02-21T10:14:00.7150005Z "(4096, 896)", 2026-02-21T10:14:00.7150143Z "(4096, 1536)", 2026-02-21T10:14:00.7150315Z "(4096, 2176)", 2026-02-21T10:14:00.7150455Z "(4096, 2816)", 2026-02-21T10:14:00.7150582Z "(4096, 3584)", 2026-02-21T10:14:00.7150717Z "(4096, 4224)", 2026-02-21T10:14:00.7150844Z "(4096, 4864)", 2026-02-21T10:14:00.7150980Z "(4096, 5504)", 2026-02-21T10:14:00.7151107Z "(4096, 6144)", 2026-02-21T10:14:00.7151240Z "(4096, 6784)", 2026-02-21T10:14:00.7151367Z "(4096, 7424)", 2026-02-21T10:14:00.7151503Z "(4096, 8064)", 2026-02-21T10:14:00.7151680Z "(4096, 8704)", 2026-02-21T10:14:00.7151807Z "(4096, 9344)", 2026-02-21T10:14:00.7151944Z "(4096, 10112)", 2026-02-21T10:14:00.7152078Z "(4096, 10752)", 2026-02-21T10:14:00.7152222Z "(4096, 11392)", 2026-02-21T10:14:00.7152355Z "(4096, 12032)", 2026-02-21T10:14:00.7152496Z "(4096, 12672)" 2026-02-21T10:14:00.7152626Z ] 2026-02-21T10:14:00.7152750Z }, 2026-02-21T10:14:00.7152862Z { 2026-02-21T10:14:00.7152989Z "benchmark": { 2026-02-21T10:14:00.7153136Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7153310Z "extra_info": { 2026-02-21T10:14:00.7153454Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7153602Z } 2026-02-21T10:14:00.7153720Z }, 2026-02-21T10:14:00.7153832Z "model": { 2026-02-21T10:14:00.7153965Z "name": "softmax" 2026-02-21T10:14:00.7154099Z }, 2026-02-21T10:14:00.7154222Z "metric": { 2026-02-21T10:14:00.7154367Z "name": "torch_compile_speedup", 2026-02-21T10:14:00.7154550Z "benchmark_values": [ 2026-02-21T10:14:00.7154699Z 1.250564387047188, 2026-02-21T10:14:00.7154851Z 1.435517213469286, 2026-02-21T10:14:00.7154997Z 2.219487315150675, 2026-02-21T10:14:00.7155138Z 1.9872697309694431, 2026-02-21T10:14:00.7155287Z 1.6881221240745872, 2026-02-21T10:14:00.7155470Z 1.5470672097488474, 2026-02-21T10:14:00.7155615Z 1.9774716427575567, 2026-02-21T10:14:00.7155751Z 1.8543425154620692, 2026-02-21T10:14:00.7155895Z 1.8854201106396877, 2026-02-21T10:14:00.7156035Z 2.0918793313501003, 2026-02-21T10:14:00.7156182Z 1.673172486995061, 2026-02-21T10:14:00.7156321Z 1.8643445788674753, 2026-02-21T10:14:00.7156470Z 1.7877475922999537, 2026-02-21T10:14:00.7156623Z 1.913510707795816, 2026-02-21T10:14:00.7156767Z 0.9858111921332556, 2026-02-21T10:14:00.7156922Z 0.9500668006873804, 2026-02-21T10:14:00.7157069Z 1.0608469534383906, 2026-02-21T10:14:00.7157219Z 0.8622689777512719, 2026-02-21T10:14:00.7157358Z 0.843324263350364, 2026-02-21T10:14:00.7157506Z 0.8328696803411412 2026-02-21T10:14:00.7157643Z ] 2026-02-21T10:14:00.7157763Z }, 2026-02-21T10:14:00.7157911Z "shape": [ 2026-02-21T10:14:00.7158045Z "(4096, 256)", 2026-02-21T10:14:00.7158183Z "(4096, 896)", 2026-02-21T10:14:00.7158312Z "(4096, 1536)", 2026-02-21T10:14:00.7158448Z "(4096, 2176)", 2026-02-21T10:14:00.7158577Z "(4096, 2816)", 2026-02-21T10:14:00.7158715Z "(4096, 3584)", 2026-02-21T10:14:00.7158889Z "(4096, 4224)", 2026-02-21T10:14:00.7159025Z "(4096, 4864)", 2026-02-21T10:14:00.7159152Z "(4096, 5504)", 2026-02-21T10:14:00.7159283Z "(4096, 6144)", 2026-02-21T10:14:00.7159408Z "(4096, 6784)", 2026-02-21T10:14:00.7159541Z "(4096, 7424)", 2026-02-21T10:14:00.7159667Z "(4096, 8064)", 2026-02-21T10:14:00.7159802Z "(4096, 8704)", 2026-02-21T10:14:00.7159936Z "(4096, 9344)", 2026-02-21T10:14:00.7160067Z "(4096, 10112)", 2026-02-21T10:14:00.7160208Z "(4096, 10752)", 2026-02-21T10:14:00.7160369Z "(4096, 11392)", 2026-02-21T10:14:00.7160500Z "(4096, 12032)", 2026-02-21T10:14:00.7160638Z "(4096, 12672)" 2026-02-21T10:14:00.7160773Z ] 2026-02-21T10:14:00.7160886Z }, 2026-02-21T10:14:00.7161006Z { 2026-02-21T10:14:00.7161123Z "benchmark": { 2026-02-21T10:14:00.7161271Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7161431Z "extra_info": { 2026-02-21T10:14:00.7161652Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7161806Z } 2026-02-21T10:14:00.7161926Z }, 2026-02-21T10:14:00.7162040Z "model": { 2026-02-21T10:14:00.7162176Z "name": "softmax" 2026-02-21T10:14:00.7162311Z }, 2026-02-21T10:14:00.7162436Z "metric": { 2026-02-21T10:14:00.7162587Z "name": "torch_compile_accuracy", 2026-02-21T10:14:00.7162767Z "benchmark_values": [ 2026-02-21T10:14:00.7162922Z 1.0, 2026-02-21T10:14:00.7163041Z 1.0, 2026-02-21T10:14:00.7163164Z 1.0, 2026-02-21T10:14:00.7163281Z 1.0, 2026-02-21T10:14:00.7163404Z 1.0, 2026-02-21T10:14:00.7163517Z 1.0, 2026-02-21T10:14:00.7163638Z 1.0, 2026-02-21T10:14:00.7163753Z 1.0, 2026-02-21T10:14:00.7163875Z 1.0, 2026-02-21T10:14:00.7163990Z 1.0, 2026-02-21T10:14:00.7164108Z 1.0, 2026-02-21T10:14:00.7164229Z 1.0, 2026-02-21T10:14:00.7164342Z 1.0, 2026-02-21T10:14:00.7164462Z 1.0, 2026-02-21T10:14:00.7164578Z 1.0, 2026-02-21T10:14:00.7164700Z 1.0, 2026-02-21T10:14:00.7164814Z 1.0, 2026-02-21T10:14:00.7164934Z 1.0, 2026-02-21T10:14:00.7165049Z 1.0, 2026-02-21T10:14:00.7165172Z 1.0 2026-02-21T10:14:00.7165287Z ] 2026-02-21T10:14:00.7165407Z }, 2026-02-21T10:14:00.7165521Z "shape": [ 2026-02-21T10:14:00.7165653Z "(4096, 256)", 2026-02-21T10:14:00.7165783Z "(4096, 896)", 2026-02-21T10:14:00.7165921Z "(4096, 1536)", 2026-02-21T10:14:00.7166059Z "(4096, 2176)", 2026-02-21T10:14:00.7166190Z "(4096, 2816)", 2026-02-21T10:14:00.7166326Z "(4096, 3584)", 2026-02-21T10:14:00.7166452Z "(4096, 4224)", 2026-02-21T10:14:00.7166594Z "(4096, 4864)", 2026-02-21T10:14:00.7166725Z "(4096, 5504)", 2026-02-21T10:14:00.7166904Z "(4096, 6144)", 2026-02-21T10:14:00.7167031Z "(4096, 6784)", 2026-02-21T10:14:00.7167166Z "(4096, 7424)", 2026-02-21T10:14:00.7167295Z "(4096, 8064)", 2026-02-21T10:14:00.7167435Z "(4096, 8704)", 2026-02-21T10:14:00.7167587Z "(4096, 9344)", 2026-02-21T10:14:00.7167721Z "(4096, 10112)", 2026-02-21T10:14:00.7167870Z "(4096, 10752)", 2026-02-21T10:14:00.7168008Z "(4096, 11392)", 2026-02-21T10:14:00.7168152Z "(4096, 12032)", 2026-02-21T10:14:00.7168286Z "(4096, 12672)" 2026-02-21T10:14:00.7168427Z ] 2026-02-21T10:14:00.7168544Z }, 2026-02-21T10:14:00.7168671Z { 2026-02-21T10:14:00.7168790Z "benchmark": { 2026-02-21T10:14:00.7168944Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7169111Z "extra_info": { 2026-02-21T10:14:00.7169266Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7169429Z } 2026-02-21T10:14:00.7169586Z }, 2026-02-21T10:14:00.7169717Z "model": { 2026-02-21T10:14:00.7169851Z "name": "softmax" 2026-02-21T10:14:00.7170001Z }, 2026-02-21T10:14:00.7170119Z "metric": { 2026-02-21T10:14:00.7170267Z "name": "helion_speedup", 2026-02-21T10:14:00.7170434Z "benchmark_values": [ 2026-02-21T10:14:00.7170634Z 1.3928367223219527, 2026-02-21T10:14:00.7170782Z 2.021686809958268, 2026-02-21T10:14:00.7170937Z 4.482999885115271, 2026-02-21T10:14:00.7171092Z 4.144729722512214, 2026-02-21T10:14:00.7171236Z 4.285616670885379, 2026-02-21T10:14:00.7171389Z 3.3652760520851452, 2026-02-21T10:14:00.7171563Z 4.891645563073838, 2026-02-21T10:14:00.7171719Z 5.029636573721792, 2026-02-21T10:14:00.7171863Z 4.886399900166625, 2026-02-21T10:14:00.7172015Z 4.5465352701655695, 2026-02-21T10:14:00.7172162Z 4.628430633856397, 2026-02-21T10:14:00.7172315Z 4.770753433747418, 2026-02-21T10:14:00.7172459Z 4.648096395332824, 2026-02-21T10:14:00.7172612Z 2.9625486935284497, 2026-02-21T10:14:00.7172769Z 1.855029253167436, 2026-02-21T10:14:00.7172914Z 2.4305577597188064, 2026-02-21T10:14:00.7173073Z 2.278486748774394, 2026-02-21T10:14:00.7173268Z 2.2062960096361914, 2026-02-21T10:14:00.7173427Z 2.2441016414386814, 2026-02-21T10:14:00.7173573Z 1.3519636111885713 2026-02-21T10:14:00.7173725Z ] 2026-02-21T10:14:00.7173842Z }, 2026-02-21T10:14:00.7173973Z "shape": [ 2026-02-21T10:14:00.7174107Z "(4096, 256)", 2026-02-21T10:14:00.7174257Z "(4096, 896)", 2026-02-21T10:14:00.7174399Z "(4096, 1536)", 2026-02-21T10:14:00.7174555Z "(4096, 2176)", 2026-02-21T10:14:00.7174709Z "(4096, 2816)", 2026-02-21T10:14:00.7174850Z "(4096, 3584)", 2026-02-21T10:14:00.7175002Z "(4096, 4224)", 2026-02-21T10:14:00.7175135Z "(4096, 4864)", 2026-02-21T10:14:00.7175277Z "(4096, 5504)", 2026-02-21T10:14:00.7175413Z "(4096, 6144)", 2026-02-21T10:14:00.7175558Z "(4096, 6784)", 2026-02-21T10:14:00.7175697Z "(4096, 7424)", 2026-02-21T10:14:00.7175841Z "(4096, 8064)", 2026-02-21T10:14:00.7175977Z "(4096, 8704)", 2026-02-21T10:14:00.7176121Z "(4096, 9344)", 2026-02-21T10:14:00.7176268Z "(4096, 10112)", 2026-02-21T10:14:00.7176411Z "(4096, 10752)", 2026-02-21T10:14:00.7176562Z "(4096, 11392)", 2026-02-21T10:14:00.7176700Z "(4096, 12032)", 2026-02-21T10:14:00.7176846Z "(4096, 12672)" 2026-02-21T10:14:00.7176983Z ] 2026-02-21T10:14:00.7177113Z }, 2026-02-21T10:14:00.7177230Z { 2026-02-21T10:14:00.7177360Z "benchmark": { 2026-02-21T10:14:00.7177509Z "name": "Helion Benchmark", 2026-02-21T10:14:00.7177687Z "extra_info": { 2026-02-21T10:14:00.7177834Z "device": "NVIDIA B200" 2026-02-21T10:14:00.7177999Z } 2026-02-21T10:14:00.7178124Z }, 2026-02-21T10:14:00.7178257Z "model": { 2026-02-21T10:14:00.7178396Z "name": "softmax" 2026-02-21T10:14:00.7178535Z }, 2026-02-21T10:14:00.7178703Z "metric": { 2026-02-21T10:14:00.7178847Z "name": "helion_accuracy", 2026-02-21T10:14:00.7179023Z "benchmark_values": [ 2026-02-21T10:14:00.7179172Z 1.0, 2026-02-21T10:14:00.7179303Z 1.0, 2026-02-21T10:14:00.7179428Z 1.0, 2026-02-21T10:14:00.7179557Z 1.0, 2026-02-21T10:14:00.7179677Z 1.0, 2026-02-21T10:14:00.7179804Z 1.0, 2026-02-21T10:14:00.7179930Z 1.0, 2026-02-21T10:14:00.7180051Z 1.0, 2026-02-21T10:14:00.7180180Z 1.0, 2026-02-21T10:14:00.7180304Z 1.0, 2026-02-21T10:14:00.7180436Z 1.0, 2026-02-21T10:14:00.7180558Z 1.0, 2026-02-21T10:14:00.7180690Z 1.0, 2026-02-21T10:14:00.7180813Z 1.0, 2026-02-21T10:14:00.7180944Z 1.0, 2026-02-21T10:14:00.7181069Z 1.0, 2026-02-21T10:14:00.7181204Z 1.0, 2026-02-21T10:14:00.7181329Z 1.0, 2026-02-21T10:14:00.7181460Z 1.0, 2026-02-21T10:14:00.7181685Z 1.0 2026-02-21T10:14:00.7181837Z ] 2026-02-21T10:14:00.7181975Z }, 2026-02-21T10:14:00.7182100Z "shape": [ 2026-02-21T10:14:00.7182236Z "(4096, 256)", 2026-02-21T10:14:00.7182374Z "(4096, 896)", 2026-02-21T10:14:00.7182523Z "(4096, 1536)", 2026-02-21T10:14:00.7182708Z "(4096, 2176)", 2026-02-21T10:14:00.7182850Z "(4096, 2816)", 2026-02-21T10:14:00.7182985Z "(4096, 3584)", 2026-02-21T10:14:00.7183137Z "(4096, 4224)", 2026-02-21T10:14:00.7183270Z "(4096, 4864)", 2026-02-21T10:14:00.7183410Z "(4096, 5504)", 2026-02-21T10:14:00.7183548Z "(4096, 6144)", 2026-02-21T10:14:00.7183680Z "(4096, 6784)", 2026-02-21T10:14:00.7183818Z "(4096, 7424)", 2026-02-21T10:14:00.7183950Z "(4096, 8064)", 2026-02-21T10:14:00.7184090Z "(4096, 8704)", 2026-02-21T10:14:00.7184225Z "(4096, 9344)", 2026-02-21T10:14:00.7184365Z "(4096, 10112)", 2026-02-21T10:14:00.7184502Z "(4096, 10752)", 2026-02-21T10:14:00.7184650Z "(4096, 11392)", 2026-02-21T10:14:00.7184789Z "(4096, 12032)", 2026-02-21T10:14:00.7184935Z "(4096, 12672)" 2026-02-21T10:14:00.7185067Z ] 2026-02-21T10:14:00.7185193Z } 2026-02-21T10:14:00.7197948Z ] 2026-02-21T10:14:00.7268495Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T10:14:00.7268808Z with: 2026-02-21T10:14:00.7269167Z github-token: *** 2026-02-21T10:14:00.7269319Z venv: .venv/bin/activate 2026-02-21T10:14:00.7269491Z schema-version: v3 2026-02-21T10:14:00.7269630Z env: 2026-02-21T10:14:00.7269776Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:00.7269986Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7270237Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:00.7270491Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7270709Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7270932Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7271293Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:00.7271806Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:00.7272030Z ##[endgroup] 2026-02-21T10:14:00.7327680Z ##[group]Run set -eux 2026-02-21T10:14:00.7327861Z set -eux 2026-02-21T10:14:00.7328002Z  2026-02-21T10:14:00.7328157Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T10:14:00.7328365Z  echo "Missing github-token input" 2026-02-21T10:14:00.7328554Z  exit 1 2026-02-21T10:14:00.7328682Z fi 2026-02-21T10:14:00.7329600Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:00.7329800Z env: 2026-02-21T10:14:00.7329947Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:00.7330147Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7330408Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:00.7330663Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7330970Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7331203Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.7331650Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:00.7332060Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:00.7332432Z GITHUB_TOKEN: *** 2026-02-21T10:14:00.7332581Z ##[endgroup] 2026-02-21T10:14:00.7981784Z + [[ -z *** ]] 2026-02-21T10:14:00.8062347Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T10:14:00.8062640Z with: 2026-02-21T10:14:00.8062910Z github-token: *** 2026-02-21T10:14:00.8063067Z env: 2026-02-21T10:14:00.8063211Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:00.8063430Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8063684Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:00.8063937Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8064162Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8064387Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8064774Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:00.8065326Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:00.8065569Z ##[endgroup] 2026-02-21T10:14:00.8074939Z ##[group]Run set -eux 2026-02-21T10:14:00.8075121Z set -eux 2026-02-21T10:14:00.8075273Z  2026-02-21T10:14:00.8075566Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T10:14:00.8075984Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:00.8076186Z env: 2026-02-21T10:14:00.8076324Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:00.8076533Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8076802Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:00.8077197Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8077445Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8077675Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:00.8078064Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:00.8078471Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:00.8078808Z GITHUB_TOKEN: *** 2026-02-21T10:14:00.8078953Z ##[endgroup] 2026-02-21T10:14:00.8664862Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-03-1005 2026-02-21T10:14:02.4019841Z setting job-id=64380329741 2026-02-21T10:14:02.4024489Z setting job-name=run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200 2026-02-21T10:14:02.4208653Z ##[group]Run set -eux 2026-02-21T10:14:02.4208832Z set -eux 2026-02-21T10:14:02.4208967Z  2026-02-21T10:14:02.4209135Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T10:14:02.4209351Z  source ".venv/bin/activate" 2026-02-21T10:14:02.4209531Z fi 2026-02-21T10:14:02.4209659Z  2026-02-21T10:14:02.4209885Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T10:14:02.4210193Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T10:14:02.4210395Z  --repo "${REPO}" \ 2026-02-21T10:14:02.4210581Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T10:14:02.4210781Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T10:14:02.4210973Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T10:14:02.4211182Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T10:14:02.4211365Z  --job-id "${JOB_ID}" \ 2026-02-21T10:14:02.4211588Z  --job-name "${JOB_NAME}" 2026-02-21T10:14:02.4211982Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:02.4212204Z env: 2026-02-21T10:14:02.4212364Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:02.4212571Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.4212838Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:02.4213094Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.4213326Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.4213558Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.4213914Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:02.4214300Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:02.4214532Z SCHEMA_VERSION: v3 2026-02-21T10:14:02.4214697Z REPO: pytorch/helion 2026-02-21T10:14:02.4214859Z HEAD_BRANCH: refs/heads/main 2026-02-21T10:14:02.4215070Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T10:14:02.4215280Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T10:14:02.4215451Z RUN_ATTEMPT: 1 2026-02-21T10:14:02.4215595Z JOB_ID: 64380329741 2026-02-21T10:14:02.4215909Z JOB_NAME: run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200 2026-02-21T10:14:02.4216166Z ##[endgroup] 2026-02-21T10:14:02.4903770Z + [[ -n .venv/bin/activate ]] 2026-02-21T10:14:02.4908661Z + source .venv/bin/activate 2026-02-21T10:14:02.4908959Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4909150Z ++ '[' -n x ']' 2026-02-21T10:14:02.4909354Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T10:14:02.4909660Z ++ '[' .venv/bin/activate = /__w/_temp/c1f10b70-5831-4a52-a48a-1f73ca90304c.sh ']' 2026-02-21T10:14:02.4909993Z ++ deactivate nondestructive 2026-02-21T10:14:02.4910195Z ++ unset -f pydoc 2026-02-21T10:14:02.4910343Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4910477Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4910617Z ++ hash -r 2026-02-21T10:14:02.4910764Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4910960Z ++ unset VIRTUAL_ENV 2026-02-21T10:14:02.4911138Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T10:14:02.4911758Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T10:14:02.4912005Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T10:14:02.4912233Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T10:14:02.4912430Z ++ '[' linux-gnu = msys ']' 2026-02-21T10:14:02.4912602Z ++ export VIRTUAL_ENV 2026-02-21T10:14:02.4912762Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4912906Z ++ unset SCRIPT_PATH 2026-02-21T10:14:02.4913634Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T10:14:02.4914894Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T10:14:02.4915615Z ++ export PATH 2026-02-21T10:14:02.4915778Z ++ '[' xhelion '!=' x ']' 2026-02-21T10:14:02.4915956Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T10:14:02.4916146Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T10:14:02.4916311Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4916455Z ++ '[' -z '' ']' 2026-02-21T10:14:02.4916597Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T10:14:02.4916756Z ++ PS1='(helion) ' 2026-02-21T10:14:02.4916902Z ++ export PS1 2026-02-21T10:14:02.4917049Z ++ alias pydoc 2026-02-21T10:14:02.4917196Z ++ true 2026-02-21T10:14:02.4917326Z ++ hash -r 2026-02-21T10:14:02.4918404Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329741 --job-name 'run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200' 2026-02-21T10:14:02.5309104Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T10:14:02.5309376Z with: 2026-02-21T10:14:02.5309521Z venv: .venv/bin/activate 2026-02-21T10:14:02.5309683Z env: 2026-02-21T10:14:02.5309826Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:02.5310025Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5310279Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:02.5310520Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5310743Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5310954Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5311326Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:02.5311763Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:02.5311985Z ##[endgroup] 2026-02-21T10:14:02.5321692Z ##[group]Run set -eux 2026-02-21T10:14:02.5321870Z set -eux 2026-02-21T10:14:02.5322013Z  2026-02-21T10:14:02.5322160Z if command -v nvidia-smi; then 2026-02-21T10:14:02.5322427Z  DEVICE_NAME=cuda 2026-02-21T10:14:02.5322593Z  nvidia-smi 2026-02-21T10:14:02.5322751Z elif command -v rocm-smi; then 2026-02-21T10:14:02.5322940Z  DEVICE_NAME=rocm 2026-02-21T10:14:02.5323094Z  rocm-smi 2026-02-21T10:14:02.5323254Z elif command -v hl-smi; then 2026-02-21T10:14:02.5323433Z  DEVICE_NAME=hpu 2026-02-21T10:14:02.5323596Z  hl-smi 2026-02-21T10:14:02.5323727Z else 2026-02-21T10:14:02.5323873Z  arch=$(uname -m) 2026-02-21T10:14:02.5324031Z  2026-02-21T10:14:02.5324160Z  case "$arch" in 2026-02-21T10:14:02.5324323Z  aarch64|arm64) 2026-02-21T10:14:02.5324488Z  DEVICE_NAME=arm64-cpu 2026-02-21T10:14:02.5324672Z  ;; 2026-02-21T10:14:02.5324806Z  *) 2026-02-21T10:14:02.5324956Z  DEVICE_NAME=cpu 2026-02-21T10:14:02.5325115Z  ;; 2026-02-21T10:14:02.5325256Z  esac 2026-02-21T10:14:02.5325387Z  lscpu 2026-02-21T10:14:02.5325522Z fi 2026-02-21T10:14:02.5325696Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T10:14:02.5326002Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:02.5326201Z env: 2026-02-21T10:14:02.5326337Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:02.5326543Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5326785Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:02.5327033Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5327252Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5327463Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.5327832Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:02.5328209Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:02.5328428Z ##[endgroup] 2026-02-21T10:14:02.5972699Z + command -v nvidia-smi 2026-02-21T10:14:02.5972916Z + DEVICE_NAME=cuda 2026-02-21T10:14:02.5973071Z + nvidia-smi 2026-02-21T10:14:02.5973234Z /usr/bin/nvidia-smi 2026-02-21T10:14:02.6138739Z Sat Feb 21 10:14:02 2026 2026-02-21T10:14:02.6140501Z +-----------------------------------------------------------------------------------------+ 2026-02-21T10:14:02.6140974Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T10:14:02.6141387Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T10:14:02.6141843Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T10:14:02.6142653Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T10:14:02.6142981Z | | | MIG M. | 2026-02-21T10:14:02.6143278Z |=========================================+========================+======================| 2026-02-21T10:14:02.6243819Z | 0 NVIDIA B200 Off | 00000000:52:00.0 Off | 0 | 2026-02-21T10:14:02.6245869Z | N/A 33C P0 191W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T10:14:02.6246188Z | | | Disabled | 2026-02-21T10:14:02.6246488Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T10:14:02.6246679Z 2026-02-21T10:14:02.6246815Z +-----------------------------------------------------------------------------------------+ 2026-02-21T10:14:02.6247126Z | Processes: | 2026-02-21T10:14:02.6247429Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T10:14:02.6247932Z | ID ID Usage | 2026-02-21T10:14:02.6248180Z |=========================================================================================| 2026-02-21T10:14:02.6248467Z | No running processes found | 2026-02-21T10:14:02.6248776Z +-----------------------------------------------------------------------------------------+ 2026-02-21T10:14:02.6565158Z + echo DEVICE_NAME=cuda 2026-02-21T10:14:02.6607435Z ##[group]Run set -eux 2026-02-21T10:14:02.6607624Z set -eux 2026-02-21T10:14:02.6607769Z  2026-02-21T10:14:02.6607918Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T10:14:02.6608149Z  # Return the same device name as PyTorch 2026-02-21T10:14:02.6608447Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T10:14:02.6608732Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T10:14:02.6609051Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T10:14:02.6609363Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T10:14:02.6609717Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T10:14:02.6610063Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T10:14:02.6610739Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T10:14:02.6611420Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T10:14:02.6611814Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T10:14:02.6612096Z fi 2026-02-21T10:14:02.6612270Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T10:14:02.6612561Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:02.6612757Z env: 2026-02-21T10:14:02.6612890Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:02.6613089Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.6613330Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:02.6613570Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.6613788Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.6613996Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.6614417Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:02.6614842Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:02.6615061Z DEVICE_NAME: cuda 2026-02-21T10:14:02.6615198Z ##[endgroup] 2026-02-21T10:14:02.7253576Z + [[ cuda == \c\u\d\a ]] 2026-02-21T10:14:02.7253835Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader 2026-02-21T10:14:02.7448363Z + DEVICE_TYPE='NVIDIA B200' 2026-02-21T10:14:02.7452647Z + echo 'DEVICE_TYPE=NVIDIA B200' 2026-02-21T10:14:02.7502000Z ##[group]Run set -eux 2026-02-21T10:14:02.7502176Z set -eux 2026-02-21T10:14:02.7502310Z  2026-02-21T10:14:02.7502481Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T10:14:02.7502685Z  source ".venv/bin/activate" 2026-02-21T10:14:02.7502862Z fi 2026-02-21T10:14:02.7502983Z  2026-02-21T10:14:02.7503182Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T10:14:02.7503538Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T10:14:02.7503882Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:02.7504079Z env: 2026-02-21T10:14:02.7504276Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:02.7504488Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.7504743Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:02.7505006Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.7505241Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.7505462Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:02.7505842Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:02.7506239Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:02.7506461Z DEVICE_NAME: cuda 2026-02-21T10:14:02.7506605Z DEVICE_TYPE: NVIDIA B200 2026-02-21T10:14:02.7506767Z ##[endgroup] 2026-02-21T10:14:02.8065770Z + [[ -n .venv/bin/activate ]] 2026-02-21T10:14:02.8066065Z + source .venv/bin/activate 2026-02-21T10:14:02.8066259Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8066420Z ++ '[' -n x ']' 2026-02-21T10:14:02.8066597Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T10:14:02.8066943Z ++ '[' .venv/bin/activate = /__w/_temp/6601c08a-5e44-423e-b852-79dcb39b8f99.sh ']' 2026-02-21T10:14:02.8067252Z ++ deactivate nondestructive 2026-02-21T10:14:02.8067456Z ++ unset -f pydoc 2026-02-21T10:14:02.8067624Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8067776Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8067935Z ++ hash -r 2026-02-21T10:14:02.8068078Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8068239Z ++ unset VIRTUAL_ENV 2026-02-21T10:14:02.8068414Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T10:14:02.8068635Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T10:14:02.8068877Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T10:14:02.8069119Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T10:14:02.8069334Z ++ '[' linux-gnu = msys ']' 2026-02-21T10:14:02.8070932Z ++ export VIRTUAL_ENV 2026-02-21T10:14:02.8071121Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8071293Z ++ unset SCRIPT_PATH 2026-02-21T10:14:02.8072035Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T10:14:02.8073257Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T10:14:02.8073990Z ++ export PATH 2026-02-21T10:14:02.8074149Z ++ '[' xhelion '!=' x ']' 2026-02-21T10:14:02.8074324Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T10:14:02.8074515Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T10:14:02.8075810Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8076040Z ++ '[' -z '' ']' 2026-02-21T10:14:02.8076186Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T10:14:02.8076348Z ++ PS1='(helion) ' 2026-02-21T10:14:02.8076501Z ++ export PS1 2026-02-21T10:14:02.8076651Z ++ alias pydoc 2026-02-21T10:14:02.8076802Z ++ true 2026-02-21T10:14:02.8076939Z ++ hash -r 2026-02-21T10:14:02.8077147Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T10:14:03.4734263Z Collecting psutil==7.0.0 2026-02-21T10:14:03.5913583Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T10:14:03.6134487Z Collecting nvidia-ml-py==13.580.82 2026-02-21T10:14:03.6211042Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T10:14:03.6332930Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T10:14:03.6590878Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T10:14:03.7437579Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T10:14:03.7445127Z Attempting uninstall: nvidia-ml-py 2026-02-21T10:14:03.7466433Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T10:14:03.7478587Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T10:14:03.8118479Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T10:14:03.8583780Z Attempting uninstall: psutil 2026-02-21T10:14:03.8612810Z Found existing installation: psutil 7.2.2 2026-02-21T10:14:03.8626674Z Uninstalling psutil-7.2.2: 2026-02-21T10:14:03.8628899Z Successfully uninstalled psutil-7.2.2 2026-02-21T10:14:03.9789849Z 2026-02-21T10:14:03.9823149Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T10:14:04.1136576Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T10:14:05.8259606Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T10:14:05.8259916Z with: 2026-02-21T10:14:05.8260063Z venv: .venv/bin/activate 2026-02-21T10:14:05.8260214Z env: 2026-02-21T10:14:05.8260372Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:05.8260574Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8260830Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:05.8261072Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8261303Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8261527Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8261986Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:05.8262384Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:05.8262596Z DEVICE_NAME: cuda 2026-02-21T10:14:05.8262856Z DEVICE_TYPE: NVIDIA B200 2026-02-21T10:14:05.8263021Z ##[endgroup] 2026-02-21T10:14:05.8271912Z ##[group]Run set -eux 2026-02-21T10:14:05.8272091Z set -eux 2026-02-21T10:14:05.8272236Z  2026-02-21T10:14:05.8272395Z # TODO (huydhn): Implement this part 2026-02-21T10:14:05.8272630Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T10:14:05.8272940Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T10:14:05.8273135Z env: 2026-02-21T10:14:05.8273278Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:05.8273473Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8273725Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:05.8273971Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8274183Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8274408Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8274773Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:05.8275269Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:05.8275493Z DEVICE_NAME: cuda 2026-02-21T10:14:05.8275658Z DEVICE_TYPE: NVIDIA B200 2026-02-21T10:14:05.8275835Z ##[endgroup] 2026-02-21T10:14:05.8819428Z + echo 'dependencies={}' 2026-02-21T10:14:05.8871311Z ##[group]Run actions/upload-artifact@v6 2026-02-21T10:14:05.8871507Z with: 2026-02-21T10:14:05.8871733Z name: benchmark-results-b200-softmax 2026-02-21T10:14:05.8871925Z path: test/test-reports 2026-02-21T10:14:05.8872098Z if-no-files-found: warn 2026-02-21T10:14:05.8872265Z compression-level: 6 2026-02-21T10:14:05.8872433Z overwrite: false 2026-02-21T10:14:05.8872593Z include-hidden-files: false 2026-02-21T10:14:05.8872775Z env: 2026-02-21T10:14:05.8872918Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T10:14:05.8873140Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8873404Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T10:14:05.8873684Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8873911Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8874126Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T10:14:05.8874625Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T10:14:05.8875015Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T10:14:05.8875252Z DEVICE_NAME: cuda 2026-02-21T10:14:05.8875405Z DEVICE_TYPE: NVIDIA B200 2026-02-21T10:14:05.8875583Z ##[endgroup] 2026-02-21T10:14:05.8877789Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T10:14:06.1183033Z With the provided path, there will be 1 file uploaded 2026-02-21T10:14:06.1188371Z Artifact name is valid! 2026-02-21T10:14:06.1188647Z Root directory input is valid! 2026-02-21T10:14:06.3901997Z Beginning upload of artifact content to blob storage 2026-02-21T10:14:06.7572908Z Uploaded bytes 1090 2026-02-21T10:14:06.8517183Z Finished uploading artifact content to blob storage! 2026-02-21T10:14:06.8518779Z SHA256 digest of uploaded artifact zip is ff7e00cf30fa3c0a5eaec5360e821f9d3995630ca5e7325469adbcc103e6907d 2026-02-21T10:14:06.8519261Z Finalizing artifact upload 2026-02-21T10:14:07.1586852Z Artifact benchmark-results-b200-softmax.zip successfully finalized. Artifact ID 5600810980 2026-02-21T10:14:07.1587426Z Artifact benchmark-results-b200-softmax has been successfully uploaded! Final size is 1090 bytes. Artifact ID is 5600810980 2026-02-21T10:14:07.1587996Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5600810980 2026-02-21T10:14:07.1805568Z Post job cleanup. 2026-02-21T10:14:07.1809514Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T10:14:07.3989757Z (node:388805) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T10:14:07.3992619Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T10:14:07.3993073Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T10:14:07.4139379Z Post job cleanup. 2026-02-21T10:14:07.4142057Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T10:14:07.6382994Z Post job cleanup. 2026-02-21T10:14:07.6386372Z ##[command]/usr/bin/docker exec 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T10:14:07.8198226Z [command]/usr/bin/git version 2026-02-21T10:14:07.8236782Z git version 2.43.0 2026-02-21T10:14:07.8270711Z Temporarily overriding HOME='/__w/_temp/08fd4e3a-806a-43b1-bcb8-1e2238f47cf5' before making global git config changes 2026-02-21T10:14:07.8274876Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T10:14:07.8280462Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T10:14:07.8320364Z Removing SSH command configuration 2026-02-21T10:14:07.8323883Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T10:14:07.8366053Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T10:14:07.8624718Z Removing HTTP extra header 2026-02-21T10:14:07.8667632Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T10:14:07.8689861Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T10:14:07.8926142Z Removing includeIf entries pointing to credentials config files 2026-02-21T10:14:07.8926561Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T10:14:07.8956763Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T10:14:07.8960457Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T10:14:07.8964619Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T10:14:07.8969668Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T10:14:07.8978369Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T10:14:07.8985492Z /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.8997252Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9031003Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T10:14:07.9043878Z /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9055569Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9085603Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T10:14:07.9094089Z /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9097996Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9125628Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T10:14:07.9145987Z /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9159803Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config 2026-02-21T10:14:07.9185403Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T10:14:07.9419696Z Removing credentials config '/__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config' 2026-02-21T10:14:07.9538287Z Stop and remove container: 6d984ead33f845ac9a028d8d082e23df_nvidiacuda1301develubuntu2404_d3efdf 2026-02-21T10:14:07.9541876Z ##[command]/usr/bin/docker rm --force 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T10:14:13.2385143Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 2026-02-21T10:14:13.2422617Z Remove container network: github_network_1dabb68bcff447bd84adae5308b06429 2026-02-21T10:14:13.2425347Z ##[command]/usr/bin/docker network rm github_network_1dabb68bcff447bd84adae5308b06429 2026-02-21T10:14:13.3362642Z github_network_1dabb68bcff447bd84adae5308b06429 2026-02-21T10:14:13.3407047Z Evaluate and set job outputs 2026-02-21T10:14:13.3411964Z Set output 'benchmark-metadata' 2026-02-21T10:14:13.3413424Z Set output 'runners-info' 2026-02-21T10:14:13.3413931Z Set output 'dependencies' 2026-02-21T10:14:13.3414327Z Cleaning up orphan processes