2026-02-21T08:04:43.4330387Z Current runner version: '2.331.0'
2026-02-21T08:04:43.4334570Z Runner name: 'dgxb200-03-1005'
2026-02-21T08:04:43.4335215Z Runner group name: 'default'
2026-02-21T08:04:43.4335863Z Machine name: '3565fcf04df8'
2026-02-21T08:04:43.4337861Z ##[group]GITHUB_TOKEN Permissions
2026-02-21T08:04:43.4339429Z Contents: read
2026-02-21T08:04:43.4339873Z Metadata: read
2026-02-21T08:04:43.4340331Z ##[endgroup]
2026-02-21T08:04:43.4341936Z Secret source: Actions
2026-02-21T08:04:43.4342519Z Prepare workflow directory
2026-02-21T08:04:43.4704674Z Prepare all required actions
2026-02-21T08:04:43.4733445Z Getting action download info
2026-02-21T08:04:43.9422916Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd)
2026-02-21T08:04:44.2973999Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405)
2026-02-21T08:04:44.6762885Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b)
2026-02-21T08:04:45.0512751Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909)
2026-02-21T08:04:45.7173150Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f)
2026-02-21T08:04:46.2983497Z Getting action download info
2026-02-21T08:04:46.5280899Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820)
2026-02-21T08:04:46.5283613Z ##[group] Inputs
2026-02-21T08:04:46.5283916Z   runner: linux.dgx.b200
2026-02-21T08:04:46.5284221Z   python-version: 3.12
2026-02-21T08:04:46.5284532Z   image: nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:04:46.5284823Z   runtime-version: cu130
2026-02-21T08:04:46.5285141Z   container-options: --gpus all
2026-02-21T08:04:46.5285440Z   alias: b200
2026-02-21T08:04:46.5285668Z   kernels: softmax
2026-02-21T08:04:46.5285931Z   env-vars: 
2026-02-21T08:04:46.5286159Z   custom-args: 
2026-02-21T08:04:46.5286661Z   run_h100: true
2026-02-21T08:04:46.5286885Z   run_b200: true
2026-02-21T08:04:46.5287169Z   run_mi325x: true
2026-02-21T08:04:46.5287392Z ##[endgroup]
2026-02-21T08:04:46.5287731Z Complete job name: run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200
2026-02-21T08:04:46.5523868Z ##[group]Checking docker version
2026-02-21T08:04:46.5533540Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}'
2026-02-21T08:04:46.5708640Z '1.53'
2026-02-21T08:04:46.5726923Z Docker daemon API version: '1.53'
2026-02-21T08:04:46.5727378Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}'
2026-02-21T08:04:46.5878967Z '1.52'
2026-02-21T08:04:46.5898365Z Docker client API version: '1.52'
2026-02-21T08:04:46.5902811Z ##[endgroup]
2026-02-21T08:04:46.5904712Z ##[group]Clean up resources from previous jobs
2026-02-21T08:04:46.5908321Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=0581a9"
2026-02-21T08:04:46.6020450Z ##[command]/usr/bin/docker network prune --force --filter "label=0581a9"
2026-02-21T08:04:46.6124176Z ##[endgroup]
2026-02-21T08:04:46.6124476Z ##[group]Create local container network
2026-02-21T08:04:46.6130773Z ##[command]/usr/bin/docker network create --label 0581a9 github_network_1dabb68bcff447bd84adae5308b06429
2026-02-21T08:04:46.6461789Z 09773e3a4e0ced1bc0281e806d312428394d5dada89c5653e3d75c24718b90b7
2026-02-21T08:04:46.6485178Z ##[endgroup]
2026-02-21T08:04:46.6503235Z ##[group]Starting job container
2026-02-21T08:04:46.6518915Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:04:47.4473345Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda
2026-02-21T08:04:47.7314469Z 1cd98a0b9132: Pulling fs layer
2026-02-21T08:04:47.7314863Z 76249c7cd503: Pulling fs layer
2026-02-21T08:04:47.7315297Z 8fb7ecb711ef: Pulling fs layer
2026-02-21T08:04:47.7315703Z afcf80b42416: Pulling fs layer
2026-02-21T08:04:47.7316079Z ab7341a40ee7: Pulling fs layer
2026-02-21T08:04:47.7316552Z e93dd1223ff5: Pulling fs layer
2026-02-21T08:04:47.7316917Z 401d11fb2a09: Pulling fs layer
2026-02-21T08:04:47.7317558Z d7913b78456a: Pulling fs layer
2026-02-21T08:04:47.7317814Z eea924c2c8fb: Pulling fs layer
2026-02-21T08:04:47.7321300Z c03b8ec8dd33: Pulling fs layer
2026-02-21T08:04:47.7321624Z c20926c42231: Pulling fs layer
2026-02-21T08:04:47.8627992Z 1cd98a0b9132: Download complete
2026-02-21T08:04:47.9636776Z d7913b78456a: Download complete
2026-02-21T08:04:48.0629567Z afcf80b42416: Download complete
2026-02-21T08:04:48.0631477Z c20926c42231: Download complete
2026-02-21T08:04:48.1632844Z 8fb7ecb711ef: Download complete
2026-02-21T08:04:48.1639294Z c03b8ec8dd33: Download complete
2026-02-21T08:04:48.6630245Z 401d11fb2a09: Download complete
2026-02-21T08:04:50.2643586Z 76249c7cd503: Download complete
2026-02-21T08:04:51.7666598Z 76249c7cd503: Pull complete
2026-02-21T08:04:51.8629416Z ab7341a40ee7: Download complete
2026-02-21T08:04:53.2671300Z 401d11fb2a09: Pull complete
2026-02-21T08:04:58.3680934Z ab7341a40ee7: Pull complete
2026-02-21T08:04:58.4641215Z d7913b78456a: Pull complete
2026-02-21T08:04:58.4678875Z c03b8ec8dd33: Pull complete
2026-02-21T08:05:13.6637444Z eea924c2c8fb: Download complete
2026-02-21T08:05:26.8631342Z e93dd1223ff5: Download complete
2026-02-21T08:05:32.1632859Z afcf80b42416: Pull complete
2026-02-21T08:05:32.1633544Z c20926c42231: Pull complete
2026-02-21T08:05:32.1642947Z eea924c2c8fb: Pull complete
2026-02-21T08:05:32.2641263Z 8fb7ecb711ef: Pull complete
2026-02-21T08:06:21.2640991Z e93dd1223ff5: Pull complete
2026-02-21T08:06:21.6873083Z 1cd98a0b9132: Pull complete
2026-02-21T08:06:21.6875055Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92
2026-02-21T08:06:21.6875522Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:06:21.6886807Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:06:21.6963236Z ##[command]/usr/bin/docker create --name 6d984ead33f845ac9a028d8d082e23df_nvidiacuda1301develubuntu2404_d3efdf --label 0581a9 --workdir /__w/helion/helion --network github_network_1dabb68bcff447bd84adae5308b06429 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/charlie/_work":"/__w" -v "/home/charlie/externals":"/__e":ro -v "/home/charlie/_work/_temp":"/__w/_temp" -v "/home/charlie/_work/_actions":"/__w/_actions" -v "/home/charlie/_work/_tool":"/__w/_tool" -v "/home/charlie/_work/_temp/_github_home":"/github/home" -v "/home/charlie/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null"
2026-02-21T08:06:21.9752053Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T08:06:21.9775801Z ##[command]/usr/bin/docker start 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T08:06:22.2510786Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T08:06:22.2523678Z ##[command]/usr/bin/docker ps --all --filter id=2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}"
2026-02-21T08:06:22.2682073Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 Up Less than a second
2026-02-21T08:06:22.2698152Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T08:06:22.2785856Z HOME=/github/home
2026-02-21T08:06:22.2787968Z GITHUB_ACTIONS=true
2026-02-21T08:06:22.2788345Z CI=true
2026-02-21T08:06:22.2788783Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:06:22.2789218Z NVARCH=x86_64
2026-02-21T08:06:22.2793746Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
2026-02-21T08:06:22.2799048Z NV_CUDA_CUDART_VERSION=13.0.88-1
2026-02-21T08:06:22.2799270Z CUDA_VERSION=13.0.1
2026-02-21T08:06:22.2799696Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:22.2800064Z NVIDIA_VISIBLE_DEVICES=all
2026-02-21T08:06:22.2800345Z NVIDIA_DRIVER_CAPABILITIES=compute,utility
2026-02-21T08:06:22.2800665Z NV_CUDA_LIB_VERSION=13.0.1-1
2026-02-21T08:06:22.2801013Z NV_NVTX_VERSION=13.0.85-1
2026-02-21T08:06:22.2801266Z NV_LIBNPP_VERSION=13.0.1.2-1
2026-02-21T08:06:22.2801531Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1
2026-02-21T08:06:22.2801834Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1
2026-02-21T08:06:22.2802096Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0
2026-02-21T08:06:22.2802395Z NV_LIBCUBLAS_VERSION=13.0.2.14-1
2026-02-21T08:06:22.2802635Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1
2026-02-21T08:06:22.2802933Z NV_LIBNCCL_PACKAGE_NAME=libnccl2
2026-02-21T08:06:22.2803230Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1
2026-02-21T08:06:22.2803456Z NCCL_VERSION=2.28.3-1
2026-02-21T08:06:22.2803718Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0
2026-02-21T08:06:22.2803970Z NVIDIA_PRODUCT_NAME=CUDA
2026-02-21T08:06:22.2804235Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1
2026-02-21T08:06:22.2804480Z NV_NVML_DEV_VERSION=13.0.87-1
2026-02-21T08:06:22.2804736Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1
2026-02-21T08:06:22.2804984Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1
2026-02-21T08:06:22.2805271Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1
2026-02-21T08:06:22.2805562Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1
2026-02-21T08:06:22.2805822Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0
2026-02-21T08:06:22.2806145Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1
2026-02-21T08:06:22.2806403Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1
2026-02-21T08:06:22.2806767Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1
2026-02-21T08:06:22.2807097Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
2026-02-21T08:06:22.2807348Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1
2026-02-21T08:06:22.2807679Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0
2026-02-21T08:06:22.2807944Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs
2026-02-21T08:06:22.2813616Z ##[endgroup]
2026-02-21T08:06:22.2820614Z ##[group]Waiting for all services to be ready
2026-02-21T08:06:22.2821807Z ##[endgroup]
2026-02-21T08:06:22.2952669Z ##[group]Run echo "Detected NVIDIA image"
2026-02-21T08:06:22.2953022Z [36;1mecho "Detected NVIDIA image"[0m
2026-02-21T08:06:22.2953392Z [36;1mnvidia-smi || echo "nvidia-smi not found"[0m
2026-02-21T08:06:22.2955589Z shell: bash -l {0}
2026-02-21T08:06:22.2955895Z env:
2026-02-21T08:06:22.2956336Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:22.2956557Z ##[endgroup]
2026-02-21T08:06:22.3582319Z Detected NVIDIA image
2026-02-21T08:06:22.3857090Z Sat Feb 21 08:06:22 2026       
2026-02-21T08:06:22.3857634Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:22.3863223Z | NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
2026-02-21T08:06:22.3864253Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:06:22.3864743Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2026-02-21T08:06:22.3865514Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2026-02-21T08:06:22.3865958Z |                                         |                        |               MIG M. |
2026-02-21T08:06:22.3866403Z |=========================================+========================+======================|
2026-02-21T08:06:22.3945836Z |   0  NVIDIA B200                    Off |   00000000:52:00.0 Off |                    0 |
2026-02-21T08:06:22.3947614Z | N/A   30C    P0            141W /  750W |       0MiB / 183359MiB |      0%      Default |
2026-02-21T08:06:22.3947969Z |                                         |                        |             Disabled |
2026-02-21T08:06:22.3948382Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:06:22.3948683Z 
2026-02-21T08:06:22.3953168Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:22.3957642Z | Processes:                                                                              |
2026-02-21T08:06:22.3962431Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2026-02-21T08:06:22.3962886Z |        ID   ID                                                               Usage      |
2026-02-21T08:06:22.3963274Z |=========================================================================================|
2026-02-21T08:06:22.3963690Z |  No running processes found                                                             |
2026-02-21T08:06:22.3964096Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:22.4328127Z ##[group]Run set -x
2026-02-21T08:06:22.4328352Z [36;1mset -x[0m
2026-02-21T08:06:22.4328586Z [36;1mapt-get update[0m
2026-02-21T08:06:22.4328797Z [36;1mapt-get install -y git[0m
2026-02-21T08:06:22.4329225Z shell: bash -l {0}
2026-02-21T08:06:22.4329409Z env:
2026-02-21T08:06:22.4329638Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:22.4329849Z ##[endgroup]
2026-02-21T08:06:22.4860916Z + apt-get update
2026-02-21T08:06:22.5413084Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease [1581 B]
2026-02-21T08:06:22.6520420Z Get:2 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
2026-02-21T08:06:22.6566269Z Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  Packages [1218 kB]
2026-02-21T08:06:22.8514636Z Get:4 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB]
2026-02-21T08:06:23.0037773Z Get:5 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB]
2026-02-21T08:06:23.0874458Z Get:6 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB]
2026-02-21T08:06:23.3744709Z Get:7 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB]
2026-02-21T08:06:23.5224596Z Get:8 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB]
2026-02-21T08:06:23.7832545Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
2026-02-21T08:06:24.0146785Z Get:10 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
2026-02-21T08:06:24.2510865Z Get:11 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB]
2026-02-21T08:06:24.4162123Z Get:12 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB]
2026-02-21T08:06:25.6637567Z Get:13 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB]
2026-02-21T08:06:25.7251103Z Get:14 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB]
2026-02-21T08:06:25.7284419Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB]
2026-02-21T08:06:25.8719293Z Get:16 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB]
2026-02-21T08:06:25.8724485Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB]
2026-02-21T08:06:25.9512578Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB]
2026-02-21T08:06:26.0547296Z Get:19 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB]
2026-02-21T08:06:26.0564446Z Get:20 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB]
2026-02-21T08:06:26.5851216Z Fetched 37.5 MB in 4s (9503 kB/s)
2026-02-21T08:06:27.8374442Z Reading package lists...
2026-02-21T08:06:27.8477353Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
2026-02-21T08:06:27.8484115Z + apt-get install -y git
2026-02-21T08:06:30.4025279Z Reading package lists...
2026-02-21T08:06:30.5209354Z Building dependency tree...
2026-02-21T08:06:30.5213906Z Reading state information...
2026-02-21T08:06:30.6599743Z The following additional packages will be installed:
2026-02-21T08:06:30.6604574Z   git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls
2026-02-21T08:06:30.6605925Z   libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3
2026-02-21T08:06:30.6606552Z   libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1
2026-02-21T08:06:30.6607095Z   libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1
2026-02-21T08:06:30.6609386Z   openssh-client publicsuffix xauth
2026-02-21T08:06:30.6614113Z Suggested packages:
2026-02-21T08:06:30.6614506Z   gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui
2026-02-21T08:06:30.6615307Z   gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain
2026-02-21T08:06:30.6615647Z   libpam-ssh monkeysphere ssh-askpass
2026-02-21T08:06:31.2370091Z The following NEW packages will be installed:
2026-02-21T08:06:31.2371769Z   git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10
2026-02-21T08:06:31.2372219Z   libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1
2026-02-21T08:06:31.2372608Z   libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0
2026-02-21T08:06:31.2373120Z   libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6
2026-02-21T08:06:31.2373821Z   libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth
2026-02-21T08:06:31.5313239Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded.
2026-02-21T08:06:31.5318395Z Need to get 8886 kB of archives.
2026-02-21T08:06:31.5322615Z After this operation, 38.0 MB of additional disk space will be used.
2026-02-21T08:06:31.5323337Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB]
2026-02-21T08:06:31.8108252Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB]
2026-02-21T08:06:32.1904785Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB]
2026-02-21T08:06:32.2454831Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB]
2026-02-21T08:06:32.3204953Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB]
2026-02-21T08:06:32.3438031Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB]
2026-02-21T08:06:32.3892113Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B]
2026-02-21T08:06:32.3932339Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB]
2026-02-21T08:06:32.5175967Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB]
2026-02-21T08:06:32.5673852Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB]
2026-02-21T08:06:32.5707357Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB]
2026-02-21T08:06:32.5989507Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB]
2026-02-21T08:06:32.6102267Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB]
2026-02-21T08:06:32.6216675Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB]
2026-02-21T08:06:32.6310502Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B]
2026-02-21T08:06:32.6326878Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB]
2026-02-21T08:06:32.6345134Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB]
2026-02-21T08:06:32.6561789Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB]
2026-02-21T08:06:32.7230046Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB]
2026-02-21T08:06:32.7933455Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB]
2026-02-21T08:06:32.7997962Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B]
2026-02-21T08:06:32.8010852Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB]
2026-02-21T08:06:32.8721829Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB]
2026-02-21T08:06:32.8815329Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB]
2026-02-21T08:06:32.8842063Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB]
2026-02-21T08:06:32.9117431Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB]
2026-02-21T08:06:32.9147126Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB]
2026-02-21T08:06:32.9301014Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB]
2026-02-21T08:06:32.9590769Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB]
2026-02-21T08:06:32.9609320Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB]
2026-02-21T08:06:33.0091168Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB]
2026-02-21T08:06:33.2234064Z debconf: delaying package configuration, since apt-utils is not installed
2026-02-21T08:06:33.2455808Z Fetched 8886 kB in 2s (4721 kB/s)
2026-02-21T08:06:33.2950144Z Selecting previously unselected package krb5-locales.
2026-02-21T08:06:33.2965368Z (Reading database ... 
2026-02-21T08:06:33.2967037Z (Reading database ... 5%
2026-02-21T08:06:33.2967696Z (Reading database ... 10%
2026-02-21T08:06:33.2972717Z (Reading database ... 15%
2026-02-21T08:06:33.2976267Z (Reading database ... 20%
2026-02-21T08:06:33.2978542Z (Reading database ... 25%
2026-02-21T08:06:33.2978816Z (Reading database ... 30%
2026-02-21T08:06:33.2979149Z (Reading database ... 35%
2026-02-21T08:06:33.2979396Z (Reading database ... 40%
2026-02-21T08:06:33.2979629Z (Reading database ... 45%
2026-02-21T08:06:33.2979816Z (Reading database ... 50%
2026-02-21T08:06:33.2980101Z (Reading database ... 55%
2026-02-21T08:06:33.2980300Z (Reading database ... 60%
2026-02-21T08:06:33.2980517Z (Reading database ... 65%
2026-02-21T08:06:33.2980800Z (Reading database ... 70%
2026-02-21T08:06:33.2984504Z (Reading database ... 75%
2026-02-21T08:06:33.2993421Z (Reading database ... 80%
2026-02-21T08:06:33.2994718Z (Reading database ... 85%
2026-02-21T08:06:33.3002685Z (Reading database ... 90%
2026-02-21T08:06:33.3006198Z (Reading database ... 95%
2026-02-21T08:06:33.3006508Z (Reading database ... 100%
2026-02-21T08:06:33.3006843Z (Reading database ... 15507 files and directories currently installed.)
2026-02-21T08:06:33.3015251Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ...
2026-02-21T08:06:33.3042745Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:33.3346901Z Selecting previously unselected package less.
2026-02-21T08:06:33.3354978Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ...
2026-02-21T08:06:33.3407821Z Unpacking less (590-2ubuntu2.1) ...
2026-02-21T08:06:33.3734759Z Selecting previously unselected package libbsd0:amd64.
2026-02-21T08:06:33.3743101Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ...
2026-02-21T08:06:33.3798253Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ...
2026-02-21T08:06:33.4101278Z Selecting previously unselected package libexpat1:amd64.
2026-02-21T08:06:33.4109370Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ...
2026-02-21T08:06:33.4137113Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ...
2026-02-21T08:06:33.4454760Z Selecting previously unselected package libkrb5support0:amd64.
2026-02-21T08:06:33.4461515Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:33.4490124Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:33.4776235Z Selecting previously unselected package libk5crypto3:amd64.
2026-02-21T08:06:33.4787864Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:33.4801977Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:33.5066755Z Selecting previously unselected package libkeyutils1:amd64.
2026-02-21T08:06:33.5074483Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ...
2026-02-21T08:06:33.5107706Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ...
2026-02-21T08:06:33.5362953Z Selecting previously unselected package libkrb5-3:amd64.
2026-02-21T08:06:33.5372833Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:33.5399591Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:33.5690021Z Selecting previously unselected package libgssapi-krb5-2:amd64.
2026-02-21T08:06:33.5697394Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:33.5723946Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:33.5992363Z Selecting previously unselected package libcbor0.10:amd64.
2026-02-21T08:06:33.5994395Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ...
2026-02-21T08:06:33.6257631Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:33.6556233Z Selecting previously unselected package libedit2:amd64.
2026-02-21T08:06:33.6565019Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ...
2026-02-21T08:06:33.6590375Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ...
2026-02-21T08:06:33.7195695Z Selecting previously unselected package libfido2-1:amd64.
2026-02-21T08:06:33.7201436Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ...
2026-02-21T08:06:33.7225794Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:33.7528116Z Selecting previously unselected package libnghttp2-14:amd64.
2026-02-21T08:06:33.7530117Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ...
2026-02-21T08:06:33.7566505Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ...
2026-02-21T08:06:33.7786740Z Selecting previously unselected package libpsl5t64:amd64.
2026-02-21T08:06:33.7796509Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ...
2026-02-21T08:06:33.7854942Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ...
2026-02-21T08:06:33.8087262Z Selecting previously unselected package libxau6:amd64.
2026-02-21T08:06:33.8089973Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ...
2026-02-21T08:06:33.8130501Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ...
2026-02-21T08:06:33.8342206Z Selecting previously unselected package libxdmcp6:amd64.
2026-02-21T08:06:33.8348559Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ...
2026-02-21T08:06:33.8376657Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ...
2026-02-21T08:06:33.8615219Z Selecting previously unselected package libxcb1:amd64.
2026-02-21T08:06:33.8621751Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ...
2026-02-21T08:06:33.8644804Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ...
2026-02-21T08:06:33.8845020Z Selecting previously unselected package libx11-data.
2026-02-21T08:06:33.8851951Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ...
2026-02-21T08:06:33.8878280Z Unpacking libx11-data (2:1.8.7-1build1) ...
2026-02-21T08:06:33.9235910Z Selecting previously unselected package libx11-6:amd64.
2026-02-21T08:06:33.9245712Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ...
2026-02-21T08:06:33.9270274Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ...
2026-02-21T08:06:33.9554460Z Selecting previously unselected package libxext6:amd64.
2026-02-21T08:06:33.9560865Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ...
2026-02-21T08:06:33.9584161Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ...
2026-02-21T08:06:33.9803065Z Selecting previously unselected package libxmuu1:amd64.
2026-02-21T08:06:33.9810794Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ...
2026-02-21T08:06:33.9835825Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:34.0064481Z Selecting previously unselected package openssh-client.
2026-02-21T08:06:34.0070141Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ...
2026-02-21T08:06:34.0145884Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:34.0493834Z Selecting previously unselected package publicsuffix.
2026-02-21T08:06:34.0498393Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ...
2026-02-21T08:06:34.0525429Z Unpacking publicsuffix (20231001.0357-0.1) ...
2026-02-21T08:06:34.0733548Z Selecting previously unselected package xauth.
2026-02-21T08:06:34.0744304Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ...
2026-02-21T08:06:34.0766781Z Unpacking xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:34.0984433Z Selecting previously unselected package libbrotli1:amd64.
2026-02-21T08:06:34.0990920Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ...
2026-02-21T08:06:34.1013396Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ...
2026-02-21T08:06:34.1261942Z Selecting previously unselected package librtmp1:amd64.
2026-02-21T08:06:34.1269567Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ...
2026-02-21T08:06:34.1298172Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ...
2026-02-21T08:06:34.1511598Z Selecting previously unselected package libssh-4:amd64.
2026-02-21T08:06:34.1516360Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ...
2026-02-21T08:06:34.1540738Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ...
2026-02-21T08:06:34.1785939Z Selecting previously unselected package libcurl3t64-gnutls:amd64.
2026-02-21T08:06:34.1794583Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ...
2026-02-21T08:06:34.1821885Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:34.2033663Z Selecting previously unselected package liberror-perl.
2026-02-21T08:06:34.2040933Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ...
2026-02-21T08:06:34.2066117Z Unpacking liberror-perl (0.17029-2) ...
2026-02-21T08:06:34.2253247Z Selecting previously unselected package git-man.
2026-02-21T08:06:34.2259265Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ...
2026-02-21T08:06:34.2283957Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:34.2561865Z Selecting previously unselected package git.
2026-02-21T08:06:34.2567173Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ...
2026-02-21T08:06:34.2631609Z Unpacking git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:34.3648536Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ...
2026-02-21T08:06:34.3719118Z Setting up libxau6:amd64 (1:1.0.9-1build6) ...
2026-02-21T08:06:34.3793785Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ...
2026-02-21T08:06:34.3858996Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:34.3925276Z Setting up libbrotli1:amd64 (1.1.0-2build2) ...
2026-02-21T08:06:34.3979537Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ...
2026-02-21T08:06:34.4040666Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ...
2026-02-21T08:06:34.4105087Z Setting up less (590-2ubuntu2.1) ...
2026-02-21T08:06:34.4239954Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:34.4302123Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:34.4365314Z Setting up liberror-perl (0.17029-2) ...
2026-02-21T08:06:34.4421481Z Setting up libx11-data (2:1.8.7-1build1) ...
2026-02-21T08:06:34.4491497Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ...
2026-02-21T08:06:34.4563613Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:34.4617067Z Setting up git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:34.4685074Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:34.4755895Z Setting up libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:34.4816637Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ...
2026-02-21T08:06:34.4865614Z Setting up publicsuffix (20231001.0357-0.1) ...
2026-02-21T08:06:34.4937418Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ...
2026-02-21T08:06:34.5000357Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ...
2026-02-21T08:06:34.5075954Z Setting up libedit2:amd64 (3.1-20230828-1build1) ...
2026-02-21T08:06:34.5139528Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:34.5224105Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ...
2026-02-21T08:06:34.5304115Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ...
2026-02-21T08:06:34.5410460Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:34.5472556Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:34.6042327Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:34.6109673Z Setting up libxext6:amd64 (2:1.3.4-1build2) ...
2026-02-21T08:06:34.6172824Z Setting up git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:34.6282543Z Setting up xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:34.6348759Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ...
2026-02-21T08:06:34.6732841Z ##[group]Run actions/checkout@v6
2026-02-21T08:06:34.6733124Z with:
2026-02-21T08:06:34.6733450Z   repository: pytorch/helion
2026-02-21T08:06:34.6733860Z   token: ***
2026-02-21T08:06:34.6734051Z   ssh-strict: true
2026-02-21T08:06:34.6734298Z   ssh-user: git
2026-02-21T08:06:34.6734499Z   persist-credentials: true
2026-02-21T08:06:34.6734748Z   clean: true
2026-02-21T08:06:34.6734964Z   sparse-checkout-cone-mode: true
2026-02-21T08:06:34.6735235Z   fetch-depth: 1
2026-02-21T08:06:34.6735464Z   fetch-tags: false
2026-02-21T08:06:34.6735660Z   show-progress: true
2026-02-21T08:06:34.6736055Z   lfs: false
2026-02-21T08:06:34.6736248Z   submodules: false
2026-02-21T08:06:34.6736481Z   set-safe-directory: true
2026-02-21T08:06:34.6736699Z env:
2026-02-21T08:06:34.6736922Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:34.6737126Z ##[endgroup]
2026-02-21T08:06:34.6769892Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:34.8403050Z Syncing repository: pytorch/helion
2026-02-21T08:06:34.8403994Z ##[group]Getting Git version info
2026-02-21T08:06:34.8404366Z Working directory is '/__w/helion/helion'
2026-02-21T08:06:34.8404787Z [command]/usr/bin/git version
2026-02-21T08:06:34.8410773Z git version 2.43.0
2026-02-21T08:06:34.8423985Z ##[endgroup]
2026-02-21T08:06:34.8440481Z Temporarily overriding HOME='/__w/_temp/4c216d0f-717e-4c76-a6ea-eec8e78f2ef5' before making global git config changes
2026-02-21T08:06:34.8441161Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T08:06:34.8441684Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T08:06:34.8471455Z Deleting the contents of '/__w/helion/helion'
2026-02-21T08:06:34.8476914Z ##[group]Initializing the repository
2026-02-21T08:06:34.8482516Z [command]/usr/bin/git init /__w/helion/helion
2026-02-21T08:06:34.8506670Z hint: Using 'master' as the name for the initial branch. This default branch name
2026-02-21T08:06:34.8508261Z hint: is subject to change. To configure the initial branch name to use in all
2026-02-21T08:06:34.8508694Z hint: of your new repositories, which will suppress this warning, call:
2026-02-21T08:06:34.8508991Z hint: 
2026-02-21T08:06:34.8509267Z hint: 	git config --global init.defaultBranch <name>
2026-02-21T08:06:34.8509525Z hint: 
2026-02-21T08:06:34.8509818Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2026-02-21T08:06:34.8510436Z hint: 'development'. The just-created branch can be renamed via this command:
2026-02-21T08:06:34.8510790Z hint: 
2026-02-21T08:06:34.8511012Z hint: 	git branch -m <name>
2026-02-21T08:06:34.8511299Z Initialized empty Git repository in /__w/helion/helion/.git/
2026-02-21T08:06:34.8516754Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion
2026-02-21T08:06:34.8541332Z ##[endgroup]
2026-02-21T08:06:34.8541763Z ##[group]Disabling automatic garbage collection
2026-02-21T08:06:34.8544045Z [command]/usr/bin/git config --local gc.auto 0
2026-02-21T08:06:34.8568188Z ##[endgroup]
2026-02-21T08:06:34.8568540Z ##[group]Setting up auth
2026-02-21T08:06:34.8568817Z Removing SSH command configuration
2026-02-21T08:06:34.8576593Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T08:06:34.8600143Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T08:06:34.8833979Z Removing HTTP extra header
2026-02-21T08:06:34.8835436Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T08:06:34.8854841Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T08:06:34.9074576Z Removing includeIf entries pointing to credentials config files
2026-02-21T08:06:34.9077230Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T08:06:34.9101066Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T08:06:34.9328967Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config http.https://github.com/.extraheader AUTHORIZATION: basic ***
2026-02-21T08:06:34.9369445Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T08:06:34.9397004Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T08:06:34.9417609Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T08:06:34.9442382Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T08:06:34.9463874Z ##[endgroup]
2026-02-21T08:06:34.9464233Z ##[group]Fetching the repository
2026-02-21T08:06:34.9470358Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main
2026-02-21T08:06:35.4130214Z From https://github.com/pytorch/helion
2026-02-21T08:06:35.4130890Z  * [new ref]         874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main
2026-02-21T08:06:35.4156029Z [command]/usr/bin/git branch --list --remote origin/main
2026-02-21T08:06:35.4178929Z   origin/main
2026-02-21T08:06:35.4181012Z [command]/usr/bin/git rev-parse refs/remotes/origin/main
2026-02-21T08:06:35.4199272Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:35.4206368Z ##[endgroup]
2026-02-21T08:06:35.4206751Z ##[group]Determining the checkout info
2026-02-21T08:06:35.4207156Z ##[endgroup]
2026-02-21T08:06:35.4207474Z [command]/usr/bin/git sparse-checkout disable
2026-02-21T08:06:35.4241955Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2026-02-21T08:06:35.4260865Z ##[group]Checking out the ref
2026-02-21T08:06:35.4261395Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main
2026-02-21T08:06:35.4460235Z Switched to a new branch 'main'
2026-02-21T08:06:35.4463852Z branch 'main' set up to track 'origin/main'.
2026-02-21T08:06:35.4466429Z ##[endgroup]
2026-02-21T08:06:35.4493215Z [command]/usr/bin/git log -1 --format=%H
2026-02-21T08:06:35.4515240Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:35.4653425Z ##[group]Run actions/setup-python@v6
2026-02-21T08:06:35.4653645Z with:
2026-02-21T08:06:35.4653868Z   python-version: 3.12
2026-02-21T08:06:35.4654126Z   check-latest: false
2026-02-21T08:06:35.4654395Z   token: ***
2026-02-21T08:06:35.4654625Z   update-environment: true
2026-02-21T08:06:35.4654832Z   allow-prereleases: false
2026-02-21T08:06:35.4655040Z   freethreaded: false
2026-02-21T08:06:35.4655231Z env:
2026-02-21T08:06:35.4655436Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:35.4655653Z ##[endgroup]
2026-02-21T08:06:35.4659048Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:35.6652821Z ##[group]Installed versions
2026-02-21T08:06:35.6664839Z Version 3.12 was not found in the local cache
2026-02-21T08:06:36.4925738Z Version 3.12 is available for downloading
2026-02-21T08:06:36.4926399Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz"
2026-02-21T08:06:37.3676211Z Extract downloaded archive
2026-02-21T08:06:37.3801877Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/76928f81-147e-4c44-8c08-6641cfb954e6 -f /__w/_temp/f4fc44dd-7761-48f1-9ebb-70b591183c16
2026-02-21T08:06:39.3035322Z Execute installation script
2026-02-21T08:06:39.3146433Z Check if Python hostedtoolcache folder exist...
2026-02-21T08:06:39.3146850Z Creating Python hostedtoolcache folder...
2026-02-21T08:06:39.3154160Z Create Python 3.12.12 folder
2026-02-21T08:06:39.3168375Z Copy Python binaries to hostedtoolcache folder
2026-02-21T08:06:39.6927270Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action)
2026-02-21T08:06:39.6965996Z Upgrading pip...
2026-02-21T08:06:41.0438233Z Looking in links: /tmp/tmpexo42m5v
2026-02-21T08:06:41.0440373Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1)
2026-02-21T08:06:41.0476502Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
2026-02-21T08:06:41.6149159Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
2026-02-21T08:06:41.7707157Z Collecting pip
2026-02-21T08:06:41.8095101Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
2026-02-21T08:06:41.8173989Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
2026-02-21T08:06:41.8461167Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 122.2 MB/s eta 0:00:00
2026-02-21T08:06:41.8553274Z Installing collected packages: pip
2026-02-21T08:06:41.8554829Z   Attempting uninstall: pip
2026-02-21T08:06:41.8565713Z Found existing installation: pip 25.0.1
2026-02-21T08:06:41.8738364Z Uninstalling pip-25.0.1:
2026-02-21T08:06:41.8770138Z Successfully uninstalled pip-25.0.1
2026-02-21T08:06:42.4390583Z Successfully installed pip-26.0.1
2026-02-21T08:06:42.4869366Z Create complete file
2026-02-21T08:06:42.4906980Z Successfully set up CPython (3.12.12)
2026-02-21T08:06:42.4907506Z ##[endgroup]
2026-02-21T08:06:42.5115837Z ##[group]Run astral-sh/setup-uv@v7
2026-02-21T08:06:42.5116074Z with:
2026-02-21T08:06:42.5116312Z   activate-environment: false
2026-02-21T08:06:42.5116584Z   working-directory: /home/charlie/_work/helion/helion
2026-02-21T08:06:42.5117095Z   github-token: ***
2026-02-21T08:06:42.5117345Z   enable-cache: auto
2026-02-21T08:06:42.5117853Z   cache-dependency-glob: **/*requirements*.txt
**/*requirements*.in
**/*constraints*.txt
**/*constraints*.in
**/pyproject.toml
**/uv.lock
**/*.py.lock

2026-02-21T08:06:42.5118357Z   restore-cache: true
2026-02-21T08:06:42.5118566Z   save-cache: true
2026-02-21T08:06:42.5118794Z   prune-cache: true
2026-02-21T08:06:42.5119015Z   cache-python: false
2026-02-21T08:06:42.5119243Z   ignore-nothing-to-cache: false
2026-02-21T08:06:42.5119500Z   ignore-empty-workdir: false
2026-02-21T08:06:42.5119709Z   add-problem-matchers: true
2026-02-21T08:06:42.5119989Z   resolution-strategy: highest
2026-02-21T08:06:42.5120197Z env:
2026-02-21T08:06:42.5120514Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:42.5120762Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:42.5121093Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:42.5121388Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:42.5121740Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:42.5122032Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:42.5122522Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:42.5122961Z ##[endgroup]
2026-02-21T08:06:42.5129239Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:42.7393744Z (node:802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T08:06:42.7395801Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T08:06:42.7464934Z Trying to find version for uv in: /__w/helion/helion/uv.toml
2026-02-21T08:06:42.7465378Z Could not find file: /__w/helion/helion/uv.toml
2026-02-21T08:06:42.7465969Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml
2026-02-21T08:06:42.7471631Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest.
2026-02-21T08:06:42.7472310Z Getting latest version from GitHub API...
2026-02-21T08:06:43.0271787Z manifest-file not provided, reading from local file.
2026-02-21T08:06:43.0306776Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases.
2026-02-21T08:06:43.0310021Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ...
2026-02-21T08:06:43.3415128Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/52f4ea79-f6cf-4673-b3a4-42e3b22b5a32 -f /__w/_temp/9043aff8-f81b-4d13-ad8e-c41d78bd8e10
2026-02-21T08:06:43.7246483Z Added /github/home/.local/bin to the path
2026-02-21T08:06:43.7248715Z Added /__w/_tool/uv/0.10.4/x86_64 to the path
2026-02-21T08:06:43.7249136Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python
2026-02-21T08:06:43.7249525Z Added /github/home/.local/share/uv/python to the path
2026-02-21T08:06:43.7255736Z Successfully installed uv version 0.10.4
2026-02-21T08:06:43.8671140Z ##[group]Run uv venv --python 3.12
2026-02-21T08:06:43.8671476Z [36;1muv venv --python 3.12[0m
2026-02-21T08:06:43.8671887Z shell: bash -l {0}
2026-02-21T08:06:43.8672110Z env:
2026-02-21T08:06:43.8672357Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:43.8672622Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.8672959Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:43.8673246Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.8673574Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.8673831Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.8674263Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:43.8674779Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:43.8675041Z ##[endgroup]
2026-02-21T08:06:43.9859704Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12
2026-02-21T08:06:43.9860558Z Creating virtual environment at: .venv
2026-02-21T08:06:43.9860937Z Activate with: source .venv/bin/activate
2026-02-21T08:06:43.9933966Z ##[group]Run source .venv/bin/activate
2026-02-21T08:06:43.9934291Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:06:43.9934656Z [36;1muv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130[0m
2026-02-21T08:06:43.9935108Z shell: bash -l {0}
2026-02-21T08:06:43.9935339Z env:
2026-02-21T08:06:43.9935519Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:43.9935802Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.9936102Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:43.9936410Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.9936666Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.9936986Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:43.9937417Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:43.9937946Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:43.9938205Z ##[endgroup]
2026-02-21T08:06:44.6154029Z Resolved 26 packages in 528ms
2026-02-21T08:06:44.6226361Z Downloading networkx (2.0MiB)
2026-02-21T08:06:44.6267518Z Downloading sympy (6.0MiB)
2026-02-21T08:06:44.6270212Z Downloading nvidia-cufft (204.2MiB)
2026-02-21T08:06:44.6307558Z Downloading nvidia-cuda-cupti (10.2MiB)
2026-02-21T08:06:44.6421845Z Downloading nvidia-cuda-runtime (2.1MiB)
2026-02-21T08:06:44.6500628Z Downloading nvidia-cusolver (184.5MiB)
2026-02-21T08:06:44.6505596Z Downloading nvidia-curand (56.8MiB)
2026-02-21T08:06:44.6507252Z Downloading nvidia-cufile (1.2MiB)
2026-02-21T08:06:44.6512194Z Downloading nvidia-nvjitlink (38.8MiB)
2026-02-21T08:06:44.6516450Z Downloading nvidia-cudnn-cu13 (332.4MiB)
2026-02-21T08:06:44.6518018Z Downloading nvidia-cusparse (133.8MiB)
2026-02-21T08:06:44.6518328Z Downloading triton (162.6MiB)
2026-02-21T08:06:44.6518587Z Downloading torch (584.2MiB)
2026-02-21T08:06:44.6600300Z Downloading nvidia-nvshmem-cu13 (57.6MiB)
2026-02-21T08:06:44.6670168Z Downloading nvidia-cusparselt-cu13 (162.0MiB)
2026-02-21T08:06:44.6794128Z Downloading nvidia-cuda-nvrtc (86.0MiB)
2026-02-21T08:06:44.6818942Z Downloading nvidia-nccl-cu13 (184.9MiB)
2026-02-21T08:06:44.6938861Z Downloading nvidia-cublas (400.0MiB)
2026-02-21T08:06:45.0604659Z  Downloaded nvidia-cufile
2026-02-21T08:06:45.1960745Z  Downloaded nvidia-cuda-runtime
2026-02-21T08:06:45.7926215Z  Downloaded networkx
2026-02-21T08:06:46.2374903Z  Downloaded nvidia-cuda-cupti
2026-02-21T08:06:47.5020805Z  Downloaded sympy
2026-02-21T08:06:47.6670899Z  Downloaded triton
2026-02-21T08:06:48.6856059Z  Downloaded nvidia-nvjitlink
2026-02-21T08:06:49.6266982Z  Downloaded nvidia-curand
2026-02-21T08:06:49.8723445Z  Downloaded nvidia-nvshmem-cu13
2026-02-21T08:06:50.9419276Z  Downloaded nvidia-cuda-nvrtc
2026-02-21T08:06:52.1602073Z  Downloaded nvidia-cusparse
2026-02-21T08:06:52.2649528Z  Downloaded nvidia-cufft
2026-02-21T08:06:52.6007897Z  Downloaded nvidia-cusparselt-cu13
2026-02-21T08:06:52.7733060Z  Downloaded nvidia-cusolver
2026-02-21T08:06:53.0037862Z  Downloaded nvidia-nccl-cu13
2026-02-21T08:06:54.0746405Z  Downloaded nvidia-cudnn-cu13
2026-02-21T08:06:54.5562613Z  Downloaded nvidia-cublas
2026-02-21T08:07:00.3320083Z  Downloaded torch
2026-02-21T08:07:00.3328180Z Prepared 26 packages in 15.71s
2026-02-21T08:07:00.3358578Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:00.3359259Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:00.3359855Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:01.9071449Z Installed 26 packages in 1.57s
2026-02-21T08:07:01.9073288Z  + filelock==3.20.0
2026-02-21T08:07:01.9073766Z  + fsspec==2025.12.0
2026-02-21T08:07:01.9073990Z  + jinja2==3.1.6
2026-02-21T08:07:01.9074271Z  + markupsafe==3.0.2
2026-02-21T08:07:01.9075081Z  + mpmath==1.3.0
2026-02-21T08:07:01.9075309Z  + networkx==3.6.1
2026-02-21T08:07:01.9080699Z  + nvidia-cublas==13.0.0.19
2026-02-21T08:07:01.9084778Z  + nvidia-cuda-cupti==13.0.48
2026-02-21T08:07:01.9086567Z  + nvidia-cuda-nvrtc==13.0.48
2026-02-21T08:07:01.9086916Z  + nvidia-cuda-runtime==13.0.48
2026-02-21T08:07:01.9087415Z  + nvidia-cudnn-cu13==9.13.0.50
2026-02-21T08:07:01.9087713Z  + nvidia-cufft==12.0.0.15
2026-02-21T08:07:01.9087951Z  + nvidia-cufile==1.15.0.42
2026-02-21T08:07:01.9088321Z  + nvidia-curand==10.4.0.35
2026-02-21T08:07:01.9088572Z  + nvidia-cusolver==12.0.3.29
2026-02-21T08:07:01.9088863Z  + nvidia-cusparse==12.6.2.49
2026-02-21T08:07:01.9089197Z  + nvidia-cusparselt-cu13==0.8.0
2026-02-21T08:07:01.9089497Z  + nvidia-nccl-cu13==2.27.7
2026-02-21T08:07:01.9089730Z  + nvidia-nvjitlink==13.0.39
2026-02-21T08:07:01.9090065Z  + nvidia-nvshmem-cu13==3.3.24
2026-02-21T08:07:01.9090363Z  + nvidia-nvtx==13.0.39
2026-02-21T08:07:01.9090602Z  + setuptools==70.2.0
2026-02-21T08:07:01.9090919Z  + sympy==1.14.0
2026-02-21T08:07:01.9091141Z  + torch==2.9.1+cu130
2026-02-21T08:07:01.9091404Z  + triton==3.5.1
2026-02-21T08:07:01.9091736Z  + typing-extensions==4.15.0
2026-02-21T08:07:01.9190058Z ##[group]Run source .venv/bin/activate
2026-02-21T08:07:01.9190365Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:07:01.9190730Z [36;1mSETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]'[0m
2026-02-21T08:07:01.9191120Z [36;1mpython -c "import helion; print(helion.__name__)"[0m
2026-02-21T08:07:01.9191686Z shell: bash -l {0}
2026-02-21T08:07:01.9191924Z env:
2026-02-21T08:07:01.9192131Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:07:01.9192651Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:01.9193089Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:07:01.9193542Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:01.9193816Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:01.9194113Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:01.9194671Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:07:01.9195116Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:07:01.9195459Z ##[endgroup]
2026-02-21T08:07:02.9027396Z Resolved 30 packages in 882ms
2026-02-21T08:07:02.9041978Z    Building helion @ file:///__w/helion/helion
2026-02-21T08:07:02.9226899Z Downloading pygments (1.2MiB)
2026-02-21T08:07:02.9231896Z Downloading virtualenv (5.6MiB)
2026-02-21T08:07:02.9232229Z Downloading numpy (15.8MiB)
2026-02-21T08:07:02.9427522Z Downloading scikit-learn (8.5MiB)
2026-02-21T08:07:02.9449936Z Downloading scipy (33.4MiB)
2026-02-21T08:07:03.0455791Z       Built helion @ file:///__w/helion/helion
2026-02-21T08:07:03.2235212Z  Downloaded virtualenv
2026-02-21T08:07:03.2363005Z  Downloaded pygments
2026-02-21T08:07:03.9617187Z  Downloaded scikit-learn
2026-02-21T08:07:03.9621527Z  Downloaded numpy
2026-02-21T08:07:04.4162150Z  Downloaded scipy
2026-02-21T08:07:04.4170531Z Prepared 27 packages in 1.51s
2026-02-21T08:07:04.4174763Z Uninstalled 1 package in 0.37ms
2026-02-21T08:07:04.4184545Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:04.4186308Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:04.4186929Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:05.0852052Z Installed 29 packages in 667ms
2026-02-21T08:07:05.0852446Z  + cfgv==3.5.0
2026-02-21T08:07:05.0857108Z  + distlib==0.4.0
2026-02-21T08:07:05.0859015Z  + expecttest==0.3.0
2026-02-21T08:07:05.0859369Z  + filecheck==1.0.3
2026-02-21T08:07:05.0859753Z  - filelock==3.20.0
2026-02-21T08:07:05.0859989Z  + filelock==3.24.3
2026-02-21T08:07:05.0860266Z  + helion==0.0.0 (from file:///__w/helion/helion)
2026-02-21T08:07:05.0860517Z  + hypothesis==6.151.9
2026-02-21T08:07:05.0860810Z  + identify==2.6.16
2026-02-21T08:07:05.0861000Z  + iniconfig==2.3.0
2026-02-21T08:07:05.0861229Z  + joblib==1.5.3
2026-02-21T08:07:05.0861681Z  + markdown-it-py==4.0.0
2026-02-21T08:07:05.0861925Z  + mdurl==0.1.2
2026-02-21T08:07:05.0862132Z  + nodeenv==1.10.0
2026-02-21T08:07:05.0862378Z  + numpy==2.4.2
2026-02-21T08:07:05.0862595Z  + packaging==26.0
2026-02-21T08:07:05.0862795Z  + platformdirs==4.9.2
2026-02-21T08:07:05.0863040Z  + pluggy==1.6.0
2026-02-21T08:07:05.0863219Z  + pre-commit==4.5.1
2026-02-21T08:07:05.0863438Z  + psutil==7.2.2
2026-02-21T08:07:05.0863625Z  + pygments==2.19.2
2026-02-21T08:07:05.0863852Z  + pytest==9.0.2
2026-02-21T08:07:05.0864033Z  + pytest-timeout==2.4.0
2026-02-21T08:07:05.0864274Z  + pyyaml==6.0.3
2026-02-21T08:07:05.0864487Z  + rich==14.3.3
2026-02-21T08:07:05.0864673Z  + scikit-learn==1.8.0
2026-02-21T08:07:05.0864896Z  + scipy==1.17.0
2026-02-21T08:07:05.0865101Z  + sortedcontainers==2.4.0
2026-02-21T08:07:05.0865348Z  + threadpoolctl==3.6.0
2026-02-21T08:07:05.0865537Z  + virtualenv==20.38.0
2026-02-21T08:07:17.1532627Z helion
2026-02-21T08:07:18.0976091Z ##[group]Run set -x
2026-02-21T08:07:18.0976321Z [36;1mset -x[0m
2026-02-21T08:07:18.0976584Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:07:18.0976859Z [36;1muv pip install pip[0m
2026-02-21T08:07:18.0977098Z [36;1muv pip install quack-kernels --no-deps[0m
2026-02-21T08:07:18.0977439Z [36;1mmkdir -p benchmarks/ && pushd benchmarks/[0m
2026-02-21T08:07:18.0977766Z [36;1mgit clone https://github.com/pytorch-labs/tritonbench/[0m
2026-02-21T08:07:18.0978080Z [36;1mpushd tritonbench/[0m
2026-02-21T08:07:18.0978486Z [36;1mgit submodule update --init --recursive[0m
2026-02-21T08:07:18.0978799Z [36;1muv pip install -r requirements.txt[0m
2026-02-21T08:07:18.0979068Z [36;1mpython install.py --liger[0m
2026-02-21T08:07:18.0979336Z [36;1muv pip install -e . --no-deps[0m
2026-02-21T08:07:18.0979696Z [36;1mpopd[0m
2026-02-21T08:07:18.0979882Z [36;1mpopd[0m
2026-02-21T08:07:18.0980228Z shell: bash -l {0}
2026-02-21T08:07:18.0980423Z env:
2026-02-21T08:07:18.0980732Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:07:18.0981001Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:18.0981329Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:07:18.0981698Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:18.0981985Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:18.0982281Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:18.0982691Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:07:18.0983209Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:07:18.0983475Z ##[endgroup]
2026-02-21T08:07:21.8834694Z + source .venv/bin/activate
2026-02-21T08:07:21.8838986Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8843061Z ++ '[' -n x ']'
2026-02-21T08:07:21.8846248Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T08:07:21.8847623Z ++ '[' .venv/bin/activate = /__w/_temp/043c16d0-a9f6-432c-a45e-a8ca6f064229.sh ']'
2026-02-21T08:07:21.8848041Z ++ deactivate nondestructive
2026-02-21T08:07:21.8848333Z ++ unset -f pydoc
2026-02-21T08:07:21.8848602Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8848869Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8849090Z ++ hash -r
2026-02-21T08:07:21.8849352Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8849600Z ++ unset VIRTUAL_ENV
2026-02-21T08:07:21.8849882Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T08:07:21.8850204Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T08:07:21.8850545Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T08:07:21.8850883Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T08:07:21.8851141Z ++ '[' linux-gnu = msys ']'
2026-02-21T08:07:21.8851441Z ++ export VIRTUAL_ENV
2026-02-21T08:07:21.8851935Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8852158Z ++ unset SCRIPT_PATH
2026-02-21T08:07:21.8852896Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:07:21.8854060Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:07:21.8854792Z ++ export PATH
2026-02-21T08:07:21.8855017Z ++ '[' xhelion '!=' x ']'
2026-02-21T08:07:21.8855236Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T08:07:21.8855495Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T08:07:21.8855705Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8855908Z ++ '[' -z '' ']'
2026-02-21T08:07:21.8856077Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T08:07:21.8856333Z ++ PS1='(helion) '
2026-02-21T08:07:21.8856513Z ++ export PS1
2026-02-21T08:07:21.8857418Z ++ alias pydoc
2026-02-21T08:07:21.8857658Z ++ true
2026-02-21T08:07:21.8857826Z ++ hash -r
2026-02-21T08:07:21.8858340Z + uv pip install pip
2026-02-21T08:07:21.9416721Z Resolved 1 package in 49ms
2026-02-21T08:07:21.9477670Z Downloading pip (1.7MiB)
2026-02-21T08:07:22.0072209Z  Downloaded pip
2026-02-21T08:07:22.0074227Z Prepared 1 package in 65ms
2026-02-21T08:07:22.0101765Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:22.0102337Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:22.0103356Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:22.3975199Z Installed 1 package in 389ms
2026-02-21T08:07:22.3977674Z  + pip==26.0.1
2026-02-21T08:07:22.4006096Z + uv pip install quack-kernels --no-deps
2026-02-21T08:07:22.4979596Z Resolved 1 package in 91ms
2026-02-21T08:07:22.5476188Z Prepared 1 package in 49ms
2026-02-21T08:07:22.5514613Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:22.5515287Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:22.5515877Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:22.5527082Z Installed 1 package in 5ms
2026-02-21T08:07:22.5527575Z  + quack-kernels==0.2.10
2026-02-21T08:07:22.5546929Z + mkdir -p benchmarks/
2026-02-21T08:07:22.5556600Z + pushd benchmarks/
2026-02-21T08:07:22.5556984Z + git clone https://github.com/pytorch-labs/tritonbench/
2026-02-21T08:07:22.5557437Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:07:22.5568638Z Cloning into 'tritonbench'...
2026-02-21T08:07:25.4910170Z + pushd tritonbench/
2026-02-21T08:07:25.4910668Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:07:25.4911726Z + git submodule update --init --recursive
2026-02-21T08:07:25.8009916Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens'
2026-02-21T08:07:25.8527979Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter'
2026-02-21T08:07:26.1946880Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass'
2026-02-21T08:07:26.3212008Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention'
2026-02-21T08:07:26.3348336Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders'
2026-02-21T08:07:26.3373116Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers'
2026-02-21T08:07:26.3395269Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'...
2026-02-21T08:07:31.1777253Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'...
2026-02-21T08:07:48.5620461Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'...
2026-02-21T08:07:54.5535122Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'...
2026-02-21T08:07:58.2049748Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'...
2026-02-21T08:08:01.2146603Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'...
2026-02-21T08:08:05.9258402Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b'
2026-02-21T08:08:06.2188629Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190'
2026-02-21T08:08:06.4387316Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel'
2026-02-21T08:08:06.4413869Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'...
2026-02-21T08:08:12.7430126Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40'
2026-02-21T08:08:13.2052710Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e'
2026-02-21T08:08:13.2792919Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6'
2026-02-21T08:08:13.2833520Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel'
2026-02-21T08:08:13.2865870Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass'
2026-02-21T08:08:13.2892847Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'...
2026-02-21T08:08:17.7259319Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'...
2026-02-21T08:08:21.4016728Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb'
2026-02-21T08:08:21.8907552Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52'
2026-02-21T08:08:21.9162486Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5'
2026-02-21T08:08:21.9180384Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'
2026-02-21T08:08:21.9204201Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'...
2026-02-21T08:08:25.9087037Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b'
2026-02-21T08:08:25.9682457Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44'
2026-02-21T08:08:25.9693055Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled'
2026-02-21T08:08:25.9701976Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass'
2026-02-21T08:08:25.9702788Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention'
2026-02-21T08:08:25.9725483Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'...
2026-02-21T08:08:29.8770373Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'...
2026-02-21T08:08:33.5053746Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'...
2026-02-21T08:08:34.4995515Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d'
2026-02-21T08:08:34.9312671Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0'
2026-02-21T08:08:34.9811196Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5'
2026-02-21T08:08:34.9824880Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel'
2026-02-21T08:08:34.9829936Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass'
2026-02-21T08:08:34.9851749Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'...
2026-02-21T08:08:38.8817439Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'...
2026-02-21T08:08:43.0156774Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33'
2026-02-21T08:08:43.4349597Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420'
2026-02-21T08:08:43.4389697Z + uv pip install -r requirements.txt
2026-02-21T08:08:43.4461056Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:43.6143613Z Resolved 30 packages in 167ms
2026-02-21T08:08:43.6251161Z Downloading pillow (6.7MiB)
2026-02-21T08:08:43.6251404Z Downloading hf-xet (3.2MiB)
2026-02-21T08:08:43.6251795Z Downloading kiwisolver (1.4MiB)
2026-02-21T08:08:43.6301777Z Downloading tokenizers (3.0MiB)
2026-02-21T08:08:43.6307472Z Downloading fonttools (4.7MiB)
2026-02-21T08:08:43.6311237Z Downloading matplotlib (8.3MiB)
2026-02-21T08:08:43.6315828Z Downloading transformers (10.3MiB)
2026-02-21T08:08:43.7889914Z  Downloaded kiwisolver
2026-02-21T08:08:43.8721030Z  Downloaded tokenizers
2026-02-21T08:08:43.8762875Z  Downloaded hf-xet
2026-02-21T08:08:44.0743288Z  Downloaded pillow
2026-02-21T08:08:44.1199624Z  Downloaded fonttools
2026-02-21T08:08:44.1812663Z  Downloaded matplotlib
2026-02-21T08:08:45.2564729Z  Downloaded transformers
2026-02-21T08:08:45.2576389Z Prepared 23 packages in 1.64s
2026-02-21T08:08:45.2612852Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:45.2613428Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:45.2614010Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:45.3284721Z Installed 23 packages in 70ms
2026-02-21T08:08:45.3289946Z  + certifi==2026.1.4
2026-02-21T08:08:45.3293699Z  + charset-normalizer==3.4.4
2026-02-21T08:08:45.3295879Z  + contourpy==1.3.3
2026-02-21T08:08:45.3296066Z  + cycler==0.12.1
2026-02-21T08:08:45.3296280Z  + fonttools==4.61.1
2026-02-21T08:08:45.3296434Z  + hf-xet==1.2.0
2026-02-21T08:08:45.3299770Z  + huggingface-hub==0.36.2
2026-02-21T08:08:45.3303734Z  + idna==3.11
2026-02-21T08:08:45.3305087Z  + kiwisolver==1.4.9
2026-02-21T08:08:45.3305287Z  + matplotlib==3.10.8
2026-02-21T08:08:45.3305449Z  + nvidia-ml-py==13.590.48
2026-02-21T08:08:45.3305622Z  + pillow==12.1.1
2026-02-21T08:08:45.3305780Z  + pyparsing==3.3.2
2026-02-21T08:08:45.3305944Z  + python-dateutil==2.9.0.post0
2026-02-21T08:08:45.3306117Z  + regex==2026.2.19
2026-02-21T08:08:45.3306255Z  + requests==2.32.5
2026-02-21T08:08:45.3306403Z  + safetensors==0.7.0
2026-02-21T08:08:45.3306545Z  + six==1.17.0
2026-02-21T08:08:45.3306688Z  + tabulate==0.9.0
2026-02-21T08:08:45.3306841Z  + tokenizers==0.21.4
2026-02-21T08:08:45.3306995Z  + tqdm==4.67.3
2026-02-21T08:08:45.3307136Z  + transformers==4.53.0
2026-02-21T08:08:45.3307293Z  + urllib3==2.6.3
2026-02-21T08:08:45.3380694Z + python install.py --liger
2026-02-21T08:08:49.1621525Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:49.1648762Z Audited 6 packages in 3ms
2026-02-21T08:08:49.6447306Z INFO:__main__:[tritonbench] installing liger-kernels...
2026-02-21T08:08:49.6515562Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:49.7473039Z Resolved 1 package in 94ms
2026-02-21T08:08:49.8123299Z Prepared 1 package in 65ms
2026-02-21T08:08:49.8169996Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:49.8170570Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:49.8171469Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:49.9358516Z Installed 1 package in 123ms
2026-02-21T08:08:49.9358939Z  + liger-kernel-nightly==0.7.0.dev20260219183429
2026-02-21T08:08:49.9396947Z INFO:__main__:[tritonbench] installation complete!
2026-02-21T08:08:50.4101482Z + uv pip install -e . --no-deps
2026-02-21T08:08:50.5533574Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:50.5570567Z Resolved 1 package in 2ms
2026-02-21T08:08:50.6112719Z    Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:08:51.8192524Z       Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:08:51.8207269Z Prepared 1 package in 1.26s
2026-02-21T08:08:51.8214554Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:51.8215097Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:51.8215557Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:51.8216227Z Installed 1 package in 0.54ms
2026-02-21T08:08:51.8216513Z  + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench)
2026-02-21T08:08:52.0803694Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:08:52.0803962Z /__w/helion/helion
2026-02-21T08:08:52.0805158Z + popd
2026-02-21T08:08:52.0805293Z + popd
2026-02-21T08:08:52.0858715Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true
2026-02-21T08:08:52.0859040Z [36;1mrm -rf /tmp/torchinductor_*/ || true[0m
2026-02-21T08:08:52.0859235Z [36;1m[0m
2026-02-21T08:08:52.0859388Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:08:52.0859559Z [36;1m[0m
2026-02-21T08:08:52.0859731Z [36;1mTEST_REPORTS_DIR=$(pwd)/test/test-reports[0m
2026-02-21T08:08:52.0859948Z [36;1mmkdir -p "$TEST_REPORTS_DIR"[0m
2026-02-21T08:08:52.0860142Z [36;1mecho "$TEST_REPORTS_DIR"[0m
2026-02-21T08:08:52.0860307Z [36;1m[0m
2026-02-21T08:08:52.0860450Z [36;1mKERNEL_LIST="softmax"[0m
2026-02-21T08:08:52.0860646Z [36;1mfor kernel in ${KERNEL_LIST//,/ }; do[0m
2026-02-21T08:08:52.0860871Z [36;1m  echo "=========================================="[0m
2026-02-21T08:08:52.0861121Z [36;1m  echo "Running benchmark for kernel: $kernel"[0m
2026-02-21T08:08:52.0861354Z [36;1m  echo "=========================================="[0m
2026-02-21T08:08:52.0861608Z [36;1m[0m
2026-02-21T08:08:52.0861860Z [36;1m  # Get available implementations and baseline for this kernel[0m
2026-02-21T08:08:52.0862260Z [36;1m  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:")[0m
2026-02-21T08:08:52.0862659Z [36;1m  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:08:52.0862980Z [36;1m  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:08:52.0863220Z [36;1m[0m
2026-02-21T08:08:52.0863360Z [36;1m  if [[ -z "$IMPLS" ]]; then[0m
2026-02-21T08:08:52.0863620Z [36;1m    echo "Warning: No implementations found for kernel $kernel, skipping..."[0m
2026-02-21T08:08:52.0863889Z [36;1m    continue[0m
2026-02-21T08:08:52.0864029Z [36;1m  fi[0m
2026-02-21T08:08:52.0864179Z [36;1m  if [[ -z "$BASELINE" ]]; then[0m
2026-02-21T08:08:52.0864427Z [36;1m    echo "Warning: No baseline found for kernel $kernel, skipping..."[0m
2026-02-21T08:08:52.0864664Z [36;1m    continue[0m
2026-02-21T08:08:52.0864819Z [36;1m  fi[0m
2026-02-21T08:08:52.0864971Z [36;1m  echo "Using baseline: $BASELINE"[0m
2026-02-21T08:08:52.0865213Z [36;1m  echo "Available implementations for $kernel: $IMPLS"[0m
2026-02-21T08:08:52.0865423Z [36;1m[0m
2026-02-21T08:08:52.0865583Z [36;1m  # Do autotuning but do not record the results[0m
2026-02-21T08:08:52.0865791Z [36;1m   python benchmarks/run.py \[0m
2026-02-21T08:08:52.0865978Z [36;1m      --op $kernel \[0m
2026-02-21T08:08:52.0866161Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:08:52.0866375Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:08:52.0866586Z [36;1m      --cudagraph \[0m
2026-02-21T08:08:52.0866747Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:08:52.0866946Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:08:52.0867150Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:08:52.0867327Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:08:52.0867501Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:08:52.0867835Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:08:52.0868041Z [36;1m      --keep-going \[0m
2026-02-21T08:08:52.0868192Z [36;1m      [0m
2026-02-21T08:08:52.0868326Z [36;1m[0m
2026-02-21T08:08:52.0868451Z [36;1m  # Relax the GPU[0m
2026-02-21T08:08:52.0868612Z [36;1m  sleep 2m[0m
2026-02-21T08:08:52.0868746Z [36;1m[0m
2026-02-21T08:08:52.0868906Z [36;1m  # Run again with cache and record results[0m
2026-02-21T08:08:52.0869208Z [36;1m   HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \[0m
2026-02-21T08:08:52.0869487Z [36;1m      --op $kernel \[0m
2026-02-21T08:08:52.0869672Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:08:52.0869890Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:08:52.0870092Z [36;1m      --cudagraph \[0m
2026-02-21T08:08:52.0870243Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:08:52.0870552Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:08:52.0870760Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:08:52.0870935Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:08:52.0871091Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:08:52.0871272Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:08:52.0871513Z [36;1m      --output "$TEST_REPORTS_DIR/helionbench.json" \[0m
2026-02-21T08:08:52.0871766Z [36;1m      --append-to-output \[0m
2026-02-21T08:08:52.0871950Z [36;1m      --keep-going \[0m
2026-02-21T08:08:52.0872102Z [36;1m      [0m
2026-02-21T08:08:52.0872233Z [36;1m[0m
2026-02-21T08:08:52.0872409Z [36;1m  echo "✅ Completed benchmark for kernel: $kernel"[0m
2026-02-21T08:08:52.0872614Z [36;1mdone[0m
2026-02-21T08:08:52.0872748Z [36;1m[0m
2026-02-21T08:08:52.0872918Z [36;1mif [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then[0m
2026-02-21T08:08:52.0873173Z [36;1m  echo "❌ helionbench.json is missing or empty"[0m
2026-02-21T08:08:52.0873367Z [36;1m  exit 1[0m
2026-02-21T08:08:52.0873512Z [36;1mfi[0m
2026-02-21T08:08:52.0873670Z [36;1mcat "$TEST_REPORTS_DIR/helionbench.json"[0m
2026-02-21T08:08:52.0873988Z shell: bash -l {0}
2026-02-21T08:08:52.0874125Z env:
2026-02-21T08:08:52.0874264Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:08:52.0874470Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:52.0874708Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:08:52.0874954Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:52.0875169Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:52.0875399Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:52.0875762Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:08:52.0876141Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:08:52.0876368Z ##[endgroup]
2026-02-21T08:08:52.1485435Z /__w/helion/helion/test/test-reports
2026-02-21T08:08:52.1485779Z ==========================================
2026-02-21T08:08:52.1486094Z Running benchmark for kernel: softmax
2026-02-21T08:08:52.1486374Z ==========================================
2026-02-21T08:08:56.9200510Z Using baseline: naive_softmax
2026-02-21T08:08:56.9205871Z Available implementations for softmax: helion_softmax_tritonbench,torch_compile_softmax,triton_softmax
2026-02-21T08:09:02.1568299Z Using num_inputs=20 for softmax
2026-02-21T08:09:02.2028533Z Running softmax benchmark with Helion implementation...
2026-02-21T08:09:02.2029231Z 
2026-02-21T08:09:02.4426258Z Equally-spaced-k mode: Selected 20 equally spaced inputs (total available: 98)
2026-02-21T08:09:02.4430959Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 5, 10, 15, 20, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 77, 82, 87, 92, 97]
2026-02-21T08:09:02.4435171Z 
2026-02-21T08:09:02.4442277Z   0%|          | 0/20 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T08:09:02.4442705Z (M, N)
2026-02-21T08:09:02.4443364Z -----------
2026-02-21T08:09:02.4443596Z (4096, 256)
2026-02-21T08:09:02.4445227Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:09:04.0872767Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:09:05.9978589Z INFO:tritonbench.utils.triton_op:Took 129.91ms to get benchmark function for torch_compile_softmax
2026-02-21T08:09:07.8878300Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:09:07.8882226Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:09:07.8883862Z               'dtype': 'torch.float16',
2026-02-21T08:09:07.8884104Z               'shape': (4096, 256),
2026-02-21T08:09:07.8884284Z               'stride': (256, 1)},),
2026-02-21T08:09:07.8884471Z   'kwargs': {}}
2026-02-21T08:09:07.8884764Z INFO:tritonbench.utils.triton_op:Took 0.48ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:09:08.1211437Z [0s] Autotune random seed: 2134816249
2026-02-21T08:09:08.3949666Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:09:41.2514447Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:09:46.0213980Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:09:46.0216846Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:09:46.0217630Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:09:46.0217898Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:09:46.0218149Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:46.0218394Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:46.0218678Z     %cst = arith.constant dense<256> : tensor<512x1xi32>
2026-02-21T08:09:46.0219025Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<512xf32>
2026-02-21T08:09:46.0219443Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<512xf32>
2026-02-21T08:09:46.0219774Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:09:46.0220023Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:46.0220266Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:09:46.0220518Z     %c256_i64 = arith.constant 256 : i64
2026-02-21T08:09:46.0220763Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:46.0221205Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : <f16>, <tensor<512x16xf16>>
2026-02-21T08:09:46.0221903Z     %1 = tt.get_program_id x : i32
2026-02-21T08:09:46.0222143Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:09:46.0222389Z     %3 = arith.minsi %2, %c8_i32 : i32
2026-02-21T08:09:46.0222648Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:09:46.0222936Z       %4 = arith.muli %arg2, %c512_i32 : i32
2026-02-21T08:09:46.0223262Z       %5 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:09:46.0223638Z       %6 = tt.splat %4 : i32 -> tensor<512xi32>
2026-02-21T08:09:46.0223909Z       %7 = arith.addi %6, %5 : tensor<512xi32>
2026-02-21T08:09:46.0224175Z       %c240_i32 = arith.constant 240 : i32
2026-02-21T08:09:46.0224423Z       %c48_i32 = arith.constant 48 : i32
2026-02-21T08:09:46.0224931Z       %8:2 = scf.for %arg3 = %c0_i32 to %c240_i32 step %c48_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<512xf32>, tensor<512xf32>)  : i32 {
2026-02-21T08:09:46.0225485Z         %59 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0225846Z         %60 = tt.splat %arg3 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0226121Z         %61 = arith.addi %60, %59 : tensor<16xi32>
2026-02-21T08:09:46.0226475Z         %62 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0226850Z         %63 = arith.muli %62, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0227207Z         %64 = tt.expand_dims %61 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0228110Z         %65 = tt.broadcast %63 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0228479Z         %66 = tt.broadcast %64 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0228809Z         %67 = arith.addi %65, %66 : tensor<512x16xi32>
2026-02-21T08:09:46.0229132Z         %68 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0229525Z         %69 = tt.addptr %68, %67 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0229882Z         %70 = tt.load %69 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0230202Z         %71 = arith.extf %70 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0230517Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0230775Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0231031Z           %150 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:09:46.0231287Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0231777Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0232084Z         %73 = arith.truncf %72 : tensor<512xf32> to tensor<512xf16>
2026-02-21T08:09:46.0232416Z         %74 = arith.extf %73 : tensor<512xf16> to tensor<512xf32>
2026-02-21T08:09:46.0232725Z         %75 = arith.cmpf ogt, %arg4, %74 : tensor<512xf32>
2026-02-21T08:09:46.0233037Z         %76 = arith.cmpf une, %arg4, %arg4 : tensor<512xf32>
2026-02-21T08:09:46.0233330Z         %77 = arith.ori %75, %76 : tensor<512xi1>
2026-02-21T08:09:46.0233641Z         %78 = arith.select %77, %arg4, %74 : tensor<512xi1>, tensor<512xf32>
2026-02-21T08:09:46.0233989Z         %79 = arith.subf %arg4, %78 : tensor<512xf32>
2026-02-21T08:09:46.0234474Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0234970Z         %81 = arith.mulf %arg5, %80 : tensor<512xf32>
2026-02-21T08:09:46.0235307Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0235706Z         %83 = tt.broadcast %82 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0236028Z         %84 = arith.subf %71, %83 : tensor<512x16xf32>
2026-02-21T08:09:46.0236515Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0237005Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0237252Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0237497Z           %150 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:46.0237737Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0237990Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0238249Z         %87 = arith.addf %81, %86 : tensor<512xf32>
2026-02-21T08:09:46.0238493Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:09:46.0238741Z         %88 = arith.muli %c16_i32, %c1_i32_4 : i32
2026-02-21T08:09:46.0238986Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T08:09:46.0239278Z         %90 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0239587Z         %91 = tt.splat %89 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0239838Z         %92 = arith.addi %91, %90 : tensor<16xi32>
2026-02-21T08:09:46.0240156Z         %93 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0240490Z         %94 = arith.muli %93, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0240809Z         %95 = tt.expand_dims %92 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0241172Z         %96 = tt.broadcast %94 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0241507Z         %97 = tt.broadcast %95 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0241845Z         %98 = arith.addi %96, %97 : tensor<512x16xi32>
2026-02-21T08:09:46.0242158Z         %99 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0242621Z         %100 = tt.addptr %99, %98 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0242960Z         %101 = tt.load %100 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0243281Z         %102 = arith.extf %101 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0243586Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0243842Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0244081Z           %150 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:09:46.0244333Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0244589Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0244882Z         %104 = arith.truncf %103 : tensor<512xf32> to tensor<512xf16>
2026-02-21T08:09:46.0245221Z         %105 = arith.extf %104 : tensor<512xf16> to tensor<512xf32>
2026-02-21T08:09:46.0245529Z         %106 = arith.cmpf ogt, %78, %105 : tensor<512xf32>
2026-02-21T08:09:46.0245901Z         %107 = arith.cmpf une, %78, %78 : tensor<512xf32>
2026-02-21T08:09:46.0246169Z         %108 = arith.ori %106, %107 : tensor<512xi1>
2026-02-21T08:09:46.0246474Z         %109 = arith.select %108, %78, %105 : tensor<512xi1>, tensor<512xf32>
2026-02-21T08:09:46.0246788Z         %110 = arith.subf %78, %109 : tensor<512xf32>
2026-02-21T08:09:46.0247253Z         %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0247718Z         %112 = arith.mulf %87, %111 : tensor<512xf32>
2026-02-21T08:09:46.0248043Z         %113 = tt.expand_dims %109 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0248431Z         %114 = tt.broadcast %113 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0248751Z         %115 = arith.subf %102, %114 : tensor<512x16xf32>
2026-02-21T08:09:46.0249228Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0249708Z         %117 = "tt.reduce"(%116) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0249947Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0250179Z           %150 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:46.0250414Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0250656Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0250954Z         %118 = arith.addf %112, %117 : tensor<512xf32>
2026-02-21T08:09:46.0251203Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:09:46.0251450Z         %119 = arith.muli %c16_i32, %c2_i32 : i32
2026-02-21T08:09:46.0251743Z         %120 = arith.addi %arg3, %119 : i32
2026-02-21T08:09:46.0252035Z         %121 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0252358Z         %122 = tt.splat %120 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0252621Z         %123 = arith.addi %122, %121 : tensor<16xi32>
2026-02-21T08:09:46.0252956Z         %124 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0253303Z         %125 = arith.muli %124, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0253636Z         %126 = tt.expand_dims %123 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0254045Z         %127 = tt.broadcast %125 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0254400Z         %128 = tt.broadcast %126 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0254729Z         %129 = arith.addi %127, %128 : tensor<512x16xi32>
2026-02-21T08:09:46.0255070Z         %130 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0255436Z         %131 = tt.addptr %130, %129 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0255751Z         %132 = tt.load %131 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0256042Z         %133 = arith.extf %132 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0256334Z         %134 = "tt.reduce"(%133) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0256630Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0256856Z           %150 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:09:46.0257085Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0257334Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0257667Z         %135 = arith.truncf %134 : tensor<512xf32> to tensor<512xf16>
2026-02-21T08:09:46.0258010Z         %136 = arith.extf %135 : tensor<512xf16> to tensor<512xf32>
2026-02-21T08:09:46.0258347Z         %137 = arith.cmpf ogt, %109, %136 : tensor<512xf32>
2026-02-21T08:09:46.0258644Z         %138 = arith.cmpf une, %109, %109 : tensor<512xf32>
2026-02-21T08:09:46.0258950Z         %139 = arith.ori %137, %138 : tensor<512xi1>
2026-02-21T08:09:46.0259302Z         %140 = arith.select %139, %109, %136 : tensor<512xi1>, tensor<512xf32>
2026-02-21T08:09:46.0259645Z         %141 = arith.subf %109, %140 : tensor<512xf32>
2026-02-21T08:09:46.0260228Z         %142 = tt.extern_elementwise %141 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0260728Z         %143 = arith.mulf %118, %142 : tensor<512xf32>
2026-02-21T08:09:46.0261040Z         %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0261401Z         %145 = tt.broadcast %144 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0261729Z         %146 = arith.subf %133, %145 : tensor<512x16xf32>
2026-02-21T08:09:46.0262183Z         %147 = tt.extern_elementwise %146 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0262636Z         %148 = "tt.reduce"(%147) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0262864Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:46.0267841Z           %150 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:46.0268095Z           tt.reduce.return %150 : f32
2026-02-21T08:09:46.0268370Z         }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0268632Z         %149 = arith.addf %143, %148 : tensor<512xf32>
2026-02-21T08:09:46.0268916Z         scf.yield %140, %149 : tensor<512xf32>, tensor<512xf32>
2026-02-21T08:09:46.0269159Z       }
2026-02-21T08:09:46.0269392Z       %9 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0269705Z       %10 = tt.splat %c240_i32 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0269952Z       %11 = arith.addi %10, %9 : tensor<16xi32>
2026-02-21T08:09:46.0270265Z       %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0270589Z       %13 = arith.muli %12, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0270899Z       %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0271245Z       %15 = tt.broadcast %13 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0271642Z       %16 = tt.broadcast %14 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0271939Z       %17 = arith.addi %15, %16 : tensor<512x16xi32>
2026-02-21T08:09:46.0272225Z       %18 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0272569Z       %19 = tt.addptr %18, %17 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0272877Z       %20 = tt.load %19 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0273162Z       %21 = arith.extf %20 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0273434Z       %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0273748Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:09:46.0273977Z         %59 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:09:46.0274205Z         tt.reduce.return %59 : f32
2026-02-21T08:09:46.0274436Z       }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0274714Z       %23 = arith.truncf %22 : tensor<512xf32> to tensor<512xf16>
2026-02-21T08:09:46.0275022Z       %24 = arith.extf %23 : tensor<512xf16> to tensor<512xf32>
2026-02-21T08:09:46.0275545Z       %25 = arith.cmpf ogt, %8#0, %24 : tensor<512xf32>
2026-02-21T08:09:46.0275813Z       %26 = arith.cmpf une, %8#0, %8#0 : tensor<512xf32>
2026-02-21T08:09:46.0276069Z       %27 = arith.ori %25, %26 : tensor<512xi1>
2026-02-21T08:09:46.0276343Z       %28 = arith.select %27, %8#0, %24 : tensor<512xi1>, tensor<512xf32>
2026-02-21T08:09:46.0276633Z       %29 = arith.subf %8#0, %28 : tensor<512xf32>
2026-02-21T08:09:46.0277067Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0277505Z       %31 = arith.mulf %8#1, %30 : tensor<512xf32>
2026-02-21T08:09:46.0277800Z       %32 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0278153Z       %33 = tt.broadcast %32 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0278443Z       %34 = arith.subf %21, %33 : tensor<512x16xf32>
2026-02-21T08:09:46.0278956Z       %35 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0279401Z       %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({
2026-02-21T08:09:46.0279627Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:09:46.0279855Z         %59 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:09:46.0280085Z         tt.reduce.return %59 : f32
2026-02-21T08:09:46.0280316Z       }) : (tensor<512x16xf32>) -> tensor<512xf32>
2026-02-21T08:09:46.0280570Z       %37 = arith.addf %31, %36 : tensor<512xf32>
2026-02-21T08:09:46.0280809Z       %c240_i32_2 = arith.constant 240 : i32
2026-02-21T08:09:46.0281056Z       %c48_i32_3 = arith.constant 48 : i32
2026-02-21T08:09:46.0281334Z       scf.for %arg3 = %c0_i32 to %c240_i32_2 step %c48_i32_3  : i32 {
2026-02-21T08:09:46.0281712Z         %59 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0282011Z         %60 = tt.splat %arg3 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0282299Z         %61 = arith.addi %60, %59 : tensor<16xi32>
2026-02-21T08:09:46.0282752Z         %62 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<512x16xf16>> -> tensor<512x16xf16>
2026-02-21T08:09:46.0283287Z         %63 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0283690Z         %64 = arith.extf %62 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0284012Z         %65 = tt.broadcast %63 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0284314Z         %66 = arith.subf %64, %65 : tensor<512x16xf32>
2026-02-21T08:09:46.0284767Z         %67 = tt.extern_elementwise %66 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0285277Z         %68 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0285651Z         %69 = tt.broadcast %68 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0285948Z         %70 = arith.divf %67, %69 : tensor<512x16xf32>
2026-02-21T08:09:46.0286247Z         %71 = arith.truncf %70 : tensor<512x16xf32> to tensor<512x16xf16>
2026-02-21T08:09:46.0286603Z         %72 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0286939Z         %73 = arith.muli %72, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0287255Z         %74 = tt.expand_dims %61 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0287594Z         %75 = tt.broadcast %73 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0287903Z         %76 = tt.broadcast %74 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0288174Z         %77 = arith.addi %75, %76 : tensor<512x16xi32>
2026-02-21T08:09:46.0288454Z         %78 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0288782Z         %79 = tt.addptr %78, %77 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0289154Z         tt.store %79, %71 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0289405Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:09:46.0289630Z         %80 = arith.muli %c16_i32, %c1_i32_4 : i32
2026-02-21T08:09:46.0289866Z         %81 = arith.addi %arg3, %80 : i32
2026-02-21T08:09:46.0290134Z         %82 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0290427Z         %83 = tt.splat %81 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0290660Z         %84 = arith.addi %83, %82 : tensor<16xi32>
2026-02-21T08:09:46.0291002Z         %85 = tt.descriptor_load %0[%4, %81] : !tt.tensordesc<tensor<512x16xf16>> -> tensor<512x16xf16>
2026-02-21T08:09:46.0291404Z         %86 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0291784Z         %87 = arith.extf %85 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0292143Z         %88 = tt.broadcast %86 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0292416Z         %89 = arith.subf %87, %88 : tensor<512x16xf32>
2026-02-21T08:09:46.0292849Z         %90 = tt.extern_elementwise %89 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0293336Z         %91 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0293669Z         %92 = tt.broadcast %91 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0293946Z         %93 = arith.divf %90, %92 : tensor<512x16xf32>
2026-02-21T08:09:46.0294214Z         %94 = arith.truncf %93 : tensor<512x16xf32> to tensor<512x16xf16>
2026-02-21T08:09:46.0294549Z         %95 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0294851Z         %96 = arith.muli %95, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0295149Z         %97 = tt.expand_dims %84 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0295494Z         %98 = tt.broadcast %96 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0295802Z         %99 = tt.broadcast %97 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0296096Z         %100 = arith.addi %98, %99 : tensor<512x16xi32>
2026-02-21T08:09:46.0296503Z         %101 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0296910Z         %102 = tt.addptr %101, %100 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0297228Z         tt.store %102, %94 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0297483Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:09:46.0297711Z         %103 = arith.muli %c16_i32, %c2_i32 : i32
2026-02-21T08:09:46.0297953Z         %104 = arith.addi %arg3, %103 : i32
2026-02-21T08:09:46.0298230Z         %105 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0298521Z         %106 = tt.splat %104 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0298770Z         %107 = arith.addi %106, %105 : tensor<16xi32>
2026-02-21T08:09:46.0299107Z         %108 = tt.descriptor_load %0[%4, %104] : !tt.tensordesc<tensor<512x16xf16>> -> tensor<512x16xf16>
2026-02-21T08:09:46.0299519Z         %109 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0299865Z         %110 = arith.extf %108 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0300186Z         %111 = tt.broadcast %109 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0300484Z         %112 = arith.subf %110, %111 : tensor<512x16xf32>
2026-02-21T08:09:46.0300921Z         %113 = tt.extern_elementwise %112 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0301420Z         %114 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0301827Z         %115 = tt.broadcast %114 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0302128Z         %116 = arith.divf %113, %115 : tensor<512x16xf32>
2026-02-21T08:09:46.0302498Z         %117 = arith.truncf %116 : tensor<512x16xf32> to tensor<512x16xf16>
2026-02-21T08:09:46.0302832Z         %118 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0303148Z         %119 = arith.muli %118, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0303444Z         %120 = tt.expand_dims %107 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0303792Z         %121 = tt.broadcast %119 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0304110Z         %122 = tt.broadcast %120 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0304397Z         %123 = arith.addi %121, %122 : tensor<512x16xi32>
2026-02-21T08:09:46.0304691Z         %124 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0305033Z         %125 = tt.addptr %124, %123 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0305409Z         tt.store %125, %117 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0305637Z       }
2026-02-21T08:09:46.0305856Z       %38 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:46.0306160Z       %39 = tt.splat %c240_i32_2 : i32 -> tensor<16xi32>
2026-02-21T08:09:46.0306404Z       %40 = arith.addi %39, %38 : tensor<16xi32>
2026-02-21T08:09:46.0306764Z       %41 = tt.descriptor_load %0[%4, %c240_i32_2] : !tt.tensordesc<tensor<512x16xf16>> -> tensor<512x16xf16>
2026-02-21T08:09:46.0307176Z       %42 = tt.expand_dims %28 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0307522Z       %43 = arith.extf %41 : tensor<512x16xf16> to tensor<512x16xf32>
2026-02-21T08:09:46.0307827Z       %44 = tt.broadcast %42 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0308112Z       %45 = arith.subf %43, %44 : tensor<512x16xf32>
2026-02-21T08:09:46.0308549Z       %46 = tt.extern_elementwise %45 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x16xf32>) -> tensor<512x16xf32>
2026-02-21T08:09:46.0309030Z       %47 = tt.expand_dims %37 {axis = 1 : i32} : tensor<512xf32> -> tensor<512x1xf32>
2026-02-21T08:09:46.0309373Z       %48 = tt.broadcast %47 : tensor<512x1xf32> -> tensor<512x16xf32>
2026-02-21T08:09:46.0309646Z       %49 = arith.divf %46, %48 : tensor<512x16xf32>
2026-02-21T08:09:46.0309932Z       %50 = arith.truncf %49 : tensor<512x16xf32> to tensor<512x16xf16>
2026-02-21T08:09:46.0310268Z       %51 = tt.expand_dims %7 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:46.0310572Z       %52 = arith.muli %51, %cst : tensor<512x1xi32>
2026-02-21T08:09:46.0310873Z       %53 = tt.expand_dims %40 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:09:46.0311200Z       %54 = tt.broadcast %52 : tensor<512x1xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0311516Z       %55 = tt.broadcast %53 : tensor<1x16xi32> -> tensor<512x16xi32>
2026-02-21T08:09:46.0311825Z       %56 = arith.addi %54, %55 : tensor<512x16xi32>
2026-02-21T08:09:46.0312118Z       %57 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0312447Z       %58 = tt.addptr %57, %56 : tensor<512x16x!tt.ptr<f16>>, tensor<512x16xi32>
2026-02-21T08:09:46.0312751Z       tt.store %58, %50 : tensor<512x16x!tt.ptr<f16>>
2026-02-21T08:09:46.0312995Z     } {tt.warp_specialize}
2026-02-21T08:09:46.0313182Z     tt.return
2026-02-21T08:09:46.0313338Z   }
2026-02-21T08:09:46.0313480Z }
2026-02-21T08:09:46.0313569Z 
2026-02-21T08:09:46.0313628Z {-#
2026-02-21T08:09:46.0313795Z   external_resources: {
2026-02-21T08:09:46.0313990Z     mlir_reproducer: {
2026-02-21T08:09:46.0319192Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:46.0324459Z       disable_threading: false,
2026-02-21T08:09:46.0324670Z       verify_each: true
2026-02-21T08:09:46.0324842Z     }
2026-02-21T08:09:46.0324994Z   }
2026-02-21T08:09:46.0325133Z #-}
2026-02-21T08:09:46.0325650Z /tmp/torchinductor_root/iq/ciqafosyvytmppnhx4u2hjiekjtwu3zcm2aprmnhs7mquxpoo5mi.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:46.0327101Z /tmp/torchinductor_root/iq/ciqafosyvytmppnhx4u2hjiekjtwu3zcm2aprmnhs7mquxpoo5mi.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:46.0328264Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:46.0329511Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 16], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], maxnreg=64, num_sm_multiplier=2, num_stages=7, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:09:46.0330701Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:46.0331005Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:47.1901386Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.0 configs/s
2026-02-21T08:09:47.1912749Z [38s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 60s])
2026-02-21T08:09:47.5438402Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2863.1 configs/s
2026-02-21T08:09:47.5917655Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T08:09:47.5922653Z error=8
2026-02-21T08:09:47.5927244Z ok=92
2026-02-21T08:09:47.5930615Z min=0.0082
2026-02-21T08:09:47.5935459Z mid=0.0533
2026-02-21T08:09:47.5937001Z max=2.5487
2026-02-21T08:09:47.5937225Z best={'block_sizes': [16, 256],
2026-02-21T08:09:47.5937506Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:09:47.5937773Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:09:47.5938074Z  'num_stages': 8,
2026-02-21T08:09:47.5938249Z  'num_warps': 4,
2026-02-21T08:09:47.5938430Z  'pid_type': 'flat',
2026-02-21T08:09:47.5938626Z  'range_flattens': [None, True],
2026-02-21T08:09:47.5942924Z  'range_multi_buffers': [None, True],
2026-02-21T08:09:47.5944393Z  'range_num_stages': [0, 1],
2026-02-21T08:09:47.5945206Z  'range_unroll_factors': [0, 0],
2026-02-21T08:09:47.5945443Z  'range_warp_specializes': [None, False]}
2026-02-21T08:09:47.5945830Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:09:48.9668151Z [40s] Generation 1 starting: 83 neighbors, 5 active search path(s)
2026-02-21T08:09:54.5760712Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 5.1 configs/s
2026-02-21T08:10:00.4278976Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 14.8 configs/s
2026-02-21T08:10:03.2060326Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.6         
2026-02-21T08:10:03.2060864Z                                                                   configs/s     
2026-02-21T08:10:03.5255754Z [55s] Generation 1 complete: 
2026-02-21T08:10:03.5256107Z ok=88
2026-02-21T08:10:03.5256278Z min=0.0061
2026-02-21T08:10:03.5256459Z mid=0.0082
2026-02-21T08:10:03.5256639Z max=0.0881
2026-02-21T08:10:03.5257388Z best={'block_sizes': [4, 256],
2026-02-21T08:10:03.5257681Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:03.5257954Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:10:03.5258214Z  'num_stages': 5,
2026-02-21T08:10:03.5258395Z  'num_warps': 4,
2026-02-21T08:10:03.5258616Z  'pid_type': 'flat',
2026-02-21T08:10:03.5258823Z  'range_flattens': [None, None],
2026-02-21T08:10:03.5259065Z  'range_multi_buffers': [None, None],
2026-02-21T08:10:03.5259286Z  'range_num_stages': [0, 1],
2026-02-21T08:10:03.5259480Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:03.5259688Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:03.5269917Z [55s] Fitting surrogate: 188 points, 188 targets
2026-02-21T08:10:04.7086052Z [56s] Generation 2 starting: 75 neighbors, 5 active search path(s)
2026-02-21T08:10:08.2766241Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 58.1 configs/s
2026-02-21T08:10:13.2355514Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 15.7 configs/s
2026-02-21T08:10:16.6636830Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 325.4         
2026-02-21T08:10:16.6637415Z                                                                   configs/s     
2026-02-21T08:10:17.0079338Z [68s] Generation 2 complete: 
2026-02-21T08:10:17.0079653Z ok=81
2026-02-21T08:10:17.0079867Z min=0.0061
2026-02-21T08:10:17.0080069Z mid=0.0082
2026-02-21T08:10:17.0080270Z max=0.0758
2026-02-21T08:10:17.0080484Z best={'block_sizes': [16, 256],
2026-02-21T08:10:17.0080823Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:17.0081173Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:17.0081484Z  'num_stages': 7,
2026-02-21T08:10:17.0082089Z  'num_warps': 16,
2026-02-21T08:10:17.0082318Z  'pid_type': 'flat',
2026-02-21T08:10:17.0082572Z  'range_flattens': [None, False],
2026-02-21T08:10:17.0082887Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:17.0083205Z  'range_num_stages': [0, 4],
2026-02-21T08:10:17.0083488Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:17.0083864Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:17.0095423Z [68s] Fitting surrogate: 269 points, 269 targets
2026-02-21T08:10:17.9834503Z [69s] Generation 3 starting: 65 neighbors, 5 active search path(s)
2026-02-21T08:10:20.9207437Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 33.0 configs/s
2026-02-21T08:10:25.2986197Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 15.7 configs/s
2026-02-21T08:10:28.6212272Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 338.1         
2026-02-21T08:10:28.6212735Z                                                                   configs/s     
2026-02-21T08:10:28.9493041Z [80s] Generation 3 complete: 
2026-02-21T08:10:28.9497171Z ok=70
2026-02-21T08:10:28.9498578Z min=0.0061
2026-02-21T08:10:28.9498747Z mid=0.0082
2026-02-21T08:10:28.9498873Z max=0.0348
2026-02-21T08:10:28.9499022Z best={'block_sizes': [16, 256],
2026-02-21T08:10:28.9499283Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:28.9500002Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:28.9500186Z  'num_stages': 7,
2026-02-21T08:10:28.9500328Z  'num_warps': 16,
2026-02-21T08:10:28.9500478Z  'pid_type': 'flat',
2026-02-21T08:10:28.9500638Z  'range_flattens': [None, False],
2026-02-21T08:10:28.9500828Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:28.9501012Z  'range_num_stages': [0, 4],
2026-02-21T08:10:28.9501186Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:28.9501363Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:28.9507057Z [80s] Fitting surrogate: 339 points, 339 targets
2026-02-21T08:10:29.8300312Z [81s] Generation 4 starting: 63 neighbors, 5 active search path(s)
2026-02-21T08:10:31.9267621Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 74.8 configs/s
2026-02-21T08:10:35.8818445Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.1 configs/s
2026-02-21T08:10:39.3138749Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.3         
2026-02-21T08:10:39.3139199Z                                                                   configs/s     
2026-02-21T08:10:39.6331138Z [91s] Generation 4 complete: 
2026-02-21T08:10:39.6335387Z ok=68
2026-02-21T08:10:39.6337016Z min=0.0061
2026-02-21T08:10:39.6337220Z mid=0.0063
2026-02-21T08:10:39.6337355Z max=0.0102
2026-02-21T08:10:39.6342665Z best={'block_sizes': [16, 256],
2026-02-21T08:10:39.6346613Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:39.6351067Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:39.6355585Z  'num_stages': 7,
2026-02-21T08:10:39.6357511Z  'num_warps': 16,
2026-02-21T08:10:39.6357698Z  'pid_type': 'flat',
2026-02-21T08:10:39.6357874Z  'range_flattens': [None, False],
2026-02-21T08:10:39.6358062Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:39.6358255Z  'range_num_stages': [0, 4],
2026-02-21T08:10:39.6358421Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:39.6358608Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:39.6358945Z [91s] Fitting surrogate: 407 points, 407 targets
2026-02-21T08:10:40.3712307Z [91s] Generation 5 starting: 50 neighbors, 4 active search path(s)
2026-02-21T08:10:42.2116544Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 41.6 configs/s
2026-02-21T08:10:45.4145060Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 16.1 configs/s
2026-02-21T08:10:48.0286974Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 392.4         
2026-02-21T08:10:48.0287791Z                                                                   configs/s     
2026-02-21T08:10:48.3000593Z [99s] Generation 5 complete: 
2026-02-21T08:10:48.3005537Z ok=55
2026-02-21T08:10:48.3009909Z min=0.0061
2026-02-21T08:10:48.3011337Z mid=0.0063
2026-02-21T08:10:48.3011505Z max=0.0164
2026-02-21T08:10:48.3011724Z best={'block_sizes': [16, 256],
2026-02-21T08:10:48.3011949Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:48.3012160Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:48.3012397Z  'num_stages': 7,
2026-02-21T08:10:48.3012538Z  'num_warps': 16,
2026-02-21T08:10:48.3012683Z  'pid_type': 'flat',
2026-02-21T08:10:48.3012847Z  'range_flattens': [None, False],
2026-02-21T08:10:48.3013027Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:48.3013219Z  'range_num_stages': [0, 4],
2026-02-21T08:10:48.3013386Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:48.3013577Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:48.3016106Z [99s] Fitting surrogate: 462 points, 462 targets
2026-02-21T08:10:48.9941230Z [100s] Generation 6 starting: 34 neighbors, 3 active search path(s)
2026-02-21T08:10:50.2529555Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 40.1 configs/s
2026-02-21T08:10:52.4546710Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 16.2 configs/s
2026-02-21T08:10:54.2748913Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 561.7         
2026-02-21T08:10:54.2750306Z                                                                   configs/s     
2026-02-21T08:10:54.4669893Z [106s] Generation 6 complete: 
2026-02-21T08:10:54.4674240Z ok=38
2026-02-21T08:10:54.4678677Z min=0.0061
2026-02-21T08:10:54.4680649Z mid=0.0063
2026-02-21T08:10:54.4680808Z max=0.0102
2026-02-21T08:10:54.4680962Z best={'block_sizes': [16, 256],
2026-02-21T08:10:54.4681174Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:54.4681397Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:54.4681649Z  'num_stages': 7,
2026-02-21T08:10:54.4681804Z  'num_warps': 16,
2026-02-21T08:10:54.4681951Z  'pid_type': 'flat',
2026-02-21T08:10:54.4682107Z  'range_flattens': [None, False],
2026-02-21T08:10:54.4682300Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:54.4682482Z  'range_num_stages': [0, 4],
2026-02-21T08:10:54.4682655Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:54.4682834Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:54.4687900Z [106s] Fitting surrogate: 500 points, 500 targets
2026-02-21T08:10:54.7874654Z [106s] Generation 7 starting: 14 neighbors, 1 active search path(s)
2026-02-21T08:10:55.4748157Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 60.2 configs/s
2026-02-21T08:10:56.3649956Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 14/14 16.5 configs/s
2026-02-21T08:10:57.0908320Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1398.9         
2026-02-21T08:10:57.0911932Z                                                                  configs/s      
2026-02-21T08:10:57.1738579Z [108s] Generation 7 complete: 
2026-02-21T08:10:57.1741749Z ok=16
2026-02-21T08:10:57.1746173Z min=0.0061
2026-02-21T08:10:57.1748292Z mid=0.0061
2026-02-21T08:10:57.1748516Z max=0.0143
2026-02-21T08:10:57.1753752Z best={'block_sizes': [16, 256],
2026-02-21T08:10:57.1755754Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:10:57.1756092Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:57.1756329Z  'num_stages': 7,
2026-02-21T08:10:57.1756556Z  'num_warps': 16,
2026-02-21T08:10:57.1756756Z  'pid_type': 'flat',
2026-02-21T08:10:57.1761341Z  'range_flattens': [None, False],
2026-02-21T08:10:57.1765305Z  'range_multi_buffers': [None, True],
2026-02-21T08:10:57.1768772Z  'range_num_stages': [0, 4],
2026-02-21T08:10:57.1772212Z  'range_unroll_factors': [0, 4],
2026-02-21T08:10:57.1776106Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:57.1780172Z [108s] Fitting surrogate: 516 points, 516 targets
2026-02-21T08:10:57.4729528Z [109s] Autotuning complete in 109.1s after searching 494 configs.
2026-02-21T08:10:57.4732285Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:10:57.4733413Z     @helion.kernel(config=helion.Config(block_sizes=[16, 256], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=7, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T08:10:57.4734674Z 
2026-02-21T08:10:57.4739502Z [109s] Code of selected kernel: /tmp/torchinductor_root/fe/cfe46z52drtjal4sey6zsroulfver3cr6rioiwvny46ikfnh4jrf.py
2026-02-21T08:10:57.9772526Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T08:10:57.9776567Z (M, N)
2026-02-21T08:10:57.9778245Z -----------
2026-02-21T08:10:57.9778436Z (4096, 256)
2026-02-21T08:10:57.9778526Z 
2026-02-21T08:10:57.9779167Z   5%|▌         | 1/20 [01:55<36:35, 115.53s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5:
2026-02-21T08:10:57.9781798Z (M, N)
2026-02-21T08:10:57.9781983Z -----------
2026-02-21T08:10:57.9782146Z (4096, 896)
2026-02-21T08:10:57.9782541Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax
2026-02-21T08:10:59.5973158Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:11:01.0600344Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for torch_compile_softmax
2026-02-21T08:11:02.8720238Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:11:02.8723834Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:11:02.8727796Z               'dtype': 'torch.float16',
2026-02-21T08:11:02.8731959Z               'shape': (4096, 896),
2026-02-21T08:11:02.8736511Z               'stride': (896, 1)},),
2026-02-21T08:11:02.8741659Z   'kwargs': {}}
2026-02-21T08:11:02.8746194Z INFO:tritonbench.utils.triton_op:Took 1.67ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:11:03.0470822Z [0s] Autotune random seed: 2134816249
2026-02-21T08:11:03.1823892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:11:36.3408960Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=32, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False])
2026-02-21T08:11:36.3427099Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:11:38.8565645Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:11:38.8566282Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:11:38.8570162Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:11:38.8574903Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:11:38.8576293Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:38.8576542Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:38.8576760Z     %cst = arith.constant dense<896> : tensor<16x1xi32>
2026-02-21T08:11:38.8577105Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T08:11:38.8582235Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T08:11:38.8583661Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:11:38.8583898Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:38.8584103Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T08:11:38.8584285Z     %c896_i64 = arith.constant 896 : i64
2026-02-21T08:11:38.8584471Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:38.8584794Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T08:11:38.8585126Z     %1 = tt.get_program_id x : i32
2026-02-21T08:11:38.8585303Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:11:38.8585490Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T08:11:38.8585691Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:11:38.8585893Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T08:11:38.8586131Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:38.8586394Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T08:11:38.8586941Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T08:11:38.8587131Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:11:38.8587318Z       %c384_i32 = arith.constant 384 : i32
2026-02-21T08:11:38.8587688Z       %8:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c384_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T08:11:38.8588157Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T08:11:38.8588498Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8588734Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8588937Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8589126Z           %108 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:38.8589327Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8589526Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8589753Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:11:38.8590000Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:11:38.8590227Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T08:11:38.8590461Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T08:11:38.8590676Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T08:11:38.8590916Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:11:38.8591163Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T08:11:38.8591522Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8592242Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T08:11:38.8592507Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8592973Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8593309Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T08:11:38.8593689Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8594055Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8594263Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8594452Z           %108 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:38.8594653Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8594843Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8595054Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T08:11:38.8595251Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:11:38.8595441Z         %68 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T08:11:38.8595637Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T08:11:38.8595917Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T08:11:38.8596243Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8596472Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8596668Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8596857Z           %108 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:38.8597046Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8597236Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8597488Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:11:38.8597745Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:11:38.8597983Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T08:11:38.8598211Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T08:11:38.8598430Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T08:11:38.8598743Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:11:38.8598993Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T08:11:38.8599362Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8599737Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T08:11:38.8599996Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8600308Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8600559Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T08:11:38.8600942Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8601333Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8601572Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8601767Z           %108 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:38.8601956Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8602153Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8602363Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T08:11:38.8602566Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:38.8602769Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:11:38.8602962Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T08:11:38.8603250Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T08:11:38.8603579Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8603825Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8604025Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8604274Z           %108 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:38.8604484Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8604672Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8604911Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:11:38.8605165Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:11:38.8605410Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T08:11:38.8605634Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T08:11:38.8605843Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T08:11:38.8606087Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:11:38.8606316Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T08:11:38.8606669Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8607027Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T08:11:38.8607286Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8607588Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8607831Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T08:11:38.8608205Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8608575Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8608771Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:38.8608955Z           %108 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:38.8609139Z           tt.reduce.return %108 : f32
2026-02-21T08:11:38.8609328Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8609526Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T08:11:38.8609751Z         scf.yield %98, %107 : tensor<16xf32>, tensor<16xf32>
2026-02-21T08:11:38.8610021Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:11:38.8610320Z       %9 = tt.descriptor_load %0[%4, %c768_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T08:11:38.8610642Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8610877Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8611070Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:11:38.8611248Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:11:38.8611441Z         tt.reduce.return %50 : f32
2026-02-21T08:11:38.8611660Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8611886Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:11:38.8612123Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:11:38.8612347Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T08:11:38.8612567Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T08:11:38.8612771Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T08:11:38.8613002Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:11:38.8613228Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T08:11:38.8613577Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8613922Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T08:11:38.8614171Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8614464Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8614693Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T08:11:38.8615062Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8615483Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T08:11:38.8615684Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:11:38.8615867Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:11:38.8616051Z         tt.reduce.return %50 : f32
2026-02-21T08:11:38.8616237Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T08:11:38.8616428Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T08:11:38.8616626Z       %c768_i32_2 = arith.constant 768 : i32
2026-02-21T08:11:38.8616811Z       %c384_i32_3 = arith.constant 384 : i32
2026-02-21T08:11:38.8617043Z       scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c384_i32_3  : i32 {
2026-02-21T08:11:38.8617316Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:11:38.8617583Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:11:38.8617796Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T08:11:38.8618044Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:11:38.8618309Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T08:11:38.8618559Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:11:38.8618852Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8619111Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8619353Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T08:11:38.8619595Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8619873Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8620175Z         %61 = tt.load %60 evictionPolicy = evict_last : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8620475Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8620765Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8621122Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8621355Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T08:11:38.8621780Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8622192Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8622486Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8622714Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T08:11:38.8622958Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T08:11:38.8623235Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8623512Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8623778Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8623981Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:11:38.8624179Z         %73 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T08:11:38.8624368Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T08:11:38.8624605Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:11:38.8624860Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T08:11:38.8625055Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T08:11:38.8625305Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:11:38.8625560Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T08:11:38.8625817Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:11:38.8626109Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8626416Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8626660Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T08:11:38.8626891Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8627175Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8627466Z         %86 = tt.load %85 evictionPolicy = evict_last : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8627772Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8628056Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8628309Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8628543Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T08:11:38.8628904Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8629313Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8629597Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8629828Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T08:11:38.8630066Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T08:11:38.8630332Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8630616Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8630869Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8631082Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:38.8631279Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:11:38.8631467Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T08:11:38.8631742Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:11:38.8632050Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T08:11:38.8632259Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T08:11:38.8632503Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:11:38.8632772Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T08:11:38.8633035Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:11:38.8633328Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8633601Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8633843Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T08:11:38.8634090Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8634374Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8634689Z         %111 = tt.load %110 evictionPolicy = evict_last : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8635000Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8635286Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8635557Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8635797Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T08:11:38.8636176Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8636595Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8636881Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8637172Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T08:11:38.8637418Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T08:11:38.8637696Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8637975Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8638246Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8638461Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:11:38.8638697Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:11:38.8638957Z       %28 = tt.splat %c768_i32_2 : i32 -> tensor<128xi32>
2026-02-21T08:11:38.8639160Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T08:11:38.8639413Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:11:38.8639673Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T08:11:38.8639926Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:11:38.8640221Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8640479Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T08:11:38.8640714Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T08:11:38.8640940Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8641250Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8641587Z       %38 = tt.load %37 evictionPolicy = evict_last : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8641904Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8642203Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T08:11:38.8642466Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8642716Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T08:11:38.8643149Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T08:11:38.8643580Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:11:38.8643882Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T08:11:38.8644122Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T08:11:38.8644374Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T08:11:38.8644652Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8644958Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T08:11:38.8645227Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T08:11:38.8645579Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:11:38.8645912Z     tt.return
2026-02-21T08:11:38.8646042Z   }
2026-02-21T08:11:38.8646175Z }
2026-02-21T08:11:38.8646245Z 
2026-02-21T08:11:38.8646297Z {-#
2026-02-21T08:11:38.8646439Z   external_resources: {
2026-02-21T08:11:38.8646600Z     mlir_reproducer: {
2026-02-21T08:11:38.8651116Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:38.8655585Z       disable_threading: false,
2026-02-21T08:11:38.8655758Z       verify_each: true
2026-02-21T08:11:38.8655902Z     }
2026-02-21T08:11:38.8656028Z   }
2026-02-21T08:11:38.8656142Z #-}
2026-02-21T08:11:38.8656604Z /tmp/torchinductor_root/57/c57bheky2o2t4z7zwekod4ad533kma5u7rgiwiyx473bitht7jc2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:38.8657800Z /tmp/torchinductor_root/57/c57bheky2o2t4z7zwekod4ad533kma5u7rgiwiyx473bitht7jc2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:38.8658768Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:38.8659837Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=64, num_sm_multiplier=4, num_stages=5, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[2, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:11:38.8660852Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:38.8661106Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:40.5010852Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:11:40.5016399Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:11:40.5020437Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:11:40.5025047Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:40.5028974Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:11:40.5033017Z     %cst = arith.constant dense<896> : tensor<32x1xi32>
2026-02-21T08:11:40.5033376Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:11:40.5033648Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:11:40.5033867Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:11:40.5034061Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:40.5034253Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T08:11:40.5034433Z     %c896_i64 = arith.constant 896 : i64
2026-02-21T08:11:40.5034619Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:40.5034927Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<32x32xf16>>
2026-02-21T08:11:40.5035252Z     %1 = tt.get_program_id x : i32
2026-02-21T08:11:40.5035459Z     scf.for %arg2 = %1 to %c128_i32 step %c9472_i32  : i32 {
2026-02-21T08:11:40.5035682Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:11:40.5036214Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:11:40.5036480Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:11:40.5036681Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:11:40.5036868Z       %c864_i32 = arith.constant 864 : i32
2026-02-21T08:11:40.5037062Z       %c96_i32 = arith.constant 96 : i32
2026-02-21T08:11:40.5037414Z       %6:2 = scf.for %arg3 = %c0_i32 to %c864_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:11:40.5037871Z         %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:11:40.5038205Z         %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5038437Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5038639Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5038836Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:40.5039049Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5039237Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5039469Z         %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:11:40.5039720Z         %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:11:40.5039953Z         %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32>
2026-02-21T08:11:40.5040182Z         %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:11:40.5040392Z         %54 = arith.ori %52, %53 : tensor<32xi1>
2026-02-21T08:11:40.5040627Z         %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:11:40.5040863Z         %56 = arith.subf %arg4, %55 : tensor<32xf32>
2026-02-21T08:11:40.5041233Z         %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5041681Z         %58 = arith.mulf %arg5, %57 : tensor<32xf32>
2026-02-21T08:11:40.5042082Z         %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5042376Z         %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5042608Z         %61 = arith.subf %48, %60 : tensor<32x32xf32>
2026-02-21T08:11:40.5042977Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5043347Z         %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5043535Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5043724Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:40.5043910Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5044103Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5044300Z         %64 = arith.addf %58, %63 : tensor<32xf32>
2026-02-21T08:11:40.5044498Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:40.5044690Z         %65 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:11:40.5044888Z         %66 = arith.addi %arg3, %65 : i32
2026-02-21T08:11:40.5045161Z         %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:11:40.5045465Z         %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5045695Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5045879Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5046063Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:40.5046248Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5046433Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5046657Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:11:40.5046890Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:11:40.5047120Z         %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32>
2026-02-21T08:11:40.5047488Z         %73 = arith.cmpf une, %55, %55 : tensor<32xf32>
2026-02-21T08:11:40.5047719Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:11:40.5047945Z         %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:11:40.5048184Z         %76 = arith.subf %55, %75 : tensor<32xf32>
2026-02-21T08:11:40.5048539Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5048895Z         %78 = arith.mulf %64, %77 : tensor<32xf32>
2026-02-21T08:11:40.5049152Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5049440Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5049680Z         %81 = arith.subf %68, %80 : tensor<32x32xf32>
2026-02-21T08:11:40.5050103Z         %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5050525Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5050728Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5050938Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:40.5051136Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5051323Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5051564Z         %84 = arith.addf %78, %83 : tensor<32xf32>
2026-02-21T08:11:40.5051761Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:40.5051967Z         %85 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:11:40.5052164Z         %86 = arith.addi %arg3, %85 : i32
2026-02-21T08:11:40.5052433Z         %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:11:40.5052753Z         %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5052984Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5053259Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5053447Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:11:40.5053645Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5053830Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5054052Z         %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:11:40.5054308Z         %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:11:40.5054530Z         %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32>
2026-02-21T08:11:40.5054746Z         %93 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:11:40.5054943Z         %94 = arith.ori %92, %93 : tensor<32xi1>
2026-02-21T08:11:40.5055172Z         %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:11:40.5055403Z         %96 = arith.subf %75, %95 : tensor<32xf32>
2026-02-21T08:11:40.5055751Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5056107Z         %98 = arith.mulf %84, %97 : tensor<32xf32>
2026-02-21T08:11:40.5056352Z         %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5056649Z         %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5056883Z         %101 = arith.subf %88, %100 : tensor<32x32xf32>
2026-02-21T08:11:40.5057251Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5057626Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5057815Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:40.5057996Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:40.5058175Z           tt.reduce.return %105 : f32
2026-02-21T08:11:40.5058358Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5058602Z         %104 = arith.addf %98, %103 : tensor<32xf32>
2026-02-21T08:11:40.5058826Z         scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:11:40.5059043Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:11:40.5059330Z       %7 = tt.descriptor_load %0[%2, %c864_i32] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:11:40.5059648Z       %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5059870Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5060061Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:11:40.5060238Z         %47 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:11:40.5060433Z         tt.reduce.return %47 : f32
2026-02-21T08:11:40.5060621Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5060835Z       %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:11:40.5061079Z       %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:11:40.5061302Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32>
2026-02-21T08:11:40.5061521Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:11:40.5061770Z       %14 = arith.ori %12, %13 : tensor<32xi1>
2026-02-21T08:11:40.5062002Z       %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:11:40.5062238Z       %16 = arith.subf %6#0, %15 : tensor<32xf32>
2026-02-21T08:11:40.5062582Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5062941Z       %18 = arith.mulf %6#1, %17 : tensor<32xf32>
2026-02-21T08:11:40.5063188Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5063482Z       %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5063711Z       %21 = arith.subf %8, %20 : tensor<32x32xf32>
2026-02-21T08:11:40.5064079Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5064504Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:11:40.5064693Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:11:40.5064875Z         %47 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:11:40.5065056Z         tt.reduce.return %47 : f32
2026-02-21T08:11:40.5065242Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:11:40.5065432Z       %24 = arith.addf %18, %23 : tensor<32xf32>
2026-02-21T08:11:40.5065627Z       %c864_i32_2 = arith.constant 864 : i32
2026-02-21T08:11:40.5065819Z       %c96_i32_3 = arith.constant 96 : i32
2026-02-21T08:11:40.5066043Z       scf.for %arg3 = %c0_i32 to %c864_i32_2 step %c96_i32_3  : i32 {
2026-02-21T08:11:40.5066291Z         %47 = tt.splat %arg3 : i32 -> tensor<32xi32>
2026-02-21T08:11:40.5066491Z         %48 = arith.addi %47, %3 : tensor<32xi32>
2026-02-21T08:11:40.5066775Z         %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:11:40.5067049Z         %50 = arith.muli %49, %cst : tensor<32x1xi32>
2026-02-21T08:11:40.5067300Z         %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:11:40.5067593Z         %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5067847Z         %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5068086Z         %54 = arith.addi %52, %53 : tensor<32x32xi32>
2026-02-21T08:11:40.5068330Z         %55 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5068608Z         %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5068916Z         %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5069222Z         %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5069566Z         %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5069828Z         %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5070052Z         %61 = arith.subf %59, %60 : tensor<32x32xf32>
2026-02-21T08:11:40.5070418Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5070841Z         %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5071126Z         %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5071352Z         %65 = arith.divf %62, %64 : tensor<32x32xf32>
2026-02-21T08:11:40.5071664Z         %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:11:40.5071945Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5072227Z         %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5072497Z         tt.store %68, %66 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5072709Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:40.5072910Z         %69 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:11:40.5073133Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:11:40.5073347Z         %71 = tt.splat %70 : i32 -> tensor<32xi32>
2026-02-21T08:11:40.5073558Z         %72 = arith.addi %71, %3 : tensor<32xi32>
2026-02-21T08:11:40.5073820Z         %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:11:40.5074095Z         %74 = arith.muli %73, %cst : tensor<32x1xi32>
2026-02-21T08:11:40.5074347Z         %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:11:40.5074647Z         %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5074911Z         %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5075156Z         %78 = arith.addi %76, %77 : tensor<32x32xi32>
2026-02-21T08:11:40.5075453Z         %79 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5075744Z         %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5076063Z         %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5076379Z         %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5076676Z         %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5076939Z         %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5077185Z         %85 = arith.subf %83, %84 : tensor<32x32xf32>
2026-02-21T08:11:40.5077569Z         %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5077995Z         %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5078301Z         %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5078544Z         %89 = arith.divf %86, %88 : tensor<32x32xf32>
2026-02-21T08:11:40.5078790Z         %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:11:40.5079062Z         %91 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5079350Z         %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5079618Z         tt.store %92, %90 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5079825Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:40.5080031Z         %93 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:11:40.5080215Z         %94 = arith.addi %arg3, %93 : i32
2026-02-21T08:11:40.5080409Z         %95 = tt.splat %94 : i32 -> tensor<32xi32>
2026-02-21T08:11:40.5080604Z         %96 = arith.addi %95, %3 : tensor<32xi32>
2026-02-21T08:11:40.5080903Z         %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:11:40.5081164Z         %98 = arith.muli %97, %cst : tensor<32x1xi32>
2026-02-21T08:11:40.5081404Z         %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:11:40.5081727Z         %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5081981Z         %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5082227Z         %102 = arith.addi %100, %101 : tensor<32x32xi32>
2026-02-21T08:11:40.5082465Z         %103 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5082756Z         %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5083069Z         %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5083379Z         %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5083675Z         %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5083938Z         %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5084183Z         %109 = arith.subf %107, %108 : tensor<32x32xf32>
2026-02-21T08:11:40.5084556Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5084975Z         %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5085270Z         %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5085507Z         %113 = arith.divf %110, %112 : tensor<32x32xf32>
2026-02-21T08:11:40.5085749Z         %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:11:40.5086021Z         %115 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5086316Z         %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5086665Z         tt.store %116, %114 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5086876Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:11:40.5087091Z       %25 = tt.splat %c864_i32_2 : i32 -> tensor<32xi32>
2026-02-21T08:11:40.5087296Z       %26 = arith.addi %25, %3 : tensor<32xi32>
2026-02-21T08:11:40.5087546Z       %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:11:40.5087806Z       %28 = arith.muli %27, %cst : tensor<32x1xi32>
2026-02-21T08:11:40.5088053Z       %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:11:40.5088343Z       %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5088599Z       %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:11:40.5088832Z       %32 = arith.addi %30, %31 : tensor<32x32xi32>
2026-02-21T08:11:40.5089064Z       %33 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5089346Z       %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5089649Z       %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5089953Z       %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5090238Z       %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:11:40.5090490Z       %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5090724Z       %39 = arith.subf %37, %38 : tensor<32x32xf32>
2026-02-21T08:11:40.5091082Z       %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:11:40.5091494Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:11:40.5091815Z       %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:11:40.5092087Z       %43 = arith.divf %40, %42 : tensor<32x32xf32>
2026-02-21T08:11:40.5092324Z       %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:11:40.5092578Z       %45 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5092849Z       %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:11:40.5093102Z       tt.store %46, %44 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:11:40.5093366Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:40.5093621Z     tt.return
2026-02-21T08:11:40.5093749Z   }
2026-02-21T08:11:40.5093879Z }
2026-02-21T08:11:40.5093948Z 
2026-02-21T08:11:40.5093998Z {-#
2026-02-21T08:11:40.5094132Z   external_resources: {
2026-02-21T08:11:40.5094286Z     mlir_reproducer: {
2026-02-21T08:11:40.5098562Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:40.5103024Z       disable_threading: false,
2026-02-21T08:11:40.5103218Z       verify_each: true
2026-02-21T08:11:40.5103383Z     }
2026-02-21T08:11:40.5103522Z   }
2026-02-21T08:11:40.5103668Z #-}
2026-02-21T08:11:40.5104151Z /tmp/torchinductor_root/xw/cxwatlt3kybgikn2evimt44eih4znjmoz32sxtzwi3iwawmh5uqd.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:40.5105478Z /tmp/torchinductor_root/xw/cxwatlt3kybgikn2evimt44eih4znjmoz32sxtzwi3iwawmh5uqd.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:40.5106554Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:40.5107725Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:11:40.5108736Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:40.5109050Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:42.2400119Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.1 configs/s
2026-02-21T08:11:42.2413956Z [39s] Adaptive compile timeout: 30s (90% percentile=1.5s, bounds=[30.0s, 30s])
2026-02-21T08:11:42.6316946Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2551.3 configs/s
2026-02-21T08:11:42.6772418Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T08:11:42.6776730Z error=6
2026-02-21T08:11:42.6779854Z timeout=1
2026-02-21T08:11:42.6784390Z ok=93
2026-02-21T08:11:42.6786487Z min=0.0123
2026-02-21T08:11:42.6786684Z mid=0.1127
2026-02-21T08:11:42.6791358Z max=8.6245
2026-02-21T08:11:42.6793265Z best={'block_sizes': [1, 1024],
2026-02-21T08:11:42.6793538Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:11:42.6793770Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:11:42.6793956Z  'maxnreg': 32,
2026-02-21T08:11:42.6794142Z  'num_sm_multiplier': 64,
2026-02-21T08:11:42.6794325Z  'num_stages': 7,
2026-02-21T08:11:42.6794461Z  'num_warps': 4,
2026-02-21T08:11:42.6794621Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:11:42.6794810Z  'range_flattens': [None, True],
2026-02-21T08:11:42.6794989Z  'range_multi_buffers': [False, True],
2026-02-21T08:11:42.6795170Z  'range_num_stages': [1, 4],
2026-02-21T08:11:42.6795331Z  'range_unroll_factors': [1, 4],
2026-02-21T08:11:42.6795511Z  'range_warp_specializes': [True, None]}
2026-02-21T08:11:42.6795720Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:11:43.6956862Z [40s] Generation 1 starting: 82 neighbors, 5 active search path(s)
2026-02-21T08:11:49.2056607Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 8.4 configs/s
2026-02-21T08:11:54.7385272Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 15.7 configs/s
2026-02-21T08:11:58.2010509Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.3         
2026-02-21T08:11:58.2014649Z                                                                   configs/s     
2026-02-21T08:11:58.5297272Z [55s] Generation 1 complete: 
2026-02-21T08:11:58.5299058Z ok=88
2026-02-21T08:11:58.5299215Z min=0.0102
2026-02-21T08:11:58.5299352Z mid=0.0142
2026-02-21T08:11:58.5299472Z max=0.1782
2026-02-21T08:11:58.5299618Z best={'block_sizes': [4, 1024],
2026-02-21T08:11:58.5299828Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:11:58.5300057Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:11:58.5300243Z  'num_stages': 5,
2026-02-21T08:11:58.5300380Z  'num_warps': 16,
2026-02-21T08:11:58.5300524Z  'pid_type': 'flat',
2026-02-21T08:11:58.5300676Z  'range_flattens': [None, True],
2026-02-21T08:11:58.5300857Z  'range_multi_buffers': [None, True],
2026-02-21T08:11:58.5301035Z  'range_num_stages': [0, 1],
2026-02-21T08:11:58.5301202Z  'range_unroll_factors': [0, 3],
2026-02-21T08:11:58.5301379Z  'range_warp_specializes': [None, False]}
2026-02-21T08:11:58.5310572Z [55s] Fitting surrogate: 188 points, 188 targets
2026-02-21T08:11:59.4406909Z [56s] Generation 2 starting: 62 neighbors, 5 active search path(s)
2026-02-21T08:12:02.7791200Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 39.6 configs/s
2026-02-21T08:12:07.2205060Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 15.0 configs/s
2026-02-21T08:12:10.6700790Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.7         
2026-02-21T08:12:10.6702224Z                                                                   configs/s     
2026-02-21T08:12:11.0360477Z [67s] Generation 2 complete: 
2026-02-21T08:12:11.0365386Z ok=68
2026-02-21T08:12:11.0370801Z min=0.0102
2026-02-21T08:12:11.0372339Z mid=0.0123
2026-02-21T08:12:11.0372521Z max=0.0185
2026-02-21T08:12:11.0372691Z best={'block_sizes': [4, 1024],
2026-02-21T08:12:11.0372939Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:12:11.0373178Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:11.0373417Z  'num_stages': 5,
2026-02-21T08:12:11.0373594Z  'num_warps': 16,
2026-02-21T08:12:11.0373731Z  'pid_type': 'flat',
2026-02-21T08:12:11.0373891Z  'range_flattens': [None, True],
2026-02-21T08:12:11.0374065Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:11.0374251Z  'range_num_stages': [0, 1],
2026-02-21T08:12:11.0374412Z  'range_unroll_factors': [0, 3],
2026-02-21T08:12:11.0374598Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:11.0374917Z [67s] Fitting surrogate: 256 points, 256 targets
2026-02-21T08:12:12.0664199Z [68s] Generation 3 starting: 63 neighbors, 5 active search path(s)
2026-02-21T08:12:15.8301237Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 27.4 configs/s
2026-02-21T08:12:19.8772019Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 16.0 configs/s
2026-02-21T08:12:23.4302414Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.2         
2026-02-21T08:12:23.4307463Z                                                                   configs/s     
2026-02-21T08:12:23.7988683Z [80s] Generation 3 complete: 
2026-02-21T08:12:23.7992795Z ok=69
2026-02-21T08:12:23.7994536Z min=0.0102
2026-02-21T08:12:23.7994743Z mid=0.0102
2026-02-21T08:12:23.7994912Z max=0.0173
2026-02-21T08:12:23.7995110Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:23.7995482Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:12:23.7995827Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:23.7996052Z  'num_stages': 7,
2026-02-21T08:12:23.8000233Z  'num_warps': 2,
2026-02-21T08:12:23.8001919Z  'pid_type': 'flat',
2026-02-21T08:12:23.8002176Z  'range_flattens': [None, True],
2026-02-21T08:12:23.8002413Z  'range_multi_buffers': [None, None],
2026-02-21T08:12:23.8002650Z  'range_num_stages': [0, 4],
2026-02-21T08:12:23.8002854Z  'range_unroll_factors': [0, 4],
2026-02-21T08:12:23.8003085Z  'range_warp_specializes': [None, None]}
2026-02-21T08:12:23.8003428Z [80s] Fitting surrogate: 325 points, 325 targets
2026-02-21T08:12:24.5978382Z [81s] Generation 4 starting: 45 neighbors, 4 active search path(s)
2026-02-21T08:12:27.3919729Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 21.6 configs/s
2026-02-21T08:12:30.3279017Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 15.9 configs/s
2026-02-21T08:12:32.8506895Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 406.3         
2026-02-21T08:12:32.8507355Z                                                                   configs/s     
2026-02-21T08:12:33.1169753Z [89s] Generation 4 complete: 
2026-02-21T08:12:33.1171304Z ok=49
2026-02-21T08:12:33.1171487Z min=0.0091
2026-02-21T08:12:33.1171690Z mid=0.0102
2026-02-21T08:12:33.1171828Z max=0.0164
2026-02-21T08:12:33.1171979Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:33.1172232Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:12:33.1172486Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:33.1172689Z  'num_stages': 6,
2026-02-21T08:12:33.1173318Z  'num_warps': 1,
2026-02-21T08:12:33.1173507Z  'pid_type': 'flat',
2026-02-21T08:12:33.1173676Z  'range_flattens': [None, True],
2026-02-21T08:12:33.1173878Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:33.1174082Z  'range_num_stages': [0, 1],
2026-02-21T08:12:33.1174255Z  'range_unroll_factors': [0, 0],
2026-02-21T08:12:33.1174456Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:33.1184175Z [89s] Fitting surrogate: 374 points, 374 targets
2026-02-21T08:12:33.7321853Z [90s] Generation 5 starting: 32 neighbors, 3 active search path(s)
2026-02-21T08:12:35.5380836Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 20.8 configs/s
2026-02-21T08:12:37.9089176Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 13.7 configs/s
2026-02-21T08:12:39.6233058Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.0         
2026-02-21T08:12:39.6234361Z                                                                   configs/s     
2026-02-21T08:12:39.8109558Z [96s] Generation 5 complete: 
2026-02-21T08:12:39.8113344Z ok=35
2026-02-21T08:12:39.8114850Z min=0.0084
2026-02-21T08:12:39.8115046Z mid=0.0101
2026-02-21T08:12:39.8115187Z max=0.0210
2026-02-21T08:12:39.8115402Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:39.8115661Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:12:39.8115955Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:39.8116154Z  'num_stages': 6,
2026-02-21T08:12:39.8116314Z  'num_warps': 1,
2026-02-21T08:12:39.8116466Z  'pid_type': 'flat',
2026-02-21T08:12:39.8116644Z  'range_flattens': [None, True],
2026-02-21T08:12:39.8116859Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:39.8117080Z  'range_num_stages': [0, 1],
2026-02-21T08:12:39.8117260Z  'range_unroll_factors': [0, 0],
2026-02-21T08:12:39.8117478Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:39.8124006Z [96s] Fitting surrogate: 409 points, 409 targets
2026-02-21T08:12:40.3376125Z [97s] Generation 6 starting: 19 neighbors, 3 active search path(s)
2026-02-21T08:12:41.6032748Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 16.2 configs/s
2026-02-21T08:12:42.9842866Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 14.1 configs/s
2026-02-21T08:12:44.0702716Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 969.9         
2026-02-21T08:12:44.0703490Z                                                                   configs/s     
2026-02-21T08:12:44.1976198Z [101s] Generation 6 complete: 
2026-02-21T08:12:44.1976696Z ok=22
2026-02-21T08:12:44.1977011Z min=0.0162
2026-02-21T08:12:44.1977325Z mid=0.0220
2026-02-21T08:12:44.1977621Z max=0.0307
2026-02-21T08:12:44.1977956Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:44.1978580Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:12:44.1979197Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:44.1979634Z  'num_stages': 2,
2026-02-21T08:12:44.1979966Z  'num_warps': 2,
2026-02-21T08:12:44.1980945Z  'pid_type': 'flat',
2026-02-21T08:12:44.1981364Z  'range_flattens': [None, False],
2026-02-21T08:12:44.1982231Z  'range_multi_buffers': [None, None],
2026-02-21T08:12:44.1982612Z  'range_num_stages': [0, 1],
2026-02-21T08:12:44.1982972Z  'range_unroll_factors': [0, 3],
2026-02-21T08:12:44.1983352Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:44.2003598Z [101s] Fitting surrogate: 431 points, 431 targets
2026-02-21T08:12:45.0344745Z [101s] Generation 7 starting: 33 neighbors, 3 active search path(s)
2026-02-21T08:12:48.0355430Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 14.2 configs/s
2026-02-21T08:12:50.4011411Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 15.5 configs/s
2026-02-21T08:12:52.5968282Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 543.7         
2026-02-21T08:12:52.5968846Z                                                                   configs/s     
2026-02-21T08:12:52.8083760Z [109s] Generation 7 complete: 
2026-02-21T08:12:52.8084098Z ok=37
2026-02-21T08:12:52.8086274Z min=0.0082
2026-02-21T08:12:52.8086480Z mid=0.0102
2026-02-21T08:12:52.8086674Z max=0.0162
2026-02-21T08:12:52.8086901Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:52.8087267Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:12:52.8087662Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:52.8087935Z  'num_stages': 2,
2026-02-21T08:12:52.8088159Z  'num_warps': 2,
2026-02-21T08:12:52.8088382Z  'pid_type': 'flat',
2026-02-21T08:12:52.8088636Z  'range_flattens': [None, False],
2026-02-21T08:12:52.8088907Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:52.8089174Z  'range_num_stages': [0, 1],
2026-02-21T08:12:52.8089438Z  'range_unroll_factors': [0, 3],
2026-02-21T08:12:52.8089717Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:52.8101070Z [109s] Fitting surrogate: 468 points, 468 targets
2026-02-21T08:12:53.4134885Z [110s] Generation 8 starting: 30 neighbors, 3 active search path(s)
2026-02-21T08:12:55.3615083Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 20.3 configs/s
2026-02-21T08:12:57.3505408Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 15.9 configs/s
2026-02-21T08:12:59.0702165Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 596.5         
2026-02-21T08:12:59.0702645Z                                                                   configs/s     
2026-02-21T08:12:59.2695713Z [116s] Generation 8 complete: 
2026-02-21T08:12:59.2696518Z ok=33
2026-02-21T08:12:59.2696712Z min=0.0100
2026-02-21T08:12:59.2696900Z mid=0.0102
2026-02-21T08:12:59.2697075Z max=0.0102
2026-02-21T08:12:59.2697279Z best={'block_sizes': [1, 1024],
2026-02-21T08:12:59.2697563Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:12:59.2697882Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:12:59.2698117Z  'num_stages': 2,
2026-02-21T08:12:59.2698286Z  'num_warps': 2,
2026-02-21T08:12:59.2698501Z  'pid_type': 'flat',
2026-02-21T08:12:59.2699229Z  'range_flattens': [None, False],
2026-02-21T08:12:59.2699440Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:59.2699662Z  'range_num_stages': [0, 1],
2026-02-21T08:12:59.2699866Z  'range_unroll_factors': [0, 3],
2026-02-21T08:12:59.2700076Z  'range_warp_specializes': [None, False]}
2026-02-21T08:12:59.2715340Z [116s] Fitting surrogate: 501 points, 501 targets
2026-02-21T08:12:59.7270178Z [116s] Generation 9 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:13:00.5611631Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 39.0 configs/s
2026-02-21T08:13:01.3236121Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 16.7 configs/s
2026-02-21T08:13:02.0410882Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1412.9         
2026-02-21T08:13:02.0411251Z                                                                  configs/s      
2026-02-21T08:13:02.1242592Z [118s] Generation 9 complete: 
2026-02-21T08:13:02.1243097Z ok=14
2026-02-21T08:13:02.1243248Z min=0.0083
2026-02-21T08:13:02.1243401Z mid=0.0101
2026-02-21T08:13:02.1243522Z max=0.0102
2026-02-21T08:13:02.1243669Z best={'block_sizes': [1, 1024],
2026-02-21T08:13:02.1243906Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:13:02.1244174Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:13:02.1244357Z  'num_stages': 7,
2026-02-21T08:13:02.1244505Z  'num_warps': 2,
2026-02-21T08:13:02.1244642Z  'pid_type': 'flat',
2026-02-21T08:13:02.1244800Z  'range_flattens': [None, True],
2026-02-21T08:13:02.1244983Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:02.1245162Z  'range_num_stages': [0, 4],
2026-02-21T08:13:02.1245332Z  'range_unroll_factors': [0, 3],
2026-02-21T08:13:02.1245508Z  'range_warp_specializes': [None, False]}
2026-02-21T08:13:02.1258634Z [118s] Fitting surrogate: 515 points, 515 targets
2026-02-21T08:13:02.4787357Z [119s] Generation 10 starting: 9 neighbors, 1 active search path(s)
2026-02-21T08:13:03.0100187Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 43.9 configs/s
2026-02-21T08:13:03.5776050Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 17.2 configs/s
2026-02-21T08:13:04.3298258Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1967.0        
2026-02-21T08:13:04.3298749Z                                                                   configs/s     
2026-02-21T08:13:04.3902879Z [121s] Generation 10 complete: 
2026-02-21T08:13:04.3903206Z ok=10
2026-02-21T08:13:04.3907168Z min=0.0083
2026-02-21T08:13:04.3910363Z mid=0.0101
2026-02-21T08:13:04.3914384Z max=0.0123
2026-02-21T08:13:04.3918266Z best={'block_sizes': [1, 1024],
2026-02-21T08:13:04.3922452Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:13:04.3926380Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:13:04.3928117Z  'num_stages': 7,
2026-02-21T08:13:04.3928303Z  'num_warps': 2,
2026-02-21T08:13:04.3928490Z  'pid_type': 'flat',
2026-02-21T08:13:04.3928676Z  'range_flattens': [None, True],
2026-02-21T08:13:04.3928860Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:04.3929039Z  'range_num_stages': [0, 4],
2026-02-21T08:13:04.3929208Z  'range_unroll_factors': [0, 3],
2026-02-21T08:13:04.3929385Z  'range_warp_specializes': [None, False]}
2026-02-21T08:13:04.3929607Z [121s] Fitting surrogate: 525 points, 525 targets
2026-02-21T08:13:04.6764528Z [121s] Autotuning complete in 121.5s after searching 500 configs.
2026-02-21T08:13:04.6766777Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:13:04.6767688Z     @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T08:13:04.6768540Z 
2026-02-21T08:13:04.6769197Z [121s] Code of selected kernel: /tmp/torchinductor_root/xz/cxzmpdljubuu42pq6odv7xvbsl2kv7wub6tl52m7bjruwfhzfnkd.py
2026-02-21T08:13:05.3243409Z WARNING:tritonbench.utils.triton_op:Completed input ID 5:
2026-02-21T08:13:05.3248232Z (M, N)
2026-02-21T08:13:05.3252587Z -----------
2026-02-21T08:13:05.3255944Z (4096, 896)
2026-02-21T08:13:05.3259819Z 
2026-02-21T08:13:05.3263875Z  10%|█         | 2/20 [04:02<36:44, 122.48s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10:
2026-02-21T08:13:05.3264739Z (M, N)
2026-02-21T08:13:05.3264875Z ------------
2026-02-21T08:13:05.3265017Z (4096, 1536)
2026-02-21T08:13:05.3265270Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:13:06.7753415Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:13:08.0988840Z INFO:tritonbench.utils.triton_op:Took 2.13ms to get benchmark function for torch_compile_softmax
2026-02-21T08:13:12.0912767Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:13:12.0913563Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:13:12.0913798Z               'dtype': 'torch.float16',
2026-02-21T08:13:12.0914030Z               'shape': (4096, 1536),
2026-02-21T08:13:12.0914256Z               'stride': (1536, 1)},),
2026-02-21T08:13:12.0914473Z   'kwargs': {}}
2026-02-21T08:13:12.0924871Z INFO:tritonbench.utils.triton_op:Took 1.59ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:13:12.2809404Z [0s] Autotune random seed: 2134816249
2026-02-21T08:13:12.4398101Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:13:39.5073404Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:13:45.5260258Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.7 configs/s
2026-02-21T08:13:45.5269607Z [33s] Adaptive compile timeout: 30s (90% percentile=2.0s, bounds=[30.0s, 30s])
2026-02-21T08:13:46.0042615Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2089.8 configs/s
2026-02-21T08:13:46.0567745Z [33s] Initial random population of 100, 5 starting points: 
2026-02-21T08:13:46.0572332Z error=5
2026-02-21T08:13:46.0577699Z ok=95
2026-02-21T08:13:46.0579726Z min=0.0164
2026-02-21T08:13:46.0579880Z mid=0.1618
2026-02-21T08:13:46.0580015Z max=14.7754
2026-02-21T08:13:46.0580162Z best={'block_sizes': [4, 2048],
2026-02-21T08:13:46.0580419Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:13:46.0580682Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:13:46.0580867Z  'num_stages': 8,
2026-02-21T08:13:46.0581017Z  'num_warps': 4,
2026-02-21T08:13:46.0581159Z  'pid_type': 'flat',
2026-02-21T08:13:46.0581321Z  'range_flattens': [None, False],
2026-02-21T08:13:46.0581501Z  'range_multi_buffers': [None, False],
2026-02-21T08:13:46.0581765Z  'range_num_stages': [0, 0],
2026-02-21T08:13:46.0581963Z  'range_unroll_factors': [0, 0],
2026-02-21T08:13:46.0582167Z  'range_warp_specializes': [None, True]}
2026-02-21T08:13:46.0582370Z [33s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:13:47.0271020Z [34s] Generation 1 starting: 78 neighbors, 5 active search path(s)
2026-02-21T08:13:53.5241119Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 4.8 configs/s
2026-02-21T08:13:58.7805263Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 15.7 configs/s
2026-02-21T08:14:02.4061379Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 281.4         
2026-02-21T08:14:02.4062421Z                                                                   configs/s     
2026-02-21T08:14:02.7388559Z [50s] Generation 1 complete: 
2026-02-21T08:14:02.7393988Z ok=84
2026-02-21T08:14:02.7399512Z min=0.0123
2026-02-21T08:14:02.7400879Z mid=0.0164
2026-02-21T08:14:02.7401056Z max=0.2846
2026-02-21T08:14:02.7401216Z best={'block_sizes': [2, 2048],
2026-02-21T08:14:02.7401528Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:02.7405056Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:02.7405243Z  'num_stages': 8,
2026-02-21T08:14:02.7405403Z  'num_warps': 4,
2026-02-21T08:14:02.7405557Z  'pid_type': 'flat',
2026-02-21T08:14:02.7405719Z  'range_flattens': [None, False],
2026-02-21T08:14:02.7405922Z  'range_multi_buffers': [None, False],
2026-02-21T08:14:02.7406120Z  'range_num_stages': [0, 0],
2026-02-21T08:14:02.7406310Z  'range_unroll_factors': [0, 0],
2026-02-21T08:14:02.7406502Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:02.7406741Z [50s] Fitting surrogate: 184 points, 184 targets
2026-02-21T08:14:03.8094304Z [51s] Generation 2 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:14:06.8648337Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 35.6 configs/s
2026-02-21T08:14:11.3347204Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 15.6 configs/s
2026-02-21T08:14:14.6533754Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 307.9         
2026-02-21T08:14:14.6538138Z                                                                   configs/s     
2026-02-21T08:14:14.9416114Z [62s] Generation 2 complete: 
2026-02-21T08:14:14.9420387Z ok=71
2026-02-21T08:14:14.9424948Z min=0.0123
2026-02-21T08:14:14.9429400Z mid=0.0163
2026-02-21T08:14:14.9431772Z max=0.0350
2026-02-21T08:14:14.9431977Z best={'block_sizes': [2, 2048],
2026-02-21T08:14:14.9432231Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:14.9432500Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:14.9432678Z  'num_stages': 8,
2026-02-21T08:14:14.9432828Z  'num_warps': 4,
2026-02-21T08:14:14.9432980Z  'pid_type': 'flat',
2026-02-21T08:14:14.9433138Z  'range_flattens': [None, False],
2026-02-21T08:14:14.9433328Z  'range_multi_buffers': [None, False],
2026-02-21T08:14:14.9433514Z  'range_num_stages': [0, 0],
2026-02-21T08:14:14.9433688Z  'range_unroll_factors': [0, 0],
2026-02-21T08:14:14.9433892Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:14.9434130Z [62s] Fitting surrogate: 255 points, 255 targets
2026-02-21T08:14:15.8447379Z [63s] Generation 3 starting: 63 neighbors, 5 active search path(s)
2026-02-21T08:14:18.9944220Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 23.1 configs/s
2026-02-21T08:14:23.0364030Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.2 configs/s
2026-02-21T08:14:26.2615735Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 337.4         
2026-02-21T08:14:26.2616076Z                                                                   configs/s     
2026-02-21T08:14:26.5552660Z [74s] Generation 3 complete: 
2026-02-21T08:14:26.5554476Z ok=69
2026-02-21T08:14:26.5554655Z min=0.0123
2026-02-21T08:14:26.5554812Z mid=0.0143
2026-02-21T08:14:26.5554958Z max=0.0410
2026-02-21T08:14:26.5555110Z best={'block_sizes': [1, 2048],
2026-02-21T08:14:26.5555407Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:26.5555697Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:14:26.5555903Z  'num_stages': 4,
2026-02-21T08:14:26.5556050Z  'num_warps': 1,
2026-02-21T08:14:26.5556203Z  'pid_type': 'flat',
2026-02-21T08:14:26.5556365Z  'range_flattens': [None, None],
2026-02-21T08:14:26.5556553Z  'range_multi_buffers': [None, False],
2026-02-21T08:14:26.5556743Z  'range_num_stages': [0, 4],
2026-02-21T08:14:26.5556919Z  'range_unroll_factors': [0, 0],
2026-02-21T08:14:26.5557111Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:26.5570340Z [74s] Fitting surrogate: 324 points, 324 targets
2026-02-21T08:14:27.2677349Z [74s] Generation 4 starting: 46 neighbors, 4 active search path(s)
2026-02-21T08:14:29.6879696Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 19.5 configs/s
2026-02-21T08:14:32.5622022Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.6 configs/s
2026-02-21T08:14:35.0766098Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 444.5         
2026-02-21T08:14:35.0767139Z                                                                   configs/s     
2026-02-21T08:14:35.2956913Z [82s] Generation 4 complete: 
2026-02-21T08:14:35.2962107Z ok=50
2026-02-21T08:14:35.2966696Z min=0.0102
2026-02-21T08:14:35.2971272Z mid=0.0123
2026-02-21T08:14:35.2975777Z max=0.0368
2026-02-21T08:14:35.2977905Z best={'block_sizes': [1, 2048],
2026-02-21T08:14:35.2978219Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:35.2978504Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:35.2978751Z  'num_stages': 5,
2026-02-21T08:14:35.2978905Z  'num_warps': 1,
2026-02-21T08:14:35.2982321Z  'pid_type': 'flat',
2026-02-21T08:14:35.2986886Z  'range_flattens': [None, True],
2026-02-21T08:14:35.2987202Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:35.2987439Z  'range_num_stages': [0, 1],
2026-02-21T08:14:35.2991440Z  'range_unroll_factors': [0, 2],
2026-02-21T08:14:35.2996922Z  'range_warp_specializes': [None, False]}
2026-02-21T08:14:35.3000991Z [82s] Fitting surrogate: 374 points, 374 targets
2026-02-21T08:14:35.9349993Z [83s] Generation 5 starting: 39 neighbors, 4 active search path(s)
2026-02-21T08:14:39.8759092Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 6.1 configs/s
2026-02-21T08:14:42.3402799Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 16.5 configs/s
2026-02-21T08:14:44.5064991Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 471.3         
2026-02-21T08:14:44.5065489Z                                                                   configs/s     
2026-02-21T08:14:44.7074250Z [92s] Generation 5 complete: 
2026-02-21T08:14:44.7078535Z ok=43
2026-02-21T08:14:44.7082928Z min=0.0102
2026-02-21T08:14:44.7087444Z mid=0.0123
2026-02-21T08:14:44.7091994Z max=0.1085
2026-02-21T08:14:44.7093577Z best={'block_sizes': [1, 2048],
2026-02-21T08:14:44.7093918Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:44.7097518Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:44.7101068Z  'num_stages': 5,
2026-02-21T08:14:44.7102478Z  'num_warps': 1,
2026-02-21T08:14:44.7102674Z  'pid_type': 'flat',
2026-02-21T08:14:44.7102843Z  'range_flattens': [None, True],
2026-02-21T08:14:44.7103040Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:44.7103237Z  'range_num_stages': [0, 1],
2026-02-21T08:14:44.7103405Z  'range_unroll_factors': [0, 2],
2026-02-21T08:14:44.7103596Z  'range_warp_specializes': [None, False]}
2026-02-21T08:14:44.7103894Z [92s] Fitting surrogate: 417 points, 417 targets
2026-02-21T08:14:45.3127059Z [92s] Generation 6 starting: 35 neighbors, 4 active search path(s)
2026-02-21T08:14:47.4359882Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 22.3 configs/s
2026-02-21T08:14:49.7991395Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 16.4 configs/s
2026-02-21T08:14:52.1101027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.5         
2026-02-21T08:14:52.1105476Z                                                                   configs/s     
2026-02-21T08:14:52.3176610Z [99s] Generation 6 complete: 
2026-02-21T08:14:52.3178195Z ok=39
2026-02-21T08:14:52.3178367Z min=0.0102
2026-02-21T08:14:52.3178506Z mid=0.0123
2026-02-21T08:14:52.3178625Z max=0.0123
2026-02-21T08:14:52.3178771Z best={'block_sizes': [1, 2048],
2026-02-21T08:14:52.3179037Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:52.3179314Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:52.3179544Z  'num_stages': 5,
2026-02-21T08:14:52.3179693Z  'num_warps': 1,
2026-02-21T08:14:52.3179916Z  'pid_type': 'flat',
2026-02-21T08:14:52.3181374Z  'range_flattens': [None, True],
2026-02-21T08:14:52.3181648Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:52.3181899Z  'range_num_stages': [0, 1],
2026-02-21T08:14:52.3182096Z  'range_unroll_factors': [0, 2],
2026-02-21T08:14:52.3182284Z  'range_warp_specializes': [None, False]}
2026-02-21T08:14:52.3189887Z [99s] Fitting surrogate: 456 points, 456 targets
2026-02-21T08:14:52.5824418Z [100s] Generation 7 starting: 10 neighbors, 1 active search path(s)
2026-02-21T08:14:53.5044188Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 18.3 configs/s
2026-02-21T08:14:54.1233762Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.4 configs/s
2026-02-21T08:14:54.7594941Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1587.9         
2026-02-21T08:14:54.7599035Z                                                                  configs/s      
2026-02-21T08:14:54.8258562Z [102s] Generation 7 complete: 
2026-02-21T08:14:54.8262837Z ok=12
2026-02-21T08:14:54.8264380Z min=0.0102
2026-02-21T08:14:54.8264549Z mid=0.0123
2026-02-21T08:14:54.8264687Z max=0.0123
2026-02-21T08:14:54.8264831Z best={'block_sizes': [1, 2048],
2026-02-21T08:14:54.8265110Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:14:54.8265438Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:54.8265641Z  'num_stages': 5,
2026-02-21T08:14:54.8265793Z  'num_warps': 1,
2026-02-21T08:14:54.8265937Z  'pid_type': 'flat',
2026-02-21T08:14:54.8266104Z  'range_flattens': [None, True],
2026-02-21T08:14:54.8266284Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:54.8266476Z  'range_num_stages': [0, 1],
2026-02-21T08:14:54.8266643Z  'range_unroll_factors': [0, 2],
2026-02-21T08:14:54.8266829Z  'range_warp_specializes': [None, False]}
2026-02-21T08:14:54.8280695Z [102s] Fitting surrogate: 468 points, 468 targets
2026-02-21T08:14:54.9988720Z [102s] Autotuning complete in 102.6s after searching 451 configs.
2026-02-21T08:14:54.9990428Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:14:54.9991413Z     @helion.kernel(config=helion.Config(block_sizes=[1, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T08:14:54.9992534Z 
2026-02-21T08:14:54.9992790Z [102s] Code of selected kernel: /tmp/torchinductor_root/aw/caw6zyyh5d37sapqzmdiy3gwzprk2nqshcqamsu2nqflpi4266ja.py
2026-02-21T08:14:55.8270005Z WARNING:tritonbench.utils.triton_op:Completed input ID 10:
2026-02-21T08:14:55.8273310Z (M, N)
2026-02-21T08:14:55.8279730Z ------------
2026-02-21T08:14:55.8280048Z (4096, 1536)
2026-02-21T08:14:55.8280214Z 
2026-02-21T08:14:55.8280955Z  15%|█▌        | 3/20 [05:53<33:09, 117.01s/it]WARNING:tritonbench.utils.triton_op:Running input ID 15:
2026-02-21T08:14:55.8281358Z (M, N)
2026-02-21T08:14:55.8285351Z ------------
2026-02-21T08:14:55.8287345Z (4096, 2176)
2026-02-21T08:14:55.8287681Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:14:57.2528171Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:14:58.5048946Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax
2026-02-21T08:15:01.6170684Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:15:01.6174828Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:15:01.6179001Z               'dtype': 'torch.float16',
2026-02-21T08:15:01.6182902Z               'shape': (4096, 2176),
2026-02-21T08:15:01.6187395Z               'stride': (2176, 1)},),
2026-02-21T08:15:01.6191495Z   'kwargs': {}}
2026-02-21T08:15:01.6196335Z INFO:tritonbench.utils.triton_op:Took 1.95ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:15:01.7952460Z [0s] Autotune random seed: 2134816249
2026-02-21T08:15:01.9370186Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:15:34.6720575Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:15:34.7888334Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:15:34.7910174Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:15:40.8088256Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.6 configs/s
2026-02-21T08:15:40.8100072Z [38s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s])
2026-02-21T08:15:41.5405377Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1370.0 configs/s
2026-02-21T08:15:41.6113533Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T08:15:41.6117150Z error=5
2026-02-21T08:15:41.6119211Z timeout=2
2026-02-21T08:15:41.6119421Z ok=93
2026-02-21T08:15:41.6124819Z min=0.0224
2026-02-21T08:15:41.6126923Z mid=0.2211
2026-02-21T08:15:41.6127093Z max=14.6094
2026-02-21T08:15:41.6127254Z best={'block_sizes': [2, 1024],
2026-02-21T08:15:41.6127526Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:15:41.6127813Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:15:41.6128010Z  'num_sm_multiplier': 64,
2026-02-21T08:15:41.6128175Z  'num_stages': 5,
2026-02-21T08:15:41.6128313Z  'num_warps': 1,
2026-02-21T08:15:41.6128473Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:15:41.6128661Z  'range_flattens': [True, True],
2026-02-21T08:15:41.6128874Z  'range_multi_buffers': [False, None],
2026-02-21T08:15:41.6129077Z  'range_num_stages': [3, 1],
2026-02-21T08:15:41.6129240Z  'range_unroll_factors': [0, 2],
2026-02-21T08:15:41.6129421Z  'range_warp_specializes': [True, None]}
2026-02-21T08:15:41.6129705Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:15:42.7867990Z [40s] Generation 1 starting: 88 neighbors, 5 active search path(s)
2026-02-21T08:15:47.3939634Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 22.9 configs/s
2026-02-21T08:15:53.0084660Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s
2026-02-21T08:15:56.2509728Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 314.0         
2026-02-21T08:15:56.2513472Z                                                                   configs/s     
2026-02-21T08:15:56.5241490Z [54s] Generation 1 complete: 
2026-02-21T08:15:56.5245412Z ok=94
2026-02-21T08:15:56.5250422Z min=0.0164
2026-02-21T08:15:56.5252592Z mid=0.0245
2026-02-21T08:15:56.5252852Z max=0.1107
2026-02-21T08:15:56.5253430Z best={'block_sizes': [1, 4096],
2026-02-21T08:15:56.5257540Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:15:56.5262018Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:15:56.5266297Z  'maxnreg': 256,
2026-02-21T08:15:56.5269512Z  'num_sm_multiplier': 16,
2026-02-21T08:15:56.5271683Z  'num_stages': 5,
2026-02-21T08:15:56.5271955Z  'num_warps': 4,
2026-02-21T08:15:56.5272132Z  'pid_type': 'persistent_blocked',
2026-02-21T08:15:56.5274804Z  'range_flattens': [None, False],
2026-02-21T08:15:56.5275102Z  'range_multi_buffers': [None, True],
2026-02-21T08:15:56.5279201Z  'range_num_stages': [3, 4],
2026-02-21T08:15:56.5282405Z  'range_unroll_factors': [1, 0],
2026-02-21T08:15:56.5286810Z  'range_warp_specializes': [None, True]}
2026-02-21T08:15:56.5290015Z [54s] Fitting surrogate: 194 points, 194 targets
2026-02-21T08:15:57.6697582Z [55s] Generation 2 starting: 82 neighbors, 5 active search path(s)
2026-02-21T08:16:09.6692124Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 1.8 configs/s
2026-02-21T08:16:14.8908864Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.4 configs/s
2026-02-21T08:16:18.5806296Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 276.4         
2026-02-21T08:16:18.5807612Z                                                                   configs/s     
2026-02-21T08:16:18.8890107Z [76s] Generation 2 complete: 
2026-02-21T08:16:18.8893899Z ok=88
2026-02-21T08:16:18.8898407Z min=0.0143
2026-02-21T08:16:18.8902545Z mid=0.0205
2026-02-21T08:16:18.8907341Z max=0.0777
2026-02-21T08:16:18.8911815Z best={'block_sizes': [1, 4096],
2026-02-21T08:16:18.8913210Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:16:18.8913481Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:16:18.8913686Z  'maxnreg': 256,
2026-02-21T08:16:18.8913839Z  'num_sm_multiplier': 16,
2026-02-21T08:16:18.8914010Z  'num_stages': 5,
2026-02-21T08:16:18.8914181Z  'num_warps': 1,
2026-02-21T08:16:18.8914352Z  'pid_type': 'persistent_blocked',
2026-02-21T08:16:18.8914539Z  'range_flattens': [None, False],
2026-02-21T08:16:18.8914716Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:18.8914902Z  'range_num_stages': [3, 3],
2026-02-21T08:16:18.8915067Z  'range_unroll_factors': [1, 0],
2026-02-21T08:16:18.8915252Z  'range_warp_specializes': [None, False]}
2026-02-21T08:16:18.8915583Z [76s] Fitting surrogate: 282 points, 282 targets
2026-02-21T08:16:20.0589111Z [78s] Generation 3 starting: 83 neighbors, 5 active search path(s)
2026-02-21T08:16:24.8309374Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 19.4 configs/s
2026-02-21T08:16:29.9557595Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.5 configs/s
2026-02-21T08:16:34.0345848Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 250.3         
2026-02-21T08:16:34.0347021Z                                                                   configs/s     
2026-02-21T08:16:34.3952211Z [92s] Generation 3 complete: 
2026-02-21T08:16:34.3954070Z error=2
2026-02-21T08:16:34.3954229Z ok=87
2026-02-21T08:16:34.3954360Z min=0.0143
2026-02-21T08:16:34.3954495Z mid=0.0184
2026-02-21T08:16:34.3954615Z max=0.0532
2026-02-21T08:16:34.3954759Z best={'block_sizes': [1, 4096],
2026-02-21T08:16:34.3954984Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:16:34.3955225Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:16:34.3955410Z  'maxnreg': 256,
2026-02-21T08:16:34.3955562Z  'num_sm_multiplier': 16,
2026-02-21T08:16:34.3955719Z  'num_stages': 5,
2026-02-21T08:16:34.3955863Z  'num_warps': 1,
2026-02-21T08:16:34.3956020Z  'pid_type': 'persistent_blocked',
2026-02-21T08:16:34.3956199Z  'range_flattens': [None, False],
2026-02-21T08:16:34.3956382Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:34.3956562Z  'range_num_stages': [3, 4],
2026-02-21T08:16:34.3956743Z  'range_unroll_factors': [1, 0],
2026-02-21T08:16:34.3960321Z  'range_warp_specializes': [None, False]}
2026-02-21T08:16:34.3971106Z [92s] Fitting surrogate: 371 points, 371 targets
2026-02-21T08:16:35.2697707Z [93s] Generation 4 starting: 60 neighbors, 4 active search path(s)
2026-02-21T08:16:38.5768566Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 29.4 configs/s
2026-02-21T08:16:42.4252485Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.6 configs/s
2026-02-21T08:16:45.9549133Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.2         
2026-02-21T08:16:45.9552832Z                                                                   configs/s     
2026-02-21T08:16:46.2796060Z [104s] Generation 4 complete: 
2026-02-21T08:16:46.2800147Z ok=65
2026-02-21T08:16:46.2802303Z min=0.0143
2026-02-21T08:16:46.2802465Z mid=0.0144
2026-02-21T08:16:46.2802600Z max=0.0266
2026-02-21T08:16:46.2802742Z best={'block_sizes': [1, 4096],
2026-02-21T08:16:46.2803026Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:16:46.2803825Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:16:46.2804033Z  'num_stages': 5,
2026-02-21T08:16:46.2804186Z  'num_warps': 4,
2026-02-21T08:16:46.2804330Z  'pid_type': 'flat',
2026-02-21T08:16:46.2804496Z  'range_flattens': [None, True],
2026-02-21T08:16:46.2804678Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:46.2804873Z  'range_num_stages': [0, 1],
2026-02-21T08:16:46.2805044Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:46.2805231Z  'range_warp_specializes': [None, True]}
2026-02-21T08:16:46.2814884Z [104s] Fitting surrogate: 436 points, 436 targets
2026-02-21T08:16:47.0156973Z [105s] Generation 5 starting: 52 neighbors, 4 active search path(s)
2026-02-21T08:16:50.6182724Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 11.9 configs/s
2026-02-21T08:16:53.9282918Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.5 configs/s
2026-02-21T08:16:57.1678030Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 351.7         
2026-02-21T08:16:57.1678407Z                                                                   configs/s     
2026-02-21T08:16:57.4452495Z [115s] Generation 5 complete: 
2026-02-21T08:16:57.4456546Z ok=56
2026-02-21T08:16:57.4460586Z min=0.0123
2026-02-21T08:16:57.4462628Z mid=0.0143
2026-02-21T08:16:57.4462863Z max=0.0225
2026-02-21T08:16:57.4463034Z best={'block_sizes': [1, 4096],
2026-02-21T08:16:57.4463317Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:16:57.4468405Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:16:57.4472235Z  'num_stages': 4,
2026-02-21T08:16:57.4475340Z  'num_warps': 1,
2026-02-21T08:16:57.4479284Z  'pid_type': 'flat',
2026-02-21T08:16:57.4483145Z  'range_flattens': [None, False],
2026-02-21T08:16:57.4487494Z  'range_multi_buffers': [None, False],
2026-02-21T08:16:57.4488992Z  'range_num_stages': [0, 4],
2026-02-21T08:16:57.4489219Z  'range_unroll_factors': [0, 3],
2026-02-21T08:16:57.4489468Z  'range_warp_specializes': [None, None]}
2026-02-21T08:16:57.4489790Z [115s] Fitting surrogate: 492 points, 492 targets
2026-02-21T08:16:58.1185695Z [116s] Generation 6 starting: 39 neighbors, 4 active search path(s)
2026-02-21T08:17:00.7963710Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 17.9 configs/s
2026-02-21T08:17:03.3880631Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.5 configs/s
2026-02-21T08:17:05.6458904Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 451.7         
2026-02-21T08:17:05.6459312Z                                                                   configs/s     
2026-02-21T08:17:05.8611297Z [123s] Generation 6 complete: 
2026-02-21T08:17:05.8611931Z ok=44
2026-02-21T08:17:05.8612086Z min=0.0123
2026-02-21T08:17:05.8612219Z mid=0.0143
2026-02-21T08:17:05.8612346Z max=0.0266
2026-02-21T08:17:05.8612482Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:05.8612739Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:17:05.8613425Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:05.8613789Z  'num_stages': 5,
2026-02-21T08:17:05.8613929Z  'num_warps': 1,
2026-02-21T08:17:05.8614080Z  'pid_type': 'flat',
2026-02-21T08:17:05.8614243Z  'range_flattens': [None, False],
2026-02-21T08:17:05.8614419Z  'range_multi_buffers': [None, True],
2026-02-21T08:17:05.8614604Z  'range_num_stages': [0, 1],
2026-02-21T08:17:05.8614766Z  'range_unroll_factors': [0, 1],
2026-02-21T08:17:05.8614948Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:05.8628861Z [123s] Fitting surrogate: 536 points, 536 targets
2026-02-21T08:17:06.4232020Z [124s] Generation 7 starting: 29 neighbors, 3 active search path(s)
2026-02-21T08:17:09.4192934Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.9 configs/s
2026-02-21T08:17:11.3207513Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.7 configs/s
2026-02-21T08:17:13.3667319Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 569.4         
2026-02-21T08:17:13.3670978Z                                                                   configs/s     
2026-02-21T08:17:13.5383013Z [131s] Generation 7 complete: 
2026-02-21T08:17:13.5384669Z ok=33
2026-02-21T08:17:13.5384838Z min=0.0123
2026-02-21T08:17:13.5384981Z mid=0.0125
2026-02-21T08:17:13.5385110Z max=0.0225
2026-02-21T08:17:13.5385270Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:13.5385544Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:17:13.5385843Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:13.5386028Z  'num_stages': 5,
2026-02-21T08:17:13.5386173Z  'num_warps': 1,
2026-02-21T08:17:13.5386315Z  'pid_type': 'flat',
2026-02-21T08:17:13.5386481Z  'range_flattens': [None, False],
2026-02-21T08:17:13.5386667Z  'range_multi_buffers': [None, None],
2026-02-21T08:17:13.5386850Z  'range_num_stages': [0, 1],
2026-02-21T08:17:13.5387021Z  'range_unroll_factors': [0, 1],
2026-02-21T08:17:13.5387222Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:13.5399319Z [131s] Fitting surrogate: 569 points, 569 targets
2026-02-21T08:17:14.1907745Z [132s] Generation 8 starting: 30 neighbors, 3 active search path(s)
2026-02-21T08:17:15.7871113Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 35.7 configs/s
2026-02-21T08:17:17.6247742Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 16.7 configs/s
2026-02-21T08:17:19.3435087Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 592.3         
2026-02-21T08:17:19.3435516Z                                                                   configs/s     
2026-02-21T08:17:19.5022492Z [137s] Generation 8 complete: 
2026-02-21T08:17:19.5024361Z ok=33
2026-02-21T08:17:19.5024576Z min=0.0123
2026-02-21T08:17:19.5024742Z mid=0.0142
2026-02-21T08:17:19.5024909Z max=0.0247
2026-02-21T08:17:19.5025099Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:19.5025369Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:17:19.5025690Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:19.5026265Z  'num_stages': 4,
2026-02-21T08:17:19.5026433Z  'num_warps': 1,
2026-02-21T08:17:19.5026598Z  'pid_type': 'flat',
2026-02-21T08:17:19.5026780Z  'range_flattens': [None, False],
2026-02-21T08:17:19.5026989Z  'range_multi_buffers': [None, None],
2026-02-21T08:17:19.5027202Z  'range_num_stages': [0, 4],
2026-02-21T08:17:19.5027394Z  'range_unroll_factors': [0, 3],
2026-02-21T08:17:19.5027607Z  'range_warp_specializes': [None, None]}
2026-02-21T08:17:19.5042212Z [137s] Fitting surrogate: 602 points, 602 targets
2026-02-21T08:17:19.9696302Z [138s] Generation 9 starting: 14 neighbors, 2 active search path(s)
2026-02-21T08:17:20.9959111Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 24.9 configs/s
2026-02-21T08:17:21.9159353Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 17.1 configs/s
2026-02-21T08:17:22.7386720Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1230.8         
2026-02-21T08:17:22.7391043Z                                                                  configs/s      
2026-02-21T08:17:22.8176803Z [140s] Generation 9 complete: 
2026-02-21T08:17:22.8181802Z ok=17
2026-02-21T08:17:22.8185643Z min=0.0123
2026-02-21T08:17:22.8190045Z mid=0.0123
2026-02-21T08:17:22.8193928Z max=0.0267
2026-02-21T08:17:22.8195907Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:22.8196213Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:17:22.8196484Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:22.8196680Z  'num_stages': 5,
2026-02-21T08:17:22.8196821Z  'num_warps': 1,
2026-02-21T08:17:22.8196973Z  'pid_type': 'flat',
2026-02-21T08:17:22.8197130Z  'range_flattens': [None, False],
2026-02-21T08:17:22.8197317Z  'range_multi_buffers': [None, None],
2026-02-21T08:17:22.8197496Z  'range_num_stages': [0, 1],
2026-02-21T08:17:22.8197667Z  'range_unroll_factors': [0, 1],
2026-02-21T08:17:22.8197852Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:22.8202029Z [140s] Fitting surrogate: 619 points, 619 targets
2026-02-21T08:17:23.3387999Z [141s] Generation 10 starting: 18 neighbors, 2 active search path(s)
2026-02-21T08:17:25.9632741Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 6.6 configs/s
2026-02-21T08:17:27.0423558Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.4 configs/s
2026-02-21T08:17:28.4955936Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 966.0         
2026-02-21T08:17:28.4960406Z                                                                   configs/s     
2026-02-21T08:17:28.6059138Z [146s] Generation 10 complete: 
2026-02-21T08:17:28.6060878Z ok=20
2026-02-21T08:17:28.6061105Z min=0.0123
2026-02-21T08:17:28.6061303Z mid=0.0124
2026-02-21T08:17:28.6061494Z max=0.0204
2026-02-21T08:17:28.6061796Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:28.6062087Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:17:28.6062435Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:28.6062698Z  'num_stages': 5,
2026-02-21T08:17:28.6065747Z  'num_warps': 1,
2026-02-21T08:17:28.6069166Z  'pid_type': 'flat',
2026-02-21T08:17:28.6074804Z  'range_flattens': [None, False],
2026-02-21T08:17:28.6076913Z  'range_multi_buffers': [None, None],
2026-02-21T08:17:28.6077181Z  'range_num_stages': [0, 1],
2026-02-21T08:17:28.6077391Z  'range_unroll_factors': [0, 1],
2026-02-21T08:17:28.6077618Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:28.6077962Z [146s] Fitting surrogate: 639 points, 639 targets
2026-02-21T08:17:29.1957036Z [147s] Generation 11 starting: 19 neighbors, 2 active search path(s)
2026-02-21T08:17:32.5379400Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s
2026-02-21T08:17:33.7126781Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 16.8 configs/s
2026-02-21T08:17:34.8153116Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 924.4         
2026-02-21T08:17:34.8157695Z                                                                   configs/s     
2026-02-21T08:17:34.9376779Z [153s] Generation 11 complete: 
2026-02-21T08:17:34.9380949Z ok=21
2026-02-21T08:17:34.9384270Z min=0.0123
2026-02-21T08:17:34.9388851Z mid=0.0124
2026-02-21T08:17:34.9392779Z max=0.0204
2026-02-21T08:17:34.9398117Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:34.9401931Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:17:34.9402342Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:34.9406189Z  'num_stages': 5,
2026-02-21T08:17:34.9409865Z  'num_warps': 1,
2026-02-21T08:17:34.9413560Z  'pid_type': 'flat',
2026-02-21T08:17:34.9417697Z  'range_flattens': [None, False],
2026-02-21T08:17:34.9421531Z  'range_multi_buffers': [None, None],
2026-02-21T08:17:34.9423127Z  'range_num_stages': [0, 1],
2026-02-21T08:17:34.9423373Z  'range_unroll_factors': [0, 2],
2026-02-21T08:17:34.9423590Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:34.9424467Z [153s] Fitting surrogate: 660 points, 660 targets
2026-02-21T08:17:35.5002414Z [153s] Generation 12 starting: 17 neighbors, 2 active search path(s)
2026-02-21T08:17:37.5054589Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 6.3 configs/s
2026-02-21T08:17:38.5692148Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.7 configs/s
2026-02-21T08:17:39.6249799Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 964.4         
2026-02-21T08:17:39.6250250Z                                                                   configs/s     
2026-02-21T08:17:39.7421194Z [157s] Generation 12 complete: 
2026-02-21T08:17:39.7421476Z ok=19
2026-02-21T08:17:39.7422133Z min=0.0123
2026-02-21T08:17:39.7422288Z mid=0.0124
2026-02-21T08:17:39.7427022Z max=0.0164
2026-02-21T08:17:39.7431806Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:39.7433497Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:17:39.7433890Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:39.7434220Z  'num_stages': 5,
2026-02-21T08:17:39.7434419Z  'num_warps': 1,
2026-02-21T08:17:39.7434610Z  'pid_type': 'flat',
2026-02-21T08:17:39.7434838Z  'range_flattens': [None, False],
2026-02-21T08:17:39.7435086Z  'range_multi_buffers': [None, False],
2026-02-21T08:17:39.7435326Z  'range_num_stages': [0, 1],
2026-02-21T08:17:39.7435568Z  'range_unroll_factors': [0, 2],
2026-02-21T08:17:39.7435816Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:39.7445030Z [157s] Fitting surrogate: 679 points, 679 targets
2026-02-21T08:17:40.1865023Z [158s] Generation 13 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:17:43.2242296Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 1.8 configs/s
2026-02-21T08:17:43.9083569Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.2 configs/s
2026-02-21T08:17:44.5091156Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1682.5        
2026-02-21T08:17:44.5092150Z                                                                   configs/s     
2026-02-21T08:17:44.5786172Z [162s] Generation 13 complete: 
2026-02-21T08:17:44.5790566Z ok=12
2026-02-21T08:17:44.5795175Z min=0.0123
2026-02-21T08:17:44.5799588Z mid=0.0123
2026-02-21T08:17:44.5804282Z max=0.0204
2026-02-21T08:17:44.5808913Z best={'block_sizes': [1, 4096],
2026-02-21T08:17:44.5813828Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:17:44.5814249Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:17:44.5814523Z  'num_stages': 5,
2026-02-21T08:17:44.5818930Z  'num_warps': 1,
2026-02-21T08:17:44.5823649Z  'pid_type': 'flat',
2026-02-21T08:17:44.5825087Z  'range_flattens': [None, False],
2026-02-21T08:17:44.5825347Z  'range_multi_buffers': [None, False],
2026-02-21T08:17:44.5825565Z  'range_num_stages': [0, 1],
2026-02-21T08:17:44.5825765Z  'range_unroll_factors': [0, 2],
2026-02-21T08:17:44.5825966Z  'range_warp_specializes': [None, False]}
2026-02-21T08:17:44.5826336Z [162s] Fitting surrogate: 691 points, 691 targets
2026-02-21T08:17:44.8930976Z [162s] Autotuning complete in 163.0s after searching 660 configs.
2026-02-21T08:17:44.8931379Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:17:44.8932758Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T08:17:44.8933720Z 
2026-02-21T08:17:44.8934004Z [162s] Code of selected kernel: /tmp/torchinductor_root/ds/cdsvhtpe65d2cr6qxifitp7ecvoikhbcqzhfyd7orochwnmaoojc.py
2026-02-21T08:17:45.7236453Z WARNING:tritonbench.utils.triton_op:Completed input ID 15:
2026-02-21T08:17:45.7240600Z (M, N)
2026-02-21T08:17:45.7242680Z ------------
2026-02-21T08:17:45.7242920Z (4096, 2176)
2026-02-21T08:17:45.7243556Z 
2026-02-21T08:17:45.7244244Z  20%|██        | 4/20 [08:43<36:46, 137.89s/it]WARNING:tritonbench.utils.triton_op:Running input ID 20:
2026-02-21T08:17:45.7249184Z (M, N)
2026-02-21T08:17:45.7250582Z ------------
2026-02-21T08:17:45.7250793Z (4096, 2816)
2026-02-21T08:17:45.7251169Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:17:47.1232461Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:17:48.6522774Z INFO:tritonbench.utils.triton_op:Took 2.57ms to get benchmark function for torch_compile_softmax
2026-02-21T08:17:52.2940321Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:17:52.2943921Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:17:52.2948598Z               'dtype': 'torch.float16',
2026-02-21T08:17:52.2953902Z               'shape': (4096, 2816),
2026-02-21T08:17:52.2958388Z               'stride': (2816, 1)},),
2026-02-21T08:17:52.2961123Z   'kwargs': {}}
2026-02-21T08:17:52.2966194Z INFO:tritonbench.utils.triton_op:Took 2.78ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:17:52.5005468Z [0s] Autotune random seed: 2134816249
2026-02-21T08:17:52.6670658Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:18:26.9070732Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:18:27.1778009Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:18:27.1799086Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:18:27.3766059Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:18:27.3766630Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:18:27.3767158Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16>
2026-02-21T08:18:27.3767412Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:18:27.3767624Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:18:27.3767832Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:18:27.3768069Z     %cst_0 = arith.constant dense<2816> : tensor<8x1xi32>
2026-02-21T08:18:27.3768386Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32>
2026-02-21T08:18:27.3769222Z     %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16>
2026-02-21T08:18:27.3769496Z     %cst_3 = arith.constant dense<2816> : tensor<512xi32>
2026-02-21T08:18:27.3769756Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:18:27.3770032Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:18:27.3770264Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:18:27.3770477Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:18:27.3770677Z     %c2816_i32 = arith.constant 2816 : i32
2026-02-21T08:18:27.3770869Z     %c2816_i64 = arith.constant 2816 : i64
2026-02-21T08:18:27.3771066Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:18:27.3771401Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:18:27.3772027Z     %1 = tt.get_program_id x : i32
2026-02-21T08:18:27.3772411Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:18:27.3772661Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:18:27.3773058Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:18:27.3773329Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:18:27.3773527Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:18:27.3773714Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:18:27.3773912Z       %c2048_i32_6 = arith.constant 2048 : i32
2026-02-21T08:18:27.3774195Z       %6 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3774795Z       %7 = tt.splat %c0_i32 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3775039Z       %8 = arith.addi %7, %6 : tensor<512xi32>
2026-02-21T08:18:27.3775313Z       %9 = arith.cmpi slt, %8, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3775644Z       %10 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:18:27.3776035Z       %11 = tt.expand_dims %9 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3776350Z       %12 = tt.broadcast %11 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3776649Z       %13 = arith.select %12, %10, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:18:27.3776950Z       %14 = arith.extf %13 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3777196Z       %15 = "tt.reduce"(%14) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3777411Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3777609Z         %231 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:27.3777822Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3778014Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3778254Z       %16 = arith.truncf %15 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:18:27.3778516Z       %17 = arith.extf %16 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:18:27.3778762Z       %18 = arith.cmpf ogt, %cst_5, %17 : tensor<8xf32>
2026-02-21T08:18:27.3779013Z       %19 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32>
2026-02-21T08:18:27.3779240Z       %20 = arith.ori %18, %19 : tensor<8xi1>
2026-02-21T08:18:27.3779488Z       %21 = arith.select %20, %cst_5, %17 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:18:27.3779739Z       %22 = arith.subf %cst_5, %21 : tensor<8xf32>
2026-02-21T08:18:27.3780140Z       %23 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3780532Z       %24 = arith.mulf %cst_4, %23 : tensor<8xf32>
2026-02-21T08:18:27.3780794Z       %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3781114Z       %26 = arith.extf %10 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3781380Z       %27 = tt.broadcast %25 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3781667Z       %28 = arith.subf %26, %27 : tensor<8x512xf32>
2026-02-21T08:18:27.3782054Z       %29 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3782581Z       %30 = arith.select %12, %29, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:18:27.3782856Z       %31 = "tt.reduce"(%30) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3783054Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3783249Z         %231 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:27.3783445Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3783646Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3783848Z       %32 = arith.addf %24, %31 : tensor<8xf32>
2026-02-21T08:18:27.3784059Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T08:18:27.3784262Z       %33 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:18:27.3784460Z       %34 = arith.addi %c0_i32, %33 : i32
2026-02-21T08:18:27.3784713Z       %35 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3785046Z       %36 = tt.splat %34 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3785269Z       %37 = arith.addi %36, %35 : tensor<512xi32>
2026-02-21T08:18:27.3785495Z       %38 = arith.cmpi slt, %37, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3785869Z       %39 = tt.descriptor_load %0[%2, %34] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:18:27.3786238Z       %40 = tt.expand_dims %38 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3786543Z       %41 = tt.broadcast %40 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3786839Z       %42 = arith.select %41, %39, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:18:27.3787136Z       %43 = arith.extf %42 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3787390Z       %44 = "tt.reduce"(%43) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3787597Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3787806Z         %231 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:27.3788022Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3788224Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3788469Z       %45 = arith.truncf %44 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:18:27.3788721Z       %46 = arith.extf %45 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:18:27.3788968Z       %47 = arith.cmpf ogt, %21, %46 : tensor<8xf32>
2026-02-21T08:18:27.3789191Z       %48 = arith.cmpf une, %21, %21 : tensor<8xf32>
2026-02-21T08:18:27.3789409Z       %49 = arith.ori %47, %48 : tensor<8xi1>
2026-02-21T08:18:27.3789649Z       %50 = arith.select %49, %21, %46 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:18:27.3789889Z       %51 = arith.subf %21, %50 : tensor<8xf32>
2026-02-21T08:18:27.3790262Z       %52 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3790634Z       %53 = arith.mulf %32, %52 : tensor<8xf32>
2026-02-21T08:18:27.3790895Z       %54 = tt.expand_dims %50 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3791196Z       %55 = arith.extf %39 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3791469Z       %56 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3791779Z       %57 = arith.subf %55, %56 : tensor<8x512xf32>
2026-02-21T08:18:27.3792161Z       %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3792606Z       %59 = arith.select %41, %58, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:18:27.3792867Z       %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3793072Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3793259Z         %231 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:27.3793464Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3793662Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3793865Z       %61 = arith.addf %53, %60 : tensor<8xf32>
2026-02-21T08:18:27.3794075Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:18:27.3794337Z       %62 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:18:27.3794538Z       %63 = arith.addi %c0_i32, %62 : i32
2026-02-21T08:18:27.3794779Z       %64 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3795049Z       %65 = tt.splat %63 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3795265Z       %66 = arith.addi %65, %64 : tensor<512xi32>
2026-02-21T08:18:27.3795492Z       %67 = arith.cmpi slt, %66, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3795829Z       %68 = tt.descriptor_load %0[%2, %63] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:18:27.3796239Z       %69 = tt.expand_dims %67 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3796543Z       %70 = tt.broadcast %69 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3796837Z       %71 = arith.select %70, %68, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:18:27.3797183Z       %72 = arith.extf %71 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3797423Z       %73 = "tt.reduce"(%72) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3797629Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3797817Z         %231 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:27.3798021Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3798210Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3798445Z       %74 = arith.truncf %73 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:18:27.3798702Z       %75 = arith.extf %74 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:18:27.3798939Z       %76 = arith.cmpf ogt, %50, %75 : tensor<8xf32>
2026-02-21T08:18:27.3799169Z       %77 = arith.cmpf une, %50, %50 : tensor<8xf32>
2026-02-21T08:18:27.3799380Z       %78 = arith.ori %76, %77 : tensor<8xi1>
2026-02-21T08:18:27.3799624Z       %79 = arith.select %78, %50, %75 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:18:27.3799861Z       %80 = arith.subf %50, %79 : tensor<8xf32>
2026-02-21T08:18:27.3800237Z       %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3800612Z       %82 = arith.mulf %61, %81 : tensor<8xf32>
2026-02-21T08:18:27.3800865Z       %83 = tt.expand_dims %79 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3801164Z       %84 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3801427Z       %85 = tt.broadcast %83 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3801723Z       %86 = arith.subf %84, %85 : tensor<8x512xf32>
2026-02-21T08:18:27.3802107Z       %87 = tt.extern_elementwise %86 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3802547Z       %88 = arith.select %70, %87, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:18:27.3802817Z       %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3803063Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3803262Z         %231 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:27.3803458Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3803657Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3803861Z       %90 = arith.addf %82, %89 : tensor<8xf32>
2026-02-21T08:18:27.3804066Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:18:27.3804268Z       %91 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:18:27.3804469Z       %92 = arith.addi %c0_i32, %91 : i32
2026-02-21T08:18:27.3804720Z       %93 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3804987Z       %94 = tt.splat %92 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3805203Z       %95 = arith.addi %94, %93 : tensor<512xi32>
2026-02-21T08:18:27.3805431Z       %96 = arith.cmpi slt, %95, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3805752Z       %97 = tt.descriptor_load %0[%2, %92] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:18:27.3806180Z       %98 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3806482Z       %99 = tt.broadcast %98 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3806775Z       %100 = arith.select %99, %97, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:18:27.3807071Z       %101 = arith.extf %100 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3807323Z       %102 = "tt.reduce"(%101) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3807524Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3807716Z         %231 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:27.3807920Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3808107Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3808347Z       %103 = arith.truncf %102 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:18:27.3808607Z       %104 = arith.extf %103 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:18:27.3808911Z       %105 = arith.cmpf ogt, %79, %104 : tensor<8xf32>
2026-02-21T08:18:27.3809141Z       %106 = arith.cmpf une, %79, %79 : tensor<8xf32>
2026-02-21T08:18:27.3809365Z       %107 = arith.ori %105, %106 : tensor<8xi1>
2026-02-21T08:18:27.3809612Z       %108 = arith.select %107, %79, %104 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:18:27.3809866Z       %109 = arith.subf %79, %108 : tensor<8xf32>
2026-02-21T08:18:27.3810250Z       %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3810624Z       %111 = arith.mulf %90, %110 : tensor<8xf32>
2026-02-21T08:18:27.3810893Z       %112 = tt.expand_dims %108 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3811201Z       %113 = arith.extf %97 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3811483Z       %114 = tt.broadcast %112 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3811776Z       %115 = arith.subf %113, %114 : tensor<8x512xf32>
2026-02-21T08:18:27.3812172Z       %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3812621Z       %117 = arith.select %99, %116, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:18:27.3812899Z       %118 = "tt.reduce"(%117) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3813114Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:27.3813303Z         %231 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:27.3813504Z         tt.reduce.return %231 : f32
2026-02-21T08:18:27.3813699Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3813904Z       %119 = arith.addf %111, %118 : tensor<8xf32>
2026-02-21T08:18:27.3814313Z       %120:2 = scf.for %arg3 = %c2048_i32 to %c2816_i32 step %c512_i32 iter_args(%arg4 = %108, %arg5 = %119) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:18:27.3814762Z         %231 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3815053Z         %232 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3815261Z         %233 = arith.addi %232, %231 : tensor<512xi32>
2026-02-21T08:18:27.3815484Z         %234 = arith.cmpi slt, %233, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3815809Z         %235 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:18:27.3816173Z         %236 = tt.expand_dims %234 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3816487Z         %237 = tt.broadcast %236 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3816777Z         %238 = arith.select %237, %235, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:18:27.3817082Z         %239 = arith.extf %238 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3817333Z         %240 = "tt.reduce"(%239) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3817532Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:27.3817734Z           %258 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:27.3818028Z           tt.reduce.return %258 : f32
2026-02-21T08:18:27.3818235Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3818471Z         %241 = arith.truncf %240 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:18:27.3818737Z         %242 = arith.extf %241 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:18:27.3818985Z         %243 = arith.cmpf ogt, %arg4, %242 : tensor<8xf32>
2026-02-21T08:18:27.3819235Z         %244 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:18:27.3819472Z         %245 = arith.ori %243, %244 : tensor<8xi1>
2026-02-21T08:18:27.3819726Z         %246 = arith.select %245, %arg4, %242 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:18:27.3819995Z         %247 = arith.subf %arg4, %246 : tensor<8xf32>
2026-02-21T08:18:27.3820382Z         %248 = tt.extern_elementwise %247 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3820828Z         %249 = arith.mulf %arg5, %248 : tensor<8xf32>
2026-02-21T08:18:27.3821105Z         %250 = tt.expand_dims %246 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3821408Z         %251 = arith.extf %235 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3821717Z         %252 = tt.broadcast %250 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3821969Z         %253 = arith.subf %251, %252 : tensor<8x512xf32>
2026-02-21T08:18:27.3822363Z         %254 = tt.extern_elementwise %253 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3822798Z         %255 = arith.select %237, %254, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:18:27.3823071Z         %256 = "tt.reduce"(%255) <{axis = 1 : i32}> ({
2026-02-21T08:18:27.3823276Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:27.3823462Z           %258 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:27.3823664Z           tt.reduce.return %258 : f32
2026-02-21T08:18:27.3823859Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:18:27.3824086Z         %257 = arith.addf %249, %256 : tensor<8xf32>
2026-02-21T08:18:27.3824298Z         scf.yield %246, %257 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:18:27.3824547Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:18:27.3824775Z       %c2048_i32_7 = arith.constant 2048 : i32
2026-02-21T08:18:27.3824965Z       %c2048_i32_8 = arith.constant 2048 : i32
2026-02-21T08:18:27.3825203Z       %121 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3825457Z       %122 = tt.splat %c0_i32 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3825674Z       %123 = arith.addi %122, %121 : tensor<512xi32>
2026-02-21T08:18:27.3825889Z       %124 = arith.cmpi slt, %123, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3826189Z       %125 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:18:27.3826459Z       %126 = arith.muli %125, %cst_0 : tensor<8x1xi32>
2026-02-21T08:18:27.3826718Z       %127 = tt.expand_dims %123 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:18:27.3827019Z       %128 = tt.broadcast %126 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3827278Z       %129 = tt.broadcast %127 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3827517Z       %130 = arith.addi %128, %129 : tensor<8x512xi32>
2026-02-21T08:18:27.3827754Z       %131 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3828039Z       %132 = tt.addptr %131, %130 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3828349Z       %133 = tt.expand_dims %124 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3828638Z       %134 = tt.broadcast %133 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3828944Z       %135 = tt.load %132, %134, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3829282Z       %136 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3829636Z       %137 = arith.extf %135 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3829896Z       %138 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3830126Z       %139 = arith.subf %137, %138 : tensor<8x512xf32>
2026-02-21T08:18:27.3830497Z       %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3830910Z       %141 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3831200Z       %142 = tt.broadcast %141 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3831429Z       %143 = arith.divf %140, %142 : tensor<8x512xf32>
2026-02-21T08:18:27.3831713Z       %144 = arith.truncf %143 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:18:27.3831992Z       %145 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3832325Z       %146 = tt.addptr %145, %130 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3832602Z       tt.store %146, %144, %134 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3832814Z       %c1_i32_9 = arith.constant 1 : i32
2026-02-21T08:18:27.3833011Z       %147 = arith.muli %c512_i32, %c1_i32_9 : i32
2026-02-21T08:18:27.3833201Z       %148 = arith.addi %c0_i32, %147 : i32
2026-02-21T08:18:27.3833437Z       %149 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3833694Z       %150 = tt.splat %148 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3833899Z       %151 = arith.addi %150, %149 : tensor<512xi32>
2026-02-21T08:18:27.3834120Z       %152 = arith.cmpi slt, %151, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3834379Z       %153 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:18:27.3834639Z       %154 = arith.muli %153, %cst_0 : tensor<8x1xi32>
2026-02-21T08:18:27.3834899Z       %155 = tt.expand_dims %151 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:18:27.3835202Z       %156 = tt.broadcast %154 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3835469Z       %157 = tt.broadcast %155 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3835702Z       %158 = arith.addi %156, %157 : tensor<8x512xi32>
2026-02-21T08:18:27.3835942Z       %159 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3836217Z       %160 = tt.addptr %159, %158 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3836525Z       %161 = tt.expand_dims %152 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3836818Z       %162 = tt.broadcast %161 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3837122Z       %163 = tt.load %160, %162, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3837461Z       %164 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3837747Z       %165 = arith.extf %163 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3838037Z       %166 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3838265Z       %167 = arith.subf %165, %166 : tensor<8x512xf32>
2026-02-21T08:18:27.3838652Z       %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3839072Z       %169 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3839350Z       %170 = tt.broadcast %169 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3839590Z       %171 = arith.divf %168, %170 : tensor<8x512xf32>
2026-02-21T08:18:27.3839831Z       %172 = arith.truncf %171 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:18:27.3840112Z       %173 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3840408Z       %174 = tt.addptr %173, %158 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3840748Z       tt.store %174, %172, %162 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3840965Z       %c2_i32_10 = arith.constant 2 : i32
2026-02-21T08:18:27.3841159Z       %175 = arith.muli %c512_i32, %c2_i32_10 : i32
2026-02-21T08:18:27.3841446Z       %176 = arith.addi %c0_i32, %175 : i32
2026-02-21T08:18:27.3841712Z       %177 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3841969Z       %178 = tt.splat %176 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3842179Z       %179 = arith.addi %178, %177 : tensor<512xi32>
2026-02-21T08:18:27.3842397Z       %180 = arith.cmpi slt, %179, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3842664Z       %181 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:18:27.3842922Z       %182 = arith.muli %181, %cst_0 : tensor<8x1xi32>
2026-02-21T08:18:27.3843252Z       %183 = tt.expand_dims %179 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:18:27.3843542Z       %184 = tt.broadcast %182 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3843809Z       %185 = tt.broadcast %183 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3844049Z       %186 = arith.addi %184, %185 : tensor<8x512xi32>
2026-02-21T08:18:27.3844285Z       %187 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3844567Z       %188 = tt.addptr %187, %186 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3844862Z       %189 = tt.expand_dims %180 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3845153Z       %190 = tt.broadcast %189 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3845448Z       %191 = tt.load %188, %190, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3845776Z       %192 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3846063Z       %193 = arith.extf %191 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3846320Z       %194 = tt.broadcast %192 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3846556Z       %195 = arith.subf %193, %194 : tensor<8x512xf32>
2026-02-21T08:18:27.3846922Z       %196 = tt.extern_elementwise %195 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3847335Z       %197 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3847619Z       %198 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3847847Z       %199 = arith.divf %196, %198 : tensor<8x512xf32>
2026-02-21T08:18:27.3848085Z       %200 = arith.truncf %199 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:18:27.3848347Z       %201 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3848625Z       %202 = tt.addptr %201, %186 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3848896Z       tt.store %202, %200, %190 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3849106Z       %c3_i32_11 = arith.constant 3 : i32
2026-02-21T08:18:27.3849305Z       %203 = arith.muli %c512_i32, %c3_i32_11 : i32
2026-02-21T08:18:27.3849497Z       %204 = arith.addi %c0_i32, %203 : i32
2026-02-21T08:18:27.3849731Z       %205 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3849973Z       %206 = tt.splat %204 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3850183Z       %207 = arith.addi %206, %205 : tensor<512xi32>
2026-02-21T08:18:27.3850397Z       %208 = arith.cmpi slt, %207, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3850666Z       %209 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:18:27.3850946Z       %210 = arith.muli %209, %cst_0 : tensor<8x1xi32>
2026-02-21T08:18:27.3851204Z       %211 = tt.expand_dims %207 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:18:27.3851501Z       %212 = tt.broadcast %210 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3851845Z       %213 = tt.broadcast %211 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3852085Z       %214 = arith.addi %212, %213 : tensor<8x512xi32>
2026-02-21T08:18:27.3852327Z       %215 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3852604Z       %216 = tt.addptr %215, %214 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3852910Z       %217 = tt.expand_dims %208 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3853199Z       %218 = tt.broadcast %217 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3853499Z       %219 = tt.load %216, %218, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3853822Z       %220 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3854111Z       %221 = arith.extf %219 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3854440Z       %222 = tt.broadcast %220 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3854674Z       %223 = arith.subf %221, %222 : tensor<8x512xf32>
2026-02-21T08:18:27.3855050Z       %224 = tt.extern_elementwise %223 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3855461Z       %225 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3855749Z       %226 = tt.broadcast %225 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3855985Z       %227 = arith.divf %224, %226 : tensor<8x512xf32>
2026-02-21T08:18:27.3856217Z       %228 = arith.truncf %227 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:18:27.3856487Z       %229 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3856764Z       %230 = tt.addptr %229, %214 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3857040Z       tt.store %230, %228, %218 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3857293Z       scf.for %arg3 = %c2048_i32_7 to %c2816_i32 step %c512_i32  : i32 {
2026-02-21T08:18:27.3857579Z         %231 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:18:27.3857838Z         %232 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:18:27.3858045Z         %233 = arith.addi %232, %231 : tensor<512xi32>
2026-02-21T08:18:27.3858268Z         %234 = arith.cmpi slt, %233, %cst_3 : tensor<512xi32>
2026-02-21T08:18:27.3858529Z         %235 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:18:27.3858797Z         %236 = arith.muli %235, %cst_0 : tensor<8x1xi32>
2026-02-21T08:18:27.3859060Z         %237 = tt.expand_dims %233 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:18:27.3859357Z         %238 = tt.broadcast %236 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3859663Z         %239 = tt.broadcast %237 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:18:27.3859907Z         %240 = arith.addi %238, %239 : tensor<8x512xi32>
2026-02-21T08:18:27.3860155Z         %241 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3860458Z         %242 = tt.addptr %241, %240 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3860786Z         %243 = tt.expand_dims %234 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:18:27.3861094Z         %244 = tt.broadcast %243 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:18:27.3861408Z         %245 = tt.load %242, %244, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3861779Z         %246 = tt.expand_dims %120#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3862078Z         %247 = arith.extf %245 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:18:27.3862352Z         %248 = tt.broadcast %246 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3862600Z         %249 = arith.subf %247, %248 : tensor<8x512xf32>
2026-02-21T08:18:27.3863101Z         %250 = tt.extern_elementwise %249 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:18:27.3863565Z         %251 = tt.expand_dims %120#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:18:27.3863862Z         %252 = tt.broadcast %251 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:18:27.3864116Z         %253 = arith.divf %250, %252 : tensor<8x512xf32>
2026-02-21T08:18:27.3864382Z         %254 = arith.truncf %253 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:18:27.3864674Z         %255 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3864980Z         %256 = tt.addptr %255, %240 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:18:27.3865264Z         tt.store %256, %254, %244 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:18:27.3865529Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:18:27.3865897Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:18:27.3866171Z     tt.return
2026-02-21T08:18:27.3866307Z   }
2026-02-21T08:18:27.3866443Z }
2026-02-21T08:18:27.3866520Z 
2026-02-21T08:18:27.3866582Z {-#
2026-02-21T08:18:27.3866717Z   external_resources: {
2026-02-21T08:18:27.3866896Z     mlir_reproducer: {
2026-02-21T08:18:27.3871261Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:18:27.3875741Z       disable_threading: false,
2026-02-21T08:18:27.3875913Z       verify_each: true
2026-02-21T08:18:27.3876056Z     }
2026-02-21T08:18:27.3876179Z   }
2026-02-21T08:18:27.3876290Z #-}
2026-02-21T08:18:27.3876715Z /tmp/torchinductor_root/bw/cbwqrdl2s7tkwz2cqoupn5wyjcm3wl4ekwbt6r3xubd4zarcf4r3.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:18:27.3877919Z /tmp/torchinductor_root/bw/cbwqrdl2s7tkwz2cqoupn5wyjcm3wl4ekwbt6r3xubd4zarcf4r3.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:18:27.3878894Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:18:27.3880060Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:18:27.3881055Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:18:27.3881312Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:18:30.0433361Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:18:30.0435221Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:18:30.0435715Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:18:30.0436343Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:18:30.0436570Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:18:30.0436759Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:18:30.0436974Z     %cst = arith.constant dense<2816> : tensor<32x1xi32>
2026-02-21T08:18:30.0437246Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:18:30.0437521Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:18:30.0437748Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:18:30.0437950Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:18:30.0438145Z     %c2816_i32 = arith.constant 2816 : i32
2026-02-21T08:18:30.0438342Z     %c2816_i64 = arith.constant 2816 : i64
2026-02-21T08:18:30.0438529Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:18:30.0438860Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:18:30.0439365Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:18:30.0439675Z     %2 = tt.get_program_id x : i32
2026-02-21T08:18:30.0439859Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:18:30.0440037Z     %4 = arith.minsi %3, %c128_i32 : i32
2026-02-21T08:18:30.0440238Z     scf.for %arg2 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:18:30.0440531Z       %5 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:18:30.0445671Z       %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:18:30.0447641Z       %7 = tt.splat %5 : i32 -> tensor<32xi32>
2026-02-21T08:18:30.0447893Z       %8 = arith.addi %7, %6 : tensor<32xi32>
2026-02-21T08:18:30.0448125Z       %c2808_i32 = arith.constant 2808 : i32
2026-02-21T08:18:30.0448335Z       %c24_i32 = arith.constant 24 : i32
2026-02-21T08:18:30.0448721Z       %9:2 = scf.for %arg3 = %c0_i32 to %c2808_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:18:30.0449139Z         %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:18:30.0449406Z         %50 = tt.splat %arg3 : i32 -> tensor<8xi32>
2026-02-21T08:18:30.0449610Z         %51 = arith.addi %50, %49 : tensor<8xi32>
2026-02-21T08:18:30.0449869Z         %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:30.0450132Z         %53 = arith.muli %52, %cst : tensor<32x1xi32>
2026-02-21T08:18:30.0450388Z         %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:18:30.0450675Z         %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0450928Z         %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0451165Z         %57 = arith.addi %55, %56 : tensor<32x8xi32>
2026-02-21T08:18:30.0451405Z         %58 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0451794Z         %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:18:30.0452385Z         %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0452716Z         %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0452945Z         %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0453146Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0453338Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:30.0453536Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0453725Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0453953Z         %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:30.0454204Z         %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:30.0454436Z         %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32>
2026-02-21T08:18:30.0454669Z         %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:18:30.0454960Z         %67 = arith.ori %65, %66 : tensor<32xi1>
2026-02-21T08:18:30.0455204Z         %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:30.0455445Z         %69 = arith.subf %arg4, %68 : tensor<32xf32>
2026-02-21T08:18:30.0455810Z         %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0456174Z         %71 = arith.mulf %arg5, %70 : tensor<32xf32>
2026-02-21T08:18:30.0456423Z         %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0456716Z         %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0456940Z         %74 = arith.subf %61, %73 : tensor<32x8xf32>
2026-02-21T08:18:30.0457294Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0457653Z         %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0457848Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0458036Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:30.0458223Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0458412Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0458605Z         %77 = arith.addf %71, %76 : tensor<32xf32>
2026-02-21T08:18:30.0458806Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:18:30.0458992Z         %78 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:18:30.0459191Z         %79 = arith.addi %arg3, %78 : i32
2026-02-21T08:18:30.0459418Z         %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:18:30.0459657Z         %81 = tt.splat %79 : i32 -> tensor<8xi32>
2026-02-21T08:18:30.0459859Z         %82 = arith.addi %81, %80 : tensor<8xi32>
2026-02-21T08:18:30.0460103Z         %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:30.0460375Z         %84 = arith.muli %83, %cst : tensor<32x1xi32>
2026-02-21T08:18:30.0460621Z         %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:18:30.0460906Z         %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0461163Z         %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0461384Z         %88 = arith.addi %86, %87 : tensor<32x8xi32>
2026-02-21T08:18:30.0461665Z         %89 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0461930Z         %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:18:30.0462227Z         %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0462510Z         %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0462732Z         %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0462924Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0463109Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:30.0463387Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0463569Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0463790Z         %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:30.0464027Z         %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:30.0464259Z         %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32>
2026-02-21T08:18:30.0464474Z         %97 = arith.cmpf une, %68, %68 : tensor<32xf32>
2026-02-21T08:18:30.0464672Z         %98 = arith.ori %96, %97 : tensor<32xi1>
2026-02-21T08:18:30.0464904Z         %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:30.0465136Z         %100 = arith.subf %68, %99 : tensor<32xf32>
2026-02-21T08:18:30.0465506Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0465937Z         %102 = arith.mulf %77, %101 : tensor<32xf32>
2026-02-21T08:18:30.0466190Z         %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0466484Z         %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0466716Z         %105 = arith.subf %92, %104 : tensor<32x8xf32>
2026-02-21T08:18:30.0467085Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0467447Z         %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0467644Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0467829Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:30.0468014Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0468203Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0468401Z         %108 = arith.addf %102, %107 : tensor<32xf32>
2026-02-21T08:18:30.0468606Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:18:30.0468794Z         %109 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:18:30.0468989Z         %110 = arith.addi %arg3, %109 : i32
2026-02-21T08:18:30.0469213Z         %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:18:30.0469470Z         %112 = tt.splat %110 : i32 -> tensor<8xi32>
2026-02-21T08:18:30.0469677Z         %113 = arith.addi %112, %111 : tensor<8xi32>
2026-02-21T08:18:30.0469923Z         %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:30.0470194Z         %115 = arith.muli %114, %cst : tensor<32x1xi32>
2026-02-21T08:18:30.0470440Z         %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:18:30.0470731Z         %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0470997Z         %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0471231Z         %119 = arith.addi %117, %118 : tensor<32x8xi32>
2026-02-21T08:18:30.0471476Z         %120 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0471789Z         %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:18:30.0472099Z         %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0472388Z         %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0472626Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0472826Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0473006Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:30.0473201Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0473381Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0473607Z         %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:30.0473857Z         %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:30.0474107Z         %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32>
2026-02-21T08:18:30.0474406Z         %128 = arith.cmpf une, %99, %99 : tensor<32xf32>
2026-02-21T08:18:30.0474612Z         %129 = arith.ori %127, %128 : tensor<32xi1>
2026-02-21T08:18:30.0474851Z         %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:30.0475088Z         %131 = arith.subf %99, %130 : tensor<32xf32>
2026-02-21T08:18:30.0475457Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0475817Z         %133 = arith.mulf %108, %132 : tensor<32xf32>
2026-02-21T08:18:30.0476078Z         %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0476389Z         %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0476663Z         %136 = arith.subf %123, %135 : tensor<32x8xf32>
2026-02-21T08:18:30.0477150Z         %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0477570Z         %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0477804Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:30.0478014Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:30.0478246Z           tt.reduce.return %140 : f32
2026-02-21T08:18:30.0478480Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0478695Z         %139 = arith.addf %133, %138 : tensor<32xf32>
2026-02-21T08:18:30.0478967Z         scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:18:30.0479230Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:18:30.0479551Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:18:30.0479824Z       %11 = tt.splat %c2808_i32 : i32 -> tensor<8xi32>
2026-02-21T08:18:30.0480067Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:18:30.0480356Z       %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:30.0480654Z       %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:18:30.0480951Z       %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:18:30.0481294Z       %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0482883Z       %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:18:30.0483132Z       %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:18:30.0483383Z       %19 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0483682Z       %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:18:30.0483999Z       %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:18:30.0484313Z       %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0484568Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0484780Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:30.0484972Z         %49 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:30.0485179Z         tt.reduce.return %49 : f32
2026-02-21T08:18:30.0485379Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0485605Z       %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:30.0485861Z       %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:30.0486133Z       %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32>
2026-02-21T08:18:30.0486350Z       %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32>
2026-02-21T08:18:30.0486554Z       %28 = arith.ori %26, %27 : tensor<32xi1>
2026-02-21T08:18:30.0486793Z       %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:30.0487033Z       %30 = arith.subf %9#0, %29 : tensor<32xf32>
2026-02-21T08:18:30.0487379Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0487847Z       %32 = arith.mulf %9#1, %31 : tensor<32xf32>
2026-02-21T08:18:30.0488096Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0488393Z       %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0488622Z       %35 = arith.subf %22, %34 : tensor<32x8xf32>
2026-02-21T08:18:30.0488982Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0489346Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:18:30.0489535Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:30.0489716Z         %49 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:30.0489900Z         tt.reduce.return %49 : f32
2026-02-21T08:18:30.0490090Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:18:30.0490342Z       %38 = arith.addf %32, %37 : tensor<32xf32>
2026-02-21T08:18:30.0490546Z       %c2808_i32_2 = arith.constant 2808 : i32
2026-02-21T08:18:30.0490740Z       %c24_i32_3 = arith.constant 24 : i32
2026-02-21T08:18:30.0490961Z       scf.for %arg3 = %c0_i32 to %c2808_i32_2 step %c24_i32_3  : i32 {
2026-02-21T08:18:30.0491366Z         %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:18:30.0491801Z         %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0492109Z         %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0492360Z         %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0492595Z         %53 = arith.subf %51, %52 : tensor<32x8xf32>
2026-02-21T08:18:30.0492960Z         %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0493368Z         %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0493658Z         %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0493883Z         %57 = arith.divf %54, %56 : tensor<32x8xf32>
2026-02-21T08:18:30.0494116Z         %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:18:30.0494427Z         tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:18:30.0494708Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:18:30.0494907Z         %59 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:18:30.0495095Z         %60 = arith.addi %arg3, %59 : i32
2026-02-21T08:18:30.0495360Z         %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:18:30.0495681Z         %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0495965Z         %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0496214Z         %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0496437Z         %65 = arith.subf %63, %64 : tensor<32x8xf32>
2026-02-21T08:18:30.0496797Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0497195Z         %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0497479Z         %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0497706Z         %69 = arith.divf %66, %68 : tensor<32x8xf32>
2026-02-21T08:18:30.0497925Z         %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:18:30.0498229Z         tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:18:30.0498497Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:18:30.0498692Z         %71 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:18:30.0498927Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:18:30.0499191Z         %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:18:30.0499518Z         %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0499792Z         %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0500041Z         %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0500265Z         %77 = arith.subf %75, %76 : tensor<32x8xf32>
2026-02-21T08:18:30.0500627Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0501058Z         %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0501333Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0501637Z         %81 = arith.divf %78, %80 : tensor<32x8xf32>
2026-02-21T08:18:30.0501857Z         %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:18:30.0502156Z         tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:18:30.0502456Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:18:30.0502773Z       %39 = tt.descriptor_load %0[%5, %c2808_i32_2] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:18:30.0503117Z       %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0503387Z       %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:18:30.0503635Z       %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0503856Z       %43 = arith.subf %41, %42 : tensor<32x8xf32>
2026-02-21T08:18:30.0504213Z       %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:18:30.0504618Z       %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:30.0504888Z       %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:18:30.0505118Z       %47 = arith.divf %44, %46 : tensor<32x8xf32>
2026-02-21T08:18:30.0505339Z       %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:18:30.0505661Z       tt.descriptor_store %1[%5, %c2808_i32_2], %48 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:18:30.0505963Z     } {tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:18:30.0506163Z     tt.return
2026-02-21T08:18:30.0506295Z   }
2026-02-21T08:18:30.0506415Z }
2026-02-21T08:18:30.0506481Z 
2026-02-21T08:18:30.0506541Z {-#
2026-02-21T08:18:30.0506670Z   external_resources: {
2026-02-21T08:18:30.0506831Z     mlir_reproducer: {
2026-02-21T08:18:30.0511127Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:18:30.0515759Z       disable_threading: false,
2026-02-21T08:18:30.0515936Z       verify_each: true
2026-02-21T08:18:30.0516080Z     }
2026-02-21T08:18:30.0516211Z   }
2026-02-21T08:18:30.0516328Z #-}
2026-02-21T08:18:30.0516821Z /tmp/torchinductor_root/e4/ce4plioxhs4dkgdsuiqbo5taxqthwvps2cnco56uv7pliohm3mam.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:18:30.0518022Z /tmp/torchinductor_root/e4/ce4plioxhs4dkgdsuiqbo5taxqthwvps2cnco56uv7pliohm3mam.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:18:30.0518996Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:18:30.0520095Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:18:30.0521127Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:18:30.0521391Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:18:31.6545041Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:18:31.6546527Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:18:31.6547022Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:18:31.6547221Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:18:31.6547421Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:18:31.6547643Z     %cst = arith.constant dense<2816> : tensor<32x1xi32>
2026-02-21T08:18:31.6547904Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:18:31.6548161Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:18:31.6548422Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:18:31.6548627Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:18:31.6548812Z     %c2816_i32 = arith.constant 2816 : i32
2026-02-21T08:18:31.6548998Z     %c2816_i64 = arith.constant 2816 : i64
2026-02-21T08:18:31.6549180Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:18:31.6549503Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2816_i32], [%c2816_i64, %c1_i64] : <f16>, <tensor<32x32xf16>>
2026-02-21T08:18:31.6549828Z     %1 = tt.get_program_id x : i32
2026-02-21T08:18:31.6550068Z     scf.for %arg2 = %1 to %c128_i32 step %c9472_i32  : i32 {
2026-02-21T08:18:31.6550328Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:18:31.6550557Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:18:31.6550811Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:18:31.6551011Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:18:31.6551196Z       %c2784_i32 = arith.constant 2784 : i32
2026-02-21T08:18:31.6551393Z       %c96_i32 = arith.constant 96 : i32
2026-02-21T08:18:31.6552296Z       %6:2 = scf.for %arg3 = %c0_i32 to %c2784_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:18:31.6552762Z         %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:18:31.6553086Z         %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6553334Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6553549Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6553739Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:31.6553938Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6554121Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6554350Z         %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:31.6554587Z         %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:31.6554929Z         %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32>
2026-02-21T08:18:31.6555170Z         %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:18:31.6555386Z         %54 = arith.ori %52, %53 : tensor<32xi1>
2026-02-21T08:18:31.6555631Z         %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:31.6555876Z         %56 = arith.subf %arg4, %55 : tensor<32xf32>
2026-02-21T08:18:31.6556246Z         %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6556656Z         %58 = arith.mulf %arg5, %57 : tensor<32xf32>
2026-02-21T08:18:31.6556918Z         %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6557223Z         %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6557471Z         %61 = arith.subf %48, %60 : tensor<32x32xf32>
2026-02-21T08:18:31.6557865Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6558240Z         %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6558451Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6558649Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:31.6558844Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6559041Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6559243Z         %64 = arith.addf %58, %63 : tensor<32xf32>
2026-02-21T08:18:31.6559444Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:18:31.6559638Z         %65 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:18:31.6559839Z         %66 = arith.addi %arg3, %65 : i32
2026-02-21T08:18:31.6560118Z         %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:18:31.6560449Z         %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6560692Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6560884Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6561084Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:31.6561280Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6561474Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6561755Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:31.6562012Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:31.6562259Z         %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32>
2026-02-21T08:18:31.6562481Z         %73 = arith.cmpf une, %55, %55 : tensor<32xf32>
2026-02-21T08:18:31.6562699Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:18:31.6562934Z         %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:31.6563182Z         %76 = arith.subf %55, %75 : tensor<32xf32>
2026-02-21T08:18:31.6563553Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6564018Z         %78 = arith.mulf %64, %77 : tensor<32xf32>
2026-02-21T08:18:31.6564284Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6564588Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6564841Z         %81 = arith.subf %68, %80 : tensor<32x32xf32>
2026-02-21T08:18:31.6565217Z         %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6565601Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6565808Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6565999Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:31.6566212Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6566470Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6566683Z         %84 = arith.addf %78, %83 : tensor<32xf32>
2026-02-21T08:18:31.6566878Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:18:31.6567076Z         %85 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:18:31.6567274Z         %86 = arith.addi %arg3, %85 : i32
2026-02-21T08:18:31.6567558Z         %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:18:31.6567869Z         %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6568091Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6568282Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6568464Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:18:31.6568658Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6568837Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6569064Z         %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:31.6569310Z         %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:31.6569532Z         %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32>
2026-02-21T08:18:31.6569748Z         %93 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:18:31.6569944Z         %94 = arith.ori %92, %93 : tensor<32xi1>
2026-02-21T08:18:31.6570173Z         %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:31.6570396Z         %96 = arith.subf %75, %95 : tensor<32xf32>
2026-02-21T08:18:31.6570742Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6571094Z         %98 = arith.mulf %84, %97 : tensor<32xf32>
2026-02-21T08:18:31.6571336Z         %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6571676Z         %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6571917Z         %101 = arith.subf %88, %100 : tensor<32x32xf32>
2026-02-21T08:18:31.6572284Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6572651Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6572840Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:18:31.6573025Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:18:31.6573208Z           tt.reduce.return %105 : f32
2026-02-21T08:18:31.6573394Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6573591Z         %104 = arith.addf %98, %103 : tensor<32xf32>
2026-02-21T08:18:31.6573810Z         scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:18:31.6574027Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:18:31.6574327Z       %7 = tt.descriptor_load %0[%2, %c2784_i32] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:18:31.6574710Z       %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6574925Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6575116Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:31.6575296Z         %47 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:18:31.6575489Z         tt.reduce.return %47 : f32
2026-02-21T08:18:31.6575670Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6575892Z       %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:18:31.6576133Z       %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:18:31.6576350Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32>
2026-02-21T08:18:31.6576569Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:18:31.6576770Z       %14 = arith.ori %12, %13 : tensor<32xi1>
2026-02-21T08:18:31.6577009Z       %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:18:31.6577300Z       %16 = arith.subf %6#0, %15 : tensor<32xf32>
2026-02-21T08:18:31.6577661Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6578018Z       %18 = arith.mulf %6#1, %17 : tensor<32xf32>
2026-02-21T08:18:31.6578262Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6578551Z       %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6578776Z       %21 = arith.subf %8, %20 : tensor<32x32xf32>
2026-02-21T08:18:31.6579132Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6579488Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:18:31.6579677Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:18:31.6579858Z         %47 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:18:31.6580038Z         tt.reduce.return %47 : f32
2026-02-21T08:18:31.6580233Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:18:31.6580425Z       %24 = arith.addf %18, %23 : tensor<32xf32>
2026-02-21T08:18:31.6580627Z       %c2784_i32_2 = arith.constant 2784 : i32
2026-02-21T08:18:31.6580814Z       %c96_i32_3 = arith.constant 96 : i32
2026-02-21T08:18:31.6581047Z       scf.for %arg3 = %c0_i32 to %c2784_i32_2 step %c96_i32_3  : i32 {
2026-02-21T08:18:31.6581293Z         %47 = tt.splat %arg3 : i32 -> tensor<32xi32>
2026-02-21T08:18:31.6581491Z         %48 = arith.addi %47, %3 : tensor<32xi32>
2026-02-21T08:18:31.6581783Z         %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:31.6582045Z         %50 = arith.muli %49, %cst : tensor<32x1xi32>
2026-02-21T08:18:31.6582300Z         %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:18:31.6582581Z         %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6582871Z         %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6583108Z         %54 = arith.addi %52, %53 : tensor<32x32xi32>
2026-02-21T08:18:31.6583339Z         %55 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6583617Z         %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6583917Z         %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6584259Z         %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6584549Z         %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6584818Z         %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6585053Z         %61 = arith.subf %59, %60 : tensor<32x32xf32>
2026-02-21T08:18:31.6585421Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6585877Z         %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6586158Z         %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6586392Z         %65 = arith.divf %62, %64 : tensor<32x32xf32>
2026-02-21T08:18:31.6586619Z         %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:18:31.6586887Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6587155Z         %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6587411Z         tt.store %68, %66 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6587612Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:18:31.6587805Z         %69 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:18:31.6587996Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:18:31.6588184Z         %71 = tt.splat %70 : i32 -> tensor<32xi32>
2026-02-21T08:18:31.6588466Z         %72 = arith.addi %71, %3 : tensor<32xi32>
2026-02-21T08:18:31.6588709Z         %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:31.6588967Z         %74 = arith.muli %73, %cst : tensor<32x1xi32>
2026-02-21T08:18:31.6589208Z         %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:18:31.6589490Z         %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6589744Z         %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6589965Z         %78 = arith.addi %76, %77 : tensor<32x32xi32>
2026-02-21T08:18:31.6590201Z         %79 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6590468Z         %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6590768Z         %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6591067Z         %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6591352Z         %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6591649Z         %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6591874Z         %85 = arith.subf %83, %84 : tensor<32x32xf32>
2026-02-21T08:18:31.6592241Z         %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6592645Z         %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6592928Z         %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6593161Z         %89 = arith.divf %86, %88 : tensor<32x32xf32>
2026-02-21T08:18:31.6593387Z         %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:18:31.6593659Z         %91 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6593932Z         %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6594190Z         tt.store %92, %90 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6594387Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:18:31.6594578Z         %93 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:18:31.6594770Z         %94 = arith.addi %arg3, %93 : i32
2026-02-21T08:18:31.6594957Z         %95 = tt.splat %94 : i32 -> tensor<32xi32>
2026-02-21T08:18:31.6595159Z         %96 = arith.addi %95, %3 : tensor<32xi32>
2026-02-21T08:18:31.6595400Z         %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:31.6595663Z         %98 = arith.muli %97, %cst : tensor<32x1xi32>
2026-02-21T08:18:31.6595905Z         %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:18:31.6596196Z         %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6596467Z         %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6596765Z         %102 = arith.addi %100, %101 : tensor<32x32xi32>
2026-02-21T08:18:31.6597016Z         %103 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6597302Z         %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6597617Z         %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6597935Z         %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6598219Z         %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6598486Z         %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6598723Z         %109 = arith.subf %107, %108 : tensor<32x32xf32>
2026-02-21T08:18:31.6599176Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6599597Z         %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6599903Z         %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6600160Z         %113 = arith.divf %110, %112 : tensor<32x32xf32>
2026-02-21T08:18:31.6600409Z         %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:18:31.6600698Z         %115 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6600993Z         %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6601273Z         tt.store %116, %114 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6601489Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:18:31.6601763Z       %25 = tt.splat %c2784_i32_2 : i32 -> tensor<32xi32>
2026-02-21T08:18:31.6602117Z       %26 = arith.addi %25, %3 : tensor<32xi32>
2026-02-21T08:18:31.6602445Z       %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:18:31.6602902Z       %28 = arith.muli %27, %cst : tensor<32x1xi32>
2026-02-21T08:18:31.6603251Z       %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:18:31.6603612Z       %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6603995Z       %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:18:31.6604285Z       %32 = arith.addi %30, %31 : tensor<32x32xi32>
2026-02-21T08:18:31.6604631Z       %33 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6604972Z       %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6605378Z       %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6605798Z       %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6606173Z       %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:18:31.6606532Z       %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6606844Z       %39 = arith.subf %37, %38 : tensor<32x32xf32>
2026-02-21T08:18:31.6607327Z       %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:18:31.6607853Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:18:31.6608216Z       %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:18:31.6608542Z       %43 = arith.divf %40, %42 : tensor<32x32xf32>
2026-02-21T08:18:31.6608837Z       %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:18:31.6609204Z       %45 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6609586Z       %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:18:31.6609973Z       tt.store %46, %44 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:18:31.6610321Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:18:31.6610604Z     tt.return
2026-02-21T08:18:31.6610853Z   }
2026-02-21T08:18:31.6611022Z }
2026-02-21T08:18:31.6611147Z 
2026-02-21T08:18:31.6611206Z {-#
2026-02-21T08:18:31.6611459Z   external_resources: {
2026-02-21T08:18:31.6611702Z     mlir_reproducer: {
2026-02-21T08:18:31.6616076Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:18:31.6620667Z       disable_threading: false,
2026-02-21T08:18:31.6621026Z       verify_each: true
2026-02-21T08:18:31.6621250Z     }
2026-02-21T08:18:31.6621455Z   }
2026-02-21T08:18:31.6621677Z #-}
2026-02-21T08:18:31.6622211Z /tmp/torchinductor_root/sc/csc5amcls62xcw2ccldevip5lr6kn5mvhagnayklxtlwdzf5nnbl.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:18:31.6623537Z /tmp/torchinductor_root/sc/csc5amcls62xcw2ccldevip5lr6kn5mvhagnayklxtlwdzf5nnbl.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:18:31.6624603Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:18:31.6625782Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:18:31.6626866Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:18:31.6627181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:18:33.5733708Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.8 configs/s
2026-02-21T08:18:33.5744743Z [40s] Adaptive compile timeout: 30s (90% percentile=4.3s, bounds=[30.0s, 30s])
2026-02-21T08:18:34.3689999Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1883.4 configs/s
2026-02-21T08:18:34.4224982Z [41s] Initial random population of 100, 5 starting points: 
2026-02-21T08:18:34.4226952Z error=7
2026-02-21T08:18:34.4227295Z timeout=2
2026-02-21T08:18:34.4227473Z ok=91
2026-02-21T08:18:34.4227842Z min=0.0225
2026-02-21T08:18:34.4228029Z mid=0.2968
2026-02-21T08:18:34.4228226Z max=18.9768
2026-02-21T08:18:34.4228485Z best={'block_sizes': [2, 1024],
2026-02-21T08:18:34.4228876Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:18:34.4229220Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:18:34.4229541Z  'num_sm_multiplier': 64,
2026-02-21T08:18:34.4229795Z  'num_stages': 5,
2026-02-21T08:18:34.4229994Z  'num_warps': 1,
2026-02-21T08:18:34.4230277Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:18:34.4230535Z  'range_flattens': [True, True],
2026-02-21T08:18:34.4230810Z  'range_multi_buffers': [False, None],
2026-02-21T08:18:34.4231059Z  'range_num_stages': [3, 1],
2026-02-21T08:18:34.4231887Z  'range_unroll_factors': [0, 2],
2026-02-21T08:18:34.4232176Z  'range_warp_specializes': [True, None]}
2026-02-21T08:18:34.4237790Z [41s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:18:35.4987847Z [42s] Generation 1 starting: 86 neighbors, 5 active search path(s)
2026-02-21T08:18:54.1295413Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 1.1 configs/s
2026-02-21T08:18:59.5436465Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.8 configs/s
2026-02-21T08:19:00.5146816Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1040.8         
2026-02-21T08:19:00.5148897Z                                                                  configs/s      
2026-02-21T08:19:00.6009917Z [67s] Generation 1 complete: 
2026-02-21T08:19:00.6014428Z error=1
2026-02-21T08:19:00.6016540Z ok=90
2026-02-21T08:19:00.6016779Z min=0.0143
2026-02-21T08:19:00.6017058Z mid=0.0245
2026-02-21T08:19:00.6017394Z max=0.1863
2026-02-21T08:19:00.6019729Z best={'block_sizes': [1, 4096],
2026-02-21T08:19:00.6020042Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:19:00.6020441Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:19:00.6020706Z  'num_stages': 5,
2026-02-21T08:19:00.6020934Z  'num_warps': 4,
2026-02-21T08:19:00.6021188Z  'pid_type': 'flat',
2026-02-21T08:19:00.6021443Z  'range_flattens': [None, False],
2026-02-21T08:19:00.6021912Z  'range_multi_buffers': [None, True],
2026-02-21T08:19:00.6022207Z  'range_num_stages': [0, 4],
2026-02-21T08:19:00.6022476Z  'range_unroll_factors': [0, 0],
2026-02-21T08:19:00.6022727Z  'range_warp_specializes': [None, False]}
2026-02-21T08:19:00.6023125Z [67s] Fitting surrogate: 191 points, 191 targets
2026-02-21T08:19:01.6616005Z [68s] Generation 2 starting: 72 neighbors, 5 active search path(s)
2026-02-21T08:19:10.9697247Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 2.4 configs/s
2026-02-21T08:19:15.5610753Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s
2026-02-21T08:19:18.0489650Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.0         
2026-02-21T08:19:18.0491109Z                                                                   configs/s     
2026-02-21T08:19:18.2465112Z [85s] Generation 2 complete: 
2026-02-21T08:19:18.2469081Z ok=78
2026-02-21T08:19:18.2471385Z min=0.0143
2026-02-21T08:19:18.2472148Z mid=0.0225
2026-02-21T08:19:18.2476601Z max=0.8029
2026-02-21T08:19:18.2481166Z best={'block_sizes': [1, 4096],
2026-02-21T08:19:18.2482413Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:19:18.2482771Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:19:18.2483033Z  'num_stages': 5,
2026-02-21T08:19:18.2483352Z  'num_warps': 4,
2026-02-21T08:19:18.2483597Z  'pid_type': 'flat',
2026-02-21T08:19:18.2483811Z  'range_flattens': [None, False],
2026-02-21T08:19:18.2484100Z  'range_multi_buffers': [None, True],
2026-02-21T08:19:18.2484348Z  'range_num_stages': [0, 4],
2026-02-21T08:19:18.2485011Z  'range_unroll_factors': [0, 0],
2026-02-21T08:19:18.2485266Z  'range_warp_specializes': [None, False]}
2026-02-21T08:19:18.2485601Z [85s] Fitting surrogate: 269 points, 269 targets
2026-02-21T08:19:19.2855660Z [86s] Generation 3 starting: 72 neighbors, 5 active search path(s)
2026-02-21T08:19:27.1098643Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 4.1 configs/s
2026-02-21T08:19:31.5710621Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.0 configs/s
2026-02-21T08:19:34.2592021Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 379.2         
2026-02-21T08:19:34.2596329Z                                                                   configs/s     
2026-02-21T08:19:34.4994627Z [101s] Generation 3 complete: 
2026-02-21T08:19:34.4998385Z error=3
2026-02-21T08:19:34.5000099Z ok=75
2026-02-21T08:19:34.5000354Z min=0.0143
2026-02-21T08:19:34.5000553Z mid=0.0205
2026-02-21T08:19:34.5000816Z max=0.0880
2026-02-21T08:19:34.5001033Z best={'block_sizes': [1, 4096],
2026-02-21T08:19:34.5001373Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:19:34.5001837Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:19:34.5004482Z  'num_stages': 5,
2026-02-21T08:19:34.5008153Z  'num_warps': 2,
2026-02-21T08:19:34.5009394Z  'pid_type': 'flat',
2026-02-21T08:19:34.5009686Z  'range_flattens': [None, True],
2026-02-21T08:19:34.5009947Z  'range_multi_buffers': [None, None],
2026-02-21T08:19:34.5010236Z  'range_num_stages': [0, 4],
2026-02-21T08:19:34.5010499Z  'range_unroll_factors': [0, 0],
2026-02-21T08:19:34.5010735Z  'range_warp_specializes': [None, None]}
2026-02-21T08:19:34.5012652Z [101s] Fitting surrogate: 347 points, 347 targets
2026-02-21T08:19:35.3079479Z [102s] Generation 4 starting: 59 neighbors, 4 active search path(s)
2026-02-21T08:19:42.2967966Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 2.9 configs/s
2026-02-21T08:19:46.0653510Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.4 configs/s
2026-02-21T08:19:48.7213186Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 384.0         
2026-02-21T08:19:48.7214118Z                                                                   configs/s     
2026-02-21T08:19:48.9632115Z [116s] Generation 4 complete: 
2026-02-21T08:19:48.9633123Z ok=64
2026-02-21T08:19:48.9633314Z min=0.0143
2026-02-21T08:19:48.9633604Z mid=0.0164
2026-02-21T08:19:48.9638172Z max=0.4035
2026-02-21T08:19:48.9639573Z best={'block_sizes': [1, 4096],
2026-02-21T08:19:48.9639964Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:19:48.9640367Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:19:48.9640605Z  'num_stages': 5,
2026-02-21T08:19:48.9640836Z  'num_warps': 2,
2026-02-21T08:19:48.9641095Z  'pid_type': 'flat',
2026-02-21T08:19:48.9641302Z  'range_flattens': [None, True],
2026-02-21T08:19:48.9641626Z  'range_multi_buffers': [None, None],
2026-02-21T08:19:48.9641877Z  'range_num_stages': [0, 4],
2026-02-21T08:19:48.9642167Z  'range_unroll_factors': [0, 0],
2026-02-21T08:19:48.9642922Z  'range_warp_specializes': [None, None]}
2026-02-21T08:19:48.9645573Z [116s] Fitting surrogate: 411 points, 411 targets
2026-02-21T08:19:49.7529893Z [117s] Generation 5 starting: 53 neighbors, 4 active search path(s)
2026-02-21T08:19:53.0857187Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 19.0 configs/s
2026-02-21T08:19:56.4285347Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.4 configs/s
2026-02-21T08:19:59.6035465Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 364.2         
2026-02-21T08:19:59.6035945Z                                                                   configs/s     
2026-02-21T08:19:59.8640688Z [127s] Generation 5 complete: 
2026-02-21T08:19:59.8642998Z ok=58
2026-02-21T08:19:59.8644517Z min=0.0143
2026-02-21T08:19:59.8647605Z mid=0.0164
2026-02-21T08:19:59.8648906Z max=0.1352
2026-02-21T08:19:59.8649109Z best={'block_sizes': [1, 4096],
2026-02-21T08:19:59.8649972Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:19:59.8650345Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:19:59.8650627Z  'num_stages': 5,
2026-02-21T08:19:59.8650824Z  'num_warps': 2,
2026-02-21T08:19:59.8651076Z  'pid_type': 'flat',
2026-02-21T08:19:59.8651328Z  'range_flattens': [None, True],
2026-02-21T08:19:59.8651737Z  'range_multi_buffers': [None, None],
2026-02-21T08:19:59.8652031Z  'range_num_stages': [0, 4],
2026-02-21T08:19:59.8652260Z  'range_unroll_factors': [0, 0],
2026-02-21T08:19:59.8652523Z  'range_warp_specializes': [None, None]}
2026-02-21T08:19:59.8657014Z [127s] Fitting surrogate: 469 points, 469 targets
2026-02-21T08:20:00.7031273Z [128s] Generation 6 starting: 53 neighbors, 4 active search path(s)
2026-02-21T08:20:03.6680864Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 25.0 configs/s
2026-02-21T08:20:06.9515742Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.3 configs/s
2026-02-21T08:20:10.0670219Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.7         
2026-02-21T08:20:10.0671227Z                                                                   configs/s     
2026-02-21T08:20:10.3446711Z [137s] Generation 6 complete: 
2026-02-21T08:20:10.3450797Z ok=58
2026-02-21T08:20:10.3454484Z min=0.0143
2026-02-21T08:20:10.3456019Z mid=0.0144
2026-02-21T08:20:10.3456383Z max=0.0235
2026-02-21T08:20:10.3456621Z best={'block_sizes': [1, 4096],
2026-02-21T08:20:10.3456915Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:20:10.3457269Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:20:10.3457502Z  'num_stages': 5,
2026-02-21T08:20:10.3457728Z  'num_warps': 2,
2026-02-21T08:20:10.3457978Z  'pid_type': 'flat',
2026-02-21T08:20:10.3458199Z  'range_flattens': [None, True],
2026-02-21T08:20:10.3458463Z  'range_multi_buffers': [None, None],
2026-02-21T08:20:10.3458714Z  'range_num_stages': [0, 4],
2026-02-21T08:20:10.3458991Z  'range_unroll_factors': [0, 0],
2026-02-21T08:20:10.3459230Z  'range_warp_specializes': [None, None]}
2026-02-21T08:20:10.3466119Z [137s] Fitting surrogate: 527 points, 527 targets
2026-02-21T08:20:10.9062413Z [138s] Generation 7 starting: 25 neighbors, 2 active search path(s)
2026-02-21T08:20:12.3999008Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 40.7 configs/s
2026-02-21T08:20:14.0097145Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 16.6 configs/s
2026-02-21T08:20:15.8592604Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 642.9         
2026-02-21T08:20:15.8597406Z                                                                   configs/s     
2026-02-21T08:20:16.0006370Z [143s] Generation 7 complete: 
2026-02-21T08:20:16.0011455Z ok=28
2026-02-21T08:20:16.0015853Z min=0.0143
2026-02-21T08:20:16.0020638Z mid=0.0145
2026-02-21T08:20:16.0024324Z max=0.0266
2026-02-21T08:20:16.0026403Z best={'block_sizes': [1, 4096],
2026-02-21T08:20:16.0026880Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:20:16.0027622Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:20:16.0027866Z  'num_stages': 2,
2026-02-21T08:20:16.0028128Z  'num_warps': 2,
2026-02-21T08:20:16.0028342Z  'pid_type': 'flat',
2026-02-21T08:20:16.0028595Z  'range_flattens': [None, False],
2026-02-21T08:20:16.0028836Z  'range_multi_buffers': [None, False],
2026-02-21T08:20:16.0029122Z  'range_num_stages': [0, 1],
2026-02-21T08:20:16.0029373Z  'range_unroll_factors': [0, 0],
2026-02-21T08:20:16.0029607Z  'range_warp_specializes': [None, None]}
2026-02-21T08:20:16.0029927Z [143s] Fitting surrogate: 555 points, 555 targets
2026-02-21T08:20:16.5049969Z [143s] Generation 8 starting: 22 neighbors, 2 active search path(s)
2026-02-21T08:20:17.9437924Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 36.5 configs/s
2026-02-21T08:20:19.3627837Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.7 configs/s
2026-02-21T08:20:20.7143578Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 750.3         
2026-02-21T08:20:20.7145018Z                                                                   configs/s     
2026-02-21T08:20:20.8285040Z [148s] Generation 8 complete: 
2026-02-21T08:20:20.8286882Z ok=24
2026-02-21T08:20:20.8287198Z min=0.0143
2026-02-21T08:20:20.8287546Z mid=0.0164
2026-02-21T08:20:20.8287799Z max=0.0286
2026-02-21T08:20:20.8288021Z best={'block_sizes': [1, 4096],
2026-02-21T08:20:20.8288412Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:20:20.8288753Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:20:20.8289091Z  'num_stages': 2,
2026-02-21T08:20:20.8289419Z  'num_warps': 2,
2026-02-21T08:20:20.8289611Z  'pid_type': 'flat',
2026-02-21T08:20:20.8289872Z  'range_flattens': [None, False],
2026-02-21T08:20:20.8290112Z  'range_multi_buffers': [None, False],
2026-02-21T08:20:20.8290418Z  'range_num_stages': [0, 1],
2026-02-21T08:20:20.8290648Z  'range_unroll_factors': [0, 0],
2026-02-21T08:20:20.8290937Z  'range_warp_specializes': [None, None]}
2026-02-21T08:20:20.8302085Z [148s] Fitting surrogate: 579 points, 579 targets
2026-02-21T08:20:21.2609616Z [148s] Generation 9 starting: 16 neighbors, 1 active search path(s)
2026-02-21T08:20:22.3639381Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 21.6 configs/s
2026-02-21T08:20:23.4105094Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 16.9 configs/s
2026-02-21T08:20:24.4027958Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1019.1         
2026-02-21T08:20:24.4032257Z                                                                  configs/s      
2026-02-21T08:20:24.4950213Z [151s] Generation 9 complete: 
2026-02-21T08:20:24.4954519Z ok=18
2026-02-21T08:20:24.4956054Z min=0.0143
2026-02-21T08:20:24.4956329Z mid=0.0144
2026-02-21T08:20:24.4956541Z max=0.0235
2026-02-21T08:20:24.4956766Z best={'block_sizes': [1, 4096],
2026-02-21T08:20:24.4957322Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:20:24.4959213Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:20:24.4959543Z  'num_stages': 2,
2026-02-21T08:20:24.4964211Z  'num_warps': 2,
2026-02-21T08:20:24.4968474Z  'pid_type': 'flat',
2026-02-21T08:20:24.4972267Z  'range_flattens': [None, False],
2026-02-21T08:20:24.4972610Z  'range_multi_buffers': [None, False],
2026-02-21T08:20:24.4972887Z  'range_num_stages': [0, 1],
2026-02-21T08:20:24.4977056Z  'range_unroll_factors': [0, 0],
2026-02-21T08:20:24.4979117Z  'range_warp_specializes': [None, None]}
2026-02-21T08:20:24.4979417Z [151s] Fitting surrogate: 597 points, 597 targets
2026-02-21T08:20:24.9466902Z [152s] Generation 10 starting: 15 neighbors, 1 active search path(s)
2026-02-21T08:20:29.4577990Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.7 configs/s
2026-02-21T08:20:30.4466501Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.9 configs/s
2026-02-21T08:20:31.3312834Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1141.0        
2026-02-21T08:20:31.3314453Z                                                                   configs/s     
2026-02-21T08:20:31.4171337Z [158s] Generation 10 complete: 
2026-02-21T08:20:31.4175681Z ok=17
2026-02-21T08:20:31.4177185Z min=0.0143
2026-02-21T08:20:31.4177436Z mid=0.0164
2026-02-21T08:20:31.4177646Z max=0.0328
2026-02-21T08:20:31.4177825Z best={'block_sizes': [1, 4096],
2026-02-21T08:20:31.4178200Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:20:31.4178537Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:20:31.4178758Z  'num_stages': 2,
2026-02-21T08:20:31.4179030Z  'num_warps': 2,
2026-02-21T08:20:31.4179223Z  'pid_type': 'flat',
2026-02-21T08:20:31.4179446Z  'range_flattens': [None, False],
2026-02-21T08:20:31.4179714Z  'range_multi_buffers': [None, False],
2026-02-21T08:20:31.4179983Z  'range_num_stages': [0, 1],
2026-02-21T08:20:31.4180200Z  'range_unroll_factors': [0, 0],
2026-02-21T08:20:31.4180505Z  'range_warp_specializes': [None, None]}
2026-02-21T08:20:31.4190411Z [158s] Fitting surrogate: 614 points, 614 targets
2026-02-21T08:20:31.6817565Z [159s] Autotuning complete in 159.0s after searching 588 configs.
2026-02-21T08:20:31.6818115Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:20:31.6819275Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T08:20:31.6820127Z 
2026-02-21T08:20:31.6820407Z [159s] Code of selected kernel: /tmp/torchinductor_root/tv/ctvw6nvwxfmleqem5ps4ftlh564h4auc6z3vbynzdex7px3ify2o.py
2026-02-21T08:20:32.4457353Z WARNING:tritonbench.utils.triton_op:Completed input ID 20:
2026-02-21T08:20:32.4462289Z (M, N)
2026-02-21T08:20:32.4464600Z ------------
2026-02-21T08:20:32.4466865Z (4096, 2816)
2026-02-21T08:20:32.4467242Z 
2026-02-21T08:20:32.4472826Z  25%|██▌       | 5/20 [11:30<37:04, 148.29s/it]WARNING:tritonbench.utils.triton_op:Running input ID 26:
2026-02-21T08:20:32.4473545Z (M, N)
2026-02-21T08:20:32.4473847Z ------------
2026-02-21T08:20:32.4474075Z (4096, 3584)
2026-02-21T08:20:32.4474445Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:20:33.8102030Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:20:35.3632561Z INFO:tritonbench.utils.triton_op:Took 2.37ms to get benchmark function for torch_compile_softmax
2026-02-21T08:20:38.6476619Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:20:38.6478106Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:20:38.6478354Z               'dtype': 'torch.float16',
2026-02-21T08:20:38.6478658Z               'shape': (4096, 3584),
2026-02-21T08:20:38.6478902Z               'stride': (3584, 1)},),
2026-02-21T08:20:38.6479237Z   'kwargs': {}}
2026-02-21T08:20:38.6495892Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:20:38.8265286Z [0s] Autotune random seed: 2134816249
2026-02-21T08:20:38.9705959Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:21:12.9628131Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:21:13.2226530Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:21:13.2241942Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s
2026-02-21T08:21:13.3914693Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:21:13.3915451Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:21:13.3916241Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:21:13.3916555Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:21:13.3916819Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:21:13.3917124Z     %cst = arith.constant dense<3584> : tensor<8x1xi32>
2026-02-21T08:21:13.3917461Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:21:13.3917824Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:21:13.3918136Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:21:13.3918410Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:21:13.3918717Z     %c3584_i32 = arith.constant 3584 : i32
2026-02-21T08:21:13.3918942Z     %c3584_i64 = arith.constant 3584 : i64
2026-02-21T08:21:13.3919199Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:21:13.3919634Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:21:13.3920007Z     %1 = tt.get_program_id x : i32
2026-02-21T08:21:13.3920282Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:21:13.3920590Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:21:13.3920893Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:21:13.3921185Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:21:13.3921617Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:21:13.3921899Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:21:13.3922272Z       %c2048_i32_2 = arith.constant 2048 : i32
2026-02-21T08:21:13.3922690Z       %6 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:21:13.3923061Z       %7 = arith.extf %6 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.3923372Z       %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3923629Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3923892Z         %183 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:13.3924165Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3924410Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3924702Z       %9 = arith.truncf %8 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:21:13.3924972Z       %10 = arith.extf %9 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:21:13.3925411Z       %11 = arith.cmpf ogt, %cst_1, %10 : tensor<8xf32>
2026-02-21T08:21:13.3925691Z       %12 = arith.cmpf une, %cst_1, %cst_1 : tensor<8xf32>
2026-02-21T08:21:13.3926337Z       %13 = arith.ori %11, %12 : tensor<8xi1>
2026-02-21T08:21:13.3926654Z       %14 = arith.select %13, %cst_1, %10 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:21:13.3926944Z       %15 = arith.subf %cst_1, %14 : tensor<8xf32>
2026-02-21T08:21:13.3927405Z       %16 = tt.extern_elementwise %15 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3927819Z       %17 = arith.mulf %cst_0, %16 : tensor<8xf32>
2026-02-21T08:21:13.3928179Z       %18 = tt.expand_dims %14 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.3928554Z       %19 = tt.broadcast %18 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.3928829Z       %20 = arith.subf %7, %19 : tensor<8x512xf32>
2026-02-21T08:21:13.3929380Z       %21 = tt.extern_elementwise %20 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.3929788Z       %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3930066Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3930670Z         %183 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:13.3930972Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3931241Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3931484Z       %23 = arith.addf %17, %22 : tensor<8xf32>
2026-02-21T08:21:13.3931836Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T08:21:13.3932070Z       %24 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:21:13.3932361Z       %25 = arith.addi %c0_i32, %24 : i32
2026-02-21T08:21:13.3932747Z       %26 = tt.descriptor_load %0[%2, %25] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:21:13.3933118Z       %27 = arith.extf %26 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.3933413Z       %28 = "tt.reduce"(%27) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3933674Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3933932Z         %183 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:13.3934166Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3934439Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3934742Z       %29 = arith.truncf %28 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:21:13.3935011Z       %30 = arith.extf %29 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:21:13.3935341Z       %31 = arith.cmpf ogt, %14, %30 : tensor<8xf32>
2026-02-21T08:21:13.3935599Z       %32 = arith.cmpf une, %14, %14 : tensor<8xf32>
2026-02-21T08:21:13.3935865Z       %33 = arith.ori %31, %32 : tensor<8xi1>
2026-02-21T08:21:13.3936165Z       %34 = arith.select %33, %14, %30 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:21:13.3936464Z       %35 = arith.subf %14, %34 : tensor<8xf32>
2026-02-21T08:21:13.3936872Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3937297Z       %37 = arith.mulf %23, %36 : tensor<8xf32>
2026-02-21T08:21:13.3937625Z       %38 = tt.expand_dims %34 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.3937948Z       %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.3938268Z       %40 = arith.subf %27, %39 : tensor<8x512xf32>
2026-02-21T08:21:13.3938707Z       %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.3939114Z       %42 = "tt.reduce"(%41) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3939386Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3939615Z         %183 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:13.3939875Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3940102Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3940386Z       %43 = arith.addf %37, %42 : tensor<8xf32>
2026-02-21T08:21:13.3940651Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:21:13.3940954Z       %44 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:21:13.3941233Z       %45 = arith.addi %c0_i32, %44 : i32
2026-02-21T08:21:13.3941580Z       %46 = tt.descriptor_load %0[%2, %45] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:21:13.3941970Z       %47 = arith.extf %46 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.3942289Z       %48 = "tt.reduce"(%47) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3942520Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3942783Z         %183 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:13.3943026Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3943281Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3943527Z       %49 = arith.truncf %48 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:21:13.3943875Z       %50 = arith.extf %49 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:21:13.3944168Z       %51 = arith.cmpf ogt, %34, %50 : tensor<8xf32>
2026-02-21T08:21:13.3944476Z       %52 = arith.cmpf une, %34, %34 : tensor<8xf32>
2026-02-21T08:21:13.3944796Z       %53 = arith.ori %51, %52 : tensor<8xi1>
2026-02-21T08:21:13.3945063Z       %54 = arith.select %53, %34, %50 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:21:13.3945525Z       %55 = arith.subf %34, %54 : tensor<8xf32>
2026-02-21T08:21:13.3945939Z       %56 = tt.extern_elementwise %55 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3946362Z       %57 = arith.mulf %43, %56 : tensor<8xf32>
2026-02-21T08:21:13.3946679Z       %58 = tt.expand_dims %54 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.3947035Z       %59 = tt.broadcast %58 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.3947346Z       %60 = arith.subf %47, %59 : tensor<8x512xf32>
2026-02-21T08:21:13.3947750Z       %61 = tt.extern_elementwise %60 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.3948205Z       %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3948479Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3948699Z         %183 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:13.3948975Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.3949205Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.3949473Z       %63 = arith.addf %57, %62 : tensor<8xf32>
2026-02-21T08:21:13.3949709Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:21:13.3949992Z       %64 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:21:13.3950252Z       %65 = arith.addi %c0_i32, %64 : i32
2026-02-21T08:21:13.3950600Z       %66 = tt.descriptor_load %0[%2, %65] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:21:13.3950988Z       %67 = arith.extf %66 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.3951252Z       %68 = "tt.reduce"(%67) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.3951578Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.3951813Z         %183 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:13.3952073Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.4026259Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4026708Z       %69 = arith.truncf %68 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:21:13.4026966Z       %70 = arith.extf %69 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:21:13.4027206Z       %71 = arith.cmpf ogt, %54, %70 : tensor<8xf32>
2026-02-21T08:21:13.4027415Z       %72 = arith.cmpf une, %54, %54 : tensor<8xf32>
2026-02-21T08:21:13.4027629Z       %73 = arith.ori %71, %72 : tensor<8xi1>
2026-02-21T08:21:13.4027868Z       %74 = arith.select %73, %54, %70 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:21:13.4028102Z       %75 = arith.subf %54, %74 : tensor<8xf32>
2026-02-21T08:21:13.4028465Z       %76 = tt.extern_elementwise %75 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4028835Z       %77 = arith.mulf %63, %76 : tensor<8xf32>
2026-02-21T08:21:13.4029319Z       %78 = tt.expand_dims %74 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4029618Z       %79 = tt.broadcast %78 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4029855Z       %80 = arith.subf %67, %79 : tensor<8x512xf32>
2026-02-21T08:21:13.4030267Z       %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4030639Z       %82 = "tt.reduce"(%81) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.4030830Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:13.4031022Z         %183 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:13.4031206Z         tt.reduce.return %183 : f32
2026-02-21T08:21:13.4031395Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4031630Z       %83 = arith.addf %77, %82 : tensor<8xf32>
2026-02-21T08:21:13.4032082Z       %84:2 = scf.for %arg3 = %c2048_i32 to %c3584_i32 step %c512_i32 iter_args(%arg4 = %74, %arg5 = %83) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:21:13.4032544Z         %183 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:21:13.4032866Z         %184 = arith.extf %183 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4033116Z         %185 = "tt.reduce"(%184) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.4033322Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:13.4033511Z           %201 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:13.4033713Z           tt.reduce.return %201 : f32
2026-02-21T08:21:13.4033899Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4034133Z         %186 = arith.truncf %185 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:21:13.4034379Z         %187 = arith.extf %186 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:21:13.4034623Z         %188 = arith.cmpf ogt, %arg4, %187 : tensor<8xf32>
2026-02-21T08:21:13.4034852Z         %189 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:21:13.4035076Z         %190 = arith.ori %188, %189 : tensor<8xi1>
2026-02-21T08:21:13.4035319Z         %191 = arith.select %190, %arg4, %187 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:21:13.4035564Z         %192 = arith.subf %arg4, %191 : tensor<8xf32>
2026-02-21T08:21:13.4035927Z         %193 = tt.extern_elementwise %192 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4036288Z         %194 = arith.mulf %arg5, %193 : tensor<8xf32>
2026-02-21T08:21:13.4036552Z         %195 = tt.expand_dims %191 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4036847Z         %196 = tt.broadcast %195 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4037089Z         %197 = arith.subf %184, %196 : tensor<8x512xf32>
2026-02-21T08:21:13.4037463Z         %198 = tt.extern_elementwise %197 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4037828Z         %199 = "tt.reduce"(%198) <{axis = 1 : i32}> ({
2026-02-21T08:21:13.4038024Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:13.4038205Z           %201 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:13.4038398Z           tt.reduce.return %201 : f32
2026-02-21T08:21:13.4038588Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:21:13.4038784Z         %200 = arith.addf %194, %199 : tensor<8xf32>
2026-02-21T08:21:13.4039007Z         scf.yield %191, %200 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:21:13.4039243Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:21:13.4039474Z       %c2048_i32_3 = arith.constant 2048 : i32
2026-02-21T08:21:13.4039659Z       %c2048_i32_4 = arith.constant 2048 : i32
2026-02-21T08:21:13.4039899Z       %85 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:21:13.4040159Z       %86 = tt.splat %c0_i32 : i32 -> tensor<512xi32>
2026-02-21T08:21:13.4040365Z       %87 = arith.addi %86, %85 : tensor<512xi32>
2026-02-21T08:21:13.4040678Z       %88 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:21:13.4040926Z       %89 = arith.muli %88, %cst : tensor<8x1xi32>
2026-02-21T08:21:13.4041194Z       %90 = tt.expand_dims %87 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:21:13.4041488Z       %91 = tt.broadcast %89 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4041800Z       %92 = tt.broadcast %90 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4042053Z       %93 = arith.addi %91, %92 : tensor<8x512xi32>
2026-02-21T08:21:13.4042298Z       %94 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4042597Z       %95 = tt.addptr %94, %93 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4042908Z       %96 = tt.load %95 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4043310Z       %97 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4043611Z       %98 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4043876Z       %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4044120Z       %100 = arith.subf %98, %99 : tensor<8x512xf32>
2026-02-21T08:21:13.4044508Z       %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4044947Z       %102 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4045247Z       %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4045502Z       %104 = arith.divf %101, %103 : tensor<8x512xf32>
2026-02-21T08:21:13.4045756Z       %105 = arith.truncf %104 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:21:13.4046044Z       %106 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4046350Z       %107 = tt.addptr %106, %93 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4046622Z       tt.store %107, %105 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4046847Z       %c1_i32_5 = arith.constant 1 : i32
2026-02-21T08:21:13.4047048Z       %108 = arith.muli %c512_i32, %c1_i32_5 : i32
2026-02-21T08:21:13.4047261Z       %109 = arith.addi %c0_i32, %108 : i32
2026-02-21T08:21:13.4047515Z       %110 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:21:13.4047775Z       %111 = tt.splat %109 : i32 -> tensor<512xi32>
2026-02-21T08:21:13.4048000Z       %112 = arith.addi %111, %110 : tensor<512xi32>
2026-02-21T08:21:13.4048258Z       %113 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:21:13.4048529Z       %114 = arith.muli %113, %cst : tensor<8x1xi32>
2026-02-21T08:21:13.4048801Z       %115 = tt.expand_dims %112 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:21:13.4049115Z       %116 = tt.broadcast %114 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4049403Z       %117 = tt.broadcast %115 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4049639Z       %118 = arith.addi %116, %117 : tensor<8x512xi32>
2026-02-21T08:21:13.4049878Z       %119 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4050153Z       %120 = tt.addptr %119, %118 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4050459Z       %121 = tt.load %120 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4050758Z       %122 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4051047Z       %123 = arith.extf %121 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4051312Z       %124 = tt.broadcast %122 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4051588Z       %125 = arith.subf %123, %124 : tensor<8x512xf32>
2026-02-21T08:21:13.4051968Z       %126 = tt.extern_elementwise %125 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4052437Z       %127 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4052725Z       %128 = tt.broadcast %127 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4052965Z       %129 = arith.divf %126, %128 : tensor<8x512xf32>
2026-02-21T08:21:13.4053200Z       %130 = arith.truncf %129 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:21:13.4053476Z       %131 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4053750Z       %132 = tt.addptr %131, %118 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4054019Z       tt.store %132, %130 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4054219Z       %c2_i32_6 = arith.constant 2 : i32
2026-02-21T08:21:13.4054415Z       %133 = arith.muli %c512_i32, %c2_i32_6 : i32
2026-02-21T08:21:13.4054614Z       %134 = arith.addi %c0_i32, %133 : i32
2026-02-21T08:21:13.4054901Z       %135 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:21:13.4055159Z       %136 = tt.splat %134 : i32 -> tensor<512xi32>
2026-02-21T08:21:13.4055363Z       %137 = arith.addi %136, %135 : tensor<512xi32>
2026-02-21T08:21:13.4055616Z       %138 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:21:13.4055870Z       %139 = arith.muli %138, %cst : tensor<8x1xi32>
2026-02-21T08:21:13.4056137Z       %140 = tt.expand_dims %137 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:21:13.4056434Z       %141 = tt.broadcast %139 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4056697Z       %142 = tt.broadcast %140 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4056940Z       %143 = arith.addi %141, %142 : tensor<8x512xi32>
2026-02-21T08:21:13.4057174Z       %144 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4057457Z       %145 = tt.addptr %144, %143 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4057759Z       %146 = tt.load %145 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4058071Z       %147 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4058358Z       %148 = arith.extf %146 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4058612Z       %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4058852Z       %150 = arith.subf %148, %149 : tensor<8x512xf32>
2026-02-21T08:21:13.4059217Z       %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4059634Z       %152 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4059925Z       %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4060155Z       %154 = arith.divf %151, %153 : tensor<8x512xf32>
2026-02-21T08:21:13.4060393Z       %155 = arith.truncf %154 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:21:13.4060656Z       %156 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4060940Z       %157 = tt.addptr %156, %143 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4061191Z       tt.store %157, %155 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4061398Z       %c3_i32_7 = arith.constant 3 : i32
2026-02-21T08:21:13.4061617Z       %158 = arith.muli %c512_i32, %c3_i32_7 : i32
2026-02-21T08:21:13.4061808Z       %159 = arith.addi %c0_i32, %158 : i32
2026-02-21T08:21:13.4062049Z       %160 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:21:13.4062301Z       %161 = tt.splat %159 : i32 -> tensor<512xi32>
2026-02-21T08:21:13.4062513Z       %162 = arith.addi %161, %160 : tensor<512xi32>
2026-02-21T08:21:13.4062757Z       %163 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:21:13.4063020Z       %164 = arith.muli %163, %cst : tensor<8x1xi32>
2026-02-21T08:21:13.4063346Z       %165 = tt.expand_dims %162 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:21:13.4063630Z       %166 = tt.broadcast %164 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4063893Z       %167 = tt.broadcast %165 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4064131Z       %168 = arith.addi %166, %167 : tensor<8x512xi32>
2026-02-21T08:21:13.4064358Z       %169 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4064636Z       %170 = tt.addptr %169, %168 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4064930Z       %171 = tt.load %170 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4065233Z       %172 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4065513Z       %173 = arith.extf %171 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4065807Z       %174 = tt.broadcast %172 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4066046Z       %175 = arith.subf %173, %174 : tensor<8x512xf32>
2026-02-21T08:21:13.4066403Z       %176 = tt.extern_elementwise %175 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4066814Z       %177 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4067098Z       %178 = tt.broadcast %177 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4067323Z       %179 = arith.divf %176, %178 : tensor<8x512xf32>
2026-02-21T08:21:13.4067561Z       %180 = arith.truncf %179 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:21:13.4067821Z       %181 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4068098Z       %182 = tt.addptr %181, %168 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4068352Z       tt.store %182, %180 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4068598Z       scf.for %arg3 = %c2048_i32_3 to %c3584_i32 step %c512_i32  : i32 {
2026-02-21T08:21:13.4068880Z         %183 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:21:13.4069126Z         %184 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:21:13.4069334Z         %185 = arith.addi %184, %183 : tensor<512xi32>
2026-02-21T08:21:13.4069575Z         %186 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:21:13.4069832Z         %187 = arith.muli %186, %cst : tensor<8x1xi32>
2026-02-21T08:21:13.4070095Z         %188 = tt.expand_dims %185 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:21:13.4070386Z         %189 = tt.broadcast %187 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4070654Z         %190 = tt.broadcast %188 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:21:13.4070890Z         %191 = arith.addi %189, %190 : tensor<8x512xi32>
2026-02-21T08:21:13.4071133Z         %192 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4071415Z         %193 = tt.addptr %192, %191 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4071883Z         %194 = tt.load %193 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4072203Z         %195 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4072490Z         %196 = arith.extf %194 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:21:13.4072757Z         %197 = tt.broadcast %195 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4072991Z         %198 = arith.subf %196, %197 : tensor<8x512xf32>
2026-02-21T08:21:13.4073371Z         %199 = tt.extern_elementwise %198 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:21:13.4073799Z         %200 = tt.expand_dims %84#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:21:13.4074144Z         %201 = tt.broadcast %200 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:21:13.4074386Z         %202 = arith.divf %199, %201 : tensor<8x512xf32>
2026-02-21T08:21:13.4074622Z         %203 = arith.truncf %202 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:21:13.4074894Z         %204 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4075173Z         %205 = tt.addptr %204, %191 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:21:13.4075439Z         tt.store %205, %203 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:21:13.4075679Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:21:13.4075965Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:21:13.4076224Z     tt.return
2026-02-21T08:21:13.4076358Z   }
2026-02-21T08:21:13.4076493Z }
2026-02-21T08:21:13.4076561Z 
2026-02-21T08:21:13.4076611Z {-#
2026-02-21T08:21:13.4076745Z   external_resources: {
2026-02-21T08:21:13.4076950Z     mlir_reproducer: {
2026-02-21T08:21:13.4081212Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:21:13.4085647Z       disable_threading: false,
2026-02-21T08:21:13.4085824Z       verify_each: true
2026-02-21T08:21:13.4085990Z     }
2026-02-21T08:21:13.4086138Z   }
2026-02-21T08:21:13.4086291Z #-}
2026-02-21T08:21:13.4086849Z /tmp/torchinductor_root/gh/cghtk5hmlnh2ptfqo2is6rzrnewaeoxnr36ykwemcetkbrwezaxq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:21:13.4088270Z /tmp/torchinductor_root/gh/cghtk5hmlnh2ptfqo2is6rzrnewaeoxnr36ykwemcetkbrwezaxq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:21:13.4089385Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:21:13.4090605Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:21:13.4091785Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:21:13.4092089Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:21:16.5269678Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:21:16.5274738Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:21:16.5275716Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:21:16.5275925Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:21:16.5276161Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:21:16.5279266Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:21:16.5279545Z     %cst = arith.constant dense<3584> : tensor<32x1xi32>
2026-02-21T08:21:16.5280151Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:21:16.5280452Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:21:16.5280674Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:21:16.5280866Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:21:16.5281047Z     %c3584_i32 = arith.constant 3584 : i32
2026-02-21T08:21:16.5281231Z     %c3584_i64 = arith.constant 3584 : i64
2026-02-21T08:21:16.5281407Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:21:16.5282220Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:21:16.5282664Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:21:16.5282980Z     %2 = tt.get_program_id x : i32
2026-02-21T08:21:16.5283177Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:21:16.5283360Z     %4 = arith.minsi %3, %c128_i32 : i32
2026-02-21T08:21:16.5283575Z     scf.for %arg2 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:21:16.5283784Z       %5 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:21:16.5284021Z       %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:21:16.5284279Z       %7 = tt.splat %5 : i32 -> tensor<32xi32>
2026-02-21T08:21:16.5284468Z       %8 = arith.addi %7, %6 : tensor<32xi32>
2026-02-21T08:21:16.5284659Z       %c3576_i32 = arith.constant 3576 : i32
2026-02-21T08:21:16.5284843Z       %c24_i32 = arith.constant 24 : i32
2026-02-21T08:21:16.5285209Z       %9:2 = scf.for %arg3 = %c0_i32 to %c3576_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:21:16.5285614Z         %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:21:16.5285867Z         %50 = tt.splat %arg3 : i32 -> tensor<8xi32>
2026-02-21T08:21:16.5286076Z         %51 = arith.addi %50, %49 : tensor<8xi32>
2026-02-21T08:21:16.5286325Z         %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:16.5286595Z         %53 = arith.muli %52, %cst : tensor<32x1xi32>
2026-02-21T08:21:16.5286843Z         %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:21:16.5287128Z         %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5287389Z         %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5287615Z         %57 = arith.addi %55, %56 : tensor<32x8xi32>
2026-02-21T08:21:16.5287896Z         %58 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5288171Z         %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:21:16.5288460Z         %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5288743Z         %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5288973Z         %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5289165Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5289520Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:16.5289716Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5289909Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5290130Z         %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:16.5290379Z         %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:16.5290607Z         %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32>
2026-02-21T08:21:16.5290838Z         %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:21:16.5291054Z         %67 = arith.ori %65, %66 : tensor<32xi1>
2026-02-21T08:21:16.5291285Z         %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:16.5291531Z         %69 = arith.subf %arg4, %68 : tensor<32xf32>
2026-02-21T08:21:16.5292036Z         %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5292410Z         %71 = arith.mulf %arg5, %70 : tensor<32xf32>
2026-02-21T08:21:16.5292674Z         %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5292961Z         %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5293189Z         %74 = arith.subf %61, %73 : tensor<32x8xf32>
2026-02-21T08:21:16.5293540Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5293906Z         %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5294099Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5294296Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:16.5294498Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5294682Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5294886Z         %77 = arith.addf %71, %76 : tensor<32xf32>
2026-02-21T08:21:16.5295080Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:21:16.5295274Z         %78 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:21:16.5295461Z         %79 = arith.addi %arg3, %78 : i32
2026-02-21T08:21:16.5295688Z         %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:21:16.5295925Z         %81 = tt.splat %79 : i32 -> tensor<8xi32>
2026-02-21T08:21:16.5296121Z         %82 = arith.addi %81, %80 : tensor<8xi32>
2026-02-21T08:21:16.5296371Z         %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:16.5296626Z         %84 = arith.muli %83, %cst : tensor<32x1xi32>
2026-02-21T08:21:16.5296878Z         %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:21:16.5297150Z         %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5297407Z         %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5297644Z         %88 = arith.addi %86, %87 : tensor<32x8xi32>
2026-02-21T08:21:16.5297873Z         %89 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5298142Z         %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:21:16.5298425Z         %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5298708Z         %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5298923Z         %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5299114Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5299301Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:16.5299486Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5299671Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5299886Z         %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:16.5300130Z         %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:16.5300419Z         %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32>
2026-02-21T08:21:16.5300639Z         %97 = arith.cmpf une, %68, %68 : tensor<32xf32>
2026-02-21T08:21:16.5300836Z         %98 = arith.ori %96, %97 : tensor<32xi1>
2026-02-21T08:21:16.5301064Z         %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:16.5301304Z         %100 = arith.subf %68, %99 : tensor<32xf32>
2026-02-21T08:21:16.5301696Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5302068Z         %102 = arith.mulf %77, %101 : tensor<32xf32>
2026-02-21T08:21:16.5302318Z         %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5302612Z         %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5302852Z         %105 = arith.subf %92, %104 : tensor<32x8xf32>
2026-02-21T08:21:16.5303278Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5303656Z         %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5303850Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5304046Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:16.5304240Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5304439Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5304653Z         %108 = arith.addf %102, %107 : tensor<32xf32>
2026-02-21T08:21:16.5304858Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:21:16.5305060Z         %109 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:21:16.5305251Z         %110 = arith.addi %arg3, %109 : i32
2026-02-21T08:21:16.5305493Z         %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:21:16.5305746Z         %112 = tt.splat %110 : i32 -> tensor<8xi32>
2026-02-21T08:21:16.5305961Z         %113 = arith.addi %112, %111 : tensor<8xi32>
2026-02-21T08:21:16.5306222Z         %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:16.5306494Z         %115 = arith.muli %114, %cst : tensor<32x1xi32>
2026-02-21T08:21:16.5306760Z         %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:21:16.5307057Z         %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5307333Z         %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5307574Z         %119 = arith.addi %117, %118 : tensor<32x8xi32>
2026-02-21T08:21:16.5307823Z         %120 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5308117Z         %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:21:16.5308426Z         %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5308733Z         %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5308968Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5309170Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5309357Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:16.5309559Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5309754Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5309982Z         %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:16.5310243Z         %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:16.5310480Z         %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32>
2026-02-21T08:21:16.5310710Z         %128 = arith.cmpf une, %99, %99 : tensor<32xf32>
2026-02-21T08:21:16.5310917Z         %129 = arith.ori %127, %128 : tensor<32xi1>
2026-02-21T08:21:16.5311165Z         %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:16.5311466Z         %131 = arith.subf %99, %130 : tensor<32xf32>
2026-02-21T08:21:16.5311858Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5312225Z         %133 = arith.mulf %108, %132 : tensor<32xf32>
2026-02-21T08:21:16.5312480Z         %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5312779Z         %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5313015Z         %136 = arith.subf %123, %135 : tensor<32x8xf32>
2026-02-21T08:21:16.5313378Z         %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5313747Z         %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5313935Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:16.5314177Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:16.5314366Z           tt.reduce.return %140 : f32
2026-02-21T08:21:16.5314554Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5314747Z         %139 = arith.addf %133, %138 : tensor<32xf32>
2026-02-21T08:21:16.5314969Z         scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:21:16.5315217Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:21:16.5315475Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:21:16.5315728Z       %11 = tt.splat %c3576_i32 : i32 -> tensor<8xi32>
2026-02-21T08:21:16.5315927Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:21:16.5316175Z       %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:16.5316429Z       %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:21:16.5316681Z       %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:21:16.5316972Z       %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5317228Z       %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:21:16.5317464Z       %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:21:16.5317695Z       %19 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5317968Z       %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:21:16.5318260Z       %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:21:16.5318533Z       %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5318757Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5318946Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:16.5319134Z         %49 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:16.5319318Z         tt.reduce.return %49 : f32
2026-02-21T08:21:16.5319509Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5319725Z       %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:16.5319967Z       %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:16.5320192Z       %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32>
2026-02-21T08:21:16.5320404Z       %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32>
2026-02-21T08:21:16.5320611Z       %28 = arith.ori %26, %27 : tensor<32xi1>
2026-02-21T08:21:16.5320835Z       %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:16.5321070Z       %30 = arith.subf %9#0, %29 : tensor<32xf32>
2026-02-21T08:21:16.5321415Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5321812Z       %32 = arith.mulf %9#1, %31 : tensor<32xf32>
2026-02-21T08:21:16.5322083Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5322376Z       %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5322717Z       %35 = arith.subf %22, %34 : tensor<32x8xf32>
2026-02-21T08:21:16.5323081Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5323460Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:21:16.5323669Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:16.5323855Z         %49 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:16.5324077Z         tt.reduce.return %49 : f32
2026-02-21T08:21:16.5324286Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:21:16.5324508Z       %38 = arith.addf %32, %37 : tensor<32xf32>
2026-02-21T08:21:16.5324718Z       %c3576_i32_2 = arith.constant 3576 : i32
2026-02-21T08:21:16.5324925Z       %c24_i32_3 = arith.constant 24 : i32
2026-02-21T08:21:16.5325164Z       scf.for %arg3 = %c0_i32 to %c3576_i32_2 step %c24_i32_3  : i32 {
2026-02-21T08:21:16.5325566Z         %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:21:16.5325921Z         %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5326217Z         %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5326481Z         %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5326715Z         %53 = arith.subf %51, %52 : tensor<32x8xf32>
2026-02-21T08:21:16.5327094Z         %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5327513Z         %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5327813Z         %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5328046Z         %57 = arith.divf %54, %56 : tensor<32x8xf32>
2026-02-21T08:21:16.5328287Z         %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:21:16.5328617Z         tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:21:16.5328907Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:21:16.5329113Z         %59 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:21:16.5329311Z         %60 = arith.addi %arg3, %59 : i32
2026-02-21T08:21:16.5329593Z         %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:21:16.5329947Z         %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5330219Z         %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5330472Z         %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5330693Z         %65 = arith.subf %63, %64 : tensor<32x8xf32>
2026-02-21T08:21:16.5331050Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5331449Z         %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5331769Z         %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5331996Z         %69 = arith.divf %66, %68 : tensor<32x8xf32>
2026-02-21T08:21:16.5332216Z         %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:21:16.5332520Z         tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:21:16.5332791Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:21:16.5332982Z         %71 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:21:16.5333167Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:21:16.5333429Z         %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:21:16.5333763Z         %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5334090Z         %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5334338Z         %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5334561Z         %77 = arith.subf %75, %76 : tensor<32x8xf32>
2026-02-21T08:21:16.5334917Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5335317Z         %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5335590Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5335819Z         %81 = arith.divf %78, %80 : tensor<32x8xf32>
2026-02-21T08:21:16.5336041Z         %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:21:16.5336339Z         tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:21:16.5336702Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:21:16.5337034Z       %39 = tt.descriptor_load %0[%5, %c3576_i32_2] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:21:16.5337389Z       %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5337670Z       %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:21:16.5337932Z       %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5338160Z       %43 = arith.subf %41, %42 : tensor<32x8xf32>
2026-02-21T08:21:16.5338516Z       %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:21:16.5338926Z       %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:16.5339207Z       %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:21:16.5339448Z       %47 = arith.divf %44, %46 : tensor<32x8xf32>
2026-02-21T08:21:16.5339672Z       %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:21:16.5339998Z       tt.descriptor_store %1[%5, %c3576_i32_2], %48 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:21:16.5340302Z     } {tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:21:16.5340504Z     tt.return
2026-02-21T08:21:16.5340638Z   }
2026-02-21T08:21:16.5340758Z }
2026-02-21T08:21:16.5340827Z 
2026-02-21T08:21:16.5340884Z {-#
2026-02-21T08:21:16.5341012Z   external_resources: {
2026-02-21T08:21:16.5341177Z     mlir_reproducer: {
2026-02-21T08:21:16.5345600Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:21:16.5350104Z       disable_threading: false,
2026-02-21T08:21:16.5350270Z       verify_each: true
2026-02-21T08:21:16.5350422Z     }
2026-02-21T08:21:16.5350536Z   }
2026-02-21T08:21:16.5350652Z #-}
2026-02-21T08:21:16.5351081Z /tmp/torchinductor_root/67/c6752cyk4ba67x52vcddptv4uiempdrdll4dxehbdy4w7nhlfag6.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:21:16.5352351Z /tmp/torchinductor_root/67/c6752cyk4ba67x52vcddptv4uiempdrdll4dxehbdy4w7nhlfag6.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:21:16.5353338Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:21:16.5354435Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:21:16.5355425Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:21:16.5355683Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:21:18.2043372Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:21:18.2048600Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:21:18.2050368Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:21:18.2050659Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:21:18.2057245Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:21:18.2059189Z     %cst = arith.constant dense<3584> : tensor<32x1xi32>
2026-02-21T08:21:18.2059480Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:21:18.2059733Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:21:18.2059963Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:21:18.2060158Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:21:18.2060345Z     %c3584_i32 = arith.constant 3584 : i32
2026-02-21T08:21:18.2060529Z     %c3584_i64 = arith.constant 3584 : i64
2026-02-21T08:21:18.2060708Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:21:18.2061045Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c3584_i32], [%c3584_i64, %c1_i64] : <f16>, <tensor<32x32xf16>>
2026-02-21T08:21:18.2061373Z     %1 = tt.get_program_id x : i32
2026-02-21T08:21:18.2061867Z     scf.for %arg2 = %1 to %c128_i32 step %c9472_i32  : i32 {
2026-02-21T08:21:18.2062093Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:21:18.2062337Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:21:18.2062598Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:21:18.2062791Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:21:18.2062991Z       %c3552_i32 = arith.constant 3552 : i32
2026-02-21T08:21:18.2063174Z       %c96_i32 = arith.constant 96 : i32
2026-02-21T08:21:18.2063547Z       %6:2 = scf.for %arg3 = %c0_i32 to %c3552_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:21:18.2064012Z         %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:21:18.2064674Z         %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2064917Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2065112Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2065310Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:18.2065501Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2065691Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2065913Z         %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:18.2066165Z         %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:18.2066404Z         %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32>
2026-02-21T08:21:18.2066652Z         %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:21:18.2066884Z         %54 = arith.ori %52, %53 : tensor<32xi1>
2026-02-21T08:21:18.2067225Z         %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:18.2067493Z         %56 = arith.subf %arg4, %55 : tensor<32xf32>
2026-02-21T08:21:18.2067870Z         %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2068309Z         %58 = arith.mulf %arg5, %57 : tensor<32xf32>
2026-02-21T08:21:18.2068578Z         %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2068891Z         %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2069144Z         %61 = arith.subf %48, %60 : tensor<32x32xf32>
2026-02-21T08:21:18.2069527Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2069912Z         %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2070109Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2070305Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:18.2070502Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2070707Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2070913Z         %64 = arith.addf %58, %63 : tensor<32xf32>
2026-02-21T08:21:18.2071122Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:21:18.2071318Z         %65 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:21:18.2071516Z         %66 = arith.addi %arg3, %65 : i32
2026-02-21T08:21:18.2071851Z         %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:21:18.2072172Z         %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2072417Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2072610Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2072807Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:18.2073004Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2073209Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2073445Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:18.2073695Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:18.2073940Z         %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32>
2026-02-21T08:21:18.2074162Z         %73 = arith.cmpf une, %55, %55 : tensor<32xf32>
2026-02-21T08:21:18.2074378Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:21:18.2074616Z         %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:18.2074863Z         %76 = arith.subf %55, %75 : tensor<32xf32>
2026-02-21T08:21:18.2075237Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2075583Z         %78 = arith.mulf %64, %77 : tensor<32xf32>
2026-02-21T08:21:18.2075833Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2076198Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2076431Z         %81 = arith.subf %68, %80 : tensor<32x32xf32>
2026-02-21T08:21:18.2076783Z         %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2077138Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2077330Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2077505Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:18.2077691Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2077870Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2078067Z         %84 = arith.addf %78, %83 : tensor<32xf32>
2026-02-21T08:21:18.2078254Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:21:18.2078446Z         %85 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:21:18.2078698Z         %86 = arith.addi %arg3, %85 : i32
2026-02-21T08:21:18.2078967Z         %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:21:18.2079280Z         %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2079505Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2079695Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2079875Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:21:18.2080070Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2080255Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2080473Z         %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:18.2080721Z         %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:18.2080944Z         %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32>
2026-02-21T08:21:18.2081167Z         %93 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:21:18.2081372Z         %94 = arith.ori %92, %93 : tensor<32xi1>
2026-02-21T08:21:18.2081655Z         %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:18.2081889Z         %96 = arith.subf %75, %95 : tensor<32xf32>
2026-02-21T08:21:18.2082236Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2082595Z         %98 = arith.mulf %84, %97 : tensor<32xf32>
2026-02-21T08:21:18.2082844Z         %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2083145Z         %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2083385Z         %101 = arith.subf %88, %100 : tensor<32x32xf32>
2026-02-21T08:21:18.2083765Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2084149Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2084344Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:21:18.2084530Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:21:18.2084715Z           tt.reduce.return %105 : f32
2026-02-21T08:21:18.2084905Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2085104Z         %104 = arith.addf %98, %103 : tensor<32xf32>
2026-02-21T08:21:18.2085350Z         scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:21:18.2085568Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:21:18.2085866Z       %7 = tt.descriptor_load %0[%2, %c3552_i32] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:21:18.2086181Z       %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2086408Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2086598Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:18.2086781Z         %47 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:21:18.2086976Z         tt.reduce.return %47 : f32
2026-02-21T08:21:18.2087217Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2087442Z       %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:21:18.2087678Z       %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:21:18.2087908Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32>
2026-02-21T08:21:18.2088124Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:21:18.2088325Z       %14 = arith.ori %12, %13 : tensor<32xi1>
2026-02-21T08:21:18.2088555Z       %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:21:18.2088786Z       %16 = arith.subf %6#0, %15 : tensor<32xf32>
2026-02-21T08:21:18.2089137Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2089488Z       %18 = arith.mulf %6#1, %17 : tensor<32xf32>
2026-02-21T08:21:18.2089801Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2090099Z       %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2090321Z       %21 = arith.subf %8, %20 : tensor<32x32xf32>
2026-02-21T08:21:18.2090677Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2091028Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:21:18.2091222Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:21:18.2091396Z         %47 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:21:18.2091613Z         tt.reduce.return %47 : f32
2026-02-21T08:21:18.2091800Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:21:18.2091989Z       %24 = arith.addf %18, %23 : tensor<32xf32>
2026-02-21T08:21:18.2092184Z       %c3552_i32_2 = arith.constant 3552 : i32
2026-02-21T08:21:18.2092367Z       %c96_i32_3 = arith.constant 96 : i32
2026-02-21T08:21:18.2092600Z       scf.for %arg3 = %c0_i32 to %c3552_i32_2 step %c96_i32_3  : i32 {
2026-02-21T08:21:18.2092839Z         %47 = tt.splat %arg3 : i32 -> tensor<32xi32>
2026-02-21T08:21:18.2093043Z         %48 = arith.addi %47, %3 : tensor<32xi32>
2026-02-21T08:21:18.2093295Z         %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:18.2093555Z         %50 = arith.muli %49, %cst : tensor<32x1xi32>
2026-02-21T08:21:18.2093812Z         %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:21:18.2094092Z         %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2094351Z         %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2094581Z         %54 = arith.addi %52, %53 : tensor<32x32xi32>
2026-02-21T08:21:18.2094828Z         %55 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2095110Z         %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2095410Z         %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2095720Z         %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2095995Z         %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2096252Z         %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2096476Z         %61 = arith.subf %59, %60 : tensor<32x32xf32>
2026-02-21T08:21:18.2096843Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2097256Z         %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2097536Z         %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2097771Z         %65 = arith.divf %62, %64 : tensor<32x32xf32>
2026-02-21T08:21:18.2098002Z         %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:21:18.2098332Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2098605Z         %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2098853Z         tt.store %68, %66 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2099061Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:21:18.2099245Z         %69 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:21:18.2099438Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:21:18.2099625Z         %71 = tt.splat %70 : i32 -> tensor<32xi32>
2026-02-21T08:21:18.2099827Z         %72 = arith.addi %71, %3 : tensor<32xi32>
2026-02-21T08:21:18.2100074Z         %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:18.2100329Z         %74 = arith.muli %73, %cst : tensor<32x1xi32>
2026-02-21T08:21:18.2100657Z         %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:21:18.2100939Z         %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2101198Z         %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2101424Z         %78 = arith.addi %76, %77 : tensor<32x32xi32>
2026-02-21T08:21:18.2101687Z         %79 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2101965Z         %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2102259Z         %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2102572Z         %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2102853Z         %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2103114Z         %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2103342Z         %85 = arith.subf %83, %84 : tensor<32x32xf32>
2026-02-21T08:21:18.2103714Z         %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2104128Z         %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2104405Z         %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2104636Z         %89 = arith.divf %86, %88 : tensor<32x32xf32>
2026-02-21T08:21:18.2104864Z         %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:21:18.2105131Z         %91 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2105407Z         %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2105652Z         tt.store %92, %90 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2105858Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:21:18.2106045Z         %93 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:21:18.2106238Z         %94 = arith.addi %arg3, %93 : i32
2026-02-21T08:21:18.2106424Z         %95 = tt.splat %94 : i32 -> tensor<32xi32>
2026-02-21T08:21:18.2106624Z         %96 = arith.addi %95, %3 : tensor<32xi32>
2026-02-21T08:21:18.2106872Z         %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:18.2107124Z         %98 = arith.muli %97, %cst : tensor<32x1xi32>
2026-02-21T08:21:18.2107373Z         %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:21:18.2107656Z         %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2107923Z         %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2108160Z         %102 = arith.addi %100, %101 : tensor<32x32xi32>
2026-02-21T08:21:18.2108405Z         %103 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2108698Z         %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2109073Z         %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2109390Z         %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2109688Z         %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2109974Z         %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2110235Z         %109 = arith.subf %107, %108 : tensor<32x32xf32>
2026-02-21T08:21:18.2110627Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2111086Z         %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2111396Z         %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2111770Z         %113 = arith.divf %110, %112 : tensor<32x32xf32>
2026-02-21T08:21:18.2112023Z         %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:21:18.2112311Z         %115 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2112613Z         %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2112884Z         tt.store %116, %114 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2113110Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:21:18.2113326Z       %25 = tt.splat %c3552_i32_2 : i32 -> tensor<32xi32>
2026-02-21T08:21:18.2113550Z       %26 = arith.addi %25, %3 : tensor<32xi32>
2026-02-21T08:21:18.2113800Z       %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:21:18.2114074Z       %28 = arith.muli %27, %cst : tensor<32x1xi32>
2026-02-21T08:21:18.2114340Z       %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:21:18.2114639Z       %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2114921Z       %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:21:18.2115162Z       %32 = arith.addi %30, %31 : tensor<32x32xi32>
2026-02-21T08:21:18.2115408Z       %33 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2115686Z       %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2116000Z       %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2116323Z       %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2116613Z       %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:21:18.2116884Z       %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2117117Z       %39 = arith.subf %37, %38 : tensor<32x32xf32>
2026-02-21T08:21:18.2117500Z       %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:21:18.2117946Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:21:18.2118219Z       %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:21:18.2118447Z       %43 = arith.divf %40, %42 : tensor<32x32xf32>
2026-02-21T08:21:18.2118672Z       %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:21:18.2118936Z       %45 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2119196Z       %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:21:18.2119447Z       tt.store %46, %44 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:21:18.2119719Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:21:18.2119964Z     tt.return
2026-02-21T08:21:18.2120098Z   }
2026-02-21T08:21:18.2120215Z }
2026-02-21T08:21:18.2120291Z 
2026-02-21T08:21:18.2120398Z {-#
2026-02-21T08:21:18.2120526Z   external_resources: {
2026-02-21T08:21:18.2120687Z     mlir_reproducer: {
2026-02-21T08:21:18.2125105Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:21:18.2129591Z       disable_threading: false,
2026-02-21T08:21:18.2129774Z       verify_each: true
2026-02-21T08:21:18.2129920Z     }
2026-02-21T08:21:18.2130045Z   }
2026-02-21T08:21:18.2130160Z #-}
2026-02-21T08:21:18.2130608Z /tmp/torchinductor_root/3k/c3kroivlf44f54gqsibocybymp3h5kiwuog4cw4ukazisoehdqy2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:21:18.2131920Z /tmp/torchinductor_root/3k/c3kroivlf44f54gqsibocybymp3h5kiwuog4cw4ukazisoehdqy2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:21:18.2132924Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:21:18.2134037Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:21:18.2135060Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:21:18.2135319Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:21:20.2480762Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.3 configs/s
2026-02-21T08:21:20.2492848Z [41s] Adaptive compile timeout: 30s (90% percentile=4.6s, bounds=[30.0s, 30s])
2026-02-21T08:21:20.8085590Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1777.5 configs/s
2026-02-21T08:21:20.8652777Z [41s] Initial random population of 100, 5 starting points: 
2026-02-21T08:21:20.8654571Z error=7
2026-02-21T08:21:20.8654782Z timeout=2
2026-02-21T08:21:20.8659879Z ok=91
2026-02-21T08:21:20.8664428Z min=0.0246
2026-02-21T08:21:20.8666426Z mid=0.3922
2026-02-21T08:21:20.8667113Z max=23.9688
2026-02-21T08:21:20.8667255Z best={'block_sizes': [1, 4096],
2026-02-21T08:21:20.8667490Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:21:20.8667733Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:21:20.8667922Z  'num_sm_multiplier': 16,
2026-02-21T08:21:20.8668082Z  'num_stages': 5,
2026-02-21T08:21:20.8668217Z  'num_warps': 16,
2026-02-21T08:21:20.8668371Z  'pid_type': 'persistent_blocked',
2026-02-21T08:21:20.8668553Z  'range_flattens': [None, False],
2026-02-21T08:21:20.8668735Z  'range_multi_buffers': [None, True],
2026-02-21T08:21:20.8668914Z  'range_num_stages': [3, 4],
2026-02-21T08:21:20.8669085Z  'range_unroll_factors': [1, 0],
2026-02-21T08:21:20.8669267Z  'range_warp_specializes': [None, False]}
2026-02-21T08:21:20.8669484Z [41s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:21:22.0248003Z [43s] Generation 1 starting: 84 neighbors, 5 active search path(s)
2026-02-21T08:21:28.2440866Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 9.4 configs/s
2026-02-21T08:21:33.5669319Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.9 configs/s
2026-02-21T08:21:36.6359505Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 331.3         
2026-02-21T08:21:36.6364442Z                                                                   configs/s     
2026-02-21T08:21:36.8619628Z [57s] Generation 1 complete: 
2026-02-21T08:21:36.8621401Z error=1
2026-02-21T08:21:36.8621838Z ok=89
2026-02-21T08:21:36.8621989Z min=0.0184
2026-02-21T08:21:36.8622118Z mid=0.0267
2026-02-21T08:21:36.8622272Z max=0.2519
2026-02-21T08:21:36.8622413Z best={'block_sizes': [1, 4096],
2026-02-21T08:21:36.8622653Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:21:36.8622895Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:21:36.8623093Z  'num_stages': 5,
2026-02-21T08:21:36.8623235Z  'num_warps': 4,
2026-02-21T08:21:36.8623384Z  'pid_type': 'flat',
2026-02-21T08:21:36.8623581Z  'range_flattens': [None, False],
2026-02-21T08:21:36.8623783Z  'range_multi_buffers': [None, True],
2026-02-21T08:21:36.8623976Z  'range_num_stages': [0, 4],
2026-02-21T08:21:36.8624144Z  'range_unroll_factors': [0, 0],
2026-02-21T08:21:36.8624334Z  'range_warp_specializes': [None, False]}
2026-02-21T08:21:36.8636088Z [57s] Fitting surrogate: 190 points, 190 targets
2026-02-21T08:21:37.7506172Z [58s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T08:21:46.3460833Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 3.4 configs/s
2026-02-21T08:21:50.8722069Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.5 configs/s
2026-02-21T08:21:54.8270271Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 257.5         
2026-02-21T08:21:54.8271855Z                                                                   configs/s     
2026-02-21T08:21:55.1403182Z [76s] Generation 2 complete: 
2026-02-21T08:21:55.1407497Z ok=77
2026-02-21T08:21:55.1410937Z min=0.0184
2026-02-21T08:21:55.1411187Z mid=0.0246
2026-02-21T08:21:55.1411337Z max=0.6145
2026-02-21T08:21:55.1411507Z best={'block_sizes': [1, 4096],
2026-02-21T08:21:55.1411853Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:21:55.1412141Z  'load_eviction_policies': ['', ''],
2026-02-21T08:21:55.1412338Z  'num_stages': 7,
2026-02-21T08:21:55.1412490Z  'num_warps': 2,
2026-02-21T08:21:55.1412630Z  'pid_type': 'flat',
2026-02-21T08:21:55.1412796Z  'range_flattens': [None, False],
2026-02-21T08:21:55.1412975Z  'range_multi_buffers': [None, True],
2026-02-21T08:21:55.1413168Z  'range_num_stages': [0, 4],
2026-02-21T08:21:55.1413333Z  'range_unroll_factors': [0, 0],
2026-02-21T08:21:55.1413523Z  'range_warp_specializes': [None, True]}
2026-02-21T08:21:55.1417892Z [76s] Fitting surrogate: 267 points, 267 targets
2026-02-21T08:21:56.0279861Z [77s] Generation 3 starting: 63 neighbors, 5 active search path(s)
2026-02-21T08:22:01.8609225Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 4.8 configs/s
2026-02-21T08:22:05.7724958Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.8 configs/s
2026-02-21T08:22:09.8410593Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 278.6         
2026-02-21T08:22:09.8411907Z                                                                   configs/s     
2026-02-21T08:22:10.1412901Z [91s] Generation 3 complete: 
2026-02-21T08:22:10.1414818Z error=1
2026-02-21T08:22:10.1414998Z ok=67
2026-02-21T08:22:10.1415161Z min=0.0184
2026-02-21T08:22:10.1415324Z mid=0.0225
2026-02-21T08:22:10.1415549Z max=0.2908
2026-02-21T08:22:10.1415730Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:10.1415980Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:22:10.1419439Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:10.1423967Z  'num_stages': 7,
2026-02-21T08:22:10.1425493Z  'num_warps': 2,
2026-02-21T08:22:10.1425696Z  'pid_type': 'flat',
2026-02-21T08:22:10.1426269Z  'range_flattens': [None, False],
2026-02-21T08:22:10.1426501Z  'range_multi_buffers': [None, False],
2026-02-21T08:22:10.1426693Z  'range_num_stages': [0, 4],
2026-02-21T08:22:10.1426875Z  'range_unroll_factors': [0, 0],
2026-02-21T08:22:10.1427059Z  'range_warp_specializes': [None, True]}
2026-02-21T08:22:10.1427365Z [91s] Fitting surrogate: 335 points, 335 targets
2026-02-21T08:22:10.8721365Z [91s] Generation 4 starting: 46 neighbors, 4 active search path(s)
2026-02-21T08:22:13.7518632Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 19.1 configs/s
2026-02-21T08:22:16.6795118Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.6 configs/s
2026-02-21T08:22:19.5456580Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 355.4         
2026-02-21T08:22:19.5458064Z                                                                   configs/s     
2026-02-21T08:22:19.7767332Z [100s] Generation 4 complete: 
2026-02-21T08:22:19.7771882Z ok=50
2026-02-21T08:22:19.7775319Z min=0.0184
2026-02-21T08:22:19.7776877Z mid=0.0184
2026-02-21T08:22:19.7777055Z max=0.0307
2026-02-21T08:22:19.7777203Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:19.7777457Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:22:19.7777698Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:19.7777896Z  'num_stages': 7,
2026-02-21T08:22:19.7778128Z  'num_warps': 2,
2026-02-21T08:22:19.7782044Z  'pid_type': 'flat',
2026-02-21T08:22:19.7786036Z  'range_flattens': [None, False],
2026-02-21T08:22:19.7788078Z  'range_multi_buffers': [None, False],
2026-02-21T08:22:19.7788305Z  'range_num_stages': [0, 4],
2026-02-21T08:22:19.7788548Z  'range_unroll_factors': [0, 0],
2026-02-21T08:22:19.7788739Z  'range_warp_specializes': [None, True]}
2026-02-21T08:22:19.7793640Z [100s] Fitting surrogate: 385 points, 385 targets
2026-02-21T08:22:20.4991306Z [101s] Generation 5 starting: 43 neighbors, 4 active search path(s)
2026-02-21T08:22:23.1029725Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 31.4 configs/s
2026-02-21T08:22:25.7390805Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.0 configs/s
2026-02-21T08:22:28.4076205Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 381.9         
2026-02-21T08:22:28.4080078Z                                                                   configs/s     
2026-02-21T08:22:28.6339891Z [109s] Generation 5 complete: 
2026-02-21T08:22:28.6343700Z ok=47
2026-02-21T08:22:28.6345339Z min=0.0164
2026-02-21T08:22:28.6345561Z mid=0.0184
2026-02-21T08:22:28.6350239Z max=0.0247
2026-02-21T08:22:28.6352358Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:28.6352652Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:22:28.6352932Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:28.6353188Z  'num_stages': 6,
2026-02-21T08:22:28.6353339Z  'num_warps': 2,
2026-02-21T08:22:28.6357931Z  'pid_type': 'flat',
2026-02-21T08:22:28.6362538Z  'range_flattens': [None, False],
2026-02-21T08:22:28.6366988Z  'range_multi_buffers': [None, None],
2026-02-21T08:22:28.6368753Z  'range_num_stages': [0, 4],
2026-02-21T08:22:28.6368930Z  'range_unroll_factors': [0, 4],
2026-02-21T08:22:28.6369149Z  'range_warp_specializes': [None, None]}
2026-02-21T08:22:28.6374060Z [109s] Fitting surrogate: 432 points, 432 targets
2026-02-21T08:22:29.0223824Z [110s] Generation 6 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:22:30.4451997Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 27.5 configs/s
2026-02-21T08:22:31.7356455Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.8 configs/s
2026-02-21T08:22:32.8779733Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 886.4         
2026-02-21T08:22:32.8781034Z                                                                   configs/s     
2026-02-21T08:22:32.9877378Z [114s] Generation 6 complete: 
2026-02-21T08:22:32.9881516Z ok=22
2026-02-21T08:22:32.9883077Z min=0.0164
2026-02-21T08:22:32.9883247Z mid=0.0184
2026-02-21T08:22:32.9884041Z max=0.0246
2026-02-21T08:22:32.9888673Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:32.9890927Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:22:32.9891221Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:32.9891416Z  'num_stages': 6,
2026-02-21T08:22:32.9891636Z  'num_warps': 2,
2026-02-21T08:22:32.9891810Z  'pid_type': 'flat',
2026-02-21T08:22:32.9891989Z  'range_flattens': [None, False],
2026-02-21T08:22:32.9892176Z  'range_multi_buffers': [None, None],
2026-02-21T08:22:32.9892379Z  'range_num_stages': [0, 3],
2026-02-21T08:22:32.9892551Z  'range_unroll_factors': [0, 4],
2026-02-21T08:22:32.9892740Z  'range_warp_specializes': [None, None]}
2026-02-21T08:22:32.9897243Z [114s] Fitting surrogate: 454 points, 454 targets
2026-02-21T08:22:33.3251212Z [114s] Generation 7 starting: 19 neighbors, 2 active search path(s)
2026-02-21T08:22:35.0147184Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 13.8 configs/s
2026-02-21T08:22:36.1666289Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s
2026-02-21T08:22:37.3092886Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 886.2         
2026-02-21T08:22:37.3094788Z                                                                   configs/s     
2026-02-21T08:22:37.4181117Z [118s] Generation 7 complete: 
2026-02-21T08:22:37.4182995Z ok=21
2026-02-21T08:22:37.4183164Z min=0.0164
2026-02-21T08:22:37.4183304Z mid=0.0184
2026-02-21T08:22:37.4183426Z max=0.0287
2026-02-21T08:22:37.4183567Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:37.4183817Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:22:37.4184070Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:37.4184243Z  'num_stages': 6,
2026-02-21T08:22:37.4184386Z  'num_warps': 2,
2026-02-21T08:22:37.4184525Z  'pid_type': 'flat',
2026-02-21T08:22:37.4184686Z  'range_flattens': [None, False],
2026-02-21T08:22:37.4184863Z  'range_multi_buffers': [None, None],
2026-02-21T08:22:37.4185519Z  'range_num_stages': [0, 3],
2026-02-21T08:22:37.4185693Z  'range_unroll_factors': [0, 4],
2026-02-21T08:22:37.4185869Z  'range_warp_specializes': [None, None]}
2026-02-21T08:22:37.4199173Z [118s] Fitting surrogate: 475 points, 475 targets
2026-02-21T08:22:37.8119764Z [118s] Generation 8 starting: 17 neighbors, 2 active search path(s)
2026-02-21T08:22:40.2890453Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.1 configs/s
2026-02-21T08:22:41.3183220Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.3 configs/s
2026-02-21T08:22:42.7674413Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 944.0         
2026-02-21T08:22:42.7678823Z                                                                   configs/s     
2026-02-21T08:22:42.8633417Z [123s] Generation 8 complete: 
2026-02-21T08:22:42.8635273Z ok=19
2026-02-21T08:22:42.8635505Z min=0.0164
2026-02-21T08:22:42.8635667Z mid=0.0184
2026-02-21T08:22:42.8635835Z max=0.0307
2026-02-21T08:22:42.8636530Z best={'block_sizes': [1, 4096],
2026-02-21T08:22:42.8636794Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:22:42.8637056Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:42.8637253Z  'num_stages': 6,
2026-02-21T08:22:42.8637418Z  'num_warps': 2,
2026-02-21T08:22:42.8637586Z  'pid_type': 'flat',
2026-02-21T08:22:42.8637781Z  'range_flattens': [None, False],
2026-02-21T08:22:42.8637971Z  'range_multi_buffers': [None, None],
2026-02-21T08:22:42.8638172Z  'range_num_stages': [0, 4],
2026-02-21T08:22:42.8638349Z  'range_unroll_factors': [0, 4],
2026-02-21T08:22:42.8638546Z  'range_warp_specializes': [None, None]}
2026-02-21T08:22:42.8645073Z [123s] Fitting surrogate: 494 points, 494 targets
2026-02-21T08:22:43.0346498Z [124s] Autotuning complete in 124.1s after searching 475 configs.
2026-02-21T08:22:43.0348188Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:22:43.0349102Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T08:22:43.0349903Z 
2026-02-21T08:22:43.0350146Z [124s] Code of selected kernel: /tmp/torchinductor_root/jz/cjz33fcpu2l4q3itrjabq4fkakv5igzr4fmx6v2h6g5idtokg7x5.py
2026-02-21T08:22:43.9590754Z WARNING:tritonbench.utils.triton_op:Completed input ID 26:
2026-02-21T08:22:43.9594823Z (M, N)
2026-02-21T08:22:43.9596325Z ------------
2026-02-21T08:22:43.9596541Z (4096, 3584)
2026-02-21T08:22:43.9596688Z 
2026-02-21T08:22:43.9597213Z  30%|███       | 6/20 [13:41<33:16, 142.58s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31:
2026-02-21T08:22:43.9598749Z (M, N)
2026-02-21T08:22:43.9598924Z ------------
2026-02-21T08:22:43.9599071Z (4096, 4224)
2026-02-21T08:22:43.9603634Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:22:45.2386709Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:22:46.4849516Z INFO:tritonbench.utils.triton_op:Took 2.18ms to get benchmark function for torch_compile_softmax
2026-02-21T08:22:49.9077533Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:22:49.9081960Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:22:49.9086102Z               'dtype': 'torch.float16',
2026-02-21T08:22:49.9087692Z               'shape': (4096, 4224),
2026-02-21T08:22:49.9087957Z               'stride': (4224, 1)},),
2026-02-21T08:22:49.9092751Z   'kwargs': {}}
2026-02-21T08:22:49.9097294Z INFO:tritonbench.utils.triton_op:Took 1.77ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:22:50.0855943Z [0s] Autotune random seed: 2134816249
2026-02-21T08:22:50.2288159Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:23:23.7029818Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:23:23.9431151Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:23:24.0934471Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:23:24.0950774Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:23:24.2833541Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:23:24.2836044Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:23:24.2836604Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16>
2026-02-21T08:23:24.2841498Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:23:24.2845995Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:23:24.2851299Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:23:24.2855886Z     %cst_0 = arith.constant dense<4224> : tensor<8x1xi32>
2026-02-21T08:23:24.2857451Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32>
2026-02-21T08:23:24.2857817Z     %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16>
2026-02-21T08:23:24.2863984Z     %cst_3 = arith.constant dense<4224> : tensor<512xi32>
2026-02-21T08:23:24.2868531Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:23:24.2870678Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:23:24.2870985Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:23:24.2875767Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:23:24.2880329Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:23:24.2884367Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:23:24.2888808Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:23:24.2892689Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:23:24.2897122Z     %1 = tt.get_program_id x : i32
2026-02-21T08:23:24.2899007Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:23:24.2899280Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:23:24.2899516Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:23:24.2899777Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:23:24.2899976Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:23:24.2900177Z       %c4096_i32_6 = arith.constant 4096 : i32
2026-02-21T08:23:24.2900363Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:23:24.2900741Z       %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:23:24.2901156Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2901416Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2901706Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:23:24.2902235Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2902555Z         %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:23:24.2902909Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2903214Z         %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2903503Z         %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:23:24.2903783Z         %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2904023Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2904216Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2904411Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:24.2904604Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2904865Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2905106Z         %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:24.2905357Z         %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:24.2905607Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32>
2026-02-21T08:23:24.2905833Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:23:24.2906132Z         %74 = arith.ori %72, %73 : tensor<8xi1>
2026-02-21T08:23:24.2906360Z         %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:24.2906605Z         %76 = arith.subf %arg4, %75 : tensor<8xf32>
2026-02-21T08:23:24.2906970Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2907327Z         %78 = arith.mulf %arg5, %77 : tensor<8xf32>
2026-02-21T08:23:24.2907586Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2907871Z         %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2908134Z         %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2908373Z         %82 = arith.subf %80, %81 : tensor<8x512xf32>
2026-02-21T08:23:24.2908736Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2909147Z         %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:23:24.2909401Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2909600Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2909782Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:24.2909980Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2910177Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2910369Z         %86 = arith.addf %78, %85 : tensor<8xf32>
2026-02-21T08:23:24.2910568Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:23:24.2910752Z         %87 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:23:24.2910942Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:23:24.2911173Z         %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2911432Z         %90 = tt.splat %88 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2911693Z         %91 = arith.addi %90, %89 : tensor<512xi32>
2026-02-21T08:23:24.2911903Z         %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2912203Z         %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:23:24.2912534Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2912822Z         %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2913090Z         %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:23:24.2913447Z         %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2913677Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2913864Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2914047Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:24.2914233Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2914420Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2914634Z         %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:24.2914886Z         %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:24.2915125Z         %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32>
2026-02-21T08:23:24.2915341Z         %102 = arith.cmpf une, %75, %75 : tensor<8xf32>
2026-02-21T08:23:24.2915555Z         %103 = arith.ori %101, %102 : tensor<8xi1>
2026-02-21T08:23:24.2915838Z         %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:24.2916082Z         %105 = arith.subf %75, %104 : tensor<8xf32>
2026-02-21T08:23:24.2916453Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2916830Z         %107 = arith.mulf %86, %106 : tensor<8xf32>
2026-02-21T08:23:24.2917095Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2917397Z         %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2917676Z         %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2917924Z         %111 = arith.subf %109, %110 : tensor<8x512xf32>
2026-02-21T08:23:24.2918319Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2918756Z         %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:23:24.2919024Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2919231Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2919413Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:24.2919613Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2919803Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2920015Z         %115 = arith.addf %107, %114 : tensor<8xf32>
2026-02-21T08:23:24.2920216Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:23:24.2920418Z         %116 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:23:24.2920638Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:23:24.2920882Z         %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2921153Z         %119 = tt.splat %117 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2921367Z         %120 = arith.addi %119, %118 : tensor<512xi32>
2026-02-21T08:23:24.2921645Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2921968Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:23:24.2922342Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2922667Z         %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2922957Z         %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:23:24.2923261Z         %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2923504Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2923709Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2923897Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:24.2924105Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2924307Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2924540Z         %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:24.2924870Z         %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:24.2925116Z         %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32>
2026-02-21T08:23:24.2925337Z         %131 = arith.cmpf une, %104, %104 : tensor<8xf32>
2026-02-21T08:23:24.2925543Z         %132 = arith.ori %130, %131 : tensor<8xi1>
2026-02-21T08:23:24.2925781Z         %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:24.2926026Z         %134 = arith.subf %104, %133 : tensor<8xf32>
2026-02-21T08:23:24.2926376Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2926749Z         %136 = arith.mulf %115, %135 : tensor<8xf32>
2026-02-21T08:23:24.2926998Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2927332Z         %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2927603Z         %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2927844Z         %140 = arith.subf %138, %139 : tensor<8x512xf32>
2026-02-21T08:23:24.2928240Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2928654Z         %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:23:24.2928909Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2929106Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2929282Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:24.2929471Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2929651Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2929851Z         %144 = arith.addf %136, %143 : tensor<8xf32>
2026-02-21T08:23:24.2930043Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:23:24.2930236Z         %145 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:23:24.2930431Z         %146 = arith.addi %arg3, %145 : i32
2026-02-21T08:23:24.2930660Z         %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2930920Z         %148 = tt.splat %146 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2931126Z         %149 = arith.addi %148, %147 : tensor<512xi32>
2026-02-21T08:23:24.2931348Z         %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2931676Z         %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:23:24.2932024Z         %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2932326Z         %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2932604Z         %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:23:24.2932893Z         %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2933125Z         %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2933324Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2933512Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:24.2933697Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2933884Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2934106Z         %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:24.2934355Z         %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:24.2934582Z         %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32>
2026-02-21T08:23:24.2934801Z         %160 = arith.cmpf une, %133, %133 : tensor<8xf32>
2026-02-21T08:23:24.2935001Z         %161 = arith.ori %159, %160 : tensor<8xi1>
2026-02-21T08:23:24.2935238Z         %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:24.2935486Z         %163 = arith.subf %133, %162 : tensor<8xf32>
2026-02-21T08:23:24.2935924Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2936283Z         %165 = arith.mulf %144, %164 : tensor<8xf32>
2026-02-21T08:23:24.2936529Z         %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2936824Z         %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2937086Z         %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2937323Z         %169 = arith.subf %167, %168 : tensor<8x512xf32>
2026-02-21T08:23:24.2937692Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2938097Z         %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:23:24.2938405Z         %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2938598Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:24.2938783Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:24.2938975Z           tt.reduce.return %174 : f32
2026-02-21T08:23:24.2939155Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2939358Z         %173 = arith.addf %165, %172 : tensor<8xf32>
2026-02-21T08:23:24.2939572Z         scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:23:24.2939822Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:23:24.2940085Z       %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2940347Z       %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2940562Z       %9 = arith.addi %8, %7 : tensor<512xi32>
2026-02-21T08:23:24.2940768Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2941078Z       %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:23:24.2941425Z       %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2941747Z       %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2942016Z       %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:23:24.2942289Z       %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2942520Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2942711Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:23:24.2942899Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:23:24.2943086Z         tt.reduce.return %60 : f32
2026-02-21T08:23:24.2943279Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2943491Z       %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:24.2943734Z       %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:24.2943964Z       %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32>
2026-02-21T08:23:24.2944174Z       %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T08:23:24.2944379Z       %21 = arith.ori %19, %20 : tensor<8xi1>
2026-02-21T08:23:24.2944597Z       %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:24.2944829Z       %23 = arith.subf %6#0, %22 : tensor<8xf32>
2026-02-21T08:23:24.2945179Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2945543Z       %25 = arith.mulf %6#1, %24 : tensor<8xf32>
2026-02-21T08:23:24.2945792Z       %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2946069Z       %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2946324Z       %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2946552Z       %29 = arith.subf %27, %28 : tensor<8x512xf32>
2026-02-21T08:23:24.2946986Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2947388Z       %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:23:24.2947631Z       %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({
2026-02-21T08:23:24.2947825Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:23:24.2947996Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:23:24.2948182Z         tt.reduce.return %60 : f32
2026-02-21T08:23:24.2948361Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:23:24.2948556Z       %33 = arith.addf %25, %32 : tensor<8xf32>
2026-02-21T08:23:24.2948746Z       %c4096_i32_7 = arith.constant 4096 : i32
2026-02-21T08:23:24.2948942Z       %c2048_i32_8 = arith.constant 2048 : i32
2026-02-21T08:23:24.2949173Z       scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8  : i32 {
2026-02-21T08:23:24.2949498Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2949763Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2949966Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:23:24.2950183Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2950439Z         %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:23:24.2950702Z         %65 = arith.muli %64, %cst_0 : tensor<8x1xi32>
2026-02-21T08:23:24.2950962Z         %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:23:24.2951248Z         %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2951511Z         %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2951783Z         %69 = arith.addi %67, %68 : tensor<8x512xi32>
2026-02-21T08:23:24.2952031Z         %70 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2952316Z         %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2952619Z         %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2952917Z         %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2953214Z         %74 = tt.load %71, %73, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2953540Z         %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2953819Z         %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2954078Z         %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2954310Z         %78 = arith.subf %76, %77 : tensor<8x512xf32>
2026-02-21T08:23:24.2954676Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2955102Z         %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2955379Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2955610Z         %82 = arith.divf %79, %81 : tensor<8x512xf32>
2026-02-21T08:23:24.2955837Z         %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:23:24.2956106Z         %84 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2956386Z         %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2956644Z         tt.store %85, %83, %73 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2956859Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:23:24.2957049Z         %86 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:23:24.2957246Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:23:24.2957474Z         %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2957777Z         %89 = tt.splat %87 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2957981Z         %90 = arith.addi %89, %88 : tensor<512xi32>
2026-02-21T08:23:24.2958192Z         %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2958456Z         %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:23:24.2958713Z         %93 = arith.muli %92, %cst_0 : tensor<8x1xi32>
2026-02-21T08:23:24.2958974Z         %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:23:24.2959261Z         %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2959542Z         %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2959792Z         %97 = arith.addi %95, %96 : tensor<8x512xi32>
2026-02-21T08:23:24.2960032Z         %98 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2960367Z         %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2960677Z         %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2960988Z         %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2961316Z         %102 = tt.load %99, %101, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2961706Z         %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2962012Z         %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2962284Z         %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2962542Z         %106 = arith.subf %104, %105 : tensor<8x512xf32>
2026-02-21T08:23:24.2962930Z         %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2963377Z         %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2963680Z         %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2963930Z         %110 = arith.divf %107, %109 : tensor<8x512xf32>
2026-02-21T08:23:24.2964185Z         %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:23:24.2964471Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2964776Z         %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2965065Z         tt.store %113, %111, %101 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2965288Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:23:24.2965493Z         %114 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:23:24.2965695Z         %115 = arith.addi %arg3, %114 : i32
2026-02-21T08:23:24.2965949Z         %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2966224Z         %117 = tt.splat %115 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2966452Z         %118 = arith.addi %117, %116 : tensor<512xi32>
2026-02-21T08:23:24.2966695Z         %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2966976Z         %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:23:24.2967245Z         %121 = arith.muli %120, %cst_0 : tensor<8x1xi32>
2026-02-21T08:23:24.2967508Z         %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:23:24.2967808Z         %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2968074Z         %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2968321Z         %125 = arith.addi %123, %124 : tensor<8x512xi32>
2026-02-21T08:23:24.2968566Z         %126 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2968851Z         %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2969208Z         %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2969494Z         %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2969807Z         %130 = tt.load %127, %129, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2970138Z         %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2970417Z         %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2970678Z         %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2970915Z         %134 = arith.subf %132, %133 : tensor<8x512xf32>
2026-02-21T08:23:24.2971294Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2971805Z         %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2972096Z         %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2972340Z         %138 = arith.divf %135, %137 : tensor<8x512xf32>
2026-02-21T08:23:24.2972577Z         %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:23:24.2972851Z         %140 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2973131Z         %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2973405Z         tt.store %141, %139, %129 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2973623Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:23:24.2973814Z         %142 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:23:24.2974009Z         %143 = arith.addi %arg3, %142 : i32
2026-02-21T08:23:24.2974238Z         %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2974500Z         %145 = tt.splat %143 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2974703Z         %146 = arith.addi %145, %144 : tensor<512xi32>
2026-02-21T08:23:24.2974925Z         %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2975191Z         %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:23:24.2975449Z         %149 = arith.muli %148, %cst_0 : tensor<8x1xi32>
2026-02-21T08:23:24.2975714Z         %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:23:24.2976006Z         %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2976273Z         %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2976513Z         %153 = arith.addi %151, %152 : tensor<8x512xi32>
2026-02-21T08:23:24.2976755Z         %154 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2977047Z         %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2977349Z         %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2977644Z         %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2977948Z         %158 = tt.load %155, %157, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2978279Z         %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2978571Z         %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2978829Z         %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2979070Z         %162 = arith.subf %160, %161 : tensor<8x512xf32>
2026-02-21T08:23:24.2979442Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2979869Z         %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2980200Z         %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2980455Z         %166 = arith.divf %163, %165 : tensor<8x512xf32>
2026-02-21T08:23:24.2980696Z         %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:23:24.2980964Z         %168 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2981247Z         %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2981511Z         tt.store %169, %167, %157 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2981804Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:23:24.2982070Z       %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:23:24.2982333Z       %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32>
2026-02-21T08:23:24.2982599Z       %36 = arith.addi %35, %34 : tensor<512xi32>
2026-02-21T08:23:24.2982813Z       %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32>
2026-02-21T08:23:24.2983075Z       %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:23:24.2983332Z       %39 = arith.muli %38, %cst_0 : tensor<8x1xi32>
2026-02-21T08:23:24.2983591Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:23:24.2983883Z       %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2984139Z       %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:23:24.2984379Z       %43 = arith.addi %41, %42 : tensor<8x512xi32>
2026-02-21T08:23:24.2984612Z       %44 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2984887Z       %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2985173Z       %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:23:24.2985459Z       %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:23:24.2985756Z       %48 = tt.load %45, %47, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2986071Z       %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2986353Z       %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:23:24.2986599Z       %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2986830Z       %52 = arith.subf %50, %51 : tensor<8x512xf32>
2026-02-21T08:23:24.2987185Z       %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:23:24.2987593Z       %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:24.2987872Z       %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:23:24.2988096Z       %56 = arith.divf %53, %55 : tensor<8x512xf32>
2026-02-21T08:23:24.2988329Z       %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:23:24.2988583Z       %58 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2988855Z       %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:23:24.2989112Z       tt.store %59, %57, %47 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:23:24.2989388Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:23:24.2989644Z     tt.return
2026-02-21T08:23:24.2989770Z   }
2026-02-21T08:23:24.2989898Z }
2026-02-21T08:23:24.2989965Z 
2026-02-21T08:23:24.2990015Z {-#
2026-02-21T08:23:24.2990151Z   external_resources: {
2026-02-21T08:23:24.2990305Z     mlir_reproducer: {
2026-02-21T08:23:24.2994683Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:23:24.2999093Z       disable_threading: false,
2026-02-21T08:23:24.2999268Z       verify_each: true
2026-02-21T08:23:24.2999411Z     }
2026-02-21T08:23:24.2999538Z   }
2026-02-21T08:23:24.2999648Z #-}
2026-02-21T08:23:24.3000077Z /tmp/torchinductor_root/ay/cay7nz7nggw5j73svyjfng2qcf2be64lex7hhfxux7fbsi3w3ldy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:23:24.3001252Z /tmp/torchinductor_root/ay/cay7nz7nggw5j73svyjfng2qcf2be64lex7hhfxux7fbsi3w3ldy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:23:24.3002271Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:23:24.3003368Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:23:24.3004341Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:23:24.3004609Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:23:30.5339278Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.6 configs/s
2026-02-21T08:23:30.5347885Z [40s] Adaptive compile timeout: 30s (90% percentile=4.5s, bounds=[30.0s, 30s])
2026-02-21T08:23:30.9958592Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2123.1 configs/s
2026-02-21T08:23:31.0412114Z [40s] Initial random population of 100, 5 starting points: 
2026-02-21T08:23:31.0416617Z error=6
2026-02-21T08:23:31.0418622Z timeout=3
2026-02-21T08:23:31.0423776Z ok=91
2026-02-21T08:23:31.0425361Z min=0.0307
2026-02-21T08:23:31.0425523Z mid=0.4117
2026-02-21T08:23:31.0425649Z max=28.4201
2026-02-21T08:23:31.0425797Z best={'block_sizes': [1, 8192],
2026-02-21T08:23:31.0426026Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:23:31.0426270Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:23:31.0426450Z  'maxnreg': 32,
2026-02-21T08:23:31.0426600Z  'num_sm_multiplier': 64,
2026-02-21T08:23:31.0427130Z  'num_stages': 7,
2026-02-21T08:23:31.0427269Z  'num_warps': 4,
2026-02-21T08:23:31.0431938Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:23:31.0435592Z  'range_flattens': [None, True],
2026-02-21T08:23:31.0435895Z  'range_multi_buffers': [False, True],
2026-02-21T08:23:31.0436119Z  'range_num_stages': [1, 4],
2026-02-21T08:23:31.0441460Z  'range_unroll_factors': [1, 4],
2026-02-21T08:23:31.0443398Z  'range_warp_specializes': [True, None]}
2026-02-21T08:23:31.0443714Z [40s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:23:32.1619009Z [41s] Generation 1 starting: 84 neighbors, 5 active search path(s)
2026-02-21T08:23:38.9858839Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 7.0 configs/s
2026-02-21T08:23:41.0983236Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:23:41.0985887Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:23:41.0986462Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:23:41.0986665Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:23:41.0988402Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:23:41.0988632Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:23:41.0988902Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T08:23:41.0989208Z     %cst_0 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T08:23:41.0994073Z     %cst_1 = arith.constant dense<4224> : tensor<1024xi32>
2026-02-21T08:23:41.0999140Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:23:41.1003746Z     %cst_3 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:23:41.1005564Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:23:41.1005765Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:23:41.1005968Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:23:41.1006170Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:23:41.1006365Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:23:41.1006688Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T08:23:41.1007122Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T08:23:41.1007475Z     %2 = tt.get_program_id x : i32
2026-02-21T08:23:41.1007696Z     scf.for %arg2 = %2 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:23:41.1007914Z       %3 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:23:41.1008112Z       %c4096_i32_4 = arith.constant 4096 : i32
2026-02-21T08:23:41.1008308Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:23:41.1008683Z       %4:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c2048_i32 iter_args(%arg4 = %cst_3, %arg5 = %cst_2) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:23:41.1009118Z         %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:23:41.1009386Z         %43 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T08:23:41.1009605Z         %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T08:23:41.1009829Z         %45 = arith.cmpi slt, %44, %cst_1 : tensor<1024xi32>
2026-02-21T08:23:41.1010132Z         %46 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1010486Z         %47 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:23:41.1010781Z         %48 = tt.broadcast %47 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:23:41.1011064Z         %49 = arith.select %48, %46, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:23:41.1011337Z         %50 = arith.extf %49 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1011634Z         %51 = "tt.reduce"(%50) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1011840Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:41.1012037Z           %98 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:41.1012577Z           tt.reduce.return %98 : f32
2026-02-21T08:23:41.1012767Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1012999Z         %52 = arith.truncf %51 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:41.1013240Z         %53 = arith.extf %52 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:41.1013475Z         %54 = arith.cmpf ogt, %arg4, %53 : tensor<8xf32>
2026-02-21T08:23:41.1013706Z         %55 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:23:41.1013915Z         %56 = arith.ori %54, %55 : tensor<8xi1>
2026-02-21T08:23:41.1014150Z         %57 = arith.select %56, %arg4, %53 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:41.1014382Z         %58 = arith.subf %arg4, %57 : tensor<8xf32>
2026-02-21T08:23:41.1014752Z         %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1015213Z         %60 = arith.mulf %arg5, %59 : tensor<8xf32>
2026-02-21T08:23:41.1015479Z         %61 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1015770Z         %62 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1016051Z         %63 = tt.broadcast %61 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1016309Z         %64 = arith.subf %62, %63 : tensor<8x1024xf32>
2026-02-21T08:23:41.1016697Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1017142Z         %66 = arith.select %48, %65, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:23:41.1017422Z         %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1017628Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:41.1017838Z           %98 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:41.1018037Z           tt.reduce.return %98 : f32
2026-02-21T08:23:41.1018259Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1018469Z         %68 = arith.addf %60, %67 : tensor<8xf32>
2026-02-21T08:23:41.1018677Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:23:41.1018871Z         %69 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T08:23:41.1019075Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:23:41.1019326Z         %71 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:23:41.1019588Z         %72 = tt.splat %70 : i32 -> tensor<1024xi32>
2026-02-21T08:23:41.1019802Z         %73 = arith.addi %72, %71 : tensor<1024xi32>
2026-02-21T08:23:41.1020023Z         %74 = arith.cmpi slt, %73, %cst_1 : tensor<1024xi32>
2026-02-21T08:23:41.1020338Z         %75 = tt.descriptor_load %0[%3, %70] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1020691Z         %76 = tt.expand_dims %74 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:23:41.1021002Z         %77 = tt.broadcast %76 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:23:41.1021296Z         %78 = arith.select %77, %75, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:23:41.1021640Z         %79 = arith.extf %78 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1021887Z         %80 = "tt.reduce"(%79) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1022083Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:41.1022286Z           %98 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:23:41.1022484Z           tt.reduce.return %98 : f32
2026-02-21T08:23:41.1022687Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1022927Z         %81 = arith.truncf %80 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:41.1023181Z         %82 = arith.extf %81 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:41.1023430Z         %83 = arith.cmpf ogt, %57, %82 : tensor<8xf32>
2026-02-21T08:23:41.1023638Z         %84 = arith.cmpf une, %57, %57 : tensor<8xf32>
2026-02-21T08:23:41.1023846Z         %85 = arith.ori %83, %84 : tensor<8xi1>
2026-02-21T08:23:41.1024141Z         %86 = arith.select %85, %57, %82 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:41.1024370Z         %87 = arith.subf %57, %86 : tensor<8xf32>
2026-02-21T08:23:41.1024724Z         %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1025071Z         %89 = arith.mulf %68, %88 : tensor<8xf32>
2026-02-21T08:23:41.1025316Z         %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1025594Z         %91 = arith.extf %75 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1025853Z         %92 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1026085Z         %93 = arith.subf %91, %92 : tensor<8x1024xf32>
2026-02-21T08:23:41.1026502Z         %94 = tt.extern_elementwise %93 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1026918Z         %95 = arith.select %77, %94, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:23:41.1027164Z         %96 = "tt.reduce"(%95) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1027359Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:23:41.1027536Z           %98 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:23:41.1027728Z           tt.reduce.return %98 : f32
2026-02-21T08:23:41.1027920Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1028113Z         %97 = arith.addf %89, %96 : tensor<8xf32>
2026-02-21T08:23:41.1028354Z         scf.yield %86, %97 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:23:41.1028573Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:23:41.1028828Z       %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:23:41.1029087Z       %6 = tt.splat %c4096_i32_4 : i32 -> tensor<1024xi32>
2026-02-21T08:23:41.1029304Z       %7 = arith.addi %6, %5 : tensor<1024xi32>
2026-02-21T08:23:41.1029514Z       %8 = arith.cmpi slt, %7, %cst_1 : tensor<1024xi32>
2026-02-21T08:23:41.1029832Z       %9 = tt.descriptor_load %0[%3, %c4096_i32_4] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1030194Z       %10 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:23:41.1030484Z       %11 = tt.broadcast %10 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:23:41.1030761Z       %12 = arith.select %11, %9, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:23:41.1031031Z       %13 = arith.extf %12 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1031263Z       %14 = "tt.reduce"(%13) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1031453Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:23:41.1031659Z         %42 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:23:41.1031855Z         tt.reduce.return %42 : f32
2026-02-21T08:23:41.1032035Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1032263Z       %15 = arith.truncf %14 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:23:41.1032493Z       %16 = arith.extf %15 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:23:41.1032719Z       %17 = arith.cmpf ogt, %4#0, %16 : tensor<8xf32>
2026-02-21T08:23:41.1032931Z       %18 = arith.cmpf une, %4#0, %4#0 : tensor<8xf32>
2026-02-21T08:23:41.1033136Z       %19 = arith.ori %17, %18 : tensor<8xi1>
2026-02-21T08:23:41.1033360Z       %20 = arith.select %19, %4#0, %16 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:23:41.1033584Z       %21 = arith.subf %4#0, %20 : tensor<8xf32>
2026-02-21T08:23:41.1033937Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1034282Z       %23 = arith.mulf %4#1, %22 : tensor<8xf32>
2026-02-21T08:23:41.1034531Z       %24 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1034814Z       %25 = arith.extf %9 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1035118Z       %26 = tt.broadcast %24 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1035351Z       %27 = arith.subf %25, %26 : tensor<8x1024xf32>
2026-02-21T08:23:41.1035709Z       %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1036118Z       %29 = arith.select %11, %28, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:23:41.1036359Z       %30 = "tt.reduce"(%29) <{axis = 1 : i32}> ({
2026-02-21T08:23:41.1036550Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:23:41.1036731Z         %42 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:23:41.1036914Z         tt.reduce.return %42 : f32
2026-02-21T08:23:41.1037100Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:23:41.1037290Z       %31 = arith.addf %23, %30 : tensor<8xf32>
2026-02-21T08:23:41.1037487Z       %c4096_i32_5 = arith.constant 4096 : i32
2026-02-21T08:23:41.1037720Z       %c2048_i32_6 = arith.constant 2048 : i32
2026-02-21T08:23:41.1037961Z       scf.for %arg3 = %c0_i32 to %c4096_i32_5 step %c2048_i32_6  : i32 {
2026-02-21T08:23:41.1038292Z         %42 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1038627Z         %43 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1038918Z         %44 = arith.extf %42 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1039168Z         %45 = tt.broadcast %43 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1039406Z         %46 = arith.subf %44, %45 : tensor<8x1024xf32>
2026-02-21T08:23:41.1039770Z         %47 = tt.extern_elementwise %46 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1040182Z         %48 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1040470Z         %49 = tt.broadcast %48 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1040704Z         %50 = arith.divf %47, %49 : tensor<8x1024xf32>
2026-02-21T08:23:41.1040946Z         %51 = arith.truncf %50 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:23:41.1041265Z         tt.descriptor_store %1[%3, %arg3], %51 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:23:41.1041581Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:23:41.1041780Z         %52 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T08:23:41.1041971Z         %53 = arith.addi %arg3, %52 : i32
2026-02-21T08:23:41.1042246Z         %54 = tt.descriptor_load %0[%3, %53] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1042576Z         %55 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1042863Z         %56 = arith.extf %54 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1043114Z         %57 = tt.broadcast %55 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1043354Z         %58 = arith.subf %56, %57 : tensor<8x1024xf32>
2026-02-21T08:23:41.1043728Z         %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1044138Z         %60 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1044423Z         %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1044651Z         %62 = arith.divf %59, %61 : tensor<8x1024xf32>
2026-02-21T08:23:41.1044891Z         %63 = arith.truncf %62 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:23:41.1045206Z         tt.descriptor_store %1[%3, %53], %63 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:23:41.1045488Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:23:41.1045794Z       %32 = tt.descriptor_load %0[%3, %c4096_i32_5] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:23:41.1046222Z       %33 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1046522Z       %34 = arith.extf %32 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:23:41.1046777Z       %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1047018Z       %36 = arith.subf %34, %35 : tensor<8x1024xf32>
2026-02-21T08:23:41.1047393Z       %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:23:41.1047808Z       %38 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:23:41.1048098Z       %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:23:41.1048331Z       %40 = arith.divf %37, %39 : tensor<8x1024xf32>
2026-02-21T08:23:41.1048576Z       %41 = arith.truncf %40 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:23:41.1048963Z       tt.descriptor_store %1[%3, %c4096_i32_5], %41 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:23:41.1049334Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:23:41.1049600Z     tt.return
2026-02-21T08:23:41.1049730Z   }
2026-02-21T08:23:41.1049862Z }
2026-02-21T08:23:41.1049934Z 
2026-02-21T08:23:41.1049986Z {-#
2026-02-21T08:23:41.1050127Z   external_resources: {
2026-02-21T08:23:41.1050291Z     mlir_reproducer: {
2026-02-21T08:23:41.1054755Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:23:41.1059312Z       disable_threading: false,
2026-02-21T08:23:41.1059487Z       verify_each: true
2026-02-21T08:23:41.1059631Z     }
2026-02-21T08:23:41.1059759Z   }
2026-02-21T08:23:41.1059875Z #-}
2026-02-21T08:23:41.1060311Z /tmp/torchinductor_root/7t/c7toaxgqpp4rbcfrjffndnwgmleknkeskrx3lz6pdu55gkgvczik.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:23:41.1061490Z /tmp/torchinductor_root/7t/c7toaxgqpp4rbcfrjffndnwgmleknkeskrx3lz6pdu55gkgvczik.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:23:41.1062549Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:23:41.1063660Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:23:41.1064653Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:23:41.1064906Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:23:44.0956861Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.0 configs/s
2026-02-21T08:23:48.1268576Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.8         
2026-02-21T08:23:48.1273513Z                                                                   configs/s     
2026-02-21T08:23:48.3921986Z [58s] Generation 1 complete: 
2026-02-21T08:23:48.3926699Z error=2
2026-02-21T08:23:48.3931897Z ok=88
2026-02-21T08:23:48.3933962Z min=0.0266
2026-02-21T08:23:48.3934119Z mid=0.0389
2026-02-21T08:23:48.3934249Z max=0.1720
2026-02-21T08:23:48.3934387Z best={'block_sizes': [1, 8192],
2026-02-21T08:23:48.3934626Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:23:48.3934862Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:23:48.3935054Z  'num_stages': 7,
2026-02-21T08:23:48.3935202Z  'num_warps': 4,
2026-02-21T08:23:48.3935339Z  'pid_type': 'flat',
2026-02-21T08:23:48.3935499Z  'range_flattens': [None, True],
2026-02-21T08:23:48.3935674Z  'range_multi_buffers': [None, True],
2026-02-21T08:23:48.3935861Z  'range_num_stages': [0, 4],
2026-02-21T08:23:48.3936022Z  'range_unroll_factors': [0, 0],
2026-02-21T08:23:48.3936202Z  'range_warp_specializes': [None, True]}
2026-02-21T08:23:48.3936431Z [58s] Fitting surrogate: 190 points, 190 targets
2026-02-21T08:23:49.4434000Z [59s] Generation 2 starting: 76 neighbors, 5 active search path(s)
2026-02-21T08:24:16.7875983Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.7 configs/s
2026-02-21T08:24:21.2249583Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:24:21.2254059Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:24:21.2255567Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf16>
2026-02-21T08:24:21.2255857Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:24:21.2256060Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:21.2256246Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:24:21.2256468Z     %cst_0 = arith.constant dense<4224> : tensor<16x1xi32>
2026-02-21T08:24:21.2256756Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<16x256xf32>
2026-02-21T08:24:21.2257050Z     %cst_2 = arith.constant dense<0xFC00> : tensor<16x256xf16>
2026-02-21T08:24:21.2257294Z     %cst_3 = arith.constant dense<4224> : tensor<256xi32>
2026-02-21T08:24:21.2257547Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T08:24:21.2257795Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T08:24:21.2258027Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:24:21.2258223Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:21.2258408Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:24:21.2258598Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:24:21.2258776Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:21.2259103Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<16x256xf16>>
2026-02-21T08:24:21.2259427Z     %1 = tt.get_program_id x : i32
2026-02-21T08:24:21.2259642Z     scf.for %arg2 = %1 to %c256_i32 step %c592_i32  : i32 {
2026-02-21T08:24:21.2260231Z       %2 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T08:24:21.2260469Z       %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:24:21.2260724Z       %4 = tt.splat %2 : i32 -> tensor<16xi32>
2026-02-21T08:24:21.2260917Z       %5 = arith.addi %4, %3 : tensor<16xi32>
2026-02-21T08:24:21.2261114Z       %c4096_i32_6 = arith.constant 4096 : i32
2026-02-21T08:24:21.2261300Z       %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:24:21.2262015Z       %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c1024_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T08:24:21.2262448Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2262707Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2262924Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:24:21.2263144Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2263630Z         %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<16x256xf16>> -> tensor<16x256xf16>
2026-02-21T08:24:21.2263996Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2264293Z         %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2264582Z         %67 = arith.select %66, %64, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16>
2026-02-21T08:24:21.2264863Z         %68 = arith.extf %67 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2265102Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2265293Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2265488Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:24:21.2265698Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2265886Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2266162Z         %70 = arith.truncf %69 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:24:21.2266413Z         %71 = arith.extf %70 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:24:21.2266641Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<16xf32>
2026-02-21T08:24:21.2266871Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T08:24:21.2267081Z         %74 = arith.ori %72, %73 : tensor<16xi1>
2026-02-21T08:24:21.2267321Z         %75 = arith.select %74, %arg4, %71 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:24:21.2267568Z         %76 = arith.subf %arg4, %75 : tensor<16xf32>
2026-02-21T08:24:21.2267935Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2268309Z         %78 = arith.mulf %arg5, %77 : tensor<16xf32>
2026-02-21T08:24:21.2268558Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2268857Z         %80 = arith.extf %64 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2269156Z         %81 = tt.broadcast %79 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2269391Z         %82 = arith.subf %80, %81 : tensor<16x256xf32>
2026-02-21T08:24:21.2269759Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2270169Z         %84 = arith.select %66, %83, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:24:21.2270422Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2270612Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2270801Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:21.2270995Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2271180Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2271382Z         %86 = arith.addf %78, %85 : tensor<16xf32>
2026-02-21T08:24:21.2271618Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:21.2271819Z         %87 = arith.muli %c256_i32, %c1_i32 : i32
2026-02-21T08:24:21.2272099Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:24:21.2272334Z         %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2272589Z         %90 = tt.splat %88 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2272782Z         %91 = arith.addi %90, %89 : tensor<256xi32>
2026-02-21T08:24:21.2272998Z         %92 = arith.cmpi slt, %91, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2273295Z         %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc<tensor<16x256xf16>> -> tensor<16x256xf16>
2026-02-21T08:24:21.2273640Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2273925Z         %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2274198Z         %96 = arith.select %95, %93, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16>
2026-02-21T08:24:21.2274543Z         %97 = arith.extf %96 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2274778Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2274969Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2275151Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:24:21.2275344Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2275525Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2275750Z         %99 = arith.truncf %98 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:24:21.2275993Z         %100 = arith.extf %99 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:24:21.2276221Z         %101 = arith.cmpf ogt, %75, %100 : tensor<16xf32>
2026-02-21T08:24:21.2276443Z         %102 = arith.cmpf une, %75, %75 : tensor<16xf32>
2026-02-21T08:24:21.2276647Z         %103 = arith.ori %101, %102 : tensor<16xi1>
2026-02-21T08:24:21.2276883Z         %104 = arith.select %103, %75, %100 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:24:21.2277126Z         %105 = arith.subf %75, %104 : tensor<16xf32>
2026-02-21T08:24:21.2277514Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2277907Z         %107 = arith.mulf %86, %106 : tensor<16xf32>
2026-02-21T08:24:21.2278168Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2278496Z         %109 = arith.extf %93 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2278767Z         %110 = tt.broadcast %108 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2279028Z         %111 = arith.subf %109, %110 : tensor<16x256xf32>
2026-02-21T08:24:21.2279428Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2279866Z         %113 = arith.select %95, %112, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:24:21.2280143Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2280341Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2280530Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:21.2280727Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2280920Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2281130Z         %115 = arith.addf %107, %114 : tensor<16xf32>
2026-02-21T08:24:21.2281324Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:21.2281519Z         %116 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:24:21.2281753Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:24:21.2281992Z         %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2282242Z         %119 = tt.splat %117 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2282454Z         %120 = arith.addi %119, %118 : tensor<256xi32>
2026-02-21T08:24:21.2282669Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2282987Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<16x256xf16>> -> tensor<16x256xf16>
2026-02-21T08:24:21.2283404Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2283698Z         %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2283982Z         %125 = arith.select %124, %122, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16>
2026-02-21T08:24:21.2284267Z         %126 = arith.extf %125 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2284510Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2284705Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2284884Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:24:21.2285078Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2285260Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2285493Z         %128 = arith.truncf %127 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:24:21.2285793Z         %129 = arith.extf %128 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:24:21.2286034Z         %130 = arith.cmpf ogt, %104, %129 : tensor<16xf32>
2026-02-21T08:24:21.2286256Z         %131 = arith.cmpf une, %104, %104 : tensor<16xf32>
2026-02-21T08:24:21.2286459Z         %132 = arith.ori %130, %131 : tensor<16xi1>
2026-02-21T08:24:21.2286698Z         %133 = arith.select %132, %104, %129 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:24:21.2286940Z         %134 = arith.subf %104, %133 : tensor<16xf32>
2026-02-21T08:24:21.2287299Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2287661Z         %136 = arith.mulf %115, %135 : tensor<16xf32>
2026-02-21T08:24:21.2287920Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2288214Z         %138 = arith.extf %122 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2288480Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2288734Z         %140 = arith.subf %138, %139 : tensor<16x256xf32>
2026-02-21T08:24:21.2289099Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2289520Z         %142 = arith.select %124, %141, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:24:21.2289782Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2289970Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2290152Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:21.2290336Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2290523Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2290717Z         %144 = arith.addf %136, %143 : tensor<16xf32>
2026-02-21T08:24:21.2290915Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:24:21.2291104Z         %145 = arith.muli %c256_i32, %c3_i32 : i32
2026-02-21T08:24:21.2291306Z         %146 = arith.addi %arg3, %145 : i32
2026-02-21T08:24:21.2291581Z         %147 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2291835Z         %148 = tt.splat %146 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2292043Z         %149 = arith.addi %148, %147 : tensor<256xi32>
2026-02-21T08:24:21.2292259Z         %150 = arith.cmpi slt, %149, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2292572Z         %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc<tensor<16x256xf16>> -> tensor<16x256xf16>
2026-02-21T08:24:21.2292916Z         %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2293222Z         %153 = tt.broadcast %152 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2293516Z         %154 = arith.select %153, %151, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16>
2026-02-21T08:24:21.2293811Z         %155 = arith.extf %154 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2294153Z         %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2294340Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2294528Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:24:21.2294715Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2294903Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2295137Z         %157 = arith.truncf %156 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:24:21.2295387Z         %158 = arith.extf %157 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:24:21.2295629Z         %159 = arith.cmpf ogt, %133, %158 : tensor<16xf32>
2026-02-21T08:24:21.2295845Z         %160 = arith.cmpf une, %133, %133 : tensor<16xf32>
2026-02-21T08:24:21.2296058Z         %161 = arith.ori %159, %160 : tensor<16xi1>
2026-02-21T08:24:21.2296288Z         %162 = arith.select %161, %133, %158 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:24:21.2296595Z         %163 = arith.subf %133, %162 : tensor<16xf32>
2026-02-21T08:24:21.2296963Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2297318Z         %165 = arith.mulf %144, %164 : tensor<16xf32>
2026-02-21T08:24:21.2297575Z         %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2297865Z         %167 = arith.extf %151 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2298136Z         %168 = tt.broadcast %166 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2298390Z         %169 = arith.subf %167, %168 : tensor<16x256xf32>
2026-02-21T08:24:21.2298776Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2299213Z         %171 = arith.select %153, %170, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:24:21.2299484Z         %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2299690Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:21.2299875Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:21.2300076Z           tt.reduce.return %174 : f32
2026-02-21T08:24:21.2300271Z         }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2300477Z         %173 = arith.addf %165, %172 : tensor<16xf32>
2026-02-21T08:24:21.2300706Z         scf.yield %162, %173 : tensor<16xf32>, tensor<16xf32>
2026-02-21T08:24:21.2300928Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:21.2301179Z       %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2301446Z       %8 = tt.splat %c4096_i32_6 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2301696Z       %9 = arith.addi %8, %7 : tensor<256xi32>
2026-02-21T08:24:21.2301918Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2302247Z       %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc<tensor<16x256xf16>> -> tensor<16x256xf16>
2026-02-21T08:24:21.2302624Z       %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2302920Z       %13 = tt.broadcast %12 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2303206Z       %14 = arith.select %13, %11, %cst_2 : tensor<16x256xi1>, tensor<16x256xf16>
2026-02-21T08:24:21.2303488Z       %15 = arith.extf %14 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2303733Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2303942Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:24:21.2304130Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:24:21.2304336Z         tt.reduce.return %60 : f32
2026-02-21T08:24:21.2304527Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2304764Z       %17 = arith.truncf %16 : tensor<16xf32> to tensor<16xf16>
2026-02-21T08:24:21.2305016Z       %18 = arith.extf %17 : tensor<16xf16> to tensor<16xf32>
2026-02-21T08:24:21.2305316Z       %19 = arith.cmpf ogt, %6#0, %18 : tensor<16xf32>
2026-02-21T08:24:21.2305545Z       %20 = arith.cmpf une, %6#0, %6#0 : tensor<16xf32>
2026-02-21T08:24:21.2305754Z       %21 = arith.ori %19, %20 : tensor<16xi1>
2026-02-21T08:24:21.2306002Z       %22 = arith.select %21, %6#0, %18 : tensor<16xi1>, tensor<16xf32>
2026-02-21T08:24:21.2306249Z       %23 = arith.subf %6#0, %22 : tensor<16xf32>
2026-02-21T08:24:21.2306628Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2306976Z       %25 = arith.mulf %6#1, %24 : tensor<16xf32>
2026-02-21T08:24:21.2307227Z       %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2307514Z       %27 = arith.extf %11 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2307768Z       %28 = tt.broadcast %26 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2308052Z       %29 = arith.subf %27, %28 : tensor<16x256xf32>
2026-02-21T08:24:21.2308410Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2308810Z       %31 = arith.select %13, %30, %cst_1 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:24:21.2309059Z       %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({
2026-02-21T08:24:21.2309244Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:24:21.2309426Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:24:21.2309607Z         tt.reduce.return %60 : f32
2026-02-21T08:24:21.2309791Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:24:21.2309981Z       %33 = arith.addf %25, %32 : tensor<16xf32>
2026-02-21T08:24:21.2310180Z       %c4096_i32_7 = arith.constant 4096 : i32
2026-02-21T08:24:21.2310367Z       %c1024_i32_8 = arith.constant 1024 : i32
2026-02-21T08:24:21.2310604Z       scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c1024_i32_8  : i32 {
2026-02-21T08:24:21.2310889Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2311139Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2311348Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:24:21.2311598Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2311870Z         %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:24:21.2312131Z         %65 = arith.muli %64, %cst_0 : tensor<16x1xi32>
2026-02-21T08:24:21.2312395Z         %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:24:21.2312689Z         %67 = tt.broadcast %65 : tensor<16x1xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2312947Z         %68 = tt.broadcast %66 : tensor<1x256xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2313185Z         %69 = arith.addi %67, %68 : tensor<16x256xi32>
2026-02-21T08:24:21.2313423Z         %70 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2313709Z         %71 = tt.addptr %70, %69 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2314008Z         %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2314287Z         %73 = tt.broadcast %72 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2314589Z         %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2314912Z         %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2315199Z         %76 = arith.extf %74 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2315455Z         %77 = tt.broadcast %75 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2315694Z         %78 = arith.subf %76, %77 : tensor<16x256xf32>
2026-02-21T08:24:21.2316065Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2316518Z         %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2316805Z         %81 = tt.broadcast %80 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2317034Z         %82 = arith.divf %79, %81 : tensor<16x256xf32>
2026-02-21T08:24:21.2317270Z         %83 = arith.truncf %82 : tensor<16x256xf32> to tensor<16x256xf16>
2026-02-21T08:24:21.2317546Z         %84 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2317819Z         %85 = tt.addptr %84, %69 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2318086Z         tt.store %85, %83, %73 : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2318299Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:21.2318495Z         %86 = arith.muli %c256_i32, %c1_i32 : i32
2026-02-21T08:24:21.2318722Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:24:21.2319011Z         %88 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2319260Z         %89 = tt.splat %87 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2319479Z         %90 = arith.addi %89, %88 : tensor<256xi32>
2026-02-21T08:24:21.2319696Z         %91 = arith.cmpi slt, %90, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2319970Z         %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:24:21.2320237Z         %93 = arith.muli %92, %cst_0 : tensor<16x1xi32>
2026-02-21T08:24:21.2320502Z         %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:24:21.2320789Z         %95 = tt.broadcast %93 : tensor<16x1xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2321042Z         %96 = tt.broadcast %94 : tensor<1x256xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2321280Z         %97 = arith.addi %95, %96 : tensor<16x256xi32>
2026-02-21T08:24:21.2321513Z         %98 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2321834Z         %99 = tt.addptr %98, %97 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2322128Z         %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2322427Z         %101 = tt.broadcast %100 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2322741Z         %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2323071Z         %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2323366Z         %104 = arith.extf %102 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2323632Z         %105 = tt.broadcast %103 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2323884Z         %106 = arith.subf %104, %105 : tensor<16x256xf32>
2026-02-21T08:24:21.2324265Z         %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2324690Z         %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2324988Z         %109 = tt.broadcast %108 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2325230Z         %110 = arith.divf %107, %109 : tensor<16x256xf32>
2026-02-21T08:24:21.2325477Z         %111 = arith.truncf %110 : tensor<16x256xf32> to tensor<16x256xf16>
2026-02-21T08:24:21.2325750Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2326040Z         %113 = tt.addptr %112, %97 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2326311Z         tt.store %113, %111, %101 : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2326522Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:21.2326714Z         %114 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:24:21.2326902Z         %115 = arith.addi %arg3, %114 : i32
2026-02-21T08:24:21.2327141Z         %116 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2327440Z         %117 = tt.splat %115 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2327647Z         %118 = arith.addi %117, %116 : tensor<256xi32>
2026-02-21T08:24:21.2327868Z         %119 = arith.cmpi slt, %118, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2328125Z         %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:24:21.2328396Z         %121 = arith.muli %120, %cst_0 : tensor<16x1xi32>
2026-02-21T08:24:21.2328653Z         %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:24:21.2328950Z         %123 = tt.broadcast %121 : tensor<16x1xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2329222Z         %124 = tt.broadcast %122 : tensor<1x256xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2329461Z         %125 = arith.addi %123, %124 : tensor<16x256xi32>
2026-02-21T08:24:21.2329764Z         %126 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2330053Z         %127 = tt.addptr %126, %125 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2330362Z         %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2330652Z         %129 = tt.broadcast %128 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2330970Z         %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2331311Z         %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2331645Z         %132 = arith.extf %130 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2331920Z         %133 = tt.broadcast %131 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2332160Z         %134 = arith.subf %132, %133 : tensor<16x256xf32>
2026-02-21T08:24:21.2332543Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2332973Z         %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2333267Z         %137 = tt.broadcast %136 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2333514Z         %138 = arith.divf %135, %137 : tensor<16x256xf32>
2026-02-21T08:24:21.2333754Z         %139 = arith.truncf %138 : tensor<16x256xf32> to tensor<16x256xf16>
2026-02-21T08:24:21.2334030Z         %140 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2334309Z         %141 = tt.addptr %140, %125 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2334586Z         tt.store %141, %139, %129 : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2334805Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:24:21.2334993Z         %142 = arith.muli %c256_i32, %c3_i32 : i32
2026-02-21T08:24:21.2335191Z         %143 = arith.addi %arg3, %142 : i32
2026-02-21T08:24:21.2335420Z         %144 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2335678Z         %145 = tt.splat %143 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2335881Z         %146 = arith.addi %145, %144 : tensor<256xi32>
2026-02-21T08:24:21.2336101Z         %147 = arith.cmpi slt, %146, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2336366Z         %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:24:21.2336627Z         %149 = arith.muli %148, %cst_0 : tensor<16x1xi32>
2026-02-21T08:24:21.2336888Z         %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:24:21.2337180Z         %151 = tt.broadcast %149 : tensor<16x1xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2337450Z         %152 = tt.broadcast %150 : tensor<1x256xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2337696Z         %153 = arith.addi %151, %152 : tensor<16x256xi32>
2026-02-21T08:24:21.2337935Z         %154 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2338302Z         %155 = tt.addptr %154, %153 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2338602Z         %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2338894Z         %157 = tt.broadcast %156 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2339202Z         %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2339539Z         %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2339835Z         %160 = arith.extf %158 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2340109Z         %161 = tt.broadcast %159 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2340368Z         %162 = arith.subf %160, %161 : tensor<16x256xf32>
2026-02-21T08:24:21.2340809Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2341249Z         %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2341604Z         %165 = tt.broadcast %164 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2341858Z         %166 = arith.divf %163, %165 : tensor<16x256xf32>
2026-02-21T08:24:21.2342114Z         %167 = arith.truncf %166 : tensor<16x256xf32> to tensor<16x256xf16>
2026-02-21T08:24:21.2342397Z         %168 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2342699Z         %169 = tt.addptr %168, %153 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2342981Z         tt.store %169, %167, %157 : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2343220Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:21.2343474Z       %34 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:21.2343745Z       %35 = tt.splat %c4096_i32_7 : i32 -> tensor<256xi32>
2026-02-21T08:24:21.2343972Z       %36 = arith.addi %35, %34 : tensor<256xi32>
2026-02-21T08:24:21.2344186Z       %37 = arith.cmpi slt, %36, %cst_3 : tensor<256xi32>
2026-02-21T08:24:21.2344458Z       %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:24:21.2344725Z       %39 = arith.muli %38, %cst_0 : tensor<16x1xi32>
2026-02-21T08:24:21.2344994Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:24:21.2345304Z       %41 = tt.broadcast %39 : tensor<16x1xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2345578Z       %42 = tt.broadcast %40 : tensor<1x256xi32> -> tensor<16x256xi32>
2026-02-21T08:24:21.2345826Z       %43 = arith.addi %41, %42 : tensor<16x256xi32>
2026-02-21T08:24:21.2346073Z       %44 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2346365Z       %45 = tt.addptr %44, %43 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2346688Z       %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:24:21.2346984Z       %47 = tt.broadcast %46 : tensor<1x256xi1> -> tensor<16x256xi1>
2026-02-21T08:24:21.2347302Z       %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2347634Z       %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2347933Z       %50 = arith.extf %48 : tensor<16x256xf16> to tensor<16x256xf32>
2026-02-21T08:24:21.2348185Z       %51 = tt.broadcast %49 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2348422Z       %52 = arith.subf %50, %51 : tensor<16x256xf32>
2026-02-21T08:24:21.2348789Z       %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:24:21.2349189Z       %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T08:24:21.2349524Z       %55 = tt.broadcast %54 : tensor<16x1xf32> -> tensor<16x256xf32>
2026-02-21T08:24:21.2349753Z       %56 = arith.divf %53, %55 : tensor<16x256xf32>
2026-02-21T08:24:21.2349987Z       %57 = arith.truncf %56 : tensor<16x256xf32> to tensor<16x256xf16>
2026-02-21T08:24:21.2350255Z       %58 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2350524Z       %59 = tt.addptr %58, %43 : tensor<16x256x!tt.ptr<f16>>, tensor<16x256xi32>
2026-02-21T08:24:21.2350789Z       tt.store %59, %57, %47 : tensor<16x256x!tt.ptr<f16>>
2026-02-21T08:24:21.2350997Z     } {tt.warp_specialize}
2026-02-21T08:24:21.2351159Z     tt.return
2026-02-21T08:24:21.2351284Z   }
2026-02-21T08:24:21.2351409Z }
2026-02-21T08:24:21.2351477Z 
2026-02-21T08:24:21.2351526Z {-#
2026-02-21T08:24:21.2351707Z   external_resources: {
2026-02-21T08:24:21.2351870Z     mlir_reproducer: {
2026-02-21T08:24:21.2356240Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:21.2360615Z       disable_threading: false,
2026-02-21T08:24:21.2360786Z       verify_each: true
2026-02-21T08:24:21.2360930Z     }
2026-02-21T08:24:21.2361056Z   }
2026-02-21T08:24:21.2361168Z #-}
2026-02-21T08:24:21.2361636Z /tmp/torchinductor_root/y6/cy6zcomat6qj4f462goe36wtmj6q2ss5acr34lcigmttspyclgjf.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:21.2362833Z /tmp/torchinductor_root/y6/cy6zcomat6qj4f462goe36wtmj6q2ss5acr34lcigmttspyclgjf.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:21.2363797Z [91s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:21.2364859Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 256], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:24:21.2365817Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:21.2366120Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:21.5928923Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.6 configs/s
2026-02-21T08:24:25.0448346Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 294.1         
2026-02-21T08:24:25.3202480Z [95s] Generation 2 complete: 
2026-02-21T08:24:25.3202808Z                                                                   configs/s     
2026-02-21T08:24:25.3203739Z error=1
2026-02-21T08:24:25.3203892Z ok=80
2026-02-21T08:24:25.3204026Z min=0.0266
2026-02-21T08:24:25.3204173Z mid=0.0389
2026-02-21T08:24:25.3204305Z max=0.2602
2026-02-21T08:24:25.3204467Z best={'block_sizes': [1, 8192],
2026-02-21T08:24:25.3204769Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:24:25.3209613Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:24:25.3213474Z  'num_stages': 7,
2026-02-21T08:24:25.3218473Z  'num_warps': 4,
2026-02-21T08:24:25.3219514Z  'pid_type': 'flat',
2026-02-21T08:24:25.3219737Z  'range_flattens': [None, True],
2026-02-21T08:24:25.3219964Z  'range_multi_buffers': [None, True],
2026-02-21T08:24:25.3220183Z  'range_num_stages': [0, 4],
2026-02-21T08:24:25.3220389Z  'range_unroll_factors': [0, 0],
2026-02-21T08:24:25.3220612Z  'range_warp_specializes': [None, True]}
2026-02-21T08:24:25.3224953Z [95s] Fitting surrogate: 271 points, 271 targets
2026-02-21T08:24:26.4773610Z [96s] Generation 3 starting: 73 neighbors, 5 active search path(s)
2026-02-21T08:25:00.3980243Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.9 configs/s
2026-02-21T08:25:05.1009789Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.3 configs/s
2026-02-21T08:25:07.6613752Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 489.7         
2026-02-21T08:25:07.6615136Z                                                                   configs/s     
2026-02-21T08:25:07.8429670Z [137s] Generation 3 complete: 
2026-02-21T08:25:07.8434996Z ok=79
2026-02-21T08:25:07.8440518Z min=0.0205
2026-02-21T08:25:07.8442627Z mid=0.0348
2026-02-21T08:25:07.8442794Z max=0.2365
2026-02-21T08:25:07.8442954Z best={'block_sizes': [1, 8192],
2026-02-21T08:25:07.8443234Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:25:07.8443509Z  'load_eviction_policies': ['', ''],
2026-02-21T08:25:07.8443711Z  'num_sm_multiplier': 32,
2026-02-21T08:25:07.8443879Z  'num_stages': 6,
2026-02-21T08:25:07.8444057Z  'num_warps': 1,
2026-02-21T08:25:07.8444233Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:25:07.8444435Z  'range_flattens': [True, True],
2026-02-21T08:25:07.8444629Z  'range_multi_buffers': [False, None],
2026-02-21T08:25:07.8444819Z  'range_num_stages': [3, 1],
2026-02-21T08:25:07.8445003Z  'range_unroll_factors': [0, 2],
2026-02-21T08:25:07.8445194Z  'range_warp_specializes': [True, None]}
2026-02-21T08:25:07.8449747Z [137s] Fitting surrogate: 350 points, 350 targets
2026-02-21T08:25:09.0011020Z [138s] Generation 4 starting: 77 neighbors, 5 active search path(s)
2026-02-21T08:25:17.0341234Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 10.1 configs/s
2026-02-21T08:25:21.0924122Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:25:21.0929397Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:25:21.0930081Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16>
2026-02-21T08:25:21.0930408Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:25:21.0930637Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:25:21.0930848Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:25:21.0931116Z     %cst_0 = arith.constant dense<4224> : tensor<8x1xi32>
2026-02-21T08:25:21.0931435Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32>
2026-02-21T08:25:21.0932063Z     %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16>
2026-02-21T08:25:21.0932824Z     %cst_3 = arith.constant dense<4224> : tensor<512xi32>
2026-02-21T08:25:21.0933123Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:25:21.0933436Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:25:21.0933701Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:25:21.0933939Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:25:21.0934159Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:25:21.0934378Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:25:21.0934583Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:25:21.0934971Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:25:21.0935375Z     %1 = tt.get_program_id x : i32
2026-02-21T08:25:21.0935629Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T08:25:21.0936067Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:25:21.0936366Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:25:21.0936684Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:25:21.0936920Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:25:21.0937160Z       %c4096_i32_6 = arith.constant 4096 : i32
2026-02-21T08:25:21.0937390Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:25:21.0937871Z       %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:25:21.0938409Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0938734Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0939005Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:25:21.0939274Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0939666Z         %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:25:21.0940105Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0940466Z         %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0940819Z         %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:25:21.0941162Z         %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0941457Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0941785Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0942027Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:21.0942271Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0942501Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0942776Z         %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:21.0943070Z         %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:21.0943364Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32>
2026-02-21T08:25:21.0943639Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:25:21.0943910Z         %74 = arith.ori %72, %73 : tensor<8xi1>
2026-02-21T08:25:21.0944201Z         %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:21.0944487Z         %76 = arith.subf %arg4, %75 : tensor<8xf32>
2026-02-21T08:25:21.0944937Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0945374Z         %78 = arith.mulf %arg5, %77 : tensor<8xf32>
2026-02-21T08:25:21.0945684Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.0946029Z         %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0946346Z         %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.0946742Z         %82 = arith.subf %80, %81 : tensor<8x512xf32>
2026-02-21T08:25:21.0947184Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.0947694Z         %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:25:21.0948004Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0948251Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0948486Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:21.0948718Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0948954Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0949190Z         %86 = arith.addf %78, %85 : tensor<8xf32>
2026-02-21T08:25:21.0949432Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:25:21.0949658Z         %87 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:25:21.0950017Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:25:21.0950311Z         %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0950624Z         %90 = tt.splat %88 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0950875Z         %91 = arith.addi %90, %89 : tensor<512xi32>
2026-02-21T08:25:21.0951133Z         %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0951504Z         %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:25:21.0951974Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0952341Z         %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0952682Z         %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:25:21.0953017Z         %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0953304Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0953543Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0953779Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:21.0954017Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0954259Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0954527Z         %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:21.0954834Z         %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:21.0955124Z         %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32>
2026-02-21T08:25:21.0955390Z         %102 = arith.cmpf une, %75, %75 : tensor<8xf32>
2026-02-21T08:25:21.0955652Z         %103 = arith.ori %101, %102 : tensor<8xi1>
2026-02-21T08:25:21.0955940Z         %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:21.0956234Z         %105 = arith.subf %75, %104 : tensor<8xf32>
2026-02-21T08:25:21.0956682Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0957130Z         %107 = arith.mulf %86, %106 : tensor<8xf32>
2026-02-21T08:25:21.0957443Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.0957800Z         %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0958132Z         %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.0958426Z         %111 = arith.subf %109, %110 : tensor<8x512xf32>
2026-02-21T08:25:21.0958892Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.0959408Z         %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:25:21.0959721Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0959967Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0960189Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:21.0960510Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0960732Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0960981Z         %115 = arith.addf %107, %114 : tensor<8xf32>
2026-02-21T08:25:21.0961225Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:21.0961457Z         %116 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:25:21.0961770Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:25:21.0962057Z         %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0962445Z         %119 = tt.splat %117 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0962698Z         %120 = arith.addi %119, %118 : tensor<512xi32>
2026-02-21T08:25:21.0962972Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0963343Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:25:21.0963853Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0964239Z         %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0964598Z         %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:25:21.0964966Z         %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0965262Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0965516Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0965757Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:21.0966001Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0966243Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0966520Z         %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:21.0966836Z         %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:21.0967130Z         %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32>
2026-02-21T08:25:21.0967415Z         %131 = arith.cmpf une, %104, %104 : tensor<8xf32>
2026-02-21T08:25:21.0967682Z         %132 = arith.ori %130, %131 : tensor<8xi1>
2026-02-21T08:25:21.0967979Z         %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:21.0968285Z         %134 = arith.subf %104, %133 : tensor<8xf32>
2026-02-21T08:25:21.0968730Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0969198Z         %136 = arith.mulf %115, %135 : tensor<8xf32>
2026-02-21T08:25:21.0969516Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.0969895Z         %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0970231Z         %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.0970540Z         %140 = arith.subf %138, %139 : tensor<8x512xf32>
2026-02-21T08:25:21.0971012Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.0971581Z         %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:25:21.0971920Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0972151Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0972360Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:21.0972584Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0972798Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0973039Z         %144 = arith.addf %136, %143 : tensor<8xf32>
2026-02-21T08:25:21.0973266Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:21.0973490Z         %145 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:25:21.0973714Z         %146 = arith.addi %arg3, %145 : i32
2026-02-21T08:25:21.0973995Z         %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0974410Z         %148 = tt.splat %146 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0974653Z         %149 = arith.addi %148, %147 : tensor<512xi32>
2026-02-21T08:25:21.0974920Z         %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0975270Z         %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:25:21.0975690Z         %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0976045Z         %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0976372Z         %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:25:21.0976711Z         %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0976982Z         %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0977276Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0977494Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:21.0977719Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0977936Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0978191Z         %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:21.0978483Z         %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:21.0978749Z         %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32>
2026-02-21T08:25:21.0979012Z         %160 = arith.cmpf une, %133, %133 : tensor<8xf32>
2026-02-21T08:25:21.0979250Z         %161 = arith.ori %159, %160 : tensor<8xi1>
2026-02-21T08:25:21.0979532Z         %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:21.0979816Z         %163 = arith.subf %133, %162 : tensor<8xf32>
2026-02-21T08:25:21.0980244Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0980679Z         %165 = arith.mulf %144, %164 : tensor<8xf32>
2026-02-21T08:25:21.0980967Z         %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.0981314Z         %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0981658Z         %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.0981944Z         %169 = arith.subf %167, %168 : tensor<8x512xf32>
2026-02-21T08:25:21.0982385Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.0982911Z         %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:25:21.0983213Z         %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0983432Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:21.0983651Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:21.0983874Z           tt.reduce.return %174 : f32
2026-02-21T08:25:21.0984095Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0984334Z         %173 = arith.addf %165, %172 : tensor<8xf32>
2026-02-21T08:25:21.0984591Z         scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:25:21.0984880Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:25:21.0985155Z       %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0985463Z       %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0985700Z       %9 = arith.addi %8, %7 : tensor<512xi32>
2026-02-21T08:25:21.0985931Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0986271Z       %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:25:21.0986648Z       %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0986956Z       %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0987310Z       %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T08:25:21.0987611Z       %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0987864Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0988071Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:21.0988271Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:25:21.0988474Z         tt.reduce.return %60 : f32
2026-02-21T08:25:21.0988682Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0988914Z       %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:21.0989175Z       %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:21.0989413Z       %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32>
2026-02-21T08:25:21.0989650Z       %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T08:25:21.0989930Z       %21 = arith.ori %19, %20 : tensor<8xi1>
2026-02-21T08:25:21.0990169Z       %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:21.0990416Z       %23 = arith.subf %6#0, %22 : tensor<8xf32>
2026-02-21T08:25:21.0990789Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0991178Z       %25 = arith.mulf %6#1, %24 : tensor<8xf32>
2026-02-21T08:25:21.0991436Z       %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.0991797Z       %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.0992071Z       %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.0992315Z       %29 = arith.subf %27, %28 : tensor<8x512xf32>
2026-02-21T08:25:21.0992702Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.0993136Z       %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:25:21.0993407Z       %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({
2026-02-21T08:25:21.0993616Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:21.0993804Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:25:21.0994009Z         tt.reduce.return %60 : f32
2026-02-21T08:25:21.0994205Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:25:21.0994421Z       %33 = arith.addf %25, %32 : tensor<8xf32>
2026-02-21T08:25:21.0994627Z       %c4096_i32_7 = arith.constant 4096 : i32
2026-02-21T08:25:21.0994841Z       %c2048_i32_8 = arith.constant 2048 : i32
2026-02-21T08:25:21.0995090Z       scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8  : i32 {
2026-02-21T08:25:21.0995398Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.0995676Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.0995898Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:25:21.0996137Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.0996415Z         %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:21.0996699Z         %65 = arith.muli %64, %cst_0 : tensor<8x1xi32>
2026-02-21T08:25:21.0996974Z         %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:21.0997298Z         %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.0997585Z         %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.0997842Z         %69 = arith.addi %67, %68 : tensor<8x512xi32>
2026-02-21T08:25:21.0998110Z         %70 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.0998409Z         %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.0998733Z         %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.0999105Z         %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.0999425Z         %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.0999776Z         %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1000076Z         %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.1000352Z         %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1000600Z         %78 = arith.subf %76, %77 : tensor<8x512xf32>
2026-02-21T08:25:21.1001001Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.1001446Z         %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1001849Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1002107Z         %82 = arith.divf %79, %81 : tensor<8x512xf32>
2026-02-21T08:25:21.1002355Z         %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:25:21.1002642Z         %84 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1002940Z         %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1003217Z         tt.store %85, %83, %73 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1003450Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:25:21.1003654Z         %86 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:25:21.1003869Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:25:21.1004116Z         %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.1004385Z         %89 = tt.splat %87 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.1004602Z         %90 = arith.addi %89, %88 : tensor<512xi32>
2026-02-21T08:25:21.1004830Z         %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.1005110Z         %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:21.1005388Z         %93 = arith.muli %92, %cst_0 : tensor<8x1xi32>
2026-02-21T08:25:21.1005670Z         %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:21.1005975Z         %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1006255Z         %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1006509Z         %97 = arith.addi %95, %96 : tensor<8x512xi32>
2026-02-21T08:25:21.1006753Z         %98 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1007047Z         %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1007361Z         %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.1007683Z         %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.1008014Z         %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1008374Z         %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1008686Z         %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.1008965Z         %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1009231Z         %106 = arith.subf %104, %105 : tensor<8x512xf32>
2026-02-21T08:25:21.1009635Z         %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.1010084Z         %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1010391Z         %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1010650Z         %110 = arith.divf %107, %109 : tensor<8x512xf32>
2026-02-21T08:25:21.1010979Z         %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:25:21.1011269Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1011633Z         %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1011926Z         tt.store %113, %111, %101 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1012175Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:21.1012388Z         %114 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:25:21.1012603Z         %115 = arith.addi %arg3, %114 : i32
2026-02-21T08:25:21.1012879Z         %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.1013156Z         %117 = tt.splat %115 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.1013385Z         %118 = arith.addi %117, %116 : tensor<512xi32>
2026-02-21T08:25:21.1013681Z         %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.1013976Z         %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:21.1014266Z         %121 = arith.muli %120, %cst_0 : tensor<8x1xi32>
2026-02-21T08:25:21.1014549Z         %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:21.1014875Z         %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1015165Z         %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1015433Z         %125 = arith.addi %123, %124 : tensor<8x512xi32>
2026-02-21T08:25:21.1015689Z         %126 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1016000Z         %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1016335Z         %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.1016648Z         %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.1016991Z         %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1017343Z         %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1017652Z         %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.1017936Z         %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1018188Z         %134 = arith.subf %132, %133 : tensor<8x512xf32>
2026-02-21T08:25:21.1018595Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.1019038Z         %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1019350Z         %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1019617Z         %138 = arith.divf %135, %137 : tensor<8x512xf32>
2026-02-21T08:25:21.1019871Z         %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:25:21.1020165Z         %140 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1020466Z         %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1020758Z         tt.store %141, %139, %129 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1020981Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:21.1021192Z         %142 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:25:21.1021403Z         %143 = arith.addi %arg3, %142 : i32
2026-02-21T08:25:21.1021702Z         %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.1021981Z         %145 = tt.splat %143 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.1022199Z         %146 = arith.addi %145, %144 : tensor<512xi32>
2026-02-21T08:25:21.1022445Z         %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.1022830Z         %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:21.1023118Z         %149 = arith.muli %148, %cst_0 : tensor<8x1xi32>
2026-02-21T08:25:21.1023405Z         %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:21.1023723Z         %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1024011Z         %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1024268Z         %153 = arith.addi %151, %152 : tensor<8x512xi32>
2026-02-21T08:25:21.1024535Z         %154 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1024842Z         %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1025175Z         %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.1025566Z         %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.1025899Z         %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1026256Z         %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1026556Z         %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.1026846Z         %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1027112Z         %162 = arith.subf %160, %161 : tensor<8x512xf32>
2026-02-21T08:25:21.1027512Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.1027963Z         %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1028261Z         %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1028531Z         %166 = arith.divf %163, %165 : tensor<8x512xf32>
2026-02-21T08:25:21.1028785Z         %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:25:21.1029083Z         %168 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1029398Z         %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1029692Z         tt.store %169, %167, %157 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1029934Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:25:21.1030188Z       %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:21.1030475Z       %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32>
2026-02-21T08:25:21.1030707Z       %36 = arith.addi %35, %34 : tensor<512xi32>
2026-02-21T08:25:21.1030934Z       %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32>
2026-02-21T08:25:21.1031216Z       %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:21.1031487Z       %39 = arith.muli %38, %cst_0 : tensor<8x1xi32>
2026-02-21T08:25:21.1031807Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:21.1032119Z       %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1032398Z       %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:25:21.1032652Z       %43 = arith.addi %41, %42 : tensor<8x512xi32>
2026-02-21T08:25:21.1032899Z       %44 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1033195Z       %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1033506Z       %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:21.1033815Z       %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T08:25:21.1034122Z       %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1034470Z       %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1034826Z       %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:25:21.1035091Z       %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1035336Z       %52 = arith.subf %50, %51 : tensor<8x512xf32>
2026-02-21T08:25:21.1035713Z       %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:25:21.1036153Z       %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:21.1036448Z       %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:25:21.1036687Z       %56 = arith.divf %53, %55 : tensor<8x512xf32>
2026-02-21T08:25:21.1036935Z       %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:25:21.1037208Z       %58 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1037554Z       %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:25:21.1037827Z       tt.store %59, %57, %47 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:25:21.1038085Z     } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:25:21.1038305Z     tt.return
2026-02-21T08:25:21.1038442Z   }
2026-02-21T08:25:21.1038582Z }
2026-02-21T08:25:21.1038658Z 
2026-02-21T08:25:21.1038712Z {-#
2026-02-21T08:25:21.1038855Z   external_resources: {
2026-02-21T08:25:21.1039025Z     mlir_reproducer: {
2026-02-21T08:25:21.1043733Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:25:21.1048674Z       disable_threading: false,
2026-02-21T08:25:21.1048864Z       verify_each: true
2026-02-21T08:25:21.1049032Z     }
2026-02-21T08:25:21.1049161Z   }
2026-02-21T08:25:21.1049311Z #-}
2026-02-21T08:25:21.1049771Z /tmp/torchinductor_root/jk/cjkllnsh5e55uktu24aq3faymfp2bujo2g4hnlxlayzgzmxzeasa.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:25:21.1051067Z /tmp/torchinductor_root/jk/cjkllnsh5e55uktu24aq3faymfp2bujo2g4hnlxlayzgzmxzeasa.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:25:21.1052250Z [150s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:25:21.1053411Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:25:21.1054453Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:25:21.1054747Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:25:21.9434546Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 16.4 configs/s
2026-02-21T08:25:23.8270452Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 538.9         
2026-02-21T08:25:23.8270967Z                                                                   configs/s     
2026-02-21T08:25:23.9905395Z [153s] Generation 4 complete: 
2026-02-21T08:25:23.9907129Z error=1
2026-02-21T08:25:23.9907311Z ok=81
2026-02-21T08:25:23.9907472Z min=0.0204
2026-02-21T08:25:23.9907629Z mid=0.0368
2026-02-21T08:25:23.9907782Z max=0.3369
2026-02-21T08:25:23.9907955Z best={'block_sizes': [1, 8192],
2026-02-21T08:25:23.9908240Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:25:23.9908546Z  'load_eviction_policies': ['', ''],
2026-02-21T08:25:23.9908744Z  'num_sm_multiplier': 32,
2026-02-21T08:25:23.9908914Z  'num_stages': 5,
2026-02-21T08:25:23.9909063Z  'num_warps': 1,
2026-02-21T08:25:23.9909246Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:25:23.9909469Z  'range_flattens': [True, True],
2026-02-21T08:25:23.9909665Z  'range_multi_buffers': [False, None],
2026-02-21T08:25:23.9909874Z  'range_num_stages': [3, 0],
2026-02-21T08:25:23.9910111Z  'range_unroll_factors': [0, 2],
2026-02-21T08:25:23.9910332Z  'range_warp_specializes': [True, None]}
2026-02-21T08:25:23.9927257Z [153s] Fitting surrogate: 432 points, 432 targets
2026-02-21T08:25:24.9571192Z [154s] Generation 5 starting: 72 neighbors, 5 active search path(s)
2026-02-21T08:25:52.6549731Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.8 configs/s
2026-02-21T08:25:56.1437607Z module {
2026-02-21T08:25:56.1442754Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:25:56.1447324Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:25:56.1452016Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:25:56.1454227Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:25:56.1454507Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:25:56.1460412Z     %cst = arith.constant dense<4224> : tensor<8x1xi32>
2026-02-21T08:25:56.1465214Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:25:56.1467460Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:25:56.1467775Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:25:56.1472525Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:25:56.1477070Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:25:56.1478697Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:25:56.1478977Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:25:56.1483782Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x128xf16>>
2026-02-21T08:25:56.1487513Z     %1 = tt.get_program_id x : i32
2026-02-21T08:25:56.1488935Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:25:56.1489169Z     %3 = arith.minsi %2, %c512_i32 : i32
2026-02-21T08:25:56.1489386Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:25:56.1489587Z       %4 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:25:56.1489844Z       %5 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:25:56.1490478Z       %6 = tt.splat %4 : i32 -> tensor<8xi32>
2026-02-21T08:25:56.1490673Z       %7 = arith.addi %6, %5 : tensor<8xi32>
2026-02-21T08:25:56.1490873Z       %c4096_i32_2 = arith.constant 4096 : i32
2026-02-21T08:25:56.1491061Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T08:25:56.1491443Z       %8:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_2 step %c512_i32_3 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:25:56.1491977Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T08:25:56.1492309Z         %51 = arith.extf %50 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1492551Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1492819Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1493017Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.1493379Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1493579Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1494101Z         %53 = arith.truncf %52 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:56.1494370Z         %54 = arith.extf %53 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:56.1498524Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<8xf32>
2026-02-21T08:25:56.1501364Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:25:56.1501768Z         %57 = arith.ori %55, %56 : tensor<8xi1>
2026-02-21T08:25:56.1502047Z         %58 = arith.select %57, %arg4, %54 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:56.1502302Z         %59 = arith.subf %arg4, %58 : tensor<8xf32>
2026-02-21T08:25:56.1502697Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1503080Z         %61 = arith.mulf %arg5, %60 : tensor<8xf32>
2026-02-21T08:25:56.1503356Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1503680Z         %63 = tt.broadcast %62 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1503924Z         %64 = arith.subf %51, %63 : tensor<8x128xf32>
2026-02-21T08:25:56.1504310Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1504695Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1504910Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1505111Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.1505310Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1505511Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1505716Z         %67 = arith.addf %61, %66 : tensor<8xf32>
2026-02-21T08:25:56.1505930Z         %c1_i32_6 = arith.constant 1 : i32
2026-02-21T08:25:56.1506126Z         %68 = arith.muli %c128_i32, %c1_i32_6 : i32
2026-02-21T08:25:56.1506331Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T08:25:56.1508875Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T08:25:56.1509194Z         %71 = arith.extf %70 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1509460Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1509665Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1509852Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.1510062Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1510247Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1510481Z         %73 = arith.truncf %72 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:56.1510725Z         %74 = arith.extf %73 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:56.1510964Z         %75 = arith.cmpf ogt, %58, %74 : tensor<8xf32>
2026-02-21T08:25:56.1511182Z         %76 = arith.cmpf une, %58, %58 : tensor<8xf32>
2026-02-21T08:25:56.1511644Z         %77 = arith.ori %75, %76 : tensor<8xi1>
2026-02-21T08:25:56.1511879Z         %78 = arith.select %77, %58, %74 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:56.1512105Z         %79 = arith.subf %58, %78 : tensor<8xf32>
2026-02-21T08:25:56.1512458Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1512805Z         %81 = arith.mulf %67, %80 : tensor<8xf32>
2026-02-21T08:25:56.1513055Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1513344Z         %83 = tt.broadcast %82 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1513575Z         %84 = arith.subf %71, %83 : tensor<8x128xf32>
2026-02-21T08:25:56.1513938Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1514370Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1514570Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1514748Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.1514967Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1515161Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1515356Z         %87 = arith.addf %81, %86 : tensor<8xf32>
2026-02-21T08:25:56.1515549Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:56.1515732Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:25:56.1515924Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T08:25:56.1516189Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T08:25:56.1516497Z         %91 = arith.extf %90 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1516724Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1516909Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1517094Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.1517276Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1517458Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1517671Z         %93 = arith.truncf %92 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:56.1517915Z         %94 = arith.extf %93 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:56.1518138Z         %95 = arith.cmpf ogt, %78, %94 : tensor<8xf32>
2026-02-21T08:25:56.1518341Z         %96 = arith.cmpf une, %78, %78 : tensor<8xf32>
2026-02-21T08:25:56.1518545Z         %97 = arith.ori %95, %96 : tensor<8xi1>
2026-02-21T08:25:56.1518763Z         %98 = arith.select %97, %78, %94 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:56.1518991Z         %99 = arith.subf %78, %98 : tensor<8xf32>
2026-02-21T08:25:56.1519333Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1519701Z         %101 = arith.mulf %87, %100 : tensor<8xf32>
2026-02-21T08:25:56.1519963Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1520253Z         %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1520501Z         %104 = arith.subf %91, %103 : tensor<8x128xf32>
2026-02-21T08:25:56.1520868Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1521246Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1521436Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1521653Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.1521847Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1522026Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1522230Z         %107 = arith.addf %101, %106 : tensor<8xf32>
2026-02-21T08:25:56.1522426Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:56.1522686Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:25:56.1522875Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T08:25:56.1523156Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T08:25:56.1523480Z         %111 = arith.extf %110 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1523708Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1523904Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1524082Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.1524274Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1524453Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1524680Z         %113 = arith.truncf %112 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:56.1524929Z         %114 = arith.extf %113 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:56.1525204Z         %115 = arith.cmpf ogt, %98, %114 : tensor<8xf32>
2026-02-21T08:25:56.1525429Z         %116 = arith.cmpf une, %98, %98 : tensor<8xf32>
2026-02-21T08:25:56.1525672Z         %117 = arith.ori %115, %116 : tensor<8xi1>
2026-02-21T08:25:56.1525901Z         %118 = arith.select %117, %98, %114 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:56.1526144Z         %119 = arith.subf %98, %118 : tensor<8xf32>
2026-02-21T08:25:56.1526504Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1526858Z         %121 = arith.mulf %107, %120 : tensor<8xf32>
2026-02-21T08:25:56.1527118Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1527409Z         %123 = tt.broadcast %122 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1527656Z         %124 = arith.subf %111, %123 : tensor<8x128xf32>
2026-02-21T08:25:56.1528016Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1528387Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1528586Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.1528764Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.1528957Z           tt.reduce.return %128 : f32
2026-02-21T08:25:56.1529136Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1529335Z         %127 = arith.addf %121, %126 : tensor<8xf32>
2026-02-21T08:25:56.1529548Z         scf.yield %118, %127 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:25:56.1529785Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:25:56.1530075Z       %9 = tt.descriptor_load %0[%4, %c4096_i32_2] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T08:25:56.1530407Z       %10 = arith.extf %9 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1530632Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1530825Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:56.1531003Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:25:56.1531198Z         tt.reduce.return %50 : f32
2026-02-21T08:25:56.1531377Z       }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1531638Z       %12 = arith.truncf %11 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:25:56.1531872Z       %13 = arith.extf %12 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:25:56.1532099Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<8xf32>
2026-02-21T08:25:56.1532314Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<8xf32>
2026-02-21T08:25:56.1532511Z       %16 = arith.ori %14, %15 : tensor<8xi1>
2026-02-21T08:25:56.1532740Z       %17 = arith.select %16, %8#0, %13 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:25:56.1532966Z       %18 = arith.subf %8#0, %17 : tensor<8xf32>
2026-02-21T08:25:56.1533318Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1533720Z       %20 = arith.mulf %8#1, %19 : tensor<8xf32>
2026-02-21T08:25:56.1533968Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1534253Z       %22 = tt.broadcast %21 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1534478Z       %23 = arith.subf %10, %22 : tensor<8x128xf32>
2026-02-21T08:25:56.1534846Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1535199Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.1535394Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:56.1535568Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:25:56.1535758Z         tt.reduce.return %50 : f32
2026-02-21T08:25:56.1535944Z       }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T08:25:56.1536134Z       %26 = arith.addf %20, %25 : tensor<8xf32>
2026-02-21T08:25:56.1536395Z       %c4096_i32_4 = arith.constant 4096 : i32
2026-02-21T08:25:56.1536589Z       %c512_i32_5 = arith.constant 512 : i32
2026-02-21T08:25:56.1536828Z       scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c512_i32_5  : i32 {
2026-02-21T08:25:56.1537108Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:25:56.1537377Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:25:56.1537588Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T08:25:56.1537832Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:56.1538100Z         %54 = arith.muli %53, %cst : tensor<8x1xi32>
2026-02-21T08:25:56.1538358Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:25:56.1538658Z         %56 = tt.broadcast %54 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1538921Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1539167Z         %58 = arith.addi %56, %57 : tensor<8x128xi32>
2026-02-21T08:25:56.1539413Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1539692Z         %60 = tt.addptr %59, %58 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1539997Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1540305Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1540598Z         %63 = arith.extf %61 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1540873Z         %64 = tt.broadcast %62 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1541115Z         %65 = arith.subf %63, %64 : tensor<8x128xf32>
2026-02-21T08:25:56.1541510Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1541979Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1542280Z         %68 = tt.broadcast %67 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1542518Z         %69 = arith.divf %66, %68 : tensor<8x128xf32>
2026-02-21T08:25:56.1542766Z         %70 = arith.truncf %69 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T08:25:56.1543044Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1543327Z         %72 = tt.addptr %71, %58 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1543592Z         tt.store %72, %70 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1543807Z         %c1_i32_6 = arith.constant 1 : i32
2026-02-21T08:25:56.1544014Z         %73 = arith.muli %c128_i32, %c1_i32_6 : i32
2026-02-21T08:25:56.1544212Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T08:25:56.1544459Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:25:56.1544721Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T08:25:56.1544979Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T08:25:56.1545235Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:56.1545501Z         %79 = arith.muli %78, %cst : tensor<8x1xi32>
2026-02-21T08:25:56.1545772Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:25:56.1546069Z         %81 = tt.broadcast %79 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1546343Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1546588Z         %83 = arith.addi %81, %82 : tensor<8x128xi32>
2026-02-21T08:25:56.1546829Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1547118Z         %85 = tt.addptr %84, %83 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1547496Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1547823Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1548119Z         %88 = arith.extf %86 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1548392Z         %89 = tt.broadcast %87 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1548625Z         %90 = arith.subf %88, %89 : tensor<8x128xf32>
2026-02-21T08:25:56.1548978Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1549384Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1549656Z         %93 = tt.broadcast %92 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1549887Z         %94 = arith.divf %91, %93 : tensor<8x128xf32>
2026-02-21T08:25:56.1550120Z         %95 = arith.truncf %94 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T08:25:56.1550384Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1550661Z         %97 = tt.addptr %96, %83 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1550909Z         tt.store %97, %95 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1551117Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:56.1551304Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:25:56.1551493Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T08:25:56.1551789Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:25:56.1552039Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T08:25:56.1552248Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T08:25:56.1552493Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:56.1552753Z         %104 = arith.muli %103, %cst : tensor<8x1xi32>
2026-02-21T08:25:56.1553018Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:25:56.1553320Z         %106 = tt.broadcast %104 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1553591Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1553834Z         %108 = arith.addi %106, %107 : tensor<8x128xi32>
2026-02-21T08:25:56.1554079Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1554354Z         %110 = tt.addptr %109, %108 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1554671Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1554988Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1555270Z         %113 = arith.extf %111 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1555541Z         %114 = tt.broadcast %112 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1555790Z         %115 = arith.subf %113, %114 : tensor<8x128xf32>
2026-02-21T08:25:56.1556229Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1556648Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1556926Z         %118 = tt.broadcast %117 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1557167Z         %119 = arith.divf %116, %118 : tensor<8x128xf32>
2026-02-21T08:25:56.1557402Z         %120 = arith.truncf %119 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T08:25:56.1557675Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1557956Z         %122 = tt.addptr %121, %108 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1558224Z         tt.store %122, %120 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1558435Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:56.1558672Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:25:56.1558874Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T08:25:56.1559103Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:25:56.1559364Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T08:25:56.1559564Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T08:25:56.1559816Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:56.1560083Z         %129 = arith.muli %128, %cst : tensor<8x1xi32>
2026-02-21T08:25:56.1560343Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:25:56.1560642Z         %131 = tt.broadcast %129 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1560904Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1561153Z         %133 = arith.addi %131, %132 : tensor<8x128xi32>
2026-02-21T08:25:56.1561391Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1561715Z         %135 = tt.addptr %134, %133 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1562025Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1562340Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1562639Z         %138 = arith.extf %136 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1562907Z         %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1563153Z         %140 = arith.subf %138, %139 : tensor<8x128xf32>
2026-02-21T08:25:56.1563540Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1563963Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1564264Z         %143 = tt.broadcast %142 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1564505Z         %144 = arith.divf %141, %143 : tensor<8x128xf32>
2026-02-21T08:25:56.1564751Z         %145 = arith.truncf %144 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T08:25:56.1565027Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1565323Z         %147 = tt.addptr %146, %133 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1565595Z         tt.store %147, %145 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1565807Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:25:56.1566043Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:25:56.1566312Z       %28 = tt.splat %c4096_i32_4 : i32 -> tensor<128xi32>
2026-02-21T08:25:56.1566536Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T08:25:56.1566786Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:25:56.1567093Z       %31 = arith.muli %30, %cst : tensor<8x1xi32>
2026-02-21T08:25:56.1567351Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:25:56.1567635Z       %33 = tt.broadcast %31 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1567895Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T08:25:56.1568122Z       %35 = arith.addi %33, %34 : tensor<8x128xi32>
2026-02-21T08:25:56.1568358Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1568625Z       %37 = tt.addptr %36, %35 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1568926Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1569239Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1569567Z       %40 = arith.extf %38 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T08:25:56.1569827Z       %41 = tt.broadcast %39 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1570047Z       %42 = arith.subf %40, %41 : tensor<8x128xf32>
2026-02-21T08:25:56.1570404Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T08:25:56.1570805Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:25:56.1571074Z       %45 = tt.broadcast %44 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T08:25:56.1571304Z       %46 = arith.divf %43, %45 : tensor<8x128xf32>
2026-02-21T08:25:56.1571525Z       %47 = arith.truncf %46 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T08:25:56.1571813Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1572073Z       %49 = tt.addptr %48, %35 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T08:25:56.1572330Z       tt.store %49, %47 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T08:25:56.1572589Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.warp_specialize}
2026-02-21T08:25:56.1572811Z     tt.return
2026-02-21T08:25:56.1572946Z   }
2026-02-21T08:25:56.1573068Z }
2026-02-21T08:25:56.1573143Z 
2026-02-21T08:25:56.1573194Z {-#
2026-02-21T08:25:56.1573323Z   external_resources: {
2026-02-21T08:25:56.1573487Z     mlir_reproducer: {
2026-02-21T08:25:56.1577768Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:25:56.1582298Z       disable_threading: false,
2026-02-21T08:25:56.1582485Z       verify_each: true
2026-02-21T08:25:56.1582633Z     }
2026-02-21T08:25:56.1582767Z   }
2026-02-21T08:25:56.1582879Z #-}
2026-02-21T08:25:56.1583316Z /tmp/torchinductor_root/pc/cpcb2g72dy7psfa7tvoq6f36rmz52skrwegzxl3uatputjkalgjh.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:25:56.1584552Z /tmp/torchinductor_root/pc/cpcb2g72dy7psfa7tvoq6f36rmz52skrwegzxl3uatputjkalgjh.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:25:56.1585611Z [185s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:25:56.1586746Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:25:56.1587752Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:25:56.1588019Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:25:56.4579139Z module {
2026-02-21T08:25:56.4586012Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:25:56.4590388Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:25:56.4594461Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x512xf16>
2026-02-21T08:25:56.4596874Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:25:56.4597162Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:25:56.4601912Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:25:56.4603607Z     %cst_0 = arith.constant dense<4224> : tensor<32x1xi32>
2026-02-21T08:25:56.4603982Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x512xf32>
2026-02-21T08:25:56.4609899Z     %cst_2 = arith.constant dense<0xFC00> : tensor<32x512xf16>
2026-02-21T08:25:56.4614076Z     %cst_3 = arith.constant dense<4224> : tensor<512xi32>
2026-02-21T08:25:56.4618656Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:25:56.4622647Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:25:56.4625865Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:25:56.4630368Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:25:56.4634232Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T08:25:56.4638808Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T08:25:56.4639131Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:25:56.4639497Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<32x512xf16>>
2026-02-21T08:25:56.4640110Z     %1 = tt.get_program_id x : i32
2026-02-21T08:25:56.4645874Z     scf.for %arg2 = %1 to %c128_i32 step %c592_i32  : i32 {
2026-02-21T08:25:56.4650001Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:25:56.4654475Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:25:56.4659519Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:25:56.4664553Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:25:56.4669660Z       %c4096_i32_6 = arith.constant 4096 : i32
2026-02-21T08:25:56.4674688Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:25:56.4678489Z       %6:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_6 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:25:56.4679292Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4679592Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4679810Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:25:56.4680050Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4680376Z         %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x512xf16>> -> tensor<32x512xf16>
2026-02-21T08:25:56.4680726Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4681025Z         %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4681309Z         %67 = arith.select %66, %64, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16>
2026-02-21T08:25:56.4681667Z         %68 = arith.extf %67 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4681981Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4682194Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4682396Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.4682594Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4682791Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4683016Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:25:56.4683265Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:25:56.4683498Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<32xf32>
2026-02-21T08:25:56.4683732Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:25:56.4683949Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:25:56.4684184Z         %75 = arith.select %74, %arg4, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:25:56.4684435Z         %76 = arith.subf %arg4, %75 : tensor<32xf32>
2026-02-21T08:25:56.4684802Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4685169Z         %78 = arith.mulf %arg5, %77 : tensor<32xf32>
2026-02-21T08:25:56.4685421Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4685726Z         %80 = arith.extf %64 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4685993Z         %81 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4686226Z         %82 = arith.subf %80, %81 : tensor<32x512xf32>
2026-02-21T08:25:56.4686597Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4686998Z         %84 = arith.select %66, %83, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32>
2026-02-21T08:25:56.4687254Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4687451Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4687635Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.4687828Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4688013Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4688217Z         %86 = arith.addf %78, %85 : tensor<32xf32>
2026-02-21T08:25:56.4688410Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:25:56.4688609Z         %87 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:25:56.4688850Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:25:56.4689088Z         %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4689347Z         %90 = tt.splat %88 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4689541Z         %91 = arith.addi %90, %89 : tensor<512xi32>
2026-02-21T08:25:56.4689759Z         %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4690056Z         %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc<tensor<32x512xf16>> -> tensor<32x512xf16>
2026-02-21T08:25:56.4690482Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4690780Z         %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4691050Z         %96 = arith.select %95, %93, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16>
2026-02-21T08:25:56.4691331Z         %97 = arith.extf %96 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4691592Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4691786Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4691967Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.4692160Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4692339Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4692562Z         %99 = arith.truncf %98 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:25:56.4692807Z         %100 = arith.extf %99 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:25:56.4693098Z         %101 = arith.cmpf ogt, %75, %100 : tensor<32xf32>
2026-02-21T08:25:56.4693327Z         %102 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:25:56.4693527Z         %103 = arith.ori %101, %102 : tensor<32xi1>
2026-02-21T08:25:56.4693766Z         %104 = arith.select %103, %75, %100 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:25:56.4694005Z         %105 = arith.subf %75, %104 : tensor<32xf32>
2026-02-21T08:25:56.4694374Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4694738Z         %107 = arith.mulf %86, %106 : tensor<32xf32>
2026-02-21T08:25:56.4694992Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4695293Z         %109 = arith.extf %93 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4695561Z         %110 = tt.broadcast %108 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4695816Z         %111 = arith.subf %109, %110 : tensor<32x512xf32>
2026-02-21T08:25:56.4696194Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4696603Z         %113 = arith.select %95, %112, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32>
2026-02-21T08:25:56.4696864Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4697052Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4697239Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.4697426Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4697620Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4697828Z         %115 = arith.addf %107, %114 : tensor<32xf32>
2026-02-21T08:25:56.4698020Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:56.4698213Z         %116 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:25:56.4698405Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:25:56.4698642Z         %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4698891Z         %119 = tt.splat %117 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4699102Z         %120 = arith.addi %119, %118 : tensor<512xi32>
2026-02-21T08:25:56.4699334Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4699652Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<32x512xf16>> -> tensor<32x512xf16>
2026-02-21T08:25:56.4700024Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4700331Z         %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4700630Z         %125 = arith.select %124, %122, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16>
2026-02-21T08:25:56.4700931Z         %126 = arith.extf %125 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4701184Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4701449Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4701691Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.4701897Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4702091Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4702335Z         %128 = arith.truncf %127 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:25:56.4702596Z         %129 = arith.extf %128 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:25:56.4702853Z         %130 = arith.cmpf ogt, %104, %129 : tensor<32xf32>
2026-02-21T08:25:56.4703094Z         %131 = arith.cmpf une, %104, %104 : tensor<32xf32>
2026-02-21T08:25:56.4703311Z         %132 = arith.ori %130, %131 : tensor<32xi1>
2026-02-21T08:25:56.4703562Z         %133 = arith.select %132, %104, %129 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:25:56.4703817Z         %134 = arith.subf %104, %133 : tensor<32xf32>
2026-02-21T08:25:56.4704257Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4704643Z         %136 = arith.mulf %115, %135 : tensor<32xf32>
2026-02-21T08:25:56.4704905Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4705222Z         %138 = arith.extf %122 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4705497Z         %139 = tt.broadcast %137 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4705757Z         %140 = arith.subf %138, %139 : tensor<32x512xf32>
2026-02-21T08:25:56.4706146Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4706642Z         %142 = arith.select %124, %141, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32>
2026-02-21T08:25:56.4706906Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4707097Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4707285Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.4707468Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4707658Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4707854Z         %144 = arith.addf %136, %143 : tensor<32xf32>
2026-02-21T08:25:56.4708051Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:56.4708241Z         %145 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:25:56.4708432Z         %146 = arith.addi %arg3, %145 : i32
2026-02-21T08:25:56.4708666Z         %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4708917Z         %148 = tt.splat %146 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4709128Z         %149 = arith.addi %148, %147 : tensor<512xi32>
2026-02-21T08:25:56.4709343Z         %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4709654Z         %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc<tensor<32x512xf16>> -> tensor<32x512xf16>
2026-02-21T08:25:56.4710010Z         %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4710304Z         %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4710594Z         %154 = arith.select %153, %151, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16>
2026-02-21T08:25:56.4710883Z         %155 = arith.extf %154 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4711123Z         %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4711312Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4711499Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:25:56.4711765Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4711944Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4712176Z         %157 = arith.truncf %156 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:25:56.4712424Z         %158 = arith.extf %157 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:25:56.4712711Z         %159 = arith.cmpf ogt, %133, %158 : tensor<32xf32>
2026-02-21T08:25:56.4712927Z         %160 = arith.cmpf une, %133, %133 : tensor<32xf32>
2026-02-21T08:25:56.4713142Z         %161 = arith.ori %159, %160 : tensor<32xi1>
2026-02-21T08:25:56.4713378Z         %162 = arith.select %161, %133, %158 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:25:56.4713616Z         %163 = arith.subf %133, %162 : tensor<32xf32>
2026-02-21T08:25:56.4713983Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4714338Z         %165 = arith.mulf %144, %164 : tensor<32xf32>
2026-02-21T08:25:56.4714596Z         %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4714891Z         %167 = arith.extf %151 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4715216Z         %168 = tt.broadcast %166 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4715467Z         %169 = arith.subf %167, %168 : tensor<32x512xf32>
2026-02-21T08:25:56.4715837Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4716265Z         %171 = arith.select %153, %170, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32>
2026-02-21T08:25:56.4716524Z         %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4716722Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:25:56.4716907Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:25:56.4717092Z           tt.reduce.return %174 : f32
2026-02-21T08:25:56.4717279Z         }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4717475Z         %173 = arith.addf %165, %172 : tensor<32xf32>
2026-02-21T08:25:56.4717697Z         scf.yield %162, %173 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:25:56.4717905Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:25:56.4718134Z       %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4718396Z       %8 = tt.splat %c4096_i32_6 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4718600Z       %9 = arith.addi %8, %7 : tensor<512xi32>
2026-02-21T08:25:56.4718813Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4719125Z       %11 = tt.descriptor_load %0[%2, %c4096_i32_6] : !tt.tensordesc<tensor<32x512xf16>> -> tensor<32x512xf16>
2026-02-21T08:25:56.4719486Z       %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4719765Z       %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4720043Z       %14 = arith.select %13, %11, %cst_2 : tensor<32x512xi1>, tensor<32x512xf16>
2026-02-21T08:25:56.4720324Z       %15 = arith.extf %14 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4720555Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4720760Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:56.4720946Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:25:56.4721145Z         tt.reduce.return %60 : f32
2026-02-21T08:25:56.4721334Z       }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4721612Z       %17 = arith.truncf %16 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:25:56.4721866Z       %18 = arith.extf %17 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:25:56.4722093Z       %19 = arith.cmpf ogt, %6#0, %18 : tensor<32xf32>
2026-02-21T08:25:56.4722320Z       %20 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:25:56.4722528Z       %21 = arith.ori %19, %20 : tensor<32xi1>
2026-02-21T08:25:56.4722775Z       %22 = arith.select %21, %6#0, %18 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:25:56.4723015Z       %23 = arith.subf %6#0, %22 : tensor<32xf32>
2026-02-21T08:25:56.4723408Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4723794Z       %25 = arith.mulf %6#1, %24 : tensor<32xf32>
2026-02-21T08:25:56.4724122Z       %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4724410Z       %27 = arith.extf %11 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4724662Z       %28 = tt.broadcast %26 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4724899Z       %29 = arith.subf %27, %28 : tensor<32x512xf32>
2026-02-21T08:25:56.4725259Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4725667Z       %31 = arith.select %13, %30, %cst_1 : tensor<32x512xi1>, tensor<32x512xf32>
2026-02-21T08:25:56.4725919Z       %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({
2026-02-21T08:25:56.4726105Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:25:56.4726290Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:25:56.4726529Z         tt.reduce.return %60 : f32
2026-02-21T08:25:56.4726724Z       }) : (tensor<32x512xf32>) -> tensor<32xf32>
2026-02-21T08:25:56.4726914Z       %33 = arith.addf %25, %32 : tensor<32xf32>
2026-02-21T08:25:56.4727113Z       %c4096_i32_7 = arith.constant 4096 : i32
2026-02-21T08:25:56.4727308Z       %c2048_i32_8 = arith.constant 2048 : i32
2026-02-21T08:25:56.4727538Z       scf.for %arg3 = %c0_i32 to %c4096_i32_7 step %c2048_i32_8  : i32 {
2026-02-21T08:25:56.4727822Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4728075Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4728282Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T08:25:56.4728490Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4728756Z         %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:25:56.4729026Z         %65 = arith.muli %64, %cst_0 : tensor<32x1xi32>
2026-02-21T08:25:56.4729282Z         %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:56.4729580Z         %67 = tt.broadcast %65 : tensor<32x1xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4729837Z         %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4730078Z         %69 = arith.addi %67, %68 : tensor<32x512xi32>
2026-02-21T08:25:56.4730311Z         %70 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4730596Z         %71 = tt.addptr %70, %69 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4730891Z         %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4731170Z         %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4731473Z         %74 = tt.load %71, %73, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4731838Z         %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4732129Z         %76 = arith.extf %74 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4732392Z         %77 = tt.broadcast %75 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4732629Z         %78 = arith.subf %76, %77 : tensor<32x512xf32>
2026-02-21T08:25:56.4733001Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4733410Z         %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4733704Z         %81 = tt.broadcast %80 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4733940Z         %82 = arith.divf %79, %81 : tensor<32x512xf32>
2026-02-21T08:25:56.4734181Z         %83 = arith.truncf %82 : tensor<32x512xf32> to tensor<32x512xf16>
2026-02-21T08:25:56.4734461Z         %84 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4734745Z         %85 = tt.addptr %84, %69 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4735089Z         tt.store %85, %83, %73 : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4735299Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:25:56.4735496Z         %86 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:25:56.4735684Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:25:56.4735922Z         %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4736181Z         %89 = tt.splat %87 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4736380Z         %90 = arith.addi %89, %88 : tensor<512xi32>
2026-02-21T08:25:56.4736600Z         %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4736857Z         %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:25:56.4737122Z         %93 = arith.muli %92, %cst_0 : tensor<32x1xi32>
2026-02-21T08:25:56.4737433Z         %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:56.4737720Z         %95 = tt.broadcast %93 : tensor<32x1xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4737983Z         %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4738210Z         %97 = arith.addi %95, %96 : tensor<32x512xi32>
2026-02-21T08:25:56.4738448Z         %98 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4738721Z         %99 = tt.addptr %98, %97 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4739021Z         %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4739321Z         %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4739628Z         %102 = tt.load %99, %101, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4739967Z         %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4740259Z         %104 = arith.extf %102 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4740532Z         %105 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4740775Z         %106 = arith.subf %104, %105 : tensor<32x512xf32>
2026-02-21T08:25:56.4741153Z         %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4741588Z         %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4741879Z         %109 = tt.broadcast %108 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4742129Z         %110 = arith.divf %107, %109 : tensor<32x512xf32>
2026-02-21T08:25:56.4742370Z         %111 = arith.truncf %110 : tensor<32x512xf32> to tensor<32x512xf16>
2026-02-21T08:25:56.4742655Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4742958Z         %113 = tt.addptr %112, %97 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4743245Z         tt.store %113, %111, %101 : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4743476Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:25:56.4743672Z         %114 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:25:56.4743880Z         %115 = arith.addi %arg3, %114 : i32
2026-02-21T08:25:56.4744121Z         %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4744393Z         %117 = tt.splat %115 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4744612Z         %118 = arith.addi %117, %116 : tensor<512xi32>
2026-02-21T08:25:56.4744837Z         %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4745120Z         %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:25:56.4745394Z         %121 = arith.muli %120, %cst_0 : tensor<32x1xi32>
2026-02-21T08:25:56.4745678Z         %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:56.4746038Z         %123 = tt.broadcast %121 : tensor<32x1xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4746330Z         %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4746595Z         %125 = arith.addi %123, %124 : tensor<32x512xi32>
2026-02-21T08:25:56.4746849Z         %126 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4747157Z         %127 = tt.addptr %126, %125 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4747484Z         %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4747796Z         %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4748134Z         %130 = tt.load %127, %129, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4748485Z         %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4748849Z         %132 = arith.extf %130 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4749133Z         %133 = tt.broadcast %131 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4749392Z         %134 = arith.subf %132, %133 : tensor<32x512xf32>
2026-02-21T08:25:56.4749781Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4750225Z         %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4750537Z         %137 = tt.broadcast %136 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4750793Z         %138 = arith.divf %135, %137 : tensor<32x512xf32>
2026-02-21T08:25:56.4751040Z         %139 = arith.truncf %138 : tensor<32x512xf32> to tensor<32x512xf16>
2026-02-21T08:25:56.4751309Z         %140 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4751633Z         %141 = tt.addptr %140, %125 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4751915Z         tt.store %141, %139, %129 : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4752130Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:25:56.4752328Z         %142 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:25:56.4752520Z         %143 = arith.addi %arg3, %142 : i32
2026-02-21T08:25:56.4752755Z         %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4753008Z         %145 = tt.splat %143 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4753218Z         %146 = arith.addi %145, %144 : tensor<512xi32>
2026-02-21T08:25:56.4753444Z         %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4753705Z         %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:25:56.4753975Z         %149 = arith.muli %148, %cst_0 : tensor<32x1xi32>
2026-02-21T08:25:56.4754235Z         %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:56.4754534Z         %151 = tt.broadcast %149 : tensor<32x1xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4754801Z         %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4755053Z         %153 = arith.addi %151, %152 : tensor<32x512xi32>
2026-02-21T08:25:56.4755302Z         %154 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4755588Z         %155 = tt.addptr %154, %153 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4755903Z         %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4756193Z         %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4756514Z         %158 = tt.load %155, %157, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4756855Z         %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4757190Z         %160 = arith.extf %158 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4757460Z         %161 = tt.broadcast %159 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4757692Z         %162 = arith.subf %160, %161 : tensor<32x512xf32>
2026-02-21T08:25:56.4758072Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4758498Z         %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4758789Z         %165 = tt.broadcast %164 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4759031Z         %166 = arith.divf %163, %165 : tensor<32x512xf32>
2026-02-21T08:25:56.4759266Z         %167 = arith.truncf %166 : tensor<32x512xf32> to tensor<32x512xf16>
2026-02-21T08:25:56.4759543Z         %168 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4759889Z         %169 = tt.addptr %168, %153 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4760171Z         tt.store %169, %167, %157 : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4760390Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:25:56.4760610Z       %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:25:56.4760874Z       %35 = tt.splat %c4096_i32_7 : i32 -> tensor<512xi32>
2026-02-21T08:25:56.4761087Z       %36 = arith.addi %35, %34 : tensor<512xi32>
2026-02-21T08:25:56.4761300Z       %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32>
2026-02-21T08:25:56.4761612Z       %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:25:56.4761882Z       %39 = arith.muli %38, %cst_0 : tensor<32x1xi32>
2026-02-21T08:25:56.4762137Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:25:56.4762419Z       %41 = tt.broadcast %39 : tensor<32x1xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4762695Z       %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<32x512xi32>
2026-02-21T08:25:56.4762929Z       %43 = arith.addi %41, %42 : tensor<32x512xi32>
2026-02-21T08:25:56.4763170Z       %44 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4763445Z       %45 = tt.addptr %44, %43 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4763744Z       %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T08:25:56.4764036Z       %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<32x512xi1>
2026-02-21T08:25:56.4764328Z       %48 = tt.load %45, %47, %cst evictionPolicy = evict_first : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4764653Z       %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4764928Z       %50 = arith.extf %48 : tensor<32x512xf16> to tensor<32x512xf32>
2026-02-21T08:25:56.4765188Z       %51 = tt.broadcast %49 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4765423Z       %52 = arith.subf %50, %51 : tensor<32x512xf32>
2026-02-21T08:25:56.4765780Z       %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x512xf32>) -> tensor<32x512xf32>
2026-02-21T08:25:56.4766186Z       %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:25:56.4766460Z       %55 = tt.broadcast %54 : tensor<32x1xf32> -> tensor<32x512xf32>
2026-02-21T08:25:56.4766695Z       %56 = arith.divf %53, %55 : tensor<32x512xf32>
2026-02-21T08:25:56.4766923Z       %57 = arith.truncf %56 : tensor<32x512xf32> to tensor<32x512xf16>
2026-02-21T08:25:56.4767191Z       %58 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4767465Z       %59 = tt.addptr %58, %43 : tensor<32x512x!tt.ptr<f16>>, tensor<32x512xi32>
2026-02-21T08:25:56.4767721Z       tt.store %59, %57, %47 : tensor<32x512x!tt.ptr<f16>>
2026-02-21T08:25:56.4767944Z     } {tt.flatten, tt.warp_specialize}
2026-02-21T08:25:56.4768170Z     tt.return
2026-02-21T08:25:56.4768310Z   }
2026-02-21T08:25:56.4768433Z }
2026-02-21T08:25:56.4768513Z 
2026-02-21T08:25:56.4768565Z {-#
2026-02-21T08:25:56.4768706Z   external_resources: {
2026-02-21T08:25:56.4768869Z     mlir_reproducer: {
2026-02-21T08:25:56.4773359Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:25:56.4777683Z       disable_threading: false,
2026-02-21T08:25:56.4777856Z       verify_each: true
2026-02-21T08:25:56.4777996Z     }
2026-02-21T08:25:56.4778118Z   }
2026-02-21T08:25:56.4778231Z #-}
2026-02-21T08:25:56.4778656Z /tmp/torchinductor_root/ey/ceyvajt67oz5df4svoktjonn76nenfn6xt77ywdibvxxhau3er2c.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:25:56.4782510Z /tmp/torchinductor_root/ey/ceyvajt67oz5df4svoktjonn76nenfn6xt77ywdibvxxhau3er2c.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:25:56.4783485Z [186s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:25:56.4784539Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=3, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:25:56.4785542Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:25:56.4785807Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:25:57.3746739Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.2 configs/s
2026-02-21T08:25:59.0335009Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 611.7         
2026-02-21T08:25:59.0339593Z                                                                   configs/s     
2026-02-21T08:25:59.1784393Z [188s] Generation 5 complete: 
2026-02-21T08:25:59.1789294Z error=2
2026-02-21T08:25:59.1793683Z ok=75
2026-02-21T08:25:59.1795573Z min=0.0204
2026-02-21T08:25:59.1795736Z mid=0.0368
2026-02-21T08:25:59.1795879Z max=1.0691
2026-02-21T08:25:59.1796017Z best={'block_sizes': [1, 8192],
2026-02-21T08:25:59.1796277Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:25:59.1796535Z  'load_eviction_policies': ['', ''],
2026-02-21T08:25:59.1796718Z  'num_sm_multiplier': 32,
2026-02-21T08:25:59.1796883Z  'num_stages': 5,
2026-02-21T08:25:59.1797024Z  'num_warps': 1,
2026-02-21T08:25:59.1797183Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:25:59.1797372Z  'range_flattens': [True, True],
2026-02-21T08:25:59.1797549Z  'range_multi_buffers': [False, None],
2026-02-21T08:25:59.1797725Z  'range_num_stages': [3, 0],
2026-02-21T08:25:59.1797894Z  'range_unroll_factors': [0, 2],
2026-02-21T08:25:59.1798075Z  'range_warp_specializes': [True, None]}
2026-02-21T08:25:59.1807540Z [188s] Fitting surrogate: 509 points, 509 targets
2026-02-21T08:25:59.9929507Z [189s] Generation 6 starting: 47 neighbors, 4 active search path(s)
2026-02-21T08:26:05.1812132Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 6.0 configs/s
2026-02-21T08:26:08.1834372Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 16.9 configs/s
2026-02-21T08:26:09.8789935Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.5         
2026-02-21T08:26:09.8790317Z                                                                   configs/s     
2026-02-21T08:26:10.0097090Z [199s] Generation 6 complete: 
2026-02-21T08:26:10.0098792Z ok=52
2026-02-21T08:26:10.0098964Z min=0.0204
2026-02-21T08:26:10.0099092Z mid=0.0307
2026-02-21T08:26:10.0099222Z max=0.0890
2026-02-21T08:26:10.0099357Z best={'block_sizes': [1, 8192],
2026-02-21T08:26:10.0099614Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:26:10.0099869Z  'load_eviction_policies': ['', ''],
2026-02-21T08:26:10.0100046Z  'num_stages': 6,
2026-02-21T08:26:10.0100193Z  'num_warps': 1,
2026-02-21T08:26:10.0100361Z  'pid_type': 'flat',
2026-02-21T08:26:10.0100540Z  'range_flattens': [None, True],
2026-02-21T08:26:10.0100713Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:10.0100897Z  'range_num_stages': [0, 4],
2026-02-21T08:26:10.0101059Z  'range_unroll_factors': [0, 0],
2026-02-21T08:26:10.0101243Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:10.0117813Z [199s] Fitting surrogate: 561 points, 561 targets
2026-02-21T08:26:10.6860679Z [200s] Generation 7 starting: 40 neighbors, 3 active search path(s)
2026-02-21T08:26:13.5418832Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 16.4 configs/s
2026-02-21T08:26:16.0178707Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s
2026-02-21T08:26:17.3021736Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 786.4         
2026-02-21T08:26:17.3022128Z                                                                   configs/s     
2026-02-21T08:26:17.4081108Z [207s] Generation 7 complete: 
2026-02-21T08:26:17.4086191Z ok=44
2026-02-21T08:26:17.4090541Z min=0.0204
2026-02-21T08:26:17.4095530Z mid=0.0326
2026-02-21T08:26:17.4096901Z max=0.0891
2026-02-21T08:26:17.4097087Z best={'block_sizes': [1, 8192],
2026-02-21T08:26:17.4097339Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:26:17.4097600Z  'load_eviction_policies': ['', ''],
2026-02-21T08:26:17.4097784Z  'num_sm_multiplier': 32,
2026-02-21T08:26:17.4097950Z  'num_stages': 7,
2026-02-21T08:26:17.4098087Z  'num_warps': 1,
2026-02-21T08:26:17.4098339Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:26:17.4098541Z  'range_flattens': [None, True],
2026-02-21T08:26:17.4098717Z  'range_multi_buffers': [False, None],
2026-02-21T08:26:17.4100314Z  'range_num_stages': [3, 1],
2026-02-21T08:26:17.4100534Z  'range_unroll_factors': [0, 2],
2026-02-21T08:26:17.4100782Z  'range_warp_specializes': [True, None]}
2026-02-21T08:26:17.4105452Z [207s] Fitting surrogate: 605 points, 605 targets
2026-02-21T08:26:17.6843509Z [207s] Autotuning complete in 207.5s after searching 585 configs.
2026-02-21T08:26:17.6843939Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:26:17.6844993Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:26:17.6845889Z 
2026-02-21T08:26:17.6846141Z [207s] Code of selected kernel: /tmp/torchinductor_root/di/cdidzf2wwarfgduqe66h5bkfocvrxgivwwf2hmp4pjyehufbn26t.py
2026-02-21T08:26:18.6302495Z WARNING:tritonbench.utils.triton_op:Completed input ID 31:
2026-02-21T08:26:18.6302751Z (M, N)
2026-02-21T08:26:18.6302890Z ------------
2026-02-21T08:26:18.6303023Z (4096, 4224)
2026-02-21T08:26:18.6303105Z 
2026-02-21T08:26:18.6310116Z  35%|███▌      | 7/20 [17:16<35:59, 166.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 36:
2026-02-21T08:26:18.6310474Z (M, N)
2026-02-21T08:26:18.6310608Z ------------
2026-02-21T08:26:18.6310759Z (4096, 4864)
2026-02-21T08:26:18.6314951Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:26:19.8997384Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:26:21.4220262Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for torch_compile_softmax
2026-02-21T08:26:24.9112193Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:26:24.9113663Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:26:24.9113920Z               'dtype': 'torch.float16',
2026-02-21T08:26:24.9114179Z               'shape': (4096, 4864),
2026-02-21T08:26:24.9114384Z               'stride': (4864, 1)},),
2026-02-21T08:26:24.9118860Z   'kwargs': {}}
2026-02-21T08:26:24.9133659Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:26:25.0920435Z [0s] Autotune random seed: 2134816249
2026-02-21T08:26:25.2294253Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:26:58.9054532Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:26:59.1317273Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:26:59.2770522Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:26:59.2786611Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:27:05.7565579Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.5 configs/s
2026-02-21T08:27:05.7576247Z [40s] Adaptive compile timeout: 30s (90% percentile=5.2s, bounds=[30.0s, 30s])
2026-02-21T08:27:06.2271517Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2085.1 configs/s
2026-02-21T08:27:06.2730294Z [41s] Initial random population of 100, 5 starting points: 
2026-02-21T08:27:06.2733414Z error=5
2026-02-21T08:27:06.2739190Z timeout=3
2026-02-21T08:27:06.2743820Z ok=92
2026-02-21T08:27:06.2745703Z min=0.0328
2026-02-21T08:27:06.2745862Z mid=0.4394
2026-02-21T08:27:06.2745998Z max=32.5366
2026-02-21T08:27:06.2746142Z best={'block_sizes': [1, 8192],
2026-02-21T08:27:06.2746381Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:27:06.2746620Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:27:06.2746800Z  'maxnreg': 32,
2026-02-21T08:27:06.2746954Z  'num_sm_multiplier': 64,
2026-02-21T08:27:06.2747109Z  'num_stages': 7,
2026-02-21T08:27:06.2747254Z  'num_warps': 4,
2026-02-21T08:27:06.2747409Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:27:06.2747609Z  'range_flattens': [None, True],
2026-02-21T08:27:06.2747784Z  'range_multi_buffers': [False, True],
2026-02-21T08:27:06.2747975Z  'range_num_stages': [1, 4],
2026-02-21T08:27:06.2748155Z  'range_unroll_factors': [1, 4],
2026-02-21T08:27:06.2748356Z  'range_warp_specializes': [True, None]}
2026-02-21T08:27:06.2748668Z [41s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:27:07.3784260Z [42s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T08:27:20.0870688Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 1.3 configs/s
2026-02-21T08:27:21.3578074Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:27:21.3582979Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:27:21.3583540Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:27:21.3584226Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:27:21.3584550Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:27:21.3584756Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:27:21.3585053Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T08:27:21.3585359Z     %cst_0 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T08:27:21.3585611Z     %cst_1 = arith.constant dense<4864> : tensor<1024xi32>
2026-02-21T08:27:21.3585867Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:27:21.3586148Z     %cst_3 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:27:21.3586370Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:27:21.3586547Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:27:21.3586735Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T08:27:21.3586915Z     %c4864_i64 = arith.constant 4864 : i64
2026-02-21T08:27:21.3587089Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:27:21.3587410Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T08:27:21.3587869Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T08:27:21.3588522Z     %2 = tt.get_program_id x : i32
2026-02-21T08:27:21.3588743Z     scf.for %arg2 = %2 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:27:21.3588976Z       %3 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:27:21.3589184Z       %c4096_i32_4 = arith.constant 4096 : i32
2026-02-21T08:27:21.3589382Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:27:21.3589786Z       %4:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_4 step %c2048_i32 iter_args(%arg4 = %cst_3, %arg5 = %cst_2) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:27:21.3590230Z         %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:27:21.3590516Z         %43 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T08:27:21.3590747Z         %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T08:27:21.3590976Z         %45 = arith.cmpi slt, %44, %cst_1 : tensor<1024xi32>
2026-02-21T08:27:21.3591459Z         %46 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3593921Z         %47 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:27:21.3594243Z         %48 = tt.broadcast %47 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:27:21.3594536Z         %49 = arith.select %48, %46, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:27:21.3594839Z         %50 = arith.extf %49 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3595100Z         %51 = "tt.reduce"(%50) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3595308Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:21.3595543Z           %98 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:27:21.3595749Z           tt.reduce.return %98 : f32
2026-02-21T08:27:21.3595957Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3596192Z         %52 = arith.truncf %51 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:27:21.3596455Z         %53 = arith.extf %52 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:27:21.3596711Z         %54 = arith.cmpf ogt, %arg4, %53 : tensor<8xf32>
2026-02-21T08:27:21.3596943Z         %55 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:27:21.3597170Z         %56 = arith.ori %54, %55 : tensor<8xi1>
2026-02-21T08:27:21.3597407Z         %57 = arith.select %56, %arg4, %53 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:27:21.3597660Z         %58 = arith.subf %arg4, %57 : tensor<8xf32>
2026-02-21T08:27:21.3598036Z         %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3598404Z         %60 = arith.mulf %arg5, %59 : tensor<8xf32>
2026-02-21T08:27:21.3598655Z         %61 = tt.expand_dims %57 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3598936Z         %62 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3599200Z         %63 = tt.broadcast %61 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3599440Z         %64 = arith.subf %62, %63 : tensor<8x1024xf32>
2026-02-21T08:27:21.3599811Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3600224Z         %66 = arith.select %48, %65, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:27:21.3600473Z         %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3600669Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:21.3600850Z           %98 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:21.3601044Z           tt.reduce.return %98 : f32
2026-02-21T08:27:21.3601226Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3601430Z         %68 = arith.addf %60, %67 : tensor<8xf32>
2026-02-21T08:27:21.3601679Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:21.3601878Z         %69 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T08:27:21.3602079Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:27:21.3602440Z         %71 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:27:21.3602707Z         %72 = tt.splat %70 : i32 -> tensor<1024xi32>
2026-02-21T08:27:21.3602913Z         %73 = arith.addi %72, %71 : tensor<1024xi32>
2026-02-21T08:27:21.3603136Z         %74 = arith.cmpi slt, %73, %cst_1 : tensor<1024xi32>
2026-02-21T08:27:21.3603438Z         %75 = tt.descriptor_load %0[%3, %70] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3603794Z         %76 = tt.expand_dims %74 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:27:21.3604097Z         %77 = tt.broadcast %76 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:27:21.3604372Z         %78 = arith.select %77, %75, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:27:21.3604655Z         %79 = arith.extf %78 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3604884Z         %80 = "tt.reduce"(%79) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3605157Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:21.3605342Z           %98 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:27:21.3605538Z           tt.reduce.return %98 : f32
2026-02-21T08:27:21.3605726Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3605939Z         %81 = arith.truncf %80 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:27:21.3606179Z         %82 = arith.extf %81 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:27:21.3606399Z         %83 = arith.cmpf ogt, %57, %82 : tensor<8xf32>
2026-02-21T08:27:21.3606614Z         %84 = arith.cmpf une, %57, %57 : tensor<8xf32>
2026-02-21T08:27:21.3606809Z         %85 = arith.ori %83, %84 : tensor<8xi1>
2026-02-21T08:27:21.3607047Z         %86 = arith.select %85, %57, %82 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:27:21.3607280Z         %87 = arith.subf %57, %86 : tensor<8xf32>
2026-02-21T08:27:21.3607631Z         %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3608005Z         %89 = arith.mulf %68, %88 : tensor<8xf32>
2026-02-21T08:27:21.3608253Z         %90 = tt.expand_dims %86 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3608547Z         %91 = arith.extf %75 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3608805Z         %92 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3609039Z         %93 = arith.subf %91, %92 : tensor<8x1024xf32>
2026-02-21T08:27:21.3609404Z         %94 = tt.extern_elementwise %93 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3609807Z         %95 = arith.select %77, %94, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:27:21.3610057Z         %96 = "tt.reduce"(%95) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3610242Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:21.3610428Z           %98 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:21.3610620Z           tt.reduce.return %98 : f32
2026-02-21T08:27:21.3610803Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3611003Z         %97 = arith.addf %89, %96 : tensor<8xf32>
2026-02-21T08:27:21.3611209Z         scf.yield %86, %97 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:27:21.3611425Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:27:21.3611692Z       %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:27:21.3611963Z       %6 = tt.splat %c4096_i32_4 : i32 -> tensor<1024xi32>
2026-02-21T08:27:21.3612173Z       %7 = arith.addi %6, %5 : tensor<1024xi32>
2026-02-21T08:27:21.3612388Z       %8 = arith.cmpi slt, %7, %cst_1 : tensor<1024xi32>
2026-02-21T08:27:21.3612706Z       %9 = tt.descriptor_load %0[%3, %c4096_i32_4] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3613062Z       %10 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T08:27:21.3613358Z       %11 = tt.broadcast %10 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T08:27:21.3613699Z       %12 = arith.select %11, %9, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T08:27:21.3613975Z       %13 = arith.extf %12 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3614205Z       %14 = "tt.reduce"(%13) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3614393Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:27:21.3614579Z         %42 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:27:21.3614767Z         tt.reduce.return %42 : f32
2026-02-21T08:27:21.3614953Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3615165Z       %15 = arith.truncf %14 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:27:21.3615406Z       %16 = arith.extf %15 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:27:21.3615627Z       %17 = arith.cmpf ogt, %4#0, %16 : tensor<8xf32>
2026-02-21T08:27:21.3615846Z       %18 = arith.cmpf une, %4#0, %4#0 : tensor<8xf32>
2026-02-21T08:27:21.3616114Z       %19 = arith.ori %17, %18 : tensor<8xi1>
2026-02-21T08:27:21.3616336Z       %20 = arith.select %19, %4#0, %16 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:27:21.3616566Z       %21 = arith.subf %4#0, %20 : tensor<8xf32>
2026-02-21T08:27:21.3616912Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3617269Z       %23 = arith.mulf %4#1, %22 : tensor<8xf32>
2026-02-21T08:27:21.3617514Z       %24 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3617790Z       %25 = arith.extf %9 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3618050Z       %26 = tt.broadcast %24 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3618283Z       %27 = arith.subf %25, %26 : tensor<8x1024xf32>
2026-02-21T08:27:21.3618676Z       %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3619087Z       %29 = arith.select %11, %28, %cst : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:27:21.3619329Z       %30 = "tt.reduce"(%29) <{axis = 1 : i32}> ({
2026-02-21T08:27:21.3619523Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:27:21.3619699Z         %42 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:27:21.3619892Z         tt.reduce.return %42 : f32
2026-02-21T08:27:21.3620083Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:27:21.3620280Z       %31 = arith.addf %23, %30 : tensor<8xf32>
2026-02-21T08:27:21.3620478Z       %c4096_i32_5 = arith.constant 4096 : i32
2026-02-21T08:27:21.3620669Z       %c2048_i32_6 = arith.constant 2048 : i32
2026-02-21T08:27:21.3620915Z       scf.for %arg3 = %c0_i32 to %c4096_i32_5 step %c2048_i32_6  : i32 {
2026-02-21T08:27:21.3621249Z         %42 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3621650Z         %43 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3621946Z         %44 = arith.extf %42 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3622203Z         %45 = tt.broadcast %43 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3622440Z         %46 = arith.subf %44, %45 : tensor<8x1024xf32>
2026-02-21T08:27:21.3622804Z         %47 = tt.extern_elementwise %46 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3623222Z         %48 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3623508Z         %49 = tt.broadcast %48 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3623736Z         %50 = arith.divf %47, %49 : tensor<8x1024xf32>
2026-02-21T08:27:21.3623974Z         %51 = arith.truncf %50 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:27:21.3624299Z         tt.descriptor_store %1[%3, %arg3], %51 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:27:21.3624652Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:21.3624842Z         %52 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T08:27:21.3625039Z         %53 = arith.addi %arg3, %52 : i32
2026-02-21T08:27:21.3625314Z         %54 = tt.descriptor_load %0[%3, %53] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3625645Z         %55 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3625932Z         %56 = arith.extf %54 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3626184Z         %57 = tt.broadcast %55 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3626418Z         %58 = arith.subf %56, %57 : tensor<8x1024xf32>
2026-02-21T08:27:21.3626782Z         %59 = tt.extern_elementwise %58 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3627276Z         %60 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3627565Z         %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3627794Z         %62 = arith.divf %59, %61 : tensor<8x1024xf32>
2026-02-21T08:27:21.3628031Z         %63 = arith.truncf %62 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:27:21.3628341Z         tt.descriptor_store %1[%3, %53], %63 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:27:21.3628628Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:27:21.3628931Z       %32 = tt.descriptor_load %0[%3, %c4096_i32_5] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T08:27:21.3629274Z       %33 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3629561Z       %34 = arith.extf %32 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T08:27:21.3629811Z       %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3630049Z       %36 = arith.subf %34, %35 : tensor<8x1024xf32>
2026-02-21T08:27:21.3630412Z       %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:27:21.3630825Z       %38 = tt.expand_dims %31 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:27:21.3631106Z       %39 = tt.broadcast %38 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T08:27:21.3631333Z       %40 = arith.divf %37, %39 : tensor<8x1024xf32>
2026-02-21T08:27:21.3631598Z       %41 = arith.truncf %40 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T08:27:21.3631926Z       tt.descriptor_store %1[%3, %c4096_i32_5], %41 : !tt.tensordesc<tensor<8x1024xf16>>, tensor<8x1024xf16>
2026-02-21T08:27:21.3632297Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:27:21.3632553Z     tt.return
2026-02-21T08:27:21.3632678Z   }
2026-02-21T08:27:21.3632805Z }
2026-02-21T08:27:21.3632876Z 
2026-02-21T08:27:21.3632926Z {-#
2026-02-21T08:27:21.3633065Z   external_resources: {
2026-02-21T08:27:21.3633224Z     mlir_reproducer: {
2026-02-21T08:27:21.3637715Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:27:21.3642477Z       disable_threading: false,
2026-02-21T08:27:21.3642650Z       verify_each: true
2026-02-21T08:27:21.3642867Z     }
2026-02-21T08:27:21.3642994Z   }
2026-02-21T08:27:21.3643105Z #-}
2026-02-21T08:27:21.3643540Z /tmp/torchinductor_root/62/c62yhclaekpli4npmlpzx47qarjsmheeqhqutpmunqmhec3r3562.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:27:21.3644740Z /tmp/torchinductor_root/62/c62yhclaekpli4npmlpzx47qarjsmheeqhqutpmunqmhec3r3562.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:27:21.3645714Z [56s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:27:21.3646839Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=5, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:27:21.3647844Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:27:21.3648098Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:27:25.4406640Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.8 configs/s
2026-02-21T08:27:29.5487647Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 246.9         
2026-02-21T08:27:29.5488162Z                                                                   configs/s     
2026-02-21T08:27:29.8192315Z [64s] Generation 1 complete: 
2026-02-21T08:27:29.8196710Z error=1
2026-02-21T08:27:29.8198132Z ok=90
2026-02-21T08:27:29.8198319Z min=0.0285
2026-02-21T08:27:29.8198456Z mid=0.0389
2026-02-21T08:27:29.8198599Z max=0.6842
2026-02-21T08:27:29.8198770Z best={'block_sizes': [1, 8192],
2026-02-21T08:27:29.8199020Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:27:29.8199263Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:27:29.8199444Z  'num_stages': 7,
2026-02-21T08:27:29.8199590Z  'num_warps': 4,
2026-02-21T08:27:29.8199729Z  'pid_type': 'flat',
2026-02-21T08:27:29.8199894Z  'range_flattens': [None, True],
2026-02-21T08:27:29.8200073Z  'range_multi_buffers': [None, None],
2026-02-21T08:27:29.8200265Z  'range_num_stages': [0, 4],
2026-02-21T08:27:29.8200431Z  'range_unroll_factors': [0, 0],
2026-02-21T08:27:29.8200633Z  'range_warp_specializes': [None, True]}
2026-02-21T08:27:29.8208041Z [64s] Fitting surrogate: 191 points, 191 targets
2026-02-21T08:27:30.7752839Z [65s] Generation 2 starting: 68 neighbors, 5 active search path(s)
2026-02-21T08:27:39.2376929Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 3.2 configs/s
2026-02-21T08:27:43.4099212Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.0 configs/s
2026-02-21T08:27:46.6335858Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 351.9         
2026-02-21T08:27:46.6339792Z                                                                   configs/s     
2026-02-21T08:27:46.8432569Z [81s] Generation 2 complete: 
2026-02-21T08:27:46.8437602Z error=1
2026-02-21T08:27:46.8439429Z ok=72
2026-02-21T08:27:46.8439596Z min=0.0245
2026-02-21T08:27:46.8439734Z mid=0.0328
2026-02-21T08:27:46.8439851Z max=0.2530
2026-02-21T08:27:46.8439995Z best={'block_sizes': [2, 8192],
2026-02-21T08:27:46.8440260Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:27:46.8440536Z  'load_eviction_policies': ['', ''],
2026-02-21T08:27:46.8440711Z  'num_stages': 4,
2026-02-21T08:27:46.8440858Z  'num_warps': 4,
2026-02-21T08:27:46.8441000Z  'pid_type': 'flat',
2026-02-21T08:27:46.8441165Z  'range_flattens': [None, None],
2026-02-21T08:27:46.8441342Z  'range_multi_buffers': [None, False],
2026-02-21T08:27:46.8442132Z  'range_num_stages': [0, 4],
2026-02-21T08:27:46.8442333Z  'range_unroll_factors': [0, 0],
2026-02-21T08:27:46.8442511Z  'range_warp_specializes': [None, True]}
2026-02-21T08:27:46.8448543Z [81s] Fitting surrogate: 264 points, 264 targets
2026-02-21T08:27:47.7545288Z [82s] Generation 3 starting: 60 neighbors, 5 active search path(s)
2026-02-21T08:27:54.5200987Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 4.6 configs/s
2026-02-21T08:27:58.2022007Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.8 configs/s
2026-02-21T08:28:01.4215721Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.3         
2026-02-21T08:28:01.4220202Z                                                                   configs/s     
2026-02-21T08:28:01.6546290Z [96s] Generation 3 complete: 
2026-02-21T08:28:01.6550623Z ok=66
2026-02-21T08:28:01.6554634Z min=0.0205
2026-02-21T08:28:01.6558520Z mid=0.0287
2026-02-21T08:28:01.6560141Z max=1.0701
2026-02-21T08:28:01.6560395Z best={'block_sizes': [1, 8192],
2026-02-21T08:28:01.6565208Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:28:01.6569501Z  'load_eviction_policies': ['', ''],
2026-02-21T08:28:01.6574001Z  'num_stages': 5,
2026-02-21T08:28:01.6578404Z  'num_warps': 4,
2026-02-21T08:28:01.6582297Z  'pid_type': 'flat',
2026-02-21T08:28:01.6586303Z  'range_flattens': [None, True],
2026-02-21T08:28:01.6587955Z  'range_multi_buffers': [None, False],
2026-02-21T08:28:01.6588212Z  'range_num_stages': [0, 4],
2026-02-21T08:28:01.6588404Z  'range_unroll_factors': [0, 0],
2026-02-21T08:28:01.6588607Z  'range_warp_specializes': [None, True]}
2026-02-21T08:28:01.6588822Z [96s] Fitting surrogate: 330 points, 330 targets
2026-02-21T08:28:02.3506560Z [97s] Generation 4 starting: 47 neighbors, 4 active search path(s)
2026-02-21T08:28:27.9282882Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 0.4 configs/s
2026-02-21T08:28:30.9909524Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 16.6 configs/s
2026-02-21T08:28:33.2987426Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 566.1         
2026-02-21T08:28:33.2991258Z                                                                   configs/s     
2026-02-21T08:28:33.4554324Z [128s] Generation 4 complete: 
2026-02-21T08:28:33.4558420Z ok=51
2026-02-21T08:28:33.4563426Z min=0.0204
2026-02-21T08:28:33.4565681Z mid=0.0287
2026-02-21T08:28:33.4565911Z max=0.1659
2026-02-21T08:28:33.4570081Z best={'block_sizes': [1, 8192],
2026-02-21T08:28:33.4571825Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:28:33.4572189Z  'load_eviction_policies': ['', ''],
2026-02-21T08:28:33.4576138Z  'num_stages': 5,
2026-02-21T08:28:33.4581226Z  'num_warps': 4,
2026-02-21T08:28:33.4582693Z  'pid_type': 'flat',
2026-02-21T08:28:33.4582899Z  'range_flattens': [None, False],
2026-02-21T08:28:33.4583091Z  'range_multi_buffers': [None, False],
2026-02-21T08:28:33.4583285Z  'range_num_stages': [0, 1],
2026-02-21T08:28:33.4584115Z  'range_unroll_factors': [0, 2],
2026-02-21T08:28:33.4584306Z  'range_warp_specializes': [None, None]}
2026-02-21T08:28:33.4584593Z [128s] Fitting surrogate: 381 points, 381 targets
2026-02-21T08:28:34.2035630Z [128s] Generation 5 starting: 44 neighbors, 4 active search path(s)
2026-02-21T08:28:41.3889625Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 2.9 configs/s
2026-02-21T08:28:44.2529695Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.7 configs/s
2026-02-21T08:28:45.9415510Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 600.1         
2026-02-21T08:28:45.9419838Z                                                                   configs/s     
2026-02-21T08:28:46.0863050Z [140s] Generation 5 complete: 
2026-02-21T08:28:46.0867242Z ok=49
2026-02-21T08:28:46.0872077Z min=0.0205
2026-02-21T08:28:46.0876238Z mid=0.0326
2026-02-21T08:28:46.0881136Z max=0.1065
2026-02-21T08:28:46.0881368Z best={'block_sizes': [1, 8192],
2026-02-21T08:28:46.0886933Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:28:46.0890627Z  'load_eviction_policies': ['', ''],
2026-02-21T08:28:46.0895147Z  'num_stages': 5,
2026-02-21T08:28:46.0895424Z  'num_warps': 4,
2026-02-21T08:28:46.0895636Z  'pid_type': 'flat',
2026-02-21T08:28:46.0895858Z  'range_flattens': [None, False],
2026-02-21T08:28:46.0896077Z  'range_multi_buffers': [None, False],
2026-02-21T08:28:46.0896274Z  'range_num_stages': [0, 1],
2026-02-21T08:28:46.0896466Z  'range_unroll_factors': [0, 2],
2026-02-21T08:28:46.0896655Z  'range_warp_specializes': [None, None]}
2026-02-21T08:28:46.0896899Z [140s] Fitting surrogate: 430 points, 430 targets
2026-02-21T08:28:46.6408450Z [141s] Generation 6 starting: 31 neighbors, 3 active search path(s)
2026-02-21T08:28:50.6095873Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 4.7 configs/s
2026-02-21T08:28:52.5758976Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.6 configs/s
2026-02-21T08:28:54.4097380Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 553.0         
2026-02-21T08:28:54.4101853Z                                                                   configs/s     
2026-02-21T08:28:54.5706697Z [149s] Generation 6 complete: 
2026-02-21T08:28:54.5710908Z ok=35
2026-02-21T08:28:54.5712809Z min=0.0204
2026-02-21T08:28:54.5712977Z mid=0.0246
2026-02-21T08:28:54.5713100Z max=0.0635
2026-02-21T08:28:54.5713248Z best={'block_sizes': [1, 8192],
2026-02-21T08:28:54.5713592Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:28:54.5713891Z  'load_eviction_policies': ['', ''],
2026-02-21T08:28:54.5718601Z  'num_stages': 5,
2026-02-21T08:28:54.5722562Z  'num_warps': 4,
2026-02-21T08:28:54.5727266Z  'pid_type': 'flat',
2026-02-21T08:28:54.5727542Z  'range_flattens': [None, True],
2026-02-21T08:28:54.5727771Z  'range_multi_buffers': [None, False],
2026-02-21T08:28:54.5728016Z  'range_num_stages': [0, 1],
2026-02-21T08:28:54.5728602Z  'range_unroll_factors': [0, 3],
2026-02-21T08:28:54.5733022Z  'range_warp_specializes': [None, None]}
2026-02-21T08:28:54.5735149Z [149s] Fitting surrogate: 465 points, 465 targets
2026-02-21T08:28:55.0108739Z [149s] Generation 7 starting: 23 neighbors, 2 active search path(s)
2026-02-21T08:28:57.4394144Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 18.5 configs/s
2026-02-21T08:28:58.8552347Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.8 configs/s
2026-02-21T08:29:00.2635712Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 719.8         
2026-02-21T08:29:00.2637053Z                                                                   configs/s     
2026-02-21T08:29:00.3931246Z [155s] Generation 7 complete: 
2026-02-21T08:29:00.3933429Z ok=25
2026-02-21T08:29:00.3938735Z min=0.0204
2026-02-21T08:29:00.3943377Z mid=0.0206
2026-02-21T08:29:00.3948502Z max=0.0390
2026-02-21T08:29:00.3953463Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:00.3956105Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:00.3956496Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:29:00.3960296Z  'num_stages': 6,
2026-02-21T08:29:00.3964824Z  'num_warps': 1,
2026-02-21T08:29:00.3966961Z  'pid_type': 'flat',
2026-02-21T08:29:00.3967232Z  'range_flattens': [None, True],
2026-02-21T08:29:00.3967443Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:00.3972149Z  'range_num_stages': [0, 1],
2026-02-21T08:29:00.3974071Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:00.3974307Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:00.3974605Z [155s] Fitting surrogate: 490 points, 490 targets
2026-02-21T08:29:00.8072647Z [155s] Generation 8 starting: 19 neighbors, 2 active search path(s)
2026-02-21T08:29:02.9635080Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 5.7 configs/s
2026-02-21T08:29:04.1130959Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s
2026-02-21T08:29:06.0211869Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 756.4         
2026-02-21T08:29:06.0212329Z                                                                   configs/s     
2026-02-21T08:29:06.1461143Z [160s] Generation 8 complete: 
2026-02-21T08:29:06.1465988Z ok=21
2026-02-21T08:29:06.1471094Z min=0.0204
2026-02-21T08:29:06.1476428Z mid=0.0205
2026-02-21T08:29:06.1478991Z max=0.0287
2026-02-21T08:29:06.1479176Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:06.1479440Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:06.1479692Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:06.1479875Z  'num_stages': 5,
2026-02-21T08:29:06.1480014Z  'num_warps': 4,
2026-02-21T08:29:06.1480162Z  'pid_type': 'flat',
2026-02-21T08:29:06.1480317Z  'range_flattens': [None, True],
2026-02-21T08:29:06.1480506Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:06.1480690Z  'range_num_stages': [0, 1],
2026-02-21T08:29:06.1480903Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:06.1481276Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:06.1499204Z [160s] Fitting surrogate: 511 points, 511 targets
2026-02-21T08:29:06.6647504Z [161s] Generation 9 starting: 23 neighbors, 2 active search path(s)
2026-02-21T08:29:09.1997694Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 11.4 configs/s
2026-02-21T08:29:10.5808958Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.2 configs/s
2026-02-21T08:29:12.0972149Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 668.0         
2026-02-21T08:29:12.0975769Z                                                                   configs/s     
2026-02-21T08:29:12.2221363Z [166s] Generation 9 complete: 
2026-02-21T08:29:12.2225867Z ok=25
2026-02-21T08:29:12.2230313Z min=0.0204
2026-02-21T08:29:12.2235318Z mid=0.0205
2026-02-21T08:29:12.2239948Z max=0.0307
2026-02-21T08:29:12.2244426Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:12.2249113Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:12.2249854Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:12.2250056Z  'num_stages': 5,
2026-02-21T08:29:12.2250214Z  'num_warps': 2,
2026-02-21T08:29:12.2250365Z  'pid_type': 'flat',
2026-02-21T08:29:12.2250524Z  'range_flattens': [None, True],
2026-02-21T08:29:12.2250720Z  'range_multi_buffers': [None, None],
2026-02-21T08:29:12.2250907Z  'range_num_stages': [0, 1],
2026-02-21T08:29:12.2251082Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:12.2251266Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:12.2251480Z [166s] Fitting surrogate: 536 points, 536 targets
2026-02-21T08:29:12.7189685Z [167s] Generation 10 starting: 19 neighbors, 2 active search path(s)
2026-02-21T08:29:14.2767497Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 20.0 configs/s
2026-02-21T08:29:15.4198744Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.3 configs/s
2026-02-21T08:29:16.6312858Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 833.8         
2026-02-21T08:29:16.6313285Z                                                                   configs/s     
2026-02-21T08:29:16.7350644Z [171s] Generation 10 complete: 
2026-02-21T08:29:16.7355106Z ok=21
2026-02-21T08:29:16.7359580Z min=0.0205
2026-02-21T08:29:16.7364034Z mid=0.0205
2026-02-21T08:29:16.7365410Z max=0.0368
2026-02-21T08:29:16.7365596Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:16.7365869Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:16.7366153Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:29:16.7366349Z  'num_stages': 8,
2026-02-21T08:29:16.7366502Z  'num_warps': 1,
2026-02-21T08:29:16.7366647Z  'pid_type': 'flat',
2026-02-21T08:29:16.7366816Z  'range_flattens': [None, None],
2026-02-21T08:29:16.7367001Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:16.7367197Z  'range_num_stages': [0, 1],
2026-02-21T08:29:16.7367364Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:16.7367584Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:16.7369623Z [171s] Fitting surrogate: 557 points, 557 targets
2026-02-21T08:29:17.2456765Z [172s] Generation 11 starting: 18 neighbors, 2 active search path(s)
2026-02-21T08:29:18.5697645Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 12.4 configs/s
2026-02-21T08:29:19.6413021Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.5 configs/s
2026-02-21T08:29:20.7667064Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 897.9         
2026-02-21T08:29:20.7671194Z                                                                   configs/s     
2026-02-21T08:29:20.8649958Z [175s] Generation 11 complete: 
2026-02-21T08:29:20.8654475Z ok=20
2026-02-21T08:29:20.8658808Z min=0.0204
2026-02-21T08:29:20.8660394Z mid=0.0205
2026-02-21T08:29:20.8660580Z max=0.0369
2026-02-21T08:29:20.8660773Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:20.8661075Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:20.8661349Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:20.8661620Z  'num_stages': 5,
2026-02-21T08:29:20.8661775Z  'num_warps': 2,
2026-02-21T08:29:20.8661974Z  'pid_type': 'flat',
2026-02-21T08:29:20.8662159Z  'range_flattens': [None, True],
2026-02-21T08:29:20.8662359Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:20.8662563Z  'range_num_stages': [0, 2],
2026-02-21T08:29:20.8662756Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:20.8662983Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:20.8670750Z [175s] Fitting surrogate: 577 points, 577 targets
2026-02-21T08:29:21.4056584Z [176s] Generation 12 starting: 21 neighbors, 2 active search path(s)
2026-02-21T08:29:23.0070725Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 11.2 configs/s
2026-02-21T08:29:24.2285520Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.9 configs/s
2026-02-21T08:29:25.6879189Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 694.3         
2026-02-21T08:29:25.6880150Z                                                                   configs/s     
2026-02-21T08:29:25.8118174Z [180s] Generation 12 complete: 
2026-02-21T08:29:25.8121990Z ok=23
2026-02-21T08:29:25.8123875Z min=0.0205
2026-02-21T08:29:25.8124062Z mid=0.0205
2026-02-21T08:29:25.8124195Z max=0.0286
2026-02-21T08:29:25.8124354Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:25.8124613Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:29:25.8124872Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:25.8125061Z  'num_stages': 5,
2026-02-21T08:29:25.8125204Z  'num_warps': 1,
2026-02-21T08:29:25.8125356Z  'pid_type': 'flat',
2026-02-21T08:29:25.8125514Z  'range_flattens': [None, True],
2026-02-21T08:29:25.8125706Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:25.8125892Z  'range_num_stages': [0, 2],
2026-02-21T08:29:25.8126070Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:25.8126581Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:25.8139407Z [180s] Fitting surrogate: 600 points, 600 targets
2026-02-21T08:29:26.1826752Z [180s] Generation 13 starting: 7 neighbors, 1 active search path(s)
2026-02-21T08:29:26.9621527Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 7.4 configs/s
2026-02-21T08:29:27.3775073Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 7/7 19.2 configs/s
2026-02-21T08:29:27.8807563Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1974.1        
2026-02-21T08:29:27.8812010Z                                                                   configs/s     
2026-02-21T08:29:27.9300451Z [182s] Generation 13 complete: 
2026-02-21T08:29:27.9305896Z ok=8
2026-02-21T08:29:27.9310539Z min=0.0204
2026-02-21T08:29:27.9312849Z mid=0.0205
2026-02-21T08:29:27.9313050Z max=0.0287
2026-02-21T08:29:27.9317414Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:27.9321766Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:27.9323848Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:27.9324069Z  'num_stages': 5,
2026-02-21T08:29:27.9324229Z  'num_warps': 1,
2026-02-21T08:29:27.9324375Z  'pid_type': 'flat',
2026-02-21T08:29:27.9324546Z  'range_flattens': [None, True],
2026-02-21T08:29:27.9324733Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:27.9324931Z  'range_num_stages': [0, 2],
2026-02-21T08:29:27.9325109Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:27.9325290Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:27.9325593Z [182s] Fitting surrogate: 608 points, 608 targets
2026-02-21T08:29:28.2980493Z [183s] Generation 14 starting: 9 neighbors, 1 active search path(s)
2026-02-21T08:29:29.0548796Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 25.4 configs/s
2026-02-21T08:29:29.5948146Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.3 configs/s
2026-02-21T08:29:30.2258734Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1584.9        
2026-02-21T08:29:30.2259867Z                                                                   configs/s     
2026-02-21T08:29:30.2903651Z [185s] Generation 14 complete: 
2026-02-21T08:29:30.2909429Z ok=10
2026-02-21T08:29:30.2911085Z min=0.0205
2026-02-21T08:29:30.2911285Z mid=0.0205
2026-02-21T08:29:30.2916370Z max=0.0267
2026-02-21T08:29:30.2921613Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:30.2925284Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:30.2929360Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:30.2933499Z  'num_stages': 5,
2026-02-21T08:29:30.2935099Z  'num_warps': 1,
2026-02-21T08:29:30.2935294Z  'pid_type': 'flat',
2026-02-21T08:29:30.2935461Z  'range_flattens': [None, True],
2026-02-21T08:29:30.2935654Z  'range_multi_buffers': [None, False],
2026-02-21T08:29:30.2935839Z  'range_num_stages': [0, 2],
2026-02-21T08:29:30.2936013Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:30.2936222Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:30.2936532Z [185s] Fitting surrogate: 618 points, 618 targets
2026-02-21T08:29:30.6456977Z [185s] Generation 15 starting: 5 neighbors, 1 active search path(s)
2026-02-21T08:29:31.3426657Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5/5 6.7 configs/s
2026-02-21T08:29:31.6408000Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 5/5 20.4 configs/s
2026-02-21T08:29:32.0064339Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2696.2        
2026-02-21T08:29:32.0066035Z                                                                   configs/s     
2026-02-21T08:29:32.0472098Z [186s] Generation 15 complete: 
2026-02-21T08:29:32.0476591Z ok=6
2026-02-21T08:29:32.0480432Z min=0.0205
2026-02-21T08:29:32.0481945Z mid=0.0205
2026-02-21T08:29:32.0482113Z max=0.0205
2026-02-21T08:29:32.0482257Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:32.0482537Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:32.0483159Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:32.0483383Z  'num_stages': 4,
2026-02-21T08:29:32.0483529Z  'num_warps': 1,
2026-02-21T08:29:32.0483683Z  'pid_type': 'flat',
2026-02-21T08:29:32.0483839Z  'range_flattens': [None, True],
2026-02-21T08:29:32.0484027Z  'range_multi_buffers': [None, True],
2026-02-21T08:29:32.0484221Z  'range_num_stages': [0, 2],
2026-02-21T08:29:32.0484386Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:32.0484581Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:32.0495435Z [186s] Fitting surrogate: 624 points, 624 targets
2026-02-21T08:29:32.4092329Z [187s] Generation 16 starting: 9 neighbors, 1 active search path(s)
2026-02-21T08:29:33.6464556Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 17.3 configs/s
2026-02-21T08:29:34.1854314Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.3 configs/s
2026-02-21T08:29:34.8038621Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1616.8        
2026-02-21T08:29:34.8042784Z                                                                   configs/s     
2026-02-21T08:29:34.8618405Z [189s] Generation 16 complete: 
2026-02-21T08:29:34.8620037Z ok=10
2026-02-21T08:29:34.8620218Z min=0.0205
2026-02-21T08:29:34.8620356Z mid=0.0205
2026-02-21T08:29:34.8620547Z max=0.0205
2026-02-21T08:29:34.8620695Z best={'block_sizes': [1, 8192],
2026-02-21T08:29:34.8625533Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:34.8629844Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:34.8631181Z  'num_stages': 4,
2026-02-21T08:29:34.8631359Z  'num_warps': 1,
2026-02-21T08:29:34.8631518Z  'pid_type': 'flat',
2026-02-21T08:29:34.8631861Z  'range_flattens': [None, True],
2026-02-21T08:29:34.8632050Z  'range_multi_buffers': [None, True],
2026-02-21T08:29:34.8632236Z  'range_num_stages': [0, 2],
2026-02-21T08:29:34.8632411Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:34.8632588Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:34.8644332Z [189s] Fitting surrogate: 634 points, 634 targets
2026-02-21T08:29:35.1366618Z [189s] Autotuning complete in 189.9s after searching 599 configs.
2026-02-21T08:29:35.1368197Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:29:35.1369147Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:29:35.1369960Z 
2026-02-21T08:29:35.1370223Z [189s] Code of selected kernel: /tmp/torchinductor_root/eh/cehz4xnjf6rgufwtihsrd5tyojkz32iuatxzbunrtbcbt4wvl4e2.py
2026-02-21T08:29:36.1874272Z WARNING:tritonbench.utils.triton_op:Completed input ID 36:
2026-02-21T08:29:36.1879197Z (M, N)
2026-02-21T08:29:36.1880525Z ------------
2026-02-21T08:29:36.1881028Z (4096, 4864)
2026-02-21T08:29:36.1881146Z 
2026-02-21T08:29:36.1888749Z  40%|████      | 8/20 [20:33<35:13, 176.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 41:
2026-02-21T08:29:36.1892841Z (M, N)
2026-02-21T08:29:36.1894758Z ------------
2026-02-21T08:29:36.1894934Z (4096, 5504)
2026-02-21T08:29:36.1895279Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:29:37.4282155Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:29:38.9433523Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for torch_compile_softmax
2026-02-21T08:29:42.3567385Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:29:42.3570942Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:29:42.3574172Z               'dtype': 'torch.float16',
2026-02-21T08:29:42.3577979Z               'shape': (4096, 5504),
2026-02-21T08:29:42.3581331Z               'stride': (5504, 1)},),
2026-02-21T08:29:42.3581859Z   'kwargs': {}}
2026-02-21T08:29:42.3590456Z INFO:tritonbench.utils.triton_op:Took 2.51ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:29:42.5354538Z [0s] Autotune random seed: 2134816249
2026-02-21T08:29:42.6730945Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:30:16.6290577Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:30:16.9085997Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:30:17.0848109Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:30:17.0868463Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T08:30:20.0670555Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:30:20.0672843Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:30:20.0673819Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:20.0674116Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:30:20.0674325Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:20.0678110Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:20.0682225Z     %cst = arith.constant dense<5504> : tensor<32x1xi32>
2026-02-21T08:30:20.0687101Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:30:20.0691880Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:30:20.0692270Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:30:20.0695267Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:20.0695583Z     %c5504_i32 = arith.constant 5504 : i32
2026-02-21T08:30:20.0695826Z     %c5504_i64 = arith.constant 5504 : i64
2026-02-21T08:30:20.0703528Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:20.0705743Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:30:20.0706345Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:30:20.0706790Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:20.0710642Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:30:20.0715324Z     %4 = arith.minsi %3, %c128_i32 : i32
2026-02-21T08:30:20.0717373Z     scf.for %arg2 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:30:20.0717652Z       %5 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:30:20.0717927Z       %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:30:20.0718227Z       %7 = tt.splat %5 : i32 -> tensor<32xi32>
2026-02-21T08:30:20.0718444Z       %8 = arith.addi %7, %6 : tensor<32xi32>
2026-02-21T08:30:20.0718673Z       %c5496_i32 = arith.constant 5496 : i32
2026-02-21T08:30:20.0718886Z       %c24_i32 = arith.constant 24 : i32
2026-02-21T08:30:20.0719336Z       %9:2 = scf.for %arg3 = %c0_i32 to %c5496_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:30:20.0719824Z         %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:30:20.0720115Z         %50 = tt.splat %arg3 : i32 -> tensor<8xi32>
2026-02-21T08:30:20.0720357Z         %51 = arith.addi %50, %49 : tensor<8xi32>
2026-02-21T08:30:20.0720756Z         %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:20.0721086Z         %53 = arith.muli %52, %cst : tensor<32x1xi32>
2026-02-21T08:30:20.0724235Z         %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:30:20.0728298Z         %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0732192Z         %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0736367Z         %57 = arith.addi %55, %56 : tensor<32x8xi32>
2026-02-21T08:30:20.0739994Z         %58 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0743789Z         %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:30:20.0747502Z         %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0747887Z         %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0748154Z         %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0748370Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0748585Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:20.0748792Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0749005Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0749244Z         %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:20.0749514Z         %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:20.0753228Z         %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32>
2026-02-21T08:30:20.0756227Z         %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:30:20.0756830Z         %67 = arith.ori %65, %66 : tensor<32xi1>
2026-02-21T08:30:20.0757105Z         %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:20.0757374Z         %69 = arith.subf %arg4, %68 : tensor<32xf32>
2026-02-21T08:30:20.0757821Z         %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0758213Z         %71 = arith.mulf %arg5, %70 : tensor<32xf32>
2026-02-21T08:30:20.0758497Z         %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0758816Z         %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0759067Z         %74 = arith.subf %61, %73 : tensor<32x8xf32>
2026-02-21T08:30:20.0759548Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0759945Z         %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0760154Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0760363Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:20.0760578Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0760780Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0761001Z         %77 = arith.addf %71, %76 : tensor<32xf32>
2026-02-21T08:30:20.0761211Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:30:20.0761425Z         %78 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:30:20.0761703Z         %79 = arith.addi %arg3, %78 : i32
2026-02-21T08:30:20.0761957Z         %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:30:20.0762230Z         %81 = tt.splat %79 : i32 -> tensor<8xi32>
2026-02-21T08:30:20.0762439Z         %82 = arith.addi %81, %80 : tensor<8xi32>
2026-02-21T08:30:20.0762720Z         %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:20.0763009Z         %84 = arith.muli %83, %cst : tensor<32x1xi32>
2026-02-21T08:30:20.0763287Z         %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:30:20.0763595Z         %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0763887Z         %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0764147Z         %88 = arith.addi %86, %87 : tensor<32x8xi32>
2026-02-21T08:30:20.0764404Z         %89 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0764706Z         %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:30:20.0765021Z         %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0765330Z         %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0765569Z         %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0765783Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0765987Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:20.0766190Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0766392Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0766625Z         %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:20.0766889Z         %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:20.0767139Z         %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32>
2026-02-21T08:30:20.0767372Z         %97 = arith.cmpf une, %68, %68 : tensor<32xf32>
2026-02-21T08:30:20.0767592Z         %98 = arith.ori %96, %97 : tensor<32xi1>
2026-02-21T08:30:20.0767833Z         %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:20.0768088Z         %100 = arith.subf %68, %99 : tensor<32xf32>
2026-02-21T08:30:20.0768480Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0768952Z         %102 = arith.mulf %77, %101 : tensor<32xf32>
2026-02-21T08:30:20.0769224Z         %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0769551Z         %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0769814Z         %105 = arith.subf %92, %104 : tensor<32x8xf32>
2026-02-21T08:30:20.0770201Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0770604Z         %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0770813Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0771017Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:20.0771218Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0771418Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0771751Z         %108 = arith.addf %102, %107 : tensor<32xf32>
2026-02-21T08:30:20.0771966Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:20.0772194Z         %109 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:30:20.0772415Z         %110 = arith.addi %arg3, %109 : i32
2026-02-21T08:30:20.0772684Z         %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:30:20.0772969Z         %112 = tt.splat %110 : i32 -> tensor<8xi32>
2026-02-21T08:30:20.0773211Z         %113 = arith.addi %112, %111 : tensor<8xi32>
2026-02-21T08:30:20.0773504Z         %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:20.0773797Z         %115 = arith.muli %114, %cst : tensor<32x1xi32>
2026-02-21T08:30:20.0774112Z         %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:30:20.0774449Z         %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0774760Z         %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0775035Z         %119 = arith.addi %117, %118 : tensor<32x8xi32>
2026-02-21T08:30:20.0775320Z         %120 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0775654Z         %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:30:20.0776013Z         %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0776364Z         %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0776634Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0776861Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0777069Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:20.0777298Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0777513Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0777770Z         %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:20.0778069Z         %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:20.0778341Z         %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32>
2026-02-21T08:30:20.0778631Z         %128 = arith.cmpf une, %99, %99 : tensor<32xf32>
2026-02-21T08:30:20.0778876Z         %129 = arith.ori %127, %128 : tensor<32xi1>
2026-02-21T08:30:20.0779148Z         %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:20.0779433Z         %131 = arith.subf %99, %130 : tensor<32xf32>
2026-02-21T08:30:20.0779842Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0780240Z         %133 = arith.mulf %108, %132 : tensor<32xf32>
2026-02-21T08:30:20.0780515Z         %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0780836Z         %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0781164Z         %136 = arith.subf %123, %135 : tensor<32x8xf32>
2026-02-21T08:30:20.0781606Z         %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0782038Z         %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0782256Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:20.0782475Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:20.0782694Z           tt.reduce.return %140 : f32
2026-02-21T08:30:20.0782912Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0783149Z         %139 = arith.addf %133, %138 : tensor<32xf32>
2026-02-21T08:30:20.0783383Z         scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:30:20.0783657Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:30:20.0783942Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:30:20.0784311Z       %11 = tt.splat %c5496_i32 : i32 -> tensor<8xi32>
2026-02-21T08:30:20.0784530Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:30:20.0784797Z       %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:20.0785080Z       %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:30:20.0785349Z       %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:30:20.0785655Z       %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0785923Z       %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:30:20.0786174Z       %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:30:20.0786425Z       %19 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0786721Z       %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:30:20.0787043Z       %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:30:20.0787347Z       %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0787590Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0787795Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:30:20.0787992Z         %49 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:30:20.0788193Z         tt.reduce.return %49 : f32
2026-02-21T08:30:20.0788398Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0788639Z       %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:20.0788894Z       %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:20.0789141Z       %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32>
2026-02-21T08:30:20.0789370Z       %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32>
2026-02-21T08:30:20.0789590Z       %28 = arith.ori %26, %27 : tensor<32xi1>
2026-02-21T08:30:20.0789829Z       %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:20.0790086Z       %30 = arith.subf %9#0, %29 : tensor<32xf32>
2026-02-21T08:30:20.0790477Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0790855Z       %32 = arith.mulf %9#1, %31 : tensor<32xf32>
2026-02-21T08:30:20.0791130Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0791443Z       %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0791751Z       %35 = arith.subf %22, %34 : tensor<32x8xf32>
2026-02-21T08:30:20.0792164Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0792584Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:30:20.0792810Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:30:20.0793021Z         %49 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:30:20.0793230Z         tt.reduce.return %49 : f32
2026-02-21T08:30:20.0793485Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:30:20.0793700Z       %38 = arith.addf %32, %37 : tensor<32xf32>
2026-02-21T08:30:20.0793904Z       %c5496_i32_2 = arith.constant 5496 : i32
2026-02-21T08:30:20.0794113Z       %c24_i32_3 = arith.constant 24 : i32
2026-02-21T08:30:20.0794361Z       scf.for %arg3 = %c0_i32 to %c5496_i32_2 step %c24_i32_3  : i32 {
2026-02-21T08:30:20.0794703Z         %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:30:20.0795070Z         %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0795372Z         %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0795651Z         %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0795897Z         %53 = arith.subf %51, %52 : tensor<32x8xf32>
2026-02-21T08:30:20.0796352Z         %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0796799Z         %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0797099Z         %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0797351Z         %57 = arith.divf %54, %56 : tensor<32x8xf32>
2026-02-21T08:30:20.0797590Z         %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:30:20.0797926Z         tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:30:20.0798236Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:30:20.0798441Z         %59 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:30:20.0798648Z         %60 = arith.addi %arg3, %59 : i32
2026-02-21T08:30:20.0798932Z         %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:30:20.0799287Z         %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0799588Z         %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0799855Z         %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0800103Z         %65 = arith.subf %63, %64 : tensor<32x8xf32>
2026-02-21T08:30:20.0800482Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0800920Z         %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0801220Z         %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0801483Z         %69 = arith.divf %66, %68 : tensor<32x8xf32>
2026-02-21T08:30:20.0801793Z         %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:30:20.0802141Z         tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:30:20.0802473Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:20.0802671Z         %71 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:30:20.0802877Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:30:20.0803155Z         %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:30:20.0803512Z         %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0803820Z         %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0804106Z         %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0804375Z         %77 = arith.subf %75, %76 : tensor<32x8xf32>
2026-02-21T08:30:20.0804785Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0805282Z         %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0805673Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0805931Z         %81 = arith.divf %78, %80 : tensor<32x8xf32>
2026-02-21T08:30:20.0806191Z         %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:30:20.0806535Z         tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:30:20.0806907Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:30:20.0807276Z       %39 = tt.descriptor_load %0[%5, %c5496_i32_2] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:30:20.0807706Z       %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0808034Z       %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:30:20.0808321Z       %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0808647Z       %43 = arith.subf %41, %42 : tensor<32x8xf32>
2026-02-21T08:30:20.0809056Z       %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:30:20.0809542Z       %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:20.0809886Z       %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:30:20.0810144Z       %47 = arith.divf %44, %46 : tensor<32x8xf32>
2026-02-21T08:30:20.0810409Z       %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:30:20.0810772Z       tt.descriptor_store %1[%5, %c5496_i32_2], %48 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:30:20.0811143Z     } {tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:30:20.0811365Z     tt.return
2026-02-21T08:30:20.0811526Z   }
2026-02-21T08:30:20.0811695Z }
2026-02-21T08:30:20.0811779Z 
2026-02-21T08:30:20.0811833Z {-#
2026-02-21T08:30:20.0811980Z   external_resources: {
2026-02-21T08:30:20.0812152Z     mlir_reproducer: {
2026-02-21T08:30:20.0816860Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:20.0821802Z       disable_threading: false,
2026-02-21T08:30:20.0822001Z       verify_each: true
2026-02-21T08:30:20.0822180Z     }
2026-02-21T08:30:20.0822315Z   }
2026-02-21T08:30:20.0822514Z #-}
2026-02-21T08:30:20.0823010Z /tmp/torchinductor_root/2b/c2bvaufoxps2wu5oj3tnplvh5juo7e4hyjyxdfl3yacmcu3dh5ru.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:20.0824300Z /tmp/torchinductor_root/2b/c2bvaufoxps2wu5oj3tnplvh5juo7e4hyjyxdfl3yacmcu3dh5ru.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:20.0825346Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:20.0826594Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:30:20.0827666Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:20.0827949Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:21.5886565Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:30:21.5891926Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:30:21.5894114Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:21.5894452Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:21.5898139Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:30:21.5898595Z     %cst = arith.constant dense<5504> : tensor<32x1xi32>
2026-02-21T08:30:21.5899015Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:30:21.5899467Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:30:21.5899830Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:30:21.5900139Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:21.5900399Z     %c5504_i32 = arith.constant 5504 : i32
2026-02-21T08:30:21.5900640Z     %c5504_i64 = arith.constant 5504 : i64
2026-02-21T08:30:21.5900888Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:21.5901312Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c5504_i32], [%c5504_i64, %c1_i64] : <f16>, <tensor<32x32xf16>>
2026-02-21T08:30:21.5901848Z     %1 = tt.get_program_id x : i32
2026-02-21T08:30:21.5902136Z     scf.for %arg2 = %1 to %c128_i32 step %c9472_i32  : i32 {
2026-02-21T08:30:21.5902432Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:30:21.5902738Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:30:21.5903108Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:30:21.5903393Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:30:21.5903653Z       %c5472_i32 = arith.constant 5472 : i32
2026-02-21T08:30:21.5903888Z       %c96_i32 = arith.constant 96 : i32
2026-02-21T08:30:21.5904371Z       %6:2 = scf.for %arg3 = %c0_i32 to %c5472_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:30:21.5905000Z         %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:30:21.5905451Z         %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5905767Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5906013Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5906263Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:21.5906509Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5906754Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5907053Z         %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:21.5907816Z         %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:21.5908145Z         %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32>
2026-02-21T08:30:21.5908451Z         %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:30:21.5908756Z         %54 = arith.ori %52, %53 : tensor<32xi1>
2026-02-21T08:30:21.5909037Z         %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:21.5909334Z         %56 = arith.subf %arg4, %55 : tensor<32xf32>
2026-02-21T08:30:21.5909833Z         %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5910302Z         %58 = arith.mulf %arg5, %57 : tensor<32xf32>
2026-02-21T08:30:21.5910659Z         %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5911186Z         %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5911500Z         %61 = arith.subf %48, %60 : tensor<32x32xf32>
2026-02-21T08:30:21.5912032Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5912477Z         %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5912703Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5912914Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:21.5913136Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5913349Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5913582Z         %64 = arith.addf %58, %63 : tensor<32xf32>
2026-02-21T08:30:21.5913801Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:21.5914024Z         %65 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:30:21.5914253Z         %66 = arith.addi %arg3, %65 : i32
2026-02-21T08:30:21.5914608Z         %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:30:21.5915013Z         %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5915301Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5915539Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5915767Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:21.5916009Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5916268Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5916544Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:21.5916829Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:21.5917097Z         %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32>
2026-02-21T08:30:21.5917340Z         %73 = arith.cmpf une, %55, %55 : tensor<32xf32>
2026-02-21T08:30:21.5917578Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:30:21.5917841Z         %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:21.5918118Z         %76 = arith.subf %55, %75 : tensor<32xf32>
2026-02-21T08:30:21.5918522Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5918980Z         %78 = arith.mulf %64, %77 : tensor<32xf32>
2026-02-21T08:30:21.5919289Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5919678Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5919969Z         %81 = arith.subf %68, %80 : tensor<32x32xf32>
2026-02-21T08:30:21.5920386Z         %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5920802Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5921026Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5921240Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:21.5921597Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5921817Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5922041Z         %84 = arith.addf %78, %83 : tensor<32xf32>
2026-02-21T08:30:21.5922269Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:21.5922483Z         %85 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:30:21.5922704Z         %86 = arith.addi %arg3, %85 : i32
2026-02-21T08:30:21.5923022Z         %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:30:21.5923379Z         %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5923646Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5923861Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5924083Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:30:21.5924301Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5924597Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5924852Z         %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:21.5925137Z         %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:21.5925406Z         %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32>
2026-02-21T08:30:21.5925649Z         %93 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:30:21.5925887Z         %94 = arith.ori %92, %93 : tensor<32xi1>
2026-02-21T08:30:21.5926152Z         %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:21.5926424Z         %96 = arith.subf %75, %95 : tensor<32xf32>
2026-02-21T08:30:21.5926833Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5927249Z         %98 = arith.mulf %84, %97 : tensor<32xf32>
2026-02-21T08:30:21.5927545Z         %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5927891Z         %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5928178Z         %101 = arith.subf %88, %100 : tensor<32x32xf32>
2026-02-21T08:30:21.5928605Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5929048Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5929276Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:21.5929482Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:21.5929709Z           tt.reduce.return %105 : f32
2026-02-21T08:30:21.5929922Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5930163Z         %104 = arith.addf %98, %103 : tensor<32xf32>
2026-02-21T08:30:21.5930420Z         scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:30:21.5930681Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:30:21.5931032Z       %7 = tt.descriptor_load %0[%2, %c5472_i32] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:30:21.5931426Z       %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5931748Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5931979Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:30:21.5932210Z         %47 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:30:21.5932438Z         tt.reduce.return %47 : f32
2026-02-21T08:30:21.5932674Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5932943Z       %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:30:21.5933243Z       %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:30:21.5933537Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32>
2026-02-21T08:30:21.5933797Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:30:21.5934050Z       %14 = arith.ori %12, %13 : tensor<32xi1>
2026-02-21T08:30:21.5934329Z       %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:30:21.5934696Z       %16 = arith.subf %6#0, %15 : tensor<32xf32>
2026-02-21T08:30:21.5935096Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5935509Z       %18 = arith.mulf %6#1, %17 : tensor<32xf32>
2026-02-21T08:30:21.5935801Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5936134Z       %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5936405Z       %21 = arith.subf %8, %20 : tensor<32x32xf32>
2026-02-21T08:30:21.5936816Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5937234Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:30:21.5937455Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:30:21.5937719Z         %47 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:30:21.5937943Z         tt.reduce.return %47 : f32
2026-02-21T08:30:21.5938153Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:30:21.5938381Z       %24 = arith.addf %18, %23 : tensor<32xf32>
2026-02-21T08:30:21.5938605Z       %c5472_i32_2 = arith.constant 5472 : i32
2026-02-21T08:30:21.5938834Z       %c96_i32_3 = arith.constant 96 : i32
2026-02-21T08:30:21.5939098Z       scf.for %arg3 = %c0_i32 to %c5472_i32_2 step %c96_i32_3  : i32 {
2026-02-21T08:30:21.5939384Z         %47 = tt.splat %arg3 : i32 -> tensor<32xi32>
2026-02-21T08:30:21.5939624Z         %48 = arith.addi %47, %3 : tensor<32xi32>
2026-02-21T08:30:21.5939911Z         %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:21.5940220Z         %50 = arith.muli %49, %cst : tensor<32x1xi32>
2026-02-21T08:30:21.5940509Z         %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:30:21.5940848Z         %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5941156Z         %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5941424Z         %54 = arith.addi %52, %53 : tensor<32x32xi32>
2026-02-21T08:30:21.5941757Z         %55 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5942076Z         %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5942424Z         %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5942775Z         %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5943105Z         %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5943404Z         %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5943671Z         %61 = arith.subf %59, %60 : tensor<32x32xf32>
2026-02-21T08:30:21.5944093Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5944571Z         %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5944905Z         %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5945179Z         %65 = arith.divf %62, %64 : tensor<32x32xf32>
2026-02-21T08:30:21.5945449Z         %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:30:21.5945769Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5946090Z         %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5946390Z         tt.store %68, %66 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5946619Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:21.5946843Z         %69 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:30:21.5947068Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:30:21.5947400Z         %71 = tt.splat %70 : i32 -> tensor<32xi32>
2026-02-21T08:30:21.5947633Z         %72 = arith.addi %71, %3 : tensor<32xi32>
2026-02-21T08:30:21.5947914Z         %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:21.5948218Z         %74 = arith.muli %73, %cst : tensor<32x1xi32>
2026-02-21T08:30:21.5948509Z         %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:30:21.5948843Z         %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5949145Z         %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5949409Z         %78 = arith.addi %76, %77 : tensor<32x32xi32>
2026-02-21T08:30:21.5949688Z         %79 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5950001Z         %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5950462Z         %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5950821Z         %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5951154Z         %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5951452Z         %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5951773Z         %85 = arith.subf %83, %84 : tensor<32x32xf32>
2026-02-21T08:30:21.5952196Z         %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5952671Z         %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5953006Z         %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5953277Z         %89 = arith.divf %86, %88 : tensor<32x32xf32>
2026-02-21T08:30:21.5953547Z         %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:30:21.5953861Z         %91 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5954177Z         %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5954473Z         tt.store %92, %90 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5954703Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:21.5954924Z         %93 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:30:21.5955146Z         %94 = arith.addi %arg3, %93 : i32
2026-02-21T08:30:21.5955364Z         %95 = tt.splat %94 : i32 -> tensor<32xi32>
2026-02-21T08:30:21.5955596Z         %96 = arith.addi %95, %3 : tensor<32xi32>
2026-02-21T08:30:21.5955873Z         %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:21.5956175Z         %98 = arith.muli %97, %cst : tensor<32x1xi32>
2026-02-21T08:30:21.5956464Z         %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:30:21.5956801Z         %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5957111Z         %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5957388Z         %102 = arith.addi %100, %101 : tensor<32x32xi32>
2026-02-21T08:30:21.5957671Z         %103 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5958000Z         %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5958365Z         %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5958730Z         %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5959078Z         %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5959390Z         %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5959676Z         %109 = arith.subf %107, %108 : tensor<32x32xf32>
2026-02-21T08:30:21.5960197Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5960698Z         %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5961049Z         %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5961346Z         %113 = arith.divf %110, %112 : tensor<32x32xf32>
2026-02-21T08:30:21.5961667Z         %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:30:21.5961999Z         %115 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5962330Z         %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5962647Z         tt.store %116, %114 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5962899Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:30:21.5963186Z       %25 = tt.splat %c5472_i32_2 : i32 -> tensor<32xi32>
2026-02-21T08:30:21.5963424Z       %26 = arith.addi %25, %3 : tensor<32xi32>
2026-02-21T08:30:21.5963684Z       %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:30:21.5963972Z       %28 = arith.muli %27, %cst : tensor<32x1xi32>
2026-02-21T08:30:21.5964242Z       %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:30:21.5964556Z       %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5964831Z       %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:30:21.5965084Z       %32 = arith.addi %30, %31 : tensor<32x32xi32>
2026-02-21T08:30:21.5965337Z       %33 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5965627Z       %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5965950Z       %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5966280Z       %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5966587Z       %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:30:21.5966855Z       %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5967104Z       %39 = arith.subf %37, %38 : tensor<32x32xf32>
2026-02-21T08:30:21.5967496Z       %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:30:21.5967937Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:30:21.5968240Z       %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:30:21.5968480Z       %43 = arith.divf %40, %42 : tensor<32x32xf32>
2026-02-21T08:30:21.5968727Z       %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:30:21.5969017Z       %45 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5969305Z       %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:30:21.5969580Z       tt.store %46, %44 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:30:21.5969868Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:21.5970140Z     tt.return
2026-02-21T08:30:21.5970272Z   }
2026-02-21T08:30:21.5970405Z }
2026-02-21T08:30:21.5970478Z 
2026-02-21T08:30:21.5970539Z {-#
2026-02-21T08:30:21.5970674Z   external_resources: {
2026-02-21T08:30:21.5970848Z     mlir_reproducer: {
2026-02-21T08:30:21.5975848Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:21.5981078Z       disable_threading: false,
2026-02-21T08:30:21.5981271Z       verify_each: true
2026-02-21T08:30:21.5981447Z     }
2026-02-21T08:30:21.5981630Z   }
2026-02-21T08:30:21.5981766Z #-}
2026-02-21T08:30:21.5982260Z /tmp/torchinductor_root/7j/c7jycigb7ntky37gbhleowgtonxpngp7phyc36yil7xkkkov35se.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:21.5983664Z /tmp/torchinductor_root/7j/c7jycigb7ntky37gbhleowgtonxpngp7phyc36yil7xkkkov35se.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:21.5984823Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:21.5986050Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:30:21.5987161Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:21.5987456Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:23.8663056Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.8 configs/s
2026-02-21T08:30:23.8670805Z [41s] Adaptive compile timeout: 30s (90% percentile=5.4s, bounds=[30.0s, 30s])
2026-02-21T08:30:24.4816397Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1602.7 configs/s
2026-02-21T08:30:24.5390249Z [41s] Initial random population of 100, 5 starting points: 
2026-02-21T08:30:24.5394523Z error=8
2026-02-21T08:30:24.5399276Z timeout=3
2026-02-21T08:30:24.5400903Z ok=89
2026-02-21T08:30:24.5401140Z min=0.0428
2026-02-21T08:30:24.5406826Z mid=0.5037
2026-02-21T08:30:24.5408288Z max=37.0422
2026-02-21T08:30:24.5408510Z best={'block_sizes': [2, 1024],
2026-02-21T08:30:24.5408807Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:30:24.5409097Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:30:24.5409305Z  'num_sm_multiplier': 64,
2026-02-21T08:30:24.5409473Z  'num_stages': 5,
2026-02-21T08:30:24.5409611Z  'num_warps': 1,
2026-02-21T08:30:24.5409777Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:30:24.5409970Z  'range_flattens': [True, True],
2026-02-21T08:30:24.5410733Z  'range_multi_buffers': [False, None],
2026-02-21T08:30:24.5410930Z  'range_num_stages': [3, 1],
2026-02-21T08:30:24.5411106Z  'range_unroll_factors': [0, 2],
2026-02-21T08:30:24.5411303Z  'range_warp_specializes': [True, None]}
2026-02-21T08:30:24.5411529Z [41s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:30:25.7564913Z [43s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T08:30:55.2814611Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.4 configs/s
2026-02-21T08:31:00.7982873Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.4 configs/s
2026-02-21T08:31:06.2420535Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 211.6         
2026-02-21T08:31:06.2422179Z                                                                   configs/s     
2026-02-21T08:31:06.5732666Z [83s] Generation 1 complete: 
2026-02-21T08:31:06.5734490Z ok=91
2026-02-21T08:31:06.5734660Z min=0.0328
2026-02-21T08:31:06.5735330Z mid=0.0452
2026-02-21T08:31:06.5735487Z max=2.5037
2026-02-21T08:31:06.5735631Z best={'block_sizes': [1, 8192],
2026-02-21T08:31:06.5735865Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:31:06.5736094Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:31:06.5736278Z  'maxnreg': 128,
2026-02-21T08:31:06.5736422Z  'num_sm_multiplier': 64,
2026-02-21T08:31:06.5736581Z  'num_stages': 7,
2026-02-21T08:31:06.5736716Z  'num_warps': 2,
2026-02-21T08:31:06.5736873Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:31:06.5737062Z  'range_flattens': [None, True],
2026-02-21T08:31:06.5737240Z  'range_multi_buffers': [False, True],
2026-02-21T08:31:06.5737424Z  'range_num_stages': [1, 4],
2026-02-21T08:31:06.5737585Z  'range_unroll_factors': [0, 4],
2026-02-21T08:31:06.5737763Z  'range_warp_specializes': [True, None]}
2026-02-21T08:31:06.5749272Z [83s] Fitting surrogate: 191 points, 191 targets
2026-02-21T08:31:07.7229717Z [85s] Generation 2 starting: 79 neighbors, 5 active search path(s)
2026-02-21T08:31:16.6902904Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 15.9 configs/s
2026-02-21T08:31:21.6965485Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.7 configs/s
2026-02-21T08:31:26.9741321Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 192.4         
2026-02-21T08:31:26.9745360Z                                                                   configs/s     
2026-02-21T08:31:27.3045046Z [104s] Generation 2 complete: 
2026-02-21T08:31:27.3049964Z ok=85
2026-02-21T08:31:27.3051776Z min=0.0307
2026-02-21T08:31:27.3051998Z mid=0.0369
2026-02-21T08:31:27.3057122Z max=0.1556
2026-02-21T08:31:27.3058736Z best={'block_sizes': [1, 8192],
2026-02-21T08:31:27.3059065Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:31:27.3059366Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:31:27.3059592Z  'num_sm_multiplier': 32,
2026-02-21T08:31:27.3059771Z  'num_stages': 5,
2026-02-21T08:31:27.3059965Z  'num_warps': 1,
2026-02-21T08:31:27.3060156Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:31:27.3060357Z  'range_flattens': [None, True],
2026-02-21T08:31:27.3060551Z  'range_multi_buffers': [False, None],
2026-02-21T08:31:27.3060738Z  'range_num_stages': [3, 1],
2026-02-21T08:31:27.3060919Z  'range_unroll_factors': [0, 2],
2026-02-21T08:31:27.3061101Z  'range_warp_specializes': [True, None]}
2026-02-21T08:31:27.3065080Z [104s] Fitting surrogate: 276 points, 276 targets
2026-02-21T08:31:28.3404201Z [105s] Generation 3 starting: 71 neighbors, 5 active search path(s)
2026-02-21T08:31:35.6376846Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 4.0 configs/s
2026-02-21T08:31:40.0829799Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s
2026-02-21T08:31:45.2341321Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 211.4         
2026-02-21T08:31:45.2342347Z                                                                   configs/s     
2026-02-21T08:31:45.5908794Z [122s] Generation 3 complete: 
2026-02-21T08:31:45.5913762Z ok=77
2026-02-21T08:31:45.5915915Z min=0.0246
2026-02-21T08:31:45.5916078Z mid=0.0348
2026-02-21T08:31:45.5916203Z max=0.1578
2026-02-21T08:31:45.5916349Z best={'block_sizes': [2, 8192],
2026-02-21T08:31:45.5916575Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:31:45.5916813Z  'load_eviction_policies': ['', ''],
2026-02-21T08:31:45.5916987Z  'maxnreg': 128,
2026-02-21T08:31:45.5917140Z  'num_sm_multiplier': 8,
2026-02-21T08:31:45.5917303Z  'num_stages': 6,
2026-02-21T08:31:45.5917440Z  'num_warps': 1,
2026-02-21T08:31:45.5917598Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:31:45.5917788Z  'range_flattens': [True, None],
2026-02-21T08:31:45.5917971Z  'range_multi_buffers': [None, None],
2026-02-21T08:31:45.5918151Z  'range_num_stages': [4, 2],
2026-02-21T08:31:45.5918321Z  'range_unroll_factors': [0, 1],
2026-02-21T08:31:45.5918498Z  'range_warp_specializes': [True, None]}
2026-02-21T08:31:45.5926820Z [122s] Fitting surrogate: 353 points, 353 targets
2026-02-21T08:31:46.6060882Z [123s] Generation 4 starting: 70 neighbors, 5 active search path(s)
2026-02-21T08:32:02.2878470Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.9 configs/s
2026-02-21T08:32:06.4161529Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 17.4 configs/s
2026-02-21T08:32:10.9468738Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 224.3         
2026-02-21T08:32:10.9470065Z                                                                   configs/s     
2026-02-21T08:32:11.2883219Z [148s] Generation 4 complete: 
2026-02-21T08:32:11.2887614Z error=3
2026-02-21T08:32:11.2888986Z ok=72
2026-02-21T08:32:11.2889148Z min=0.0245
2026-02-21T08:32:11.2889287Z mid=0.0307
2026-02-21T08:32:11.2889407Z max=0.1925
2026-02-21T08:32:11.2889555Z best={'block_sizes': [1, 8192],
2026-02-21T08:32:11.2889863Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:32:11.2890159Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:11.2890332Z  'num_stages': 4,
2026-02-21T08:32:11.2890477Z  'num_warps': 1,
2026-02-21T08:32:11.2890618Z  'pid_type': 'flat',
2026-02-21T08:32:11.2890779Z  'range_flattens': [None, True],
2026-02-21T08:32:11.2890960Z  'range_multi_buffers': [None, False],
2026-02-21T08:32:11.2891142Z  'range_num_stages': [0, 3],
2026-02-21T08:32:11.2891310Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:11.2891485Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:11.2901143Z [148s] Fitting surrogate: 428 points, 428 targets
2026-02-21T08:32:12.7543577Z [150s] Generation 5 starting: 76 neighbors, 5 active search path(s)
2026-02-21T08:32:20.1498707Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 8.0 configs/s
2026-02-21T08:32:24.8694469Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.7 configs/s
2026-02-21T08:32:29.4952183Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 220.0         
2026-02-21T08:32:29.4953169Z                                                                   configs/s     
2026-02-21T08:32:29.8448139Z [167s] Generation 5 complete: 
2026-02-21T08:32:29.8452174Z ok=82
2026-02-21T08:32:29.8456084Z min=0.0244
2026-02-21T08:32:29.8460398Z mid=0.0266
2026-02-21T08:32:29.8463554Z max=0.1823
2026-02-21T08:32:29.8468084Z best={'block_sizes': [1, 8192],
2026-02-21T08:32:29.8472626Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:32:29.8476423Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:29.8476722Z  'num_stages': 4,
2026-02-21T08:32:29.8476916Z  'num_warps': 1,
2026-02-21T08:32:29.8477096Z  'pid_type': 'flat',
2026-02-21T08:32:29.8477275Z  'range_flattens': [None, True],
2026-02-21T08:32:29.8477470Z  'range_multi_buffers': [None, False],
2026-02-21T08:32:29.8477654Z  'range_num_stages': [0, 3],
2026-02-21T08:32:29.8477833Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:29.8482544Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:29.8486436Z [167s] Fitting surrogate: 510 points, 510 targets
2026-02-21T08:32:30.7372515Z [168s] Generation 6 starting: 50 neighbors, 4 active search path(s)
2026-02-21T08:32:36.3183609Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 7.3 configs/s
2026-02-21T08:32:39.4849510Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.6 configs/s
2026-02-21T08:32:43.6665510Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.4         
2026-02-21T08:32:43.6666936Z                                                                   configs/s     
2026-02-21T08:32:43.9616359Z [181s] Generation 6 complete: 
2026-02-21T08:32:43.9618323Z ok=55
2026-02-21T08:32:43.9618498Z min=0.0244
2026-02-21T08:32:43.9618628Z mid=0.0246
2026-02-21T08:32:43.9618757Z max=0.0572
2026-02-21T08:32:43.9618896Z best={'block_sizes': [1, 8192],
2026-02-21T08:32:43.9619133Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:32:43.9619752Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:43.9619963Z  'num_stages': 4,
2026-02-21T08:32:43.9620110Z  'num_warps': 1,
2026-02-21T08:32:43.9620249Z  'pid_type': 'flat',
2026-02-21T08:32:43.9620412Z  'range_flattens': [None, True],
2026-02-21T08:32:43.9620589Z  'range_multi_buffers': [None, False],
2026-02-21T08:32:43.9620781Z  'range_num_stages': [0, 3],
2026-02-21T08:32:43.9620945Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:43.9621127Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:43.9633503Z [181s] Fitting surrogate: 565 points, 565 targets
2026-02-21T08:32:44.8067460Z [182s] Generation 7 starting: 45 neighbors, 4 active search path(s)
2026-02-21T08:32:50.5239472Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 5.8 configs/s
2026-02-21T08:32:53.3219162Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.7 configs/s
2026-02-21T08:32:56.7595097Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.7         
2026-02-21T08:32:56.7596398Z                                                                   configs/s     
2026-02-21T08:32:57.0324596Z [194s] Generation 7 complete: 
2026-02-21T08:32:57.0326189Z ok=50
2026-02-21T08:32:57.0326361Z min=0.0244
2026-02-21T08:32:57.0326559Z mid=0.0246
2026-02-21T08:32:57.0326697Z max=0.0513
2026-02-21T08:32:57.0330748Z best={'block_sizes': [1, 8192],
2026-02-21T08:32:57.0334389Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:32:57.0336143Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:57.0336385Z  'num_stages': 5,
2026-02-21T08:32:57.0336536Z  'num_warps': 1,
2026-02-21T08:32:57.0336692Z  'pid_type': 'flat',
2026-02-21T08:32:57.0336850Z  'range_flattens': [None, None],
2026-02-21T08:32:57.0337039Z  'range_multi_buffers': [None, None],
2026-02-21T08:32:57.0337222Z  'range_num_stages': [0, 1],
2026-02-21T08:32:57.0337397Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:57.0337587Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:57.0347418Z [194s] Fitting surrogate: 615 points, 615 targets
2026-02-21T08:32:57.6281774Z [194s] Generation 8 starting: 28 neighbors, 3 active search path(s)
2026-02-21T08:33:00.7543139Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 9.7 configs/s
2026-02-21T08:33:02.4495697Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 17.0 configs/s
2026-02-21T08:33:04.5083604Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 492.7         
2026-02-21T08:33:04.5084969Z                                                                   configs/s     
2026-02-21T08:33:04.6845462Z [202s] Generation 8 complete: 
2026-02-21T08:33:04.6849031Z ok=31
2026-02-21T08:33:04.6850408Z min=0.0244
2026-02-21T08:33:04.6850568Z mid=0.0245
2026-02-21T08:33:04.6850701Z max=0.0307
2026-02-21T08:33:04.6850839Z best={'block_sizes': [1, 8192],
2026-02-21T08:33:04.6851101Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:33:04.6851360Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:04.6852360Z  'num_stages': 4,
2026-02-21T08:33:04.6852539Z  'num_warps': 2,
2026-02-21T08:33:04.6852682Z  'pid_type': 'flat',
2026-02-21T08:33:04.6852846Z  'range_flattens': [None, True],
2026-02-21T08:33:04.6853023Z  'range_multi_buffers': [None, False],
2026-02-21T08:33:04.6853214Z  'range_num_stages': [0, 2],
2026-02-21T08:33:04.6853377Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:04.6853662Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:04.6866490Z [202s] Fitting surrogate: 646 points, 646 targets
2026-02-21T08:33:05.2258683Z [202s] Generation 9 starting: 21 neighbors, 2 active search path(s)
2026-02-21T08:33:08.2644633Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.2 configs/s
2026-02-21T08:33:09.5995735Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.0 configs/s
2026-02-21T08:33:11.6419390Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 687.4         
2026-02-21T08:33:11.6419943Z                                                                   configs/s     
2026-02-21T08:33:11.7663100Z [209s] Generation 9 complete: 
2026-02-21T08:33:11.7663303Z ok=23
2026-02-21T08:33:11.7663454Z min=0.0245
2026-02-21T08:33:11.7663587Z mid=0.0246
2026-02-21T08:33:11.7663724Z max=0.0512
2026-02-21T08:33:11.7663861Z best={'block_sizes': [1, 8192],
2026-02-21T08:33:11.7664090Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:33:11.7664330Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:11.7664513Z  'num_stages': 4,
2026-02-21T08:33:11.7664665Z  'num_warps': 1,
2026-02-21T08:33:11.7664805Z  'pid_type': 'flat',
2026-02-21T08:33:11.7664972Z  'range_flattens': [None, True],
2026-02-21T08:33:11.7665151Z  'range_multi_buffers': [None, False],
2026-02-21T08:33:11.7665345Z  'range_num_stages': [0, 2],
2026-02-21T08:33:11.7665511Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:11.7682361Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:11.7682629Z [209s] Fitting surrogate: 669 points, 669 targets
2026-02-21T08:33:12.2673976Z [209s] Generation 10 starting: 17 neighbors, 2 active search path(s)
2026-02-21T08:33:13.7344625Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.7 configs/s
2026-02-21T08:33:14.7438428Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.6 configs/s
2026-02-21T08:33:15.9267565Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 852.6         
2026-02-21T08:33:15.9271242Z                                                                   configs/s     
2026-02-21T08:33:16.0240162Z [213s] Generation 10 complete: 
2026-02-21T08:33:16.0242102Z ok=19
2026-02-21T08:33:16.0242272Z min=0.0245
2026-02-21T08:33:16.0242410Z mid=0.0245
2026-02-21T08:33:16.0242529Z max=0.0430
2026-02-21T08:33:16.0242673Z best={'block_sizes': [1, 8192],
2026-02-21T08:33:16.0242901Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:33:16.0243138Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:16.0243315Z  'num_stages': 4,
2026-02-21T08:33:16.0243486Z  'num_warps': 1,
2026-02-21T08:33:16.0243970Z  'pid_type': 'flat',
2026-02-21T08:33:16.0244137Z  'range_flattens': [None, True],
2026-02-21T08:33:16.0244315Z  'range_multi_buffers': [None, False],
2026-02-21T08:33:16.0244506Z  'range_num_stages': [0, 2],
2026-02-21T08:33:16.0244678Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:16.0244855Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:16.0271209Z [213s] Fitting surrogate: 688 points, 688 targets
2026-02-21T08:33:16.4563121Z [213s] Generation 11 starting: 11 neighbors, 2 active search path(s)
2026-02-21T08:33:18.0750604Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 3.5 configs/s
2026-02-21T08:33:18.7197450Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s
2026-02-21T08:33:19.5849013Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1160.3        
2026-02-21T08:33:19.5852701Z                                                                   configs/s     
2026-02-21T08:33:19.6596718Z [216s] Generation 11 complete: 
2026-02-21T08:33:19.6600267Z ok=13
2026-02-21T08:33:19.6603509Z min=0.0245
2026-02-21T08:33:19.6607971Z mid=0.0245
2026-02-21T08:33:19.6611850Z max=0.0308
2026-02-21T08:33:19.6613283Z best={'block_sizes': [1, 8192],
2026-02-21T08:33:19.6613612Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:33:19.6613845Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:19.6618882Z  'num_stages': 4,
2026-02-21T08:33:19.6619143Z  'num_warps': 4,
2026-02-21T08:33:19.6619332Z  'pid_type': 'flat',
2026-02-21T08:33:19.6619519Z  'range_flattens': [None, True],
2026-02-21T08:33:19.6623504Z  'range_multi_buffers': [None, False],
2026-02-21T08:33:19.6627226Z  'range_num_stages': [0, 3],
2026-02-21T08:33:19.6632174Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:19.6636593Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:19.6638620Z [216s] Fitting surrogate: 701 points, 701 targets
2026-02-21T08:33:20.0708949Z [217s] Generation 12 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:33:21.5947041Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 2.6 configs/s
2026-02-21T08:33:22.2405376Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s
2026-02-21T08:33:22.9648438Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1382.9        
2026-02-21T08:33:22.9649635Z                                                                   configs/s     
2026-02-21T08:33:23.0350874Z [220s] Generation 12 complete: 
2026-02-21T08:33:23.0356164Z ok=12
2026-02-21T08:33:23.0360691Z min=0.0245
2026-02-21T08:33:23.0362211Z mid=0.0245
2026-02-21T08:33:23.0362390Z max=0.0369
2026-02-21T08:33:23.0362596Z best={'block_sizes': [1, 8192],
2026-02-21T08:33:23.0362855Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:33:23.0365218Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:23.0365499Z  'num_stages': 4,
2026-02-21T08:33:23.0370603Z  'num_warps': 4,
2026-02-21T08:33:23.0372926Z  'pid_type': 'flat',
2026-02-21T08:33:23.0373204Z  'range_flattens': [None, True],
2026-02-21T08:33:23.0373421Z  'range_multi_buffers': [None, False],
2026-02-21T08:33:23.0377633Z  'range_num_stages': [0, 3],
2026-02-21T08:33:23.0377957Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:23.0378190Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:23.0382635Z [220s] Fitting surrogate: 713 points, 713 targets
2026-02-21T08:33:23.3240718Z [220s] Autotuning complete in 220.7s after searching 680 configs.
2026-02-21T08:33:23.3245302Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:33:23.3247508Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:33:23.3248380Z 
2026-02-21T08:33:23.3253946Z [220s] Code of selected kernel: /tmp/torchinductor_root/yl/cyl2mzswb3juxdo7sy6qfkfuw5h2iowrkcl3hdh555nmuczs77xm.py
2026-02-21T08:33:24.1308493Z WARNING:tritonbench.utils.triton_op:Completed input ID 41:
2026-02-21T08:33:24.1312616Z (M, N)
2026-02-21T08:33:24.1317748Z ------------
2026-02-21T08:33:24.1319989Z (4096, 5504)
2026-02-21T08:33:24.1320128Z 
2026-02-21T08:33:24.1320701Z  45%|████▌     | 9/20 [24:21<35:15, 192.34s/it]WARNING:tritonbench.utils.triton_op:Running input ID 46:
2026-02-21T08:33:24.1325033Z (M, N)
2026-02-21T08:33:24.1326724Z ------------
2026-02-21T08:33:24.1326902Z (4096, 6144)
2026-02-21T08:33:24.1327181Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:33:25.3789876Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:33:26.8741449Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for torch_compile_softmax
2026-02-21T08:33:32.0115784Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:33:32.0117113Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:33:32.0117346Z               'dtype': 'torch.float16',
2026-02-21T08:33:32.0123352Z               'shape': (4096, 6144),
2026-02-21T08:33:32.0124821Z               'stride': (6144, 1)},),
2026-02-21T08:33:32.0125043Z   'kwargs': {}}
2026-02-21T08:33:32.0136898Z INFO:tritonbench.utils.triton_op:Took 2.61ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:33:32.1929058Z [0s] Autotune random seed: 2134816249
2026-02-21T08:33:32.3307341Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:34:06.5704872Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:34:06.9115833Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:34:07.1523504Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:34:07.1544619Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:34:13.9626470Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.8 configs/s
2026-02-21T08:34:13.9635637Z [41s] Adaptive compile timeout: 30s (90% percentile=6.4s, bounds=[30.0s, 30s])
2026-02-21T08:34:14.3677484Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2405.1 configs/s
2026-02-21T08:34:14.4115229Z [42s] Initial random population of 100, 5 starting points: 
2026-02-21T08:34:14.4116500Z error=6
2026-02-21T08:34:14.4116660Z timeout=3
2026-02-21T08:34:14.4116787Z ok=91
2026-02-21T08:34:14.4116917Z min=0.0348
2026-02-21T08:34:14.4117041Z mid=0.5385
2026-02-21T08:34:14.4117170Z max=41.0603
2026-02-21T08:34:14.4117317Z best={'block_sizes': [1, 8192],
2026-02-21T08:34:14.4117549Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:34:14.4117787Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:34:14.4117965Z  'maxnreg': 32,
2026-02-21T08:34:14.4118148Z  'num_sm_multiplier': 64,
2026-02-21T08:34:14.4118699Z  'num_stages': 7,
2026-02-21T08:34:14.4118850Z  'num_warps': 4,
2026-02-21T08:34:14.4119009Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:34:14.4119214Z  'range_flattens': [None, True],
2026-02-21T08:34:14.4119396Z  'range_multi_buffers': [False, True],
2026-02-21T08:34:14.4119595Z  'range_num_stages': [1, 4],
2026-02-21T08:34:14.4119775Z  'range_unroll_factors': [1, 4],
2026-02-21T08:34:14.4119953Z  'range_warp_specializes': [True, None]}
2026-02-21T08:34:14.4130874Z [42s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:34:15.5611286Z [43s] Generation 1 starting: 83 neighbors, 5 active search path(s)
2026-02-21T08:34:23.0337266Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 12.3 configs/s
2026-02-21T08:34:28.2389255Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.9 configs/s
2026-02-21T08:34:33.3714749Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 221.9         
2026-02-21T08:34:33.3716842Z                                                                   configs/s     
2026-02-21T08:34:33.6693900Z [61s] Generation 1 complete: 
2026-02-21T08:34:33.6698073Z ok=89
2026-02-21T08:34:33.6702032Z min=0.0307
2026-02-21T08:34:33.6704154Z mid=0.0409
2026-02-21T08:34:33.6704367Z max=2.1657
2026-02-21T08:34:33.6709305Z best={'block_sizes': [1, 8192],
2026-02-21T08:34:33.6711225Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:34:33.6711495Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:34:33.6711940Z  'num_stages': 7,
2026-02-21T08:34:33.6712091Z  'num_warps': 4,
2026-02-21T08:34:33.6712229Z  'pid_type': 'flat',
2026-02-21T08:34:33.6712393Z  'range_flattens': [None, True],
2026-02-21T08:34:33.6712572Z  'range_multi_buffers': [None, None],
2026-02-21T08:34:33.6712762Z  'range_num_stages': [0, 4],
2026-02-21T08:34:33.6712925Z  'range_unroll_factors': [0, 0],
2026-02-21T08:34:33.6713109Z  'range_warp_specializes': [None, True]}
2026-02-21T08:34:33.6713359Z [61s] Fitting surrogate: 189 points, 189 targets
2026-02-21T08:34:34.5585348Z [62s] Generation 2 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:34:44.4898003Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 2.4 configs/s
2026-02-21T08:34:48.5769418Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 16.8 configs/s
2026-02-21T08:34:52.7458340Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 243.2         
2026-02-21T08:34:52.7459719Z                                                                   configs/s     
2026-02-21T08:34:53.0448377Z [80s] Generation 2 complete: 
2026-02-21T08:34:53.0451529Z ok=71
2026-02-21T08:34:53.0454889Z min=0.0287
2026-02-21T08:34:53.0459296Z mid=0.0368
2026-02-21T08:34:53.0460701Z max=0.5110
2026-02-21T08:34:53.0460911Z best={'block_sizes': [2, 8192],
2026-02-21T08:34:53.0461187Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:34:53.0461509Z  'load_eviction_policies': ['', ''],
2026-02-21T08:34:53.0462175Z  'num_stages': 4,
2026-02-21T08:34:53.0462329Z  'num_warps': 2,
2026-02-21T08:34:53.0462473Z  'pid_type': 'flat',
2026-02-21T08:34:53.0462638Z  'range_flattens': [None, False],
2026-02-21T08:34:53.0462827Z  'range_multi_buffers': [None, False],
2026-02-21T08:34:53.0463026Z  'range_num_stages': [0, 4],
2026-02-21T08:34:53.0463201Z  'range_unroll_factors': [0, 0],
2026-02-21T08:34:53.0463382Z  'range_warp_specializes': [None, True]}
2026-02-21T08:34:53.0463717Z [80s] Fitting surrogate: 260 points, 260 targets
2026-02-21T08:34:53.8768935Z [81s] Generation 3 starting: 59 neighbors, 5 active search path(s)
2026-02-21T08:35:00.3396708Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 24.3 configs/s
2026-02-21T08:35:04.0781879Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.8 configs/s
2026-02-21T08:35:08.6769869Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 242.6         
2026-02-21T08:35:08.6772582Z                                                                   configs/s     
2026-02-21T08:35:08.9585995Z [96s] Generation 3 complete: 
2026-02-21T08:35:08.9587393Z ok=65
2026-02-21T08:35:08.9587595Z min=0.0246
2026-02-21T08:35:08.9587761Z mid=0.0328
2026-02-21T08:35:08.9587919Z max=0.0522
2026-02-21T08:35:08.9588096Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:08.9588396Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:08.9588710Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:08.9588901Z  'num_stages': 4,
2026-02-21T08:35:08.9589074Z  'num_warps': 1,
2026-02-21T08:35:08.9589250Z  'pid_type': 'flat',
2026-02-21T08:35:08.9589424Z  'range_flattens': [None, None],
2026-02-21T08:35:08.9589620Z  'range_multi_buffers': [None, False],
2026-02-21T08:35:08.9589819Z  'range_num_stages': [0, 4],
2026-02-21T08:35:08.9589998Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:08.9591804Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:08.9599078Z [96s] Fitting surrogate: 325 points, 325 targets
2026-02-21T08:35:09.5562205Z [97s] Generation 4 starting: 38 neighbors, 3 active search path(s)
2026-02-21T08:35:15.4060892Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 2.2 configs/s
2026-02-21T08:35:17.7699134Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.8 configs/s
2026-02-21T08:35:20.2157974Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 414.5         
2026-02-21T08:35:20.2161207Z                                                                   configs/s     
2026-02-21T08:35:20.4049369Z [108s] Generation 4 complete: 
2026-02-21T08:35:20.4053714Z ok=41
2026-02-21T08:35:20.4055147Z min=0.0265
2026-02-21T08:35:20.4055314Z mid=0.0328
2026-02-21T08:35:20.4055435Z max=0.1455
2026-02-21T08:35:20.4055583Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:20.4055855Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:20.4056128Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:20.4056348Z  'num_stages': 4,
2026-02-21T08:35:20.4056503Z  'num_warps': 1,
2026-02-21T08:35:20.4056655Z  'pid_type': 'flat',
2026-02-21T08:35:20.4056808Z  'range_flattens': [None, None],
2026-02-21T08:35:20.4056990Z  'range_multi_buffers': [None, False],
2026-02-21T08:35:20.4057172Z  'range_num_stages': [0, 4],
2026-02-21T08:35:20.4057339Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:20.4057522Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:20.4068192Z [108s] Fitting surrogate: 366 points, 366 targets
2026-02-21T08:35:20.8142649Z [108s] Generation 5 starting: 17 neighbors, 2 active search path(s)
2026-02-21T08:35:22.9669009Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 12.5 configs/s
2026-02-21T08:35:24.1240096Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.1 configs/s
2026-02-21T08:35:25.4197815Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 777.4         
2026-02-21T08:35:25.4199609Z                                                                   configs/s     
2026-02-21T08:35:25.5192874Z [113s] Generation 5 complete: 
2026-02-21T08:35:25.5196740Z ok=20
2026-02-21T08:35:25.5199869Z min=0.0266
2026-02-21T08:35:25.5203721Z mid=0.0307
2026-02-21T08:35:25.5208163Z max=0.0492
2026-02-21T08:35:25.5212554Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:25.5214148Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:25.5214455Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:25.5214633Z  'num_stages': 4,
2026-02-21T08:35:25.5214786Z  'num_warps': 1,
2026-02-21T08:35:25.5214928Z  'pid_type': 'flat',
2026-02-21T08:35:25.5215171Z  'range_flattens': [None, None],
2026-02-21T08:35:25.5217353Z  'range_multi_buffers': [None, False],
2026-02-21T08:35:25.5217587Z  'range_num_stages': [0, 4],
2026-02-21T08:35:25.5217762Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:25.5217954Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:25.5218575Z [113s] Fitting surrogate: 386 points, 386 targets
2026-02-21T08:35:25.9678413Z [113s] Generation 6 starting: 21 neighbors, 2 active search path(s)
2026-02-21T08:35:27.8818751Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 15.7 configs/s
2026-02-21T08:35:29.1580428Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.0 configs/s
2026-02-21T08:35:30.7485850Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 635.7         
2026-02-21T08:35:30.7489942Z                                                                   configs/s     
2026-02-21T08:35:30.8688654Z [118s] Generation 6 complete: 
2026-02-21T08:35:30.8693053Z ok=24
2026-02-21T08:35:30.8695128Z min=0.0265
2026-02-21T08:35:30.8695305Z mid=0.0266
2026-02-21T08:35:30.8695439Z max=0.0471
2026-02-21T08:35:30.8695580Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:30.8695868Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:30.8696161Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:30.8696381Z  'num_stages': 5,
2026-02-21T08:35:30.8696539Z  'num_warps': 4,
2026-02-21T08:35:30.8696693Z  'pid_type': 'flat',
2026-02-21T08:35:30.8696860Z  'range_flattens': [None, True],
2026-02-21T08:35:30.8697038Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:30.8697234Z  'range_num_stages': [0, 2],
2026-02-21T08:35:30.8697404Z  'range_unroll_factors': [0, 2],
2026-02-21T08:35:30.8697592Z  'range_warp_specializes': [None, None]}
2026-02-21T08:35:30.8706167Z [118s] Fitting surrogate: 410 points, 410 targets
2026-02-21T08:35:31.2966701Z [118s] Generation 7 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:35:33.4670839Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 7.9 configs/s
2026-02-21T08:35:34.8633508Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 15.5 configs/s
2026-02-21T08:35:36.8602077Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 657.9         
2026-02-21T08:35:36.8603612Z                                                                   configs/s     
2026-02-21T08:35:36.9871804Z [124s] Generation 7 complete: 
2026-02-21T08:35:36.9872954Z ok=22
2026-02-21T08:35:36.9873114Z min=0.0266
2026-02-21T08:35:36.9873256Z mid=0.0266
2026-02-21T08:35:36.9873376Z max=0.0369
2026-02-21T08:35:36.9873522Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:36.9873769Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:35:36.9874027Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:36.9874204Z  'num_stages': 3,
2026-02-21T08:35:36.9874353Z  'num_warps': 1,
2026-02-21T08:35:36.9874492Z  'pid_type': 'flat',
2026-02-21T08:35:36.9874656Z  'range_flattens': [None, True],
2026-02-21T08:35:36.9874839Z  'range_multi_buffers': [None, False],
2026-02-21T08:35:36.9875019Z  'range_num_stages': [0, 3],
2026-02-21T08:35:36.9875187Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:36.9875361Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:36.9887752Z [124s] Fitting surrogate: 432 points, 432 targets
2026-02-21T08:35:37.2673091Z [124s] Generation 8 starting: 10 neighbors, 1 active search path(s)
2026-02-21T08:35:38.6410634Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 10.7 configs/s
2026-02-21T08:35:39.2549530Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.6 configs/s
2026-02-21T08:35:40.0787204Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1216.1         
2026-02-21T08:35:40.0788597Z                                                                  configs/s      
2026-02-21T08:35:40.1514645Z [127s] Generation 8 complete: 
2026-02-21T08:35:40.1519010Z ok=12
2026-02-21T08:35:40.1520413Z min=0.0266
2026-02-21T08:35:40.1520579Z mid=0.0266
2026-02-21T08:35:40.1520705Z max=0.0307
2026-02-21T08:35:40.1520852Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:40.1521121Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:40.1521406Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:40.1521818Z  'num_stages': 5,
2026-02-21T08:35:40.1522328Z  'num_warps': 4,
2026-02-21T08:35:40.1522483Z  'pid_type': 'flat',
2026-02-21T08:35:40.1522656Z  'range_flattens': [None, None],
2026-02-21T08:35:40.1522839Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:40.1523038Z  'range_num_stages': [0, 2],
2026-02-21T08:35:40.1523218Z  'range_unroll_factors': [0, 1],
2026-02-21T08:35:40.1523400Z  'range_warp_specializes': [None, None]}
2026-02-21T08:35:40.1540557Z [127s] Fitting surrogate: 444 points, 444 targets
2026-02-21T08:35:40.4575853Z [128s] Generation 9 starting: 10 neighbors, 1 active search path(s)
2026-02-21T08:35:41.7570980Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 13.6 configs/s
2026-02-21T08:35:42.3582072Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.0 configs/s
2026-02-21T08:35:42.8393018Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2057.0         
2026-02-21T08:35:42.8397039Z                                                                  configs/s      
2026-02-21T08:35:42.8898361Z [130s] Generation 9 complete: 
2026-02-21T08:35:42.8902931Z ok=11
2026-02-21T08:35:42.8905867Z min=0.0266
2026-02-21T08:35:42.8906044Z mid=0.0327
2026-02-21T08:35:42.8906175Z max=0.1455
2026-02-21T08:35:42.8906331Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:42.8906616Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:42.8906917Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:42.8907105Z  'num_stages': 5,
2026-02-21T08:35:42.8907263Z  'num_warps': 4,
2026-02-21T08:35:42.8907423Z  'pid_type': 'flat',
2026-02-21T08:35:42.8907587Z  'range_flattens': [None, None],
2026-02-21T08:35:42.8907782Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:42.8907976Z  'range_num_stages': [0, 2],
2026-02-21T08:35:42.8908157Z  'range_unroll_factors': [0, 1],
2026-02-21T08:35:42.8908346Z  'range_warp_specializes': [None, None]}
2026-02-21T08:35:42.8924287Z [130s] Fitting surrogate: 455 points, 455 targets
2026-02-21T08:35:43.1810496Z [130s] Generation 10 starting: 7 neighbors, 1 active search path(s)
2026-02-21T08:35:44.2867005Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 10.8 configs/s
2026-02-21T08:35:44.7139581Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 7/7 18.5 configs/s
2026-02-21T08:35:45.2619458Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1809.8        
2026-02-21T08:35:45.2621206Z                                                                   configs/s     
2026-02-21T08:35:45.3135660Z [132s] Generation 10 complete: 
2026-02-21T08:35:45.3139583Z ok=8
2026-02-21T08:35:45.3144119Z min=0.0266
2026-02-21T08:35:45.3145852Z mid=0.0266
2026-02-21T08:35:45.3146058Z max=0.0328
2026-02-21T08:35:45.3151493Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:45.3155583Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:45.3159706Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:45.3163566Z  'num_stages': 5,
2026-02-21T08:35:45.3175717Z  'num_warps': 1,
2026-02-21T08:35:45.3176065Z  'pid_type': 'flat',
2026-02-21T08:35:45.3176239Z  'range_flattens': [None, None],
2026-02-21T08:35:45.3176442Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:45.3176645Z  'range_num_stages': [0, 2],
2026-02-21T08:35:45.3176819Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:45.3177018Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:45.3177252Z [132s] Fitting surrogate: 463 points, 463 targets
2026-02-21T08:35:45.5785388Z [133s] Generation 11 starting: 9 neighbors, 1 active search path(s)
2026-02-21T08:35:47.2425671Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.6 configs/s
2026-02-21T08:35:47.8486348Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.9 configs/s
2026-02-21T08:35:48.5234409Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1480.7        
2026-02-21T08:35:48.5235268Z                                                                   configs/s     
2026-02-21T08:35:48.5866202Z [136s] Generation 11 complete: 
2026-02-21T08:35:48.5870559Z ok=11
2026-02-21T08:35:48.5872130Z min=0.0266
2026-02-21T08:35:48.5872296Z mid=0.0266
2026-02-21T08:35:48.5872418Z max=0.0389
2026-02-21T08:35:48.5872562Z best={'block_sizes': [1, 8192],
2026-02-21T08:35:48.5872832Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:48.5873107Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:48.5873280Z  'num_stages': 5,
2026-02-21T08:35:48.5873425Z  'num_warps': 1,
2026-02-21T08:35:48.5873565Z  'pid_type': 'flat',
2026-02-21T08:35:48.5873723Z  'range_flattens': [None, None],
2026-02-21T08:35:48.5873896Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:48.5874084Z  'range_num_stages': [0, 2],
2026-02-21T08:35:48.5874250Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:48.5874423Z  'range_warp_specializes': [None, True]}
2026-02-21T08:35:48.5884017Z [136s] Fitting surrogate: 474 points, 474 targets
2026-02-21T08:35:48.7678437Z [136s] Autotuning complete in 136.4s after searching 451 configs.
2026-02-21T08:35:48.7678943Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:35:48.7679992Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:35:48.7680882Z 
2026-02-21T08:35:48.7681154Z [136s] Code of selected kernel: /tmp/torchinductor_root/34/c34nendb26c5gje5v6qbm6aunt2taswx5hcv7ajfpjlnjbo4gcdo.py
2026-02-21T08:35:49.7088697Z WARNING:tritonbench.utils.triton_op:Completed input ID 46:
2026-02-21T08:35:49.7090509Z (M, N)
2026-02-21T08:35:49.7090684Z ------------
2026-02-21T08:35:49.7090827Z (4096, 6144)
2026-02-21T08:35:49.7090987Z 
2026-02-21T08:35:49.7104613Z  50%|█████     | 10/20 [26:47<29:39, 177.90s/it]WARNING:tritonbench.utils.triton_op:Running input ID 51:
2026-02-21T08:35:49.7106154Z (M, N)
2026-02-21T08:35:49.7106329Z ------------
2026-02-21T08:35:49.7106477Z (4096, 6784)
2026-02-21T08:35:49.7106851Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax
2026-02-21T08:35:50.9126277Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:35:52.4112509Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for torch_compile_softmax
2026-02-21T08:35:55.9221188Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:35:55.9225492Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:35:55.9228718Z               'dtype': 'torch.float16',
2026-02-21T08:35:55.9231962Z               'shape': (4096, 6784),
2026-02-21T08:35:55.9236488Z               'stride': (6784, 1)},),
2026-02-21T08:35:55.9238348Z   'kwargs': {}}
2026-02-21T08:35:55.9243588Z INFO:tritonbench.utils.triton_op:Took 2.38ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:35:56.1011820Z [0s] Autotune random seed: 2134816249
2026-02-21T08:35:56.2402290Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:36:31.0616532Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:36:31.4522547Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:36:31.7120601Z [35s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:36:31.7137060Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:36:38.7049031Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.4 configs/s
2026-02-21T08:36:38.7058740Z [42s] Adaptive compile timeout: 30s (90% percentile=7.3s, bounds=[30.0s, 30s])
2026-02-21T08:36:39.4884928Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1261.2 configs/s
2026-02-21T08:36:39.5599454Z [43s] Initial random population of 100, 5 starting points: 
2026-02-21T08:36:39.5601107Z error=6
2026-02-21T08:36:39.5601269Z timeout=3
2026-02-21T08:36:39.5601403Z ok=91
2026-02-21T08:36:39.5601525Z min=0.0492
2026-02-21T08:36:39.5601952Z mid=0.6514
2026-02-21T08:36:39.5602076Z max=45.4574
2026-02-21T08:36:39.5602230Z best={'block_sizes': [2, 1024],
2026-02-21T08:36:39.5602502Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:36:39.5602799Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:36:39.5602998Z  'num_sm_multiplier': 64,
2026-02-21T08:36:39.5603157Z  'num_stages': 5,
2026-02-21T08:36:39.5603303Z  'num_warps': 1,
2026-02-21T08:36:39.5603460Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:36:39.5603660Z  'range_flattens': [True, True],
2026-02-21T08:36:39.5603837Z  'range_multi_buffers': [False, None],
2026-02-21T08:36:39.5604032Z  'range_num_stages': [3, 1],
2026-02-21T08:36:39.5604199Z  'range_unroll_factors': [0, 2],
2026-02-21T08:36:39.5604403Z  'range_warp_specializes': [True, None]}
2026-02-21T08:36:39.5613414Z [43s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:36:40.6639704Z [44s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T08:37:06.7963521Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.6 configs/s
2026-02-21T08:37:11.9294633Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.7 configs/s
2026-02-21T08:37:18.4073461Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 173.5         
2026-02-21T08:37:18.4074365Z                                                                   configs/s     
2026-02-21T08:37:18.7708940Z [82s] Generation 1 complete: 
2026-02-21T08:37:18.7713019Z ok=86
2026-02-21T08:37:18.7714419Z min=0.0410
2026-02-21T08:37:18.7714594Z mid=0.0512
2026-02-21T08:37:18.7714732Z max=2.4760
2026-02-21T08:37:18.7714877Z best={'block_sizes': [2, 4096],
2026-02-21T08:37:18.7715695Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:37:18.7716024Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:37:18.7716230Z  'num_sm_multiplier': 64,
2026-02-21T08:37:18.7716399Z  'num_stages': 5,
2026-02-21T08:37:18.7716551Z  'num_warps': 4,
2026-02-21T08:37:18.7716715Z  'pid_type': 'persistent_blocked',
2026-02-21T08:37:18.7716916Z  'range_flattens': [True, True],
2026-02-21T08:37:18.7717100Z  'range_multi_buffers': [None, None],
2026-02-21T08:37:18.7717298Z  'range_num_stages': [3, 1],
2026-02-21T08:37:18.7717478Z  'range_unroll_factors': [0, 2],
2026-02-21T08:37:18.7717663Z  'range_warp_specializes': [True, None]}
2026-02-21T08:37:18.7723677Z [82s] Fitting surrogate: 186 points, 186 targets
2026-02-21T08:37:19.7710214Z [83s] Generation 2 starting: 68 neighbors, 5 active search path(s)
2026-02-21T08:37:39.8361353Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.8 configs/s
2026-02-21T08:37:44.0770892Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.9 configs/s
2026-02-21T08:37:48.4698164Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.6         
2026-02-21T08:37:48.4702317Z                                                                   configs/s     
2026-02-21T08:37:48.7441860Z [112s] Generation 2 complete: 
2026-02-21T08:37:48.7446028Z ok=74
2026-02-21T08:37:48.7450427Z min=0.0328
2026-02-21T08:37:48.7455492Z mid=0.0450
2026-02-21T08:37:48.7457497Z max=0.4588
2026-02-21T08:37:48.7457677Z best={'block_sizes': [1, 8192],
2026-02-21T08:37:48.7457957Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:37:48.7458254Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:37:48.7458465Z  'num_stages': 4,
2026-02-21T08:37:48.7458616Z  'num_warps': 16,
2026-02-21T08:37:48.7458757Z  'pid_type': 'flat',
2026-02-21T08:37:48.7458922Z  'range_flattens': [None, False],
2026-02-21T08:37:48.7459105Z  'range_multi_buffers': [None, False],
2026-02-21T08:37:48.7459296Z  'range_num_stages': [0, 3],
2026-02-21T08:37:48.7459482Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:48.7459682Z  'range_warp_specializes': [None, True]}
2026-02-21T08:37:48.7461403Z [112s] Fitting surrogate: 260 points, 260 targets
2026-02-21T08:37:49.5764443Z [113s] Generation 3 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:37:55.2811717Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 8.8 configs/s
2026-02-21T08:37:58.9934724Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.9 configs/s
2026-02-21T08:38:01.6359132Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.6         
2026-02-21T08:38:01.6360539Z                                                                   configs/s     
2026-02-21T08:38:01.7915558Z [125s] Generation 3 complete: 
2026-02-21T08:38:01.7919841Z ok=66
2026-02-21T08:38:01.7923597Z min=0.0266
2026-02-21T08:38:01.7928226Z mid=0.0410
2026-02-21T08:38:01.7930222Z max=0.0655
2026-02-21T08:38:01.7930425Z best={'block_sizes': [1, 8192],
2026-02-21T08:38:01.7930731Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:38:01.7931379Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:38:01.7931782Z  'num_stages': 4,
2026-02-21T08:38:01.7931951Z  'num_warps': 4,
2026-02-21T08:38:01.7932097Z  'pid_type': 'flat',
2026-02-21T08:38:01.7932268Z  'range_flattens': [None, False],
2026-02-21T08:38:01.7932448Z  'range_multi_buffers': [None, False],
2026-02-21T08:38:01.7932641Z  'range_num_stages': [0, 3],
2026-02-21T08:38:01.7932814Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:01.7933089Z  'range_warp_specializes': [None, True]}
2026-02-21T08:38:01.7933304Z [125s] Fitting surrogate: 326 points, 326 targets
2026-02-21T08:38:02.6800582Z [126s] Generation 4 starting: 58 neighbors, 5 active search path(s)
2026-02-21T08:38:08.6009109Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 35.5 configs/s
2026-02-21T08:38:12.2119006Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 16.8 configs/s
2026-02-21T08:38:15.9497348Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.3         
2026-02-21T08:38:15.9501469Z                                                                   configs/s     
2026-02-21T08:38:16.1915016Z [139s] Generation 4 complete: 
2026-02-21T08:38:16.1915285Z ok=63
2026-02-21T08:38:16.1915463Z min=0.0266
2026-02-21T08:38:16.1915604Z mid=0.0369
2026-02-21T08:38:16.1915770Z max=0.0746
2026-02-21T08:38:16.1915918Z best={'block_sizes': [1, 8192],
2026-02-21T08:38:16.1916222Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:38:16.1916566Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:38:16.1916770Z  'num_stages': 4,
2026-02-21T08:38:16.1916916Z  'num_warps': 1,
2026-02-21T08:38:16.1917056Z  'pid_type': 'flat',
2026-02-21T08:38:16.1917218Z  'range_flattens': [None, False],
2026-02-21T08:38:16.1917401Z  'range_multi_buffers': [None, False],
2026-02-21T08:38:16.1917584Z  'range_num_stages': [0, 3],
2026-02-21T08:38:16.1917773Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:16.1917963Z  'range_warp_specializes': [None, True]}
2026-02-21T08:38:16.1934882Z [139s] Fitting surrogate: 389 points, 389 targets
2026-02-21T08:38:16.7837837Z [140s] Generation 5 starting: 32 neighbors, 3 active search path(s)
2026-02-21T08:38:20.8386120Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 3.0 configs/s
2026-02-21T08:38:22.8952922Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 16.9 configs/s
2026-02-21T08:38:25.2256988Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 434.6         
2026-02-21T08:38:25.2258690Z                                                                   configs/s     
2026-02-21T08:38:25.3995512Z [149s] Generation 5 complete: 
2026-02-21T08:38:25.3997581Z ok=35
2026-02-21T08:38:25.3997799Z min=0.0266
2026-02-21T08:38:25.3998029Z mid=0.0285
2026-02-21T08:38:25.3998199Z max=0.0553
2026-02-21T08:38:25.3998372Z best={'block_sizes': [1, 8192],
2026-02-21T08:38:25.4002079Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:38:25.4006232Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:38:25.4007630Z  'num_stages': 4,
2026-02-21T08:38:25.4007805Z  'num_warps': 1,
2026-02-21T08:38:25.4007965Z  'pid_type': 'flat',
2026-02-21T08:38:25.4008127Z  'range_flattens': [None, False],
2026-02-21T08:38:25.4008318Z  'range_multi_buffers': [None, False],
2026-02-21T08:38:25.4008510Z  'range_num_stages': [0, 3],
2026-02-21T08:38:25.4008678Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:25.4008865Z  'range_warp_specializes': [None, True]}
2026-02-21T08:38:25.4011440Z [149s] Fitting surrogate: 424 points, 424 targets
2026-02-21T08:38:25.7799645Z [149s] Generation 6 starting: 16 neighbors, 2 active search path(s)
2026-02-21T08:38:28.0391429Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.9 configs/s
2026-02-21T08:38:29.0022873Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 17.4 configs/s
2026-02-21T08:38:30.1717952Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 861.9         
2026-02-21T08:38:30.1722404Z                                                                   configs/s     
2026-02-21T08:38:30.2616661Z [154s] Generation 6 complete: 
2026-02-21T08:38:30.2620919Z ok=18
2026-02-21T08:38:30.2624915Z min=0.0266
2026-02-21T08:38:30.2629339Z mid=0.0266
2026-02-21T08:38:30.2633817Z max=0.0471
2026-02-21T08:38:30.2637844Z best={'block_sizes': [1, 8192],
2026-02-21T08:38:30.2642490Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:38:30.2642849Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:38:30.2647325Z  'num_stages': 5,
2026-02-21T08:38:30.2648883Z  'num_warps': 4,
2026-02-21T08:38:30.2649077Z  'pid_type': 'flat',
2026-02-21T08:38:30.2649262Z  'range_flattens': [None, False],
2026-02-21T08:38:30.2649466Z  'range_multi_buffers': [None, None],
2026-02-21T08:38:30.2649661Z  'range_num_stages': [0, 0],
2026-02-21T08:38:30.2649864Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:30.2650067Z  'range_warp_specializes': [None, True]}
2026-02-21T08:38:30.2650376Z [154s] Fitting surrogate: 442 points, 442 targets
2026-02-21T08:38:30.4986275Z [154s] Generation 7 starting: 6 neighbors, 1 active search path(s)
2026-02-21T08:38:31.5266962Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 3.9 configs/s
2026-02-21T08:38:31.8846988Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 6/6 19.5 configs/s
2026-02-21T08:38:32.3648171Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2056.5         
2026-02-21T08:38:32.3650049Z                                                                  configs/s      
2026-02-21T08:38:32.4121317Z [156s] Generation 7 complete: 
2026-02-21T08:38:32.4122934Z ok=7
2026-02-21T08:38:32.4123095Z min=0.0266
2026-02-21T08:38:32.4123287Z mid=0.0266
2026-02-21T08:38:32.4123421Z max=0.0328
2026-02-21T08:38:32.4128435Z best={'block_sizes': [1, 8192],
2026-02-21T08:38:32.4130120Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:38:32.4130462Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:38:32.4130656Z  'num_stages': 5,
2026-02-21T08:38:32.4130806Z  'num_warps': 4,
2026-02-21T08:38:32.4130951Z  'pid_type': 'flat',
2026-02-21T08:38:32.4131116Z  'range_flattens': [None, False],
2026-02-21T08:38:32.4131294Z  'range_multi_buffers': [None, None],
2026-02-21T08:38:32.4131485Z  'range_num_stages': [0, 0],
2026-02-21T08:38:32.4131738Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:32.4131919Z  'range_warp_specializes': [None, True]}
2026-02-21T08:38:32.4140810Z [156s] Fitting surrogate: 449 points, 449 targets
2026-02-21T08:38:32.5831782Z [156s] Autotuning complete in 156.3s after searching 431 configs.
2026-02-21T08:38:32.5832189Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:38:32.5833185Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:38:32.5836879Z 
2026-02-21T08:38:32.5837287Z [156s] Code of selected kernel: /tmp/torchinductor_root/ia/cia5lcrlnyyqtlkltxwguhwcq43qd6izyhmsy7zouruem2pnjllu.py
2026-02-21T08:38:33.5729832Z WARNING:tritonbench.utils.triton_op:Completed input ID 51:
2026-02-21T08:38:33.5734322Z (M, N)
2026-02-21T08:38:33.5738812Z ------------
2026-02-21T08:38:33.5743302Z (4096, 6784)
2026-02-21T08:38:33.5746975Z 
2026-02-21T08:38:33.5751298Z  55%|█████▌    | 11/20 [29:31<26:02, 173.61s/it]WARNING:tritonbench.utils.triton_op:Running input ID 56:
2026-02-21T08:38:33.5755326Z (M, N)
2026-02-21T08:38:33.5755565Z ------------
2026-02-21T08:38:33.5755768Z (4096, 7424)
2026-02-21T08:38:33.5756388Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:38:34.7448771Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:38:36.2672571Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for torch_compile_softmax
2026-02-21T08:38:39.7742364Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:38:39.7744304Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:38:39.7744585Z               'dtype': 'torch.float16',
2026-02-21T08:38:39.7744794Z               'shape': (4096, 7424),
2026-02-21T08:38:39.7749107Z               'stride': (7424, 1)},),
2026-02-21T08:38:39.7752298Z   'kwargs': {}}
2026-02-21T08:38:39.7762672Z INFO:tritonbench.utils.triton_op:Took 2.17ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:38:39.9508107Z [0s] Autotune random seed: 2134816249
2026-02-21T08:38:40.0884635Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:39:15.3859240Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:39:15.8440229Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:39:16.0989151Z [36s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:39:16.1007610Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:39:19.2154771Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:39:19.2157566Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:39:19.2158038Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:39:19.2158237Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:39:19.2158416Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:39:19.2158595Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:39:19.2158822Z     %cst = arith.constant dense<7424> : tensor<32x1xi32>
2026-02-21T08:39:19.2159464Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:39:19.2159729Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:39:19.2159944Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:39:19.2160135Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:39:19.2160319Z     %c7424_i32 = arith.constant 7424 : i32
2026-02-21T08:39:19.2160504Z     %c7424_i64 = arith.constant 7424 : i64
2026-02-21T08:39:19.2160682Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:39:19.2161003Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:39:19.2161447Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : <f16>, <tensor<32x8xf16>>
2026-02-21T08:39:19.2161934Z     %2 = tt.get_program_id x : i32
2026-02-21T08:39:19.2162124Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:39:19.2162456Z     %4 = arith.minsi %3, %c128_i32 : i32
2026-02-21T08:39:19.2162679Z     scf.for %arg2 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:39:19.2162889Z       %5 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:39:19.2163184Z       %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:39:19.2163438Z       %7 = tt.splat %5 : i32 -> tensor<32xi32>
2026-02-21T08:39:19.2163646Z       %8 = arith.addi %7, %6 : tensor<32xi32>
2026-02-21T08:39:19.2163834Z       %c7416_i32 = arith.constant 7416 : i32
2026-02-21T08:39:19.2164024Z       %c24_i32 = arith.constant 24 : i32
2026-02-21T08:39:19.2164382Z       %9:2 = scf.for %arg3 = %c0_i32 to %c7416_i32 step %c24_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:39:19.2164790Z         %49 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:39:19.2165044Z         %50 = tt.splat %arg3 : i32 -> tensor<8xi32>
2026-02-21T08:39:19.2165244Z         %51 = arith.addi %50, %49 : tensor<8xi32>
2026-02-21T08:39:19.2165510Z         %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:19.2165766Z         %53 = arith.muli %52, %cst : tensor<32x1xi32>
2026-02-21T08:39:19.2166019Z         %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:39:19.2166300Z         %55 = tt.broadcast %53 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2166561Z         %56 = tt.broadcast %54 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2166797Z         %57 = arith.addi %55, %56 : tensor<32x8xi32>
2026-02-21T08:39:19.2167030Z         %58 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2167307Z         %59 = tt.addptr %58, %57 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:39:19.2167594Z         %60 = tt.load %59 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2167878Z         %61 = arith.extf %60 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2168109Z         %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2168301Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2168497Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:19.2168689Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2168877Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2169094Z         %63 = arith.truncf %62 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:19.2169334Z         %64 = arith.extf %63 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:19.2169560Z         %65 = arith.cmpf ogt, %arg4, %64 : tensor<32xf32>
2026-02-21T08:39:19.2169788Z         %66 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:39:19.2170005Z         %67 = arith.ori %65, %66 : tensor<32xi1>
2026-02-21T08:39:19.2170231Z         %68 = arith.select %67, %arg4, %64 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:19.2170474Z         %69 = arith.subf %arg4, %68 : tensor<32xf32>
2026-02-21T08:39:19.2170831Z         %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2171281Z         %71 = arith.mulf %arg5, %70 : tensor<32xf32>
2026-02-21T08:39:19.2171570Z         %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2171865Z         %73 = tt.broadcast %72 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2172113Z         %74 = arith.subf %61, %73 : tensor<32x8xf32>
2026-02-21T08:39:19.2172464Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2172827Z         %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2173017Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2173208Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:19.2173403Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2173649Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2173855Z         %77 = arith.addf %71, %76 : tensor<32xf32>
2026-02-21T08:39:19.2174046Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:39:19.2174243Z         %78 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:39:19.2174432Z         %79 = arith.addi %arg3, %78 : i32
2026-02-21T08:39:19.2174666Z         %80 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:39:19.2174909Z         %81 = tt.splat %79 : i32 -> tensor<8xi32>
2026-02-21T08:39:19.2175108Z         %82 = arith.addi %81, %80 : tensor<8xi32>
2026-02-21T08:39:19.2175387Z         %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:19.2175646Z         %84 = arith.muli %83, %cst : tensor<32x1xi32>
2026-02-21T08:39:19.2175896Z         %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:39:19.2176170Z         %86 = tt.broadcast %84 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2176431Z         %87 = tt.broadcast %85 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2176665Z         %88 = arith.addi %86, %87 : tensor<32x8xi32>
2026-02-21T08:39:19.2176896Z         %89 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2177167Z         %90 = tt.addptr %89, %88 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:39:19.2177455Z         %91 = tt.load %90 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2177741Z         %92 = arith.extf %91 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2177963Z         %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2178157Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2178342Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:19.2178530Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2178719Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2178937Z         %94 = arith.truncf %93 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:19.2179185Z         %95 = arith.extf %94 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:19.2179411Z         %96 = arith.cmpf ogt, %68, %95 : tensor<32xf32>
2026-02-21T08:39:19.2179631Z         %97 = arith.cmpf une, %68, %68 : tensor<32xf32>
2026-02-21T08:39:19.2179830Z         %98 = arith.ori %96, %97 : tensor<32xi1>
2026-02-21T08:39:19.2180063Z         %99 = arith.select %98, %68, %95 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:19.2180302Z         %100 = arith.subf %68, %99 : tensor<32xf32>
2026-02-21T08:39:19.2180663Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2181027Z         %102 = arith.mulf %77, %101 : tensor<32xf32>
2026-02-21T08:39:19.2181277Z         %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2181603Z         %104 = tt.broadcast %103 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2181909Z         %105 = arith.subf %92, %104 : tensor<32x8xf32>
2026-02-21T08:39:19.2182263Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2182627Z         %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2182817Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2183006Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:19.2183192Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2183380Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2183585Z         %108 = arith.addf %102, %107 : tensor<32xf32>
2026-02-21T08:39:19.2183780Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:19.2183972Z         %109 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:39:19.2184160Z         %110 = arith.addi %arg3, %109 : i32
2026-02-21T08:39:19.2184463Z         %111 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:39:19.2184714Z         %112 = tt.splat %110 : i32 -> tensor<8xi32>
2026-02-21T08:39:19.2184922Z         %113 = arith.addi %112, %111 : tensor<8xi32>
2026-02-21T08:39:19.2185179Z         %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:19.2185446Z         %115 = arith.muli %114, %cst : tensor<32x1xi32>
2026-02-21T08:39:19.2185703Z         %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:39:19.2185990Z         %117 = tt.broadcast %115 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2186258Z         %118 = tt.broadcast %116 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2186490Z         %119 = arith.addi %117, %118 : tensor<32x8xi32>
2026-02-21T08:39:19.2186734Z         %120 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2187024Z         %121 = tt.addptr %120, %119 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:39:19.2187331Z         %122 = tt.load %121 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2187638Z         %123 = arith.extf %122 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2187866Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2188064Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2188244Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:19.2188440Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2188628Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2188849Z         %125 = arith.truncf %124 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:19.2189100Z         %126 = arith.extf %125 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:19.2189333Z         %127 = arith.cmpf ogt, %99, %126 : tensor<32xf32>
2026-02-21T08:39:19.2189558Z         %128 = arith.cmpf une, %99, %99 : tensor<32xf32>
2026-02-21T08:39:19.2189761Z         %129 = arith.ori %127, %128 : tensor<32xi1>
2026-02-21T08:39:19.2190004Z         %130 = arith.select %129, %99, %126 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:19.2190250Z         %131 = arith.subf %99, %130 : tensor<32xf32>
2026-02-21T08:39:19.2190603Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2190970Z         %133 = arith.mulf %108, %132 : tensor<32xf32>
2026-02-21T08:39:19.2191224Z         %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2191521Z         %135 = tt.broadcast %134 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2191792Z         %136 = arith.subf %123, %135 : tensor<32x8xf32>
2026-02-21T08:39:19.2192171Z         %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2192561Z         %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2192762Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:19.2193025Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:19.2193220Z           tt.reduce.return %140 : f32
2026-02-21T08:39:19.2193422Z         }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2193630Z         %139 = arith.addf %133, %138 : tensor<32xf32>
2026-02-21T08:39:19.2193864Z         scf.yield %130, %139 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:39:19.2194128Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:39:19.2194400Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:39:19.2194661Z       %11 = tt.splat %c7416_i32 : i32 -> tensor<8xi32>
2026-02-21T08:39:19.2194871Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:39:19.2195133Z       %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:19.2195406Z       %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:39:19.2195765Z       %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:39:19.2196067Z       %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2196329Z       %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:39:19.2196570Z       %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:39:19.2196815Z       %19 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2197099Z       %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f16>>, tensor<32x8xi32>
2026-02-21T08:39:19.2197407Z       %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f16>>
2026-02-21T08:39:19.2197699Z       %22 = arith.extf %21 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2197965Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2198168Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:39:19.2198355Z         %49 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:39:19.2198566Z         tt.reduce.return %49 : f32
2026-02-21T08:39:19.2198759Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2198994Z       %24 = arith.truncf %23 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:19.2199244Z       %25 = arith.extf %24 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:19.2199485Z       %26 = arith.cmpf ogt, %9#0, %25 : tensor<32xf32>
2026-02-21T08:39:19.2199714Z       %27 = arith.cmpf une, %9#0, %9#0 : tensor<32xf32>
2026-02-21T08:39:19.2199943Z       %28 = arith.ori %26, %27 : tensor<32xi1>
2026-02-21T08:39:19.2200195Z       %29 = arith.select %28, %9#0, %25 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:19.2200427Z       %30 = arith.subf %9#0, %29 : tensor<32xf32>
2026-02-21T08:39:19.2200789Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2201144Z       %32 = arith.mulf %9#1, %31 : tensor<32xf32>
2026-02-21T08:39:19.2201401Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2201718Z       %34 = tt.broadcast %33 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2201944Z       %35 = arith.subf %22, %34 : tensor<32x8xf32>
2026-02-21T08:39:19.2202301Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2202654Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:39:19.2202848Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:39:19.2203023Z         %49 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:39:19.2203217Z         tt.reduce.return %49 : f32
2026-02-21T08:39:19.2203407Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:39:19.2203600Z       %38 = arith.addf %32, %37 : tensor<32xf32>
2026-02-21T08:39:19.2203803Z       %c7416_i32_2 = arith.constant 7416 : i32
2026-02-21T08:39:19.2203992Z       %c24_i32_3 = arith.constant 24 : i32
2026-02-21T08:39:19.2204227Z       scf.for %arg3 = %c0_i32 to %c7416_i32_2 step %c24_i32_3  : i32 {
2026-02-21T08:39:19.2204615Z         %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:39:19.2204966Z         %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2205259Z         %51 = arith.extf %49 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2205513Z         %52 = tt.broadcast %50 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2205751Z         %53 = arith.subf %51, %52 : tensor<32x8xf32>
2026-02-21T08:39:19.2206114Z         %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2206524Z         %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2206808Z         %56 = tt.broadcast %55 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2207093Z         %57 = arith.divf %54, %56 : tensor<32x8xf32>
2026-02-21T08:39:19.2207328Z         %58 = arith.truncf %57 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:39:19.2207633Z         tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:39:19.2207919Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:39:19.2208109Z         %59 = arith.muli %c8_i32, %c1_i32_4 : i32
2026-02-21T08:39:19.2208300Z         %60 = arith.addi %arg3, %59 : i32
2026-02-21T08:39:19.2208560Z         %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:39:19.2208890Z         %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2209174Z         %63 = arith.extf %61 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2209418Z         %64 = tt.broadcast %62 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2209649Z         %65 = arith.subf %63, %64 : tensor<32x8xf32>
2026-02-21T08:39:19.2210006Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2210419Z         %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2210709Z         %68 = tt.broadcast %67 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2210931Z         %69 = arith.divf %66, %68 : tensor<32x8xf32>
2026-02-21T08:39:19.2211160Z         %70 = arith.truncf %69 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:39:19.2211456Z         tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:39:19.2211770Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:19.2211957Z         %71 = arith.muli %c8_i32, %c2_i32 : i32
2026-02-21T08:39:19.2212149Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:39:19.2212416Z         %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:39:19.2212744Z         %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2213027Z         %75 = arith.extf %73 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2213272Z         %76 = tt.broadcast %74 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2213505Z         %77 = arith.subf %75, %76 : tensor<32x8xf32>
2026-02-21T08:39:19.2213863Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2214263Z         %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2214550Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2214777Z         %81 = arith.divf %78, %80 : tensor<32x8xf32>
2026-02-21T08:39:19.2215016Z         %82 = arith.truncf %81 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:39:19.2215315Z         tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:39:19.2215682Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:39:19.2216006Z       %39 = tt.descriptor_load %0[%5, %c7416_i32_2] : !tt.tensordesc<tensor<32x8xf16>> -> tensor<32x8xf16>
2026-02-21T08:39:19.2216346Z       %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2216633Z       %41 = arith.extf %39 : tensor<32x8xf16> to tensor<32x8xf32>
2026-02-21T08:39:19.2216875Z       %42 = tt.broadcast %40 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2217105Z       %43 = arith.subf %41, %42 : tensor<32x8xf32>
2026-02-21T08:39:19.2217461Z       %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:39:19.2217861Z       %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:19.2218201Z       %46 = tt.broadcast %45 : tensor<32x1xf32> -> tensor<32x8xf32>
2026-02-21T08:39:19.2218426Z       %47 = arith.divf %44, %46 : tensor<32x8xf32>
2026-02-21T08:39:19.2218652Z       %48 = arith.truncf %47 : tensor<32x8xf32> to tensor<32x8xf16>
2026-02-21T08:39:19.2218963Z       tt.descriptor_store %1[%5, %c7416_i32_2], %48 : !tt.tensordesc<tensor<32x8xf16>>, tensor<32x8xf16>
2026-02-21T08:39:19.2219271Z     } {tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:39:19.2219468Z     tt.return
2026-02-21T08:39:19.2219595Z   }
2026-02-21T08:39:19.2219723Z }
2026-02-21T08:39:19.2219791Z 
2026-02-21T08:39:19.2219840Z {-#
2026-02-21T08:39:19.2219974Z   external_resources: {
2026-02-21T08:39:19.2220130Z     mlir_reproducer: {
2026-02-21T08:39:19.2224513Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:39:19.2228939Z       disable_threading: false,
2026-02-21T08:39:19.2229102Z       verify_each: true
2026-02-21T08:39:19.2229250Z     }
2026-02-21T08:39:19.2229370Z   }
2026-02-21T08:39:19.2229480Z #-}
2026-02-21T08:39:19.2229900Z /tmp/torchinductor_root/2w/c2wcir4gm3rdqiowhzt2k5g2mukrxiva6udznsaoahozanh2bh7g.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:39:19.2231083Z /tmp/torchinductor_root/2w/c2wcir4gm3rdqiowhzt2k5g2mukrxiva6udznsaoahozanh2bh7g.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:39:19.2232134Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:39:19.2233222Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 8], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:39:19.2234201Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:39:19.2234460Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:39:20.7489287Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:39:20.7493852Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:39:20.7498883Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:39:20.7500930Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:39:20.7501189Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:39:20.7501423Z     %cst = arith.constant dense<7424> : tensor<32x1xi32>
2026-02-21T08:39:20.7509285Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32xf32>
2026-02-21T08:39:20.7509547Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<32xf32>
2026-02-21T08:39:20.7509775Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:39:20.7509970Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:39:20.7510151Z     %c7424_i32 = arith.constant 7424 : i32
2026-02-21T08:39:20.7510336Z     %c7424_i64 = arith.constant 7424 : i64
2026-02-21T08:39:20.7510528Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:39:20.7510859Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c7424_i32], [%c7424_i64, %c1_i64] : <f16>, <tensor<32x32xf16>>
2026-02-21T08:39:20.7511178Z     %1 = tt.get_program_id x : i32
2026-02-21T08:39:20.7511395Z     scf.for %arg2 = %1 to %c128_i32 step %c9472_i32  : i32 {
2026-02-21T08:39:20.7511684Z       %2 = arith.muli %arg2, %c32_i32 : i32
2026-02-21T08:39:20.7511931Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:39:20.7512189Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:39:20.7512382Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:39:20.7512583Z       %c7392_i32 = arith.constant 7392 : i32
2026-02-21T08:39:20.7512770Z       %c96_i32 = arith.constant 96 : i32
2026-02-21T08:39:20.7513150Z       %6:2 = scf.for %arg3 = %c0_i32 to %c7392_i32 step %c96_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<32xf32>, tensor<32xf32>)  : i32 {
2026-02-21T08:39:20.7513616Z         %47 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:39:20.7513952Z         %48 = arith.extf %47 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7514195Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7514389Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7514580Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:20.7514771Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7514962Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7515185Z         %50 = arith.truncf %49 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:20.7515433Z         %51 = arith.extf %50 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:20.7515673Z         %52 = arith.cmpf ogt, %arg4, %51 : tensor<32xf32>
2026-02-21T08:39:20.7515900Z         %53 = arith.cmpf une, %arg4, %arg4 : tensor<32xf32>
2026-02-21T08:39:20.7516164Z         %54 = arith.ori %52, %53 : tensor<32xi1>
2026-02-21T08:39:20.7516711Z         %55 = arith.select %54, %arg4, %51 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:20.7516951Z         %56 = arith.subf %arg4, %55 : tensor<32xf32>
2026-02-21T08:39:20.7517318Z         %57 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7517678Z         %58 = arith.mulf %arg5, %57 : tensor<32xf32>
2026-02-21T08:39:20.7517930Z         %59 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7518226Z         %60 = tt.broadcast %59 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7518459Z         %61 = arith.subf %48, %60 : tensor<32x32xf32>
2026-02-21T08:39:20.7518823Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7519187Z         %63 = "tt.reduce"(%62) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7519447Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7519642Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:20.7519830Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7520026Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7520226Z         %64 = arith.addf %58, %63 : tensor<32xf32>
2026-02-21T08:39:20.7520431Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:39:20.7520626Z         %65 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:39:20.7520833Z         %66 = arith.addi %arg3, %65 : i32
2026-02-21T08:39:20.7521121Z         %67 = tt.descriptor_load %0[%2, %66] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:39:20.7521440Z         %68 = arith.extf %67 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7521715Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7521898Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7522087Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:20.7522279Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7522467Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7522693Z         %70 = arith.truncf %69 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:20.7522930Z         %71 = arith.extf %70 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:20.7523162Z         %72 = arith.cmpf ogt, %55, %71 : tensor<32xf32>
2026-02-21T08:39:20.7523373Z         %73 = arith.cmpf une, %55, %55 : tensor<32xf32>
2026-02-21T08:39:20.7523581Z         %74 = arith.ori %72, %73 : tensor<32xi1>
2026-02-21T08:39:20.7523804Z         %75 = arith.select %74, %55, %71 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:20.7524041Z         %76 = arith.subf %55, %75 : tensor<32xf32>
2026-02-21T08:39:20.7524391Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7524738Z         %78 = arith.mulf %64, %77 : tensor<32xf32>
2026-02-21T08:39:20.7524998Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7525282Z         %80 = tt.broadcast %79 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7525519Z         %81 = arith.subf %68, %80 : tensor<32x32xf32>
2026-02-21T08:39:20.7525883Z         %82 = tt.extern_elementwise %81 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7526238Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7526433Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7526610Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:20.7526797Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7526976Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7527172Z         %84 = arith.addf %78, %83 : tensor<32xf32>
2026-02-21T08:39:20.7527356Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:20.7527551Z         %85 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:39:20.7527809Z         %86 = arith.addi %arg3, %85 : i32
2026-02-21T08:39:20.7528073Z         %87 = tt.descriptor_load %0[%2, %86] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:39:20.7528385Z         %88 = arith.extf %87 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7528608Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7528795Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7528973Z           %105 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:39:20.7529167Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7529351Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7529563Z         %90 = arith.truncf %89 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:20.7529804Z         %91 = arith.extf %90 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:20.7530027Z         %92 = arith.cmpf ogt, %75, %91 : tensor<32xf32>
2026-02-21T08:39:20.7530288Z         %93 = arith.cmpf une, %75, %75 : tensor<32xf32>
2026-02-21T08:39:20.7530493Z         %94 = arith.ori %92, %93 : tensor<32xi1>
2026-02-21T08:39:20.7530727Z         %95 = arith.select %94, %75, %91 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:20.7530962Z         %96 = arith.subf %75, %95 : tensor<32xf32>
2026-02-21T08:39:20.7531306Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7531699Z         %98 = arith.mulf %84, %97 : tensor<32xf32>
2026-02-21T08:39:20.7531950Z         %99 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7532252Z         %100 = tt.broadcast %99 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7532494Z         %101 = arith.subf %88, %100 : tensor<32x32xf32>
2026-02-21T08:39:20.7532873Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7533255Z         %103 = "tt.reduce"(%102) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7533447Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:39:20.7533631Z           %105 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:39:20.7533816Z           tt.reduce.return %105 : f32
2026-02-21T08:39:20.7534003Z         }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7534199Z         %104 = arith.addf %98, %103 : tensor<32xf32>
2026-02-21T08:39:20.7534421Z         scf.yield %95, %104 : tensor<32xf32>, tensor<32xf32>
2026-02-21T08:39:20.7534639Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:39:20.7534929Z       %7 = tt.descriptor_load %0[%2, %c7392_i32] : !tt.tensordesc<tensor<32x32xf16>> -> tensor<32x32xf16>
2026-02-21T08:39:20.7535255Z       %8 = arith.extf %7 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7535475Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7535668Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:39:20.7535849Z         %47 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:39:20.7536045Z         tt.reduce.return %47 : f32
2026-02-21T08:39:20.7536234Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7536449Z       %10 = arith.truncf %9 : tensor<32xf32> to tensor<32xf16>
2026-02-21T08:39:20.7536691Z       %11 = arith.extf %10 : tensor<32xf16> to tensor<32xf32>
2026-02-21T08:39:20.7536913Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<32xf32>
2026-02-21T08:39:20.7537130Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<32xf32>
2026-02-21T08:39:20.7537329Z       %14 = arith.ori %12, %13 : tensor<32xi1>
2026-02-21T08:39:20.7537558Z       %15 = arith.select %14, %6#0, %11 : tensor<32xi1>, tensor<32xf32>
2026-02-21T08:39:20.7537794Z       %16 = arith.subf %6#0, %15 : tensor<32xf32>
2026-02-21T08:39:20.7538140Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7538496Z       %18 = arith.mulf %6#1, %17 : tensor<32xf32>
2026-02-21T08:39:20.7538804Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7539090Z       %20 = tt.broadcast %19 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7539325Z       %21 = arith.subf %8, %20 : tensor<32x32xf32>
2026-02-21T08:39:20.7539702Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7540079Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:39:20.7540272Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:39:20.7540463Z         %47 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:39:20.7540653Z         tt.reduce.return %47 : f32
2026-02-21T08:39:20.7540849Z       }) : (tensor<32x32xf32>) -> tensor<32xf32>
2026-02-21T08:39:20.7541048Z       %24 = arith.addf %18, %23 : tensor<32xf32>
2026-02-21T08:39:20.7541257Z       %c7392_i32_2 = arith.constant 7392 : i32
2026-02-21T08:39:20.7541520Z       %c96_i32_3 = arith.constant 96 : i32
2026-02-21T08:39:20.7541795Z       scf.for %arg3 = %c0_i32 to %c7392_i32_2 step %c96_i32_3  : i32 {
2026-02-21T08:39:20.7542057Z         %47 = tt.splat %arg3 : i32 -> tensor<32xi32>
2026-02-21T08:39:20.7542267Z         %48 = arith.addi %47, %3 : tensor<32xi32>
2026-02-21T08:39:20.7542530Z         %49 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:20.7542805Z         %50 = arith.muli %49, %cst : tensor<32x1xi32>
2026-02-21T08:39:20.7543074Z         %51 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:39:20.7543377Z         %52 = tt.broadcast %50 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7543643Z         %53 = tt.broadcast %51 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7543890Z         %54 = arith.addi %52, %53 : tensor<32x32xi32>
2026-02-21T08:39:20.7544137Z         %55 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7544440Z         %56 = tt.addptr %55, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7544756Z         %57 = tt.load %56 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7545086Z         %58 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7545388Z         %59 = arith.extf %57 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7545650Z         %60 = tt.broadcast %58 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7545895Z         %61 = arith.subf %59, %60 : tensor<32x32xf32>
2026-02-21T08:39:20.7546272Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7546704Z         %63 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7547002Z         %64 = tt.broadcast %63 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7547247Z         %65 = arith.divf %62, %64 : tensor<32x32xf32>
2026-02-21T08:39:20.7547479Z         %66 = arith.truncf %65 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:39:20.7547739Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7548016Z         %68 = tt.addptr %67, %54 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7548267Z         tt.store %68, %66 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7548474Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:39:20.7548665Z         %69 = arith.muli %c32_i32, %c1_i32 : i32
2026-02-21T08:39:20.7548851Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:39:20.7549046Z         %71 = tt.splat %70 : i32 -> tensor<32xi32>
2026-02-21T08:39:20.7549240Z         %72 = arith.addi %71, %3 : tensor<32xi32>
2026-02-21T08:39:20.7549485Z         %73 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:20.7549744Z         %74 = arith.muli %73, %cst : tensor<32x1xi32>
2026-02-21T08:39:20.7550045Z         %75 = tt.expand_dims %72 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:39:20.7550328Z         %76 = tt.broadcast %74 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7550577Z         %77 = tt.broadcast %75 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7550811Z         %78 = arith.addi %76, %77 : tensor<32x32xi32>
2026-02-21T08:39:20.7551038Z         %79 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7551310Z         %80 = tt.addptr %79, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7551628Z         %81 = tt.load %80 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7551935Z         %82 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7552218Z         %83 = arith.extf %81 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7552537Z         %84 = tt.broadcast %82 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7552771Z         %85 = arith.subf %83, %84 : tensor<32x32xf32>
2026-02-21T08:39:20.7553123Z         %86 = tt.extern_elementwise %85 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7553535Z         %87 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7553819Z         %88 = tt.broadcast %87 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7554044Z         %89 = arith.divf %86, %88 : tensor<32x32xf32>
2026-02-21T08:39:20.7554279Z         %90 = arith.truncf %89 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:39:20.7554535Z         %91 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7554807Z         %92 = tt.addptr %91, %78 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7555049Z         tt.store %92, %90 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7555259Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:20.7555456Z         %93 = arith.muli %c32_i32, %c2_i32 : i32
2026-02-21T08:39:20.7555637Z         %94 = arith.addi %arg3, %93 : i32
2026-02-21T08:39:20.7555828Z         %95 = tt.splat %94 : i32 -> tensor<32xi32>
2026-02-21T08:39:20.7556021Z         %96 = arith.addi %95, %3 : tensor<32xi32>
2026-02-21T08:39:20.7556264Z         %97 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:20.7556514Z         %98 = arith.muli %97, %cst : tensor<32x1xi32>
2026-02-21T08:39:20.7556762Z         %99 = tt.expand_dims %96 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:39:20.7557050Z         %100 = tt.broadcast %98 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7557309Z         %101 = tt.broadcast %99 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7557560Z         %102 = arith.addi %100, %101 : tensor<32x32xi32>
2026-02-21T08:39:20.7557816Z         %103 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7558110Z         %104 = tt.addptr %103, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7558424Z         %105 = tt.load %104 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7558734Z         %106 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7559029Z         %107 = arith.extf %105 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7559284Z         %108 = tt.broadcast %106 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7559530Z         %109 = arith.subf %107, %108 : tensor<32x32xf32>
2026-02-21T08:39:20.7559898Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7560319Z         %111 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7560613Z         %112 = tt.broadcast %111 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7560903Z         %113 = arith.divf %110, %112 : tensor<32x32xf32>
2026-02-21T08:39:20.7561150Z         %114 = arith.truncf %113 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:39:20.7561419Z         %115 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7561738Z         %116 = tt.addptr %115, %102 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7562003Z         tt.store %116, %114 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7562213Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:39:20.7562431Z       %25 = tt.splat %c7392_i32_2 : i32 -> tensor<32xi32>
2026-02-21T08:39:20.7562637Z       %26 = arith.addi %25, %3 : tensor<32xi32>
2026-02-21T08:39:20.7562885Z       %27 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:39:20.7563138Z       %28 = arith.muli %27, %cst : tensor<32x1xi32>
2026-02-21T08:39:20.7563440Z       %29 = tt.expand_dims %26 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:39:20.7563732Z       %30 = tt.broadcast %28 : tensor<32x1xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7563982Z       %31 = tt.broadcast %29 : tensor<1x32xi32> -> tensor<32x32xi32>
2026-02-21T08:39:20.7564211Z       %32 = arith.addi %30, %31 : tensor<32x32xi32>
2026-02-21T08:39:20.7564438Z       %33 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7564714Z       %34 = tt.addptr %33, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7565003Z       %35 = tt.load %34 evictionPolicy = evict_first : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7565311Z       %36 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7565595Z       %37 = arith.extf %35 : tensor<32x32xf16> to tensor<32x32xf32>
2026-02-21T08:39:20.7565842Z       %38 = tt.broadcast %36 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7566074Z       %39 = arith.subf %37, %38 : tensor<32x32xf32>
2026-02-21T08:39:20.7566429Z       %40 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x32xf32>) -> tensor<32x32xf32>
2026-02-21T08:39:20.7566839Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<32xf32> -> tensor<32x1xf32>
2026-02-21T08:39:20.7567117Z       %42 = tt.broadcast %41 : tensor<32x1xf32> -> tensor<32x32xf32>
2026-02-21T08:39:20.7567337Z       %43 = arith.divf %40, %42 : tensor<32x32xf32>
2026-02-21T08:39:20.7567565Z       %44 = arith.truncf %43 : tensor<32x32xf32> to tensor<32x32xf16>
2026-02-21T08:39:20.7567820Z       %45 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7568090Z       %46 = tt.addptr %45, %32 : tensor<32x32x!tt.ptr<f16>>, tensor<32x32xi32>
2026-02-21T08:39:20.7568338Z       tt.store %46, %44 : tensor<32x32x!tt.ptr<f16>>
2026-02-21T08:39:20.7568611Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:39:20.7568865Z     tt.return
2026-02-21T08:39:20.7568992Z   }
2026-02-21T08:39:20.7569115Z }
2026-02-21T08:39:20.7569185Z 
2026-02-21T08:39:20.7569235Z {-#
2026-02-21T08:39:20.7569372Z   external_resources: {
2026-02-21T08:39:20.7569527Z     mlir_reproducer: {
2026-02-21T08:39:20.7573851Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:39:20.7578378Z       disable_threading: false,
2026-02-21T08:39:20.7578562Z       verify_each: true
2026-02-21T08:39:20.7578708Z     }
2026-02-21T08:39:20.7578842Z   }
2026-02-21T08:39:20.7578963Z #-}
2026-02-21T08:39:20.7579414Z /tmp/torchinductor_root/3o/c3ouqpfz652r7o4j7juljqku7fknds53vpazrynxeu6dtcsako3r.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:39:20.7580644Z /tmp/torchinductor_root/3o/c3ouqpfz652r7o4j7juljqku7fknds53vpazrynxeu6dtcsako3r.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:39:20.7581700Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:39:20.7582795Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=64, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:39:20.7583800Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:39:20.7584066Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:39:23.2835722Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.0 configs/s
2026-02-21T08:39:23.2848638Z [43s] Adaptive compile timeout: 30s (90% percentile=7.7s, bounds=[30.0s, 30s])
2026-02-21T08:39:23.9814689Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1409.8 configs/s
2026-02-21T08:39:24.0386171Z [43s] Initial random population of 100, 5 starting points: 
2026-02-21T08:39:24.0390036Z error=8
2026-02-21T08:39:24.0395694Z timeout=3
2026-02-21T08:39:24.0400068Z ok=89
2026-02-21T08:39:24.0404611Z min=0.0532
2026-02-21T08:39:24.0406761Z mid=0.7076
2026-02-21T08:39:24.0406926Z max=49.5391
2026-02-21T08:39:24.0407089Z best={'block_sizes': [2, 1024],
2026-02-21T08:39:24.0407368Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:39:24.0407643Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:39:24.0407839Z  'num_sm_multiplier': 64,
2026-02-21T08:39:24.0407996Z  'num_stages': 5,
2026-02-21T08:39:24.0408138Z  'num_warps': 1,
2026-02-21T08:39:24.0408290Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:39:24.0408489Z  'range_flattens': [True, True],
2026-02-21T08:39:24.0408659Z  'range_multi_buffers': [False, None],
2026-02-21T08:39:24.0408845Z  'range_num_stages': [3, 1],
2026-02-21T08:39:24.0409016Z  'range_unroll_factors': [0, 2],
2026-02-21T08:39:24.0409191Z  'range_warp_specializes': [True, None]}
2026-02-21T08:39:24.0409409Z [43s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:39:25.1250847Z [45s] Generation 1 starting: 79 neighbors, 5 active search path(s)
2026-02-21T08:39:34.1079827Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 5.4 configs/s
2026-02-21T08:39:38.9799634Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.0 configs/s
2026-02-21T08:39:43.3214923Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 232.8         
2026-02-21T08:39:43.3215456Z                                                                   configs/s     
2026-02-21T08:39:43.5702801Z [63s] Generation 1 complete: 
2026-02-21T08:39:43.5703128Z ok=85
2026-02-21T08:39:43.5703322Z min=0.0389
2026-02-21T08:39:43.5703582Z mid=0.0553
2026-02-21T08:39:43.5703765Z max=2.1699
2026-02-21T08:39:43.5703959Z best={'block_sizes': [1, 8192],
2026-02-21T08:39:43.5704338Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:39:43.5706666Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:39:43.5706923Z  'num_stages': 1,
2026-02-21T08:39:43.5707442Z  'num_warps': 4,
2026-02-21T08:39:43.5707626Z  'pid_type': 'flat',
2026-02-21T08:39:43.5707794Z  'range_flattens': [None, False],
2026-02-21T08:39:43.5707976Z  'range_multi_buffers': [None, None],
2026-02-21T08:39:43.5708164Z  'range_num_stages': [0, 4],
2026-02-21T08:39:43.5708330Z  'range_unroll_factors': [0, 1],
2026-02-21T08:39:43.5708517Z  'range_warp_specializes': [None, False]}
2026-02-21T08:39:43.5715378Z [63s] Fitting surrogate: 185 points, 185 targets
2026-02-21T08:39:44.5422992Z [64s] Generation 2 starting: 67 neighbors, 5 active search path(s)
2026-02-21T08:39:55.2078246Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.7 configs/s
2026-02-21T08:39:59.3226729Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.2 configs/s
2026-02-21T08:40:02.1747084Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 429.1         
2026-02-21T08:40:02.1750945Z                                                                   configs/s     
2026-02-21T08:40:02.3254177Z [82s] Generation 2 complete: 
2026-02-21T08:40:02.3256162Z error=1
2026-02-21T08:40:02.3256343Z ok=72
2026-02-21T08:40:02.3256509Z min=0.0307
2026-02-21T08:40:02.3256668Z mid=0.0471
2026-02-21T08:40:02.3256809Z max=0.2642
2026-02-21T08:40:02.3256991Z best={'block_sizes': [1, 8192],
2026-02-21T08:40:02.3257268Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:40:02.3257585Z  'load_eviction_policies': ['', ''],
2026-02-21T08:40:02.3257806Z  'num_stages': 1,
2026-02-21T08:40:02.3257981Z  'num_warps': 4,
2026-02-21T08:40:02.3258144Z  'pid_type': 'flat',
2026-02-21T08:40:02.3258319Z  'range_flattens': [None, False],
2026-02-21T08:40:02.3258522Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:02.3258724Z  'range_num_stages': [0, 4],
2026-02-21T08:40:02.3258907Z  'range_unroll_factors': [0, 1],
2026-02-21T08:40:02.3259092Z  'range_warp_specializes': [None, False]}
2026-02-21T08:40:02.3269788Z [82s] Fitting surrogate: 258 points, 258 targets
2026-02-21T08:40:03.3199280Z [83s] Generation 3 starting: 68 neighbors, 5 active search path(s)
2026-02-21T08:40:12.0047170Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 2.1 configs/s
2026-02-21T08:40:16.1532070Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.8 configs/s
2026-02-21T08:40:21.1595885Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 202.5         
2026-02-21T08:40:21.1599824Z                                                                   configs/s     
2026-02-21T08:40:21.4876988Z [101s] Generation 3 complete: 
2026-02-21T08:40:21.4882130Z ok=73
2026-02-21T08:40:21.4884154Z min=0.0307
2026-02-21T08:40:21.4884361Z mid=0.0430
2026-02-21T08:40:21.4889280Z max=1.1116
2026-02-21T08:40:21.4893840Z best={'block_sizes': [1, 8192],
2026-02-21T08:40:21.4898598Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:40:21.4902961Z  'load_eviction_policies': ['', ''],
2026-02-21T08:40:21.4904632Z  'num_stages': 1,
2026-02-21T08:40:21.4904905Z  'num_warps': 4,
2026-02-21T08:40:21.4909599Z  'pid_type': 'flat',
2026-02-21T08:40:21.4913983Z  'range_flattens': [None, False],
2026-02-21T08:40:21.4917975Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:21.4923027Z  'range_num_stages': [0, 4],
2026-02-21T08:40:21.4924662Z  'range_unroll_factors': [0, 1],
2026-02-21T08:40:21.4924912Z  'range_warp_specializes': [None, False]}
2026-02-21T08:40:21.4925138Z [101s] Fitting surrogate: 331 points, 331 targets
2026-02-21T08:40:22.3171875Z [102s] Generation 4 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:40:28.2963549Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 9.1 configs/s
2026-02-21T08:40:32.0635518Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s
2026-02-21T08:40:36.0445698Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 254.7         
2026-02-21T08:40:36.0449526Z                                                                   configs/s     
2026-02-21T08:40:36.3158270Z [116s] Generation 4 complete: 
2026-02-21T08:40:36.3159824Z ok=67
2026-02-21T08:40:36.3160039Z min=0.0306
2026-02-21T08:40:36.3160201Z mid=0.0389
2026-02-21T08:40:36.3160372Z max=0.1238
2026-02-21T08:40:36.3160552Z best={'block_sizes': [1, 8192],
2026-02-21T08:40:36.3160807Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:40:36.3161069Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:40:36.3161285Z  'num_stages': 6,
2026-02-21T08:40:36.3161447Z  'num_warps': 1,
2026-02-21T08:40:36.3161808Z  'pid_type': 'flat',
2026-02-21T08:40:36.3162000Z  'range_flattens': [None, False],
2026-02-21T08:40:36.3162185Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:36.3162377Z  'range_num_stages': [0, 2],
2026-02-21T08:40:36.3162548Z  'range_unroll_factors': [0, 0],
2026-02-21T08:40:36.3162736Z  'range_warp_specializes': [None, True]}
2026-02-21T08:40:36.3174792Z [116s] Fitting surrogate: 398 points, 398 targets
2026-02-21T08:40:37.0302909Z [116s] Generation 5 starting: 50 neighbors, 4 active search path(s)
2026-02-21T08:40:42.1162602Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 25.4 configs/s
2026-02-21T08:40:45.1005891Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.0 configs/s
2026-02-21T08:40:48.5252271Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.1         
2026-02-21T08:40:48.5252633Z                                                                   configs/s     
2026-02-21T08:40:48.7650216Z [128s] Generation 5 complete: 
2026-02-21T08:40:48.7651733Z ok=54
2026-02-21T08:40:48.7651912Z min=0.0288
2026-02-21T08:40:48.7652056Z mid=0.0327
2026-02-21T08:40:48.7652182Z max=0.0881
2026-02-21T08:40:48.7652336Z best={'block_sizes': [1, 8192],
2026-02-21T08:40:48.7652576Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:40:48.7652835Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:40:48.7653032Z  'num_stages': 6,
2026-02-21T08:40:48.7653187Z  'num_warps': 1,
2026-02-21T08:40:48.7653370Z  'pid_type': 'flat',
2026-02-21T08:40:48.7653562Z  'range_flattens': [None, False],
2026-02-21T08:40:48.7653757Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:48.7653950Z  'range_num_stages': [0, 2],
2026-02-21T08:40:48.7654130Z  'range_unroll_factors': [0, 0],
2026-02-21T08:40:48.7654314Z  'range_warp_specializes': [None, True]}
2026-02-21T08:40:48.7672488Z [128s] Fitting surrogate: 452 points, 452 targets
2026-02-21T08:40:49.4179097Z [129s] Generation 6 starting: 44 neighbors, 4 active search path(s)
2026-02-21T08:40:53.6628547Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 10.0 configs/s
2026-02-21T08:40:56.3453402Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 17.1 configs/s
2026-02-21T08:40:59.4493624Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.7         
2026-02-21T08:40:59.4494879Z                                                                   configs/s     
2026-02-21T08:40:59.6803399Z [139s] Generation 6 complete: 
2026-02-21T08:40:59.6804259Z ok=48
2026-02-21T08:40:59.6804391Z min=0.0287
2026-02-21T08:40:59.6804523Z mid=0.0307
2026-02-21T08:40:59.6804641Z max=0.0779
2026-02-21T08:40:59.6804784Z best={'block_sizes': [1, 8192],
2026-02-21T08:40:59.6805021Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:40:59.6805280Z  'load_eviction_policies': ['', ''],
2026-02-21T08:40:59.6805467Z  'num_sm_multiplier': 32,
2026-02-21T08:40:59.6805624Z  'num_stages': 6,
2026-02-21T08:40:59.6805769Z  'num_warps': 2,
2026-02-21T08:40:59.6805919Z  'pid_type': 'persistent_blocked',
2026-02-21T08:40:59.6806104Z  'range_flattens': [True, True],
2026-02-21T08:40:59.6806277Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:59.6806461Z  'range_num_stages': [3, 1],
2026-02-21T08:40:59.6806621Z  'range_unroll_factors': [0, 2],
2026-02-21T08:40:59.6806800Z  'range_warp_specializes': [True, None]}
2026-02-21T08:40:59.6820328Z [139s] Fitting surrogate: 500 points, 500 targets
2026-02-21T08:41:00.1652597Z [140s] Generation 7 starting: 27 neighbors, 2 active search path(s)
2026-02-21T08:41:03.8458607Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 9.3 configs/s
2026-02-21T08:41:05.4652929Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.1 configs/s
2026-02-21T08:41:07.5385657Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.8         
2026-02-21T08:41:07.5389194Z                                                                   configs/s     
2026-02-21T08:41:07.6958347Z [147s] Generation 7 complete: 
2026-02-21T08:41:07.6963676Z ok=29
2026-02-21T08:41:07.6965558Z min=0.0307
2026-02-21T08:41:07.6965716Z mid=0.0307
2026-02-21T08:41:07.6965849Z max=0.0471
2026-02-21T08:41:07.6965988Z best={'block_sizes': [1, 8192],
2026-02-21T08:41:07.6966249Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:41:07.6966505Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:07.6966702Z  'num_sm_multiplier': 32,
2026-02-21T08:41:07.6966899Z  'num_stages': 6,
2026-02-21T08:41:07.6967069Z  'num_warps': 2,
2026-02-21T08:41:07.6967234Z  'pid_type': 'persistent_blocked',
2026-02-21T08:41:07.6967417Z  'range_flattens': [True, True],
2026-02-21T08:41:07.6967603Z  'range_multi_buffers': [None, None],
2026-02-21T08:41:07.6967781Z  'range_num_stages': [3, 1],
2026-02-21T08:41:07.6967950Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:07.6968126Z  'range_warp_specializes': [True, None]}
2026-02-21T08:41:07.6976468Z [147s] Fitting surrogate: 529 points, 529 targets
2026-02-21T08:41:08.0579554Z [147s] Generation 8 starting: 8 neighbors, 1 active search path(s)
2026-02-21T08:41:09.6672771Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 4.4 configs/s
2026-02-21T08:41:10.2008011Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.6 configs/s
2026-02-21T08:41:10.7120956Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1929.6         
2026-02-21T08:41:10.7125232Z                                                                  configs/s      
2026-02-21T08:41:10.7618642Z [150s] Generation 8 complete: 
2026-02-21T08:41:10.7618906Z ok=10
2026-02-21T08:41:10.7619091Z min=0.0287
2026-02-21T08:41:10.7619258Z mid=0.0389
2026-02-21T08:41:10.7619389Z max=0.0553
2026-02-21T08:41:10.7619557Z best={'block_sizes': [1, 8192],
2026-02-21T08:41:10.7619817Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:41:10.7620083Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:10.7620269Z  'num_sm_multiplier': 32,
2026-02-21T08:41:10.7620424Z  'num_stages': 6,
2026-02-21T08:41:10.7620565Z  'num_warps': 2,
2026-02-21T08:41:10.7620711Z  'pid_type': 'persistent_blocked',
2026-02-21T08:41:10.7620897Z  'range_flattens': [True, True],
2026-02-21T08:41:10.7621070Z  'range_multi_buffers': [None, None],
2026-02-21T08:41:10.7621251Z  'range_num_stages': [3, 1],
2026-02-21T08:41:10.7621411Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:10.7621650Z  'range_warp_specializes': [True, None]}
2026-02-21T08:41:10.7643445Z [150s] Fitting surrogate: 539 points, 539 targets
2026-02-21T08:41:11.1121894Z [151s] Generation 9 starting: 10 neighbors, 1 active search path(s)
2026-02-21T08:41:12.2671301Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.6 configs/s
2026-02-21T08:41:12.8534527Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.6 configs/s
2026-02-21T08:41:14.2401942Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1151.9         
2026-02-21T08:41:14.2402532Z                                                                  configs/s      
2026-02-21T08:41:14.3133492Z [154s] Generation 9 complete: 
2026-02-21T08:41:14.3135047Z ok=12
2026-02-21T08:41:14.3135250Z min=0.0307
2026-02-21T08:41:14.3139880Z mid=0.0307
2026-02-21T08:41:14.3144405Z max=0.0389
2026-02-21T08:41:14.3148811Z best={'block_sizes': [1, 8192],
2026-02-21T08:41:14.3150362Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:41:14.3150726Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:14.3150951Z  'num_sm_multiplier': 32,
2026-02-21T08:41:14.3156177Z  'num_stages': 6,
2026-02-21T08:41:14.3159352Z  'num_warps': 2,
2026-02-21T08:41:14.3161488Z  'pid_type': 'persistent_blocked',
2026-02-21T08:41:14.3161791Z  'range_flattens': [True, True],
2026-02-21T08:41:14.3161995Z  'range_multi_buffers': [None, None],
2026-02-21T08:41:14.3162179Z  'range_num_stages': [3, 1],
2026-02-21T08:41:14.3162347Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:14.3162524Z  'range_warp_specializes': [True, None]}
2026-02-21T08:41:14.3162813Z [154s] Fitting surrogate: 551 points, 551 targets
2026-02-21T08:41:14.7353679Z [154s] Generation 10 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:41:16.3736876Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 12.7 configs/s
2026-02-21T08:41:17.0934380Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.8 configs/s
2026-02-21T08:41:18.1373291Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 959.9         
2026-02-21T08:41:18.1377235Z                                                                   configs/s     
2026-02-21T08:41:18.2190296Z [158s] Generation 10 complete: 
2026-02-21T08:41:18.2194678Z ok=14
2026-02-21T08:41:18.2199050Z min=0.0307
2026-02-21T08:41:18.2200906Z mid=0.0307
2026-02-21T08:41:18.2201096Z max=0.0431
2026-02-21T08:41:18.2201275Z best={'block_sizes': [1, 8192],
2026-02-21T08:41:18.2201774Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:41:18.2202090Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:18.2202287Z  'num_sm_multiplier': 32,
2026-02-21T08:41:18.2202460Z  'num_stages': 6,
2026-02-21T08:41:18.2202623Z  'num_warps': 2,
2026-02-21T08:41:18.2202819Z  'pid_type': 'persistent_blocked',
2026-02-21T08:41:18.2203023Z  'range_flattens': [True, True],
2026-02-21T08:41:18.2203219Z  'range_multi_buffers': [None, None],
2026-02-21T08:41:18.2203422Z  'range_num_stages': [3, 1],
2026-02-21T08:41:18.2203647Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:18.2204184Z  'range_warp_specializes': [True, None]}
2026-02-21T08:41:18.2218744Z [158s] Fitting surrogate: 565 points, 565 targets
2026-02-21T08:41:18.4975428Z [158s] Autotuning complete in 158.4s after searching 534 configs.
2026-02-21T08:41:18.4977450Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:41:18.4978454Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=6, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:41:18.4979311Z 
2026-02-21T08:41:18.4979562Z [158s] Code of selected kernel: /tmp/torchinductor_root/gk/cgkn344xjwop3j7ywcryqion4i3hhvpupzxrejegoxquuyzp5mdx.py
2026-02-21T08:41:19.5940000Z WARNING:tritonbench.utils.triton_op:Completed input ID 56:
2026-02-21T08:41:19.5943842Z (M, N)
2026-02-21T08:41:19.5948159Z ------------
2026-02-21T08:41:19.5949790Z (4096, 7424)
2026-02-21T08:41:19.5949983Z 
2026-02-21T08:41:19.5955845Z  60%|██████    | 12/20 [32:17<22:50, 171.30s/it]WARNING:tritonbench.utils.triton_op:Running input ID 61:
2026-02-21T08:41:19.5959709Z (M, N)
2026-02-21T08:41:19.5962882Z ------------
2026-02-21T08:41:19.5963135Z (4096, 8064)
2026-02-21T08:41:19.5966269Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:41:20.7793105Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:41:22.2562985Z INFO:tritonbench.utils.triton_op:Took 2.24ms to get benchmark function for torch_compile_softmax
2026-02-21T08:41:25.7738979Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:41:25.7743113Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:41:25.7744455Z               'dtype': 'torch.float16',
2026-02-21T08:41:25.7744720Z               'shape': (4096, 8064),
2026-02-21T08:41:25.7744925Z               'stride': (8064, 1)},),
2026-02-21T08:41:25.7745095Z   'kwargs': {}}
2026-02-21T08:41:25.7761918Z INFO:tritonbench.utils.triton_op:Took 2.55ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:41:25.9583340Z [0s] Autotune random seed: 2134816249
2026-02-21T08:41:26.0970599Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:42:01.2694309Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:42:01.7723093Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[False, False])
2026-02-21T08:42:02.0769492Z [35s] Timeout after 30s compiling Config(block_sizes=[256, 4096], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[None, True])
2026-02-21T08:42:02.0781119Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:42:09.4254296Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.7 configs/s
2026-02-21T08:42:09.4264542Z [43s] Adaptive compile timeout: 30s (90% percentile=8.3s, bounds=[30.0s, 30s])
2026-02-21T08:42:10.0163388Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1651.3 configs/s
2026-02-21T08:42:10.0670994Z [43s] Initial random population of 100, 5 starting points: 
2026-02-21T08:42:10.0672457Z error=6
2026-02-21T08:42:10.0672624Z timeout=3
2026-02-21T08:42:10.0672760Z ok=91
2026-02-21T08:42:10.0672883Z min=0.0532
2026-02-21T08:42:10.0673024Z mid=0.8255
2026-02-21T08:42:10.0673152Z max=53.8153
2026-02-21T08:42:10.0673301Z best={'block_sizes': [2, 1024],
2026-02-21T08:42:10.0673570Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:42:10.0673849Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:42:10.0674044Z  'num_sm_multiplier': 64,
2026-02-21T08:42:10.0674203Z  'num_stages': 5,
2026-02-21T08:42:10.0674346Z  'num_warps': 1,
2026-02-21T08:42:10.0674501Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:42:10.0675023Z  'range_flattens': [True, True],
2026-02-21T08:42:10.0675225Z  'range_multi_buffers': [False, None],
2026-02-21T08:42:10.0675414Z  'range_num_stages': [3, 1],
2026-02-21T08:42:10.0675578Z  'range_unroll_factors': [0, 2],
2026-02-21T08:42:10.0675763Z  'range_warp_specializes': [True, None]}
2026-02-21T08:42:10.0687104Z [43s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:42:11.0331435Z [44s] Generation 1 starting: 79 neighbors, 5 active search path(s)
2026-02-21T08:42:23.5399350Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 1.8 configs/s
2026-02-21T08:42:28.5581883Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 16.9 configs/s
2026-02-21T08:42:33.6492212Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 198.7         
2026-02-21T08:42:33.6496200Z                                                                   configs/s     
2026-02-21T08:42:33.9479726Z [67s] Generation 1 complete: 
2026-02-21T08:42:33.9481718Z ok=85
2026-02-21T08:42:33.9481985Z min=0.0389
2026-02-21T08:42:33.9482223Z mid=0.0552
2026-02-21T08:42:33.9482384Z max=2.5283
2026-02-21T08:42:33.9482594Z best={'block_sizes': [1, 8192],
2026-02-21T08:42:33.9482881Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:42:33.9483202Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:42:33.9483455Z  'num_stages': 1,
2026-02-21T08:42:33.9483636Z  'num_warps': 4,
2026-02-21T08:42:33.9483857Z  'pid_type': 'flat',
2026-02-21T08:42:33.9484053Z  'range_flattens': [None, None],
2026-02-21T08:42:33.9484303Z  'range_multi_buffers': [None, None],
2026-02-21T08:42:33.9484532Z  'range_num_stages': [0, 4],
2026-02-21T08:42:33.9484770Z  'range_unroll_factors': [0, 1],
2026-02-21T08:42:33.9484990Z  'range_warp_specializes': [None, False]}
2026-02-21T08:42:33.9491457Z [67s] Fitting surrogate: 185 points, 185 targets
2026-02-21T08:42:34.8167932Z [68s] Generation 2 starting: 60 neighbors, 5 active search path(s)
2026-02-21T08:43:00.0769308Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.7 configs/s
2026-02-21T08:43:03.7909316Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.2 configs/s
2026-02-21T08:43:05.7981780Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 502.4         
2026-02-21T08:43:05.7982256Z                                                                   configs/s     
2026-02-21T08:43:05.9228747Z [99s] Generation 2 complete: 
2026-02-21T08:43:05.9233124Z error=1
2026-02-21T08:43:05.9234567Z ok=65
2026-02-21T08:43:05.9234782Z min=0.0307
2026-02-21T08:43:05.9234995Z mid=0.0491
2026-02-21T08:43:05.9235163Z max=0.8058
2026-02-21T08:43:05.9235374Z best={'block_sizes': [1, 8192],
2026-02-21T08:43:05.9235674Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:43:05.9236013Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:43:05.9236249Z  'num_stages': 1,
2026-02-21T08:43:05.9236466Z  'num_warps': 4,
2026-02-21T08:43:05.9236690Z  'pid_type': 'flat',
2026-02-21T08:43:05.9236915Z  'range_flattens': [None, None],
2026-02-21T08:43:05.9237481Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:05.9237710Z  'range_num_stages': [0, 4],
2026-02-21T08:43:05.9237950Z  'range_unroll_factors': [0, 1],
2026-02-21T08:43:05.9238172Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:05.9241120Z [99s] Fitting surrogate: 251 points, 251 targets
2026-02-21T08:43:06.8374267Z [100s] Generation 3 starting: 60 neighbors, 5 active search path(s)
2026-02-21T08:43:13.9979690Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 5.5 configs/s
2026-02-21T08:43:17.6435285Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.9 configs/s
2026-02-21T08:43:22.0600434Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.5         
2026-02-21T08:43:22.0605179Z                                                                   configs/s     
2026-02-21T08:43:22.3604167Z [116s] Generation 3 complete: 
2026-02-21T08:43:22.3609131Z ok=65
2026-02-21T08:43:22.3614196Z min=0.0307
2026-02-21T08:43:22.3618454Z mid=0.0410
2026-02-21T08:43:22.3620464Z max=1.4491
2026-02-21T08:43:22.3620693Z best={'block_sizes': [1, 8192],
2026-02-21T08:43:22.3621071Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:43:22.3626114Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:43:22.3628212Z  'num_stages': 1,
2026-02-21T08:43:22.3628454Z  'num_warps': 4,
2026-02-21T08:43:22.3628691Z  'pid_type': 'flat',
2026-02-21T08:43:22.3628905Z  'range_flattens': [None, None],
2026-02-21T08:43:22.3629176Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:22.3629414Z  'range_num_stages': [0, 4],
2026-02-21T08:43:22.3629662Z  'range_unroll_factors': [0, 1],
2026-02-21T08:43:22.3629893Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:22.3630198Z [116s] Fitting surrogate: 316 points, 316 targets
2026-02-21T08:43:23.1902284Z [117s] Generation 4 starting: 51 neighbors, 5 active search path(s)
2026-02-21T08:43:31.5280175Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 10.7 configs/s
2026-02-21T08:43:34.8953008Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 15.9 configs/s
2026-02-21T08:43:38.3426946Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 293.8         
2026-02-21T08:43:38.3430147Z                                                                   configs/s     
2026-02-21T08:43:38.5797673Z [132s] Generation 4 complete: 
2026-02-21T08:43:38.5799281Z ok=56
2026-02-21T08:43:38.5799579Z min=0.0307
2026-02-21T08:43:38.5804207Z mid=0.0389
2026-02-21T08:43:38.5809180Z max=0.1351
2026-02-21T08:43:38.5810629Z best={'block_sizes': [1, 8192],
2026-02-21T08:43:38.5810995Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:43:38.5811313Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:43:38.5811818Z  'num_stages': 1,
2026-02-21T08:43:38.5812015Z  'num_warps': 4,
2026-02-21T08:43:38.5812231Z  'pid_type': 'flat',
2026-02-21T08:43:38.5812472Z  'range_flattens': [None, None],
2026-02-21T08:43:38.5813093Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:38.5813352Z  'range_num_stages': [0, 4],
2026-02-21T08:43:38.5813562Z  'range_unroll_factors': [0, 1],
2026-02-21T08:43:38.5813830Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:38.5815216Z [132s] Fitting surrogate: 372 points, 372 targets
2026-02-21T08:43:39.8043397Z [133s] Generation 5 starting: 35 neighbors, 4 active search path(s)
2026-02-21T08:43:45.1032452Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 29.0 configs/s
2026-02-21T08:43:47.2861005Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 16.3 configs/s
2026-02-21T08:43:50.1993547Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 348.0         
2026-02-21T08:43:50.1994106Z                                                                   configs/s     
2026-02-21T08:43:50.4089743Z [144s] Generation 5 complete: 
2026-02-21T08:43:50.4094147Z ok=40
2026-02-21T08:43:50.4097529Z min=0.0307
2026-02-21T08:43:50.4102586Z mid=0.0327
2026-02-21T08:43:50.4104484Z max=0.0820
2026-02-21T08:43:50.4104711Z best={'block_sizes': [1, 8192],
2026-02-21T08:43:50.4105051Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:43:50.4105423Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:43:50.4105700Z  'num_stages': 1,
2026-02-21T08:43:50.4110456Z  'num_warps': 4,
2026-02-21T08:43:50.4114476Z  'pid_type': 'flat',
2026-02-21T08:43:50.4119195Z  'range_flattens': [None, None],
2026-02-21T08:43:50.4119480Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:50.4119735Z  'range_num_stages': [0, 4],
2026-02-21T08:43:50.4119989Z  'range_unroll_factors': [0, 1],
2026-02-21T08:43:50.4126220Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:50.4126582Z [144s] Fitting surrogate: 412 points, 412 targets
2026-02-21T08:43:50.7726550Z [144s] Generation 6 starting: 18 neighbors, 2 active search path(s)
2026-02-21T08:43:54.9089339Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 7.5 configs/s
2026-02-21T08:43:56.1478111Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 15.8 configs/s
2026-02-21T08:43:57.3760913Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 817.3         
2026-02-21T08:43:57.3762787Z                                                                   configs/s     
2026-02-21T08:43:57.4665602Z [151s] Generation 6 complete: 
2026-02-21T08:43:57.4667486Z ok=21
2026-02-21T08:43:57.4667713Z min=0.0307
2026-02-21T08:43:57.4667928Z mid=0.0329
2026-02-21T08:43:57.4668103Z max=0.1801
2026-02-21T08:43:57.4668326Z best={'block_sizes': [1, 8192],
2026-02-21T08:43:57.4668639Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:43:57.4668929Z  'load_eviction_policies': ['', ''],
2026-02-21T08:43:57.4669191Z  'num_stages': 2,
2026-02-21T08:43:57.4669382Z  'num_warps': 4,
2026-02-21T08:43:57.4669604Z  'pid_type': 'flat',
2026-02-21T08:43:57.4669814Z  'range_flattens': [None, False],
2026-02-21T08:43:57.4670119Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:57.4670367Z  'range_num_stages': [0, 1],
2026-02-21T08:43:57.4670619Z  'range_unroll_factors': [0, 1],
2026-02-21T08:43:57.4670851Z  'range_warp_specializes': [None, True]}
2026-02-21T08:43:57.4688784Z [151s] Fitting surrogate: 433 points, 433 targets
2026-02-21T08:43:57.9013689Z [151s] Generation 7 starting: 24 neighbors, 2 active search path(s)
2026-02-21T08:44:01.2284580Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 18.2 configs/s
2026-02-21T08:44:02.6956845Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 16.9 configs/s
2026-02-21T08:44:04.3523021Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 610.1         
2026-02-21T08:44:04.3523875Z                                                                   configs/s     
2026-02-21T08:44:04.4727861Z [158s] Generation 7 complete: 
2026-02-21T08:44:04.4728168Z ok=26
2026-02-21T08:44:04.4728374Z min=0.0307
2026-02-21T08:44:04.4728582Z mid=0.0308
2026-02-21T08:44:04.4729091Z max=0.1004
2026-02-21T08:44:04.4729272Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:04.4729564Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:04.4729839Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:04.4730088Z  'num_stages': 2,
2026-02-21T08:44:04.4730277Z  'num_warps': 4,
2026-02-21T08:44:04.4730494Z  'pid_type': 'flat',
2026-02-21T08:44:04.4730701Z  'range_flattens': [None, False],
2026-02-21T08:44:04.4730959Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:04.4731214Z  'range_num_stages': [0, 0],
2026-02-21T08:44:04.4731424Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:04.4731976Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:04.4749591Z [158s] Fitting surrogate: 459 points, 459 targets
2026-02-21T08:44:04.7711095Z [158s] Generation 8 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:44:07.6667982Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 2.7 configs/s
2026-02-21T08:44:09.0947983Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 11/11 7.5 configs/s
2026-02-21T08:44:09.9985265Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1107.3         
2026-02-21T08:44:09.9989413Z                                                                  configs/s      
2026-02-21T08:44:10.0709044Z [163s] Generation 8 complete: 
2026-02-21T08:44:10.0710546Z ok=13
2026-02-21T08:44:10.0710785Z min=0.0307
2026-02-21T08:44:10.0710964Z mid=0.0327
2026-02-21T08:44:10.0711162Z max=0.0696
2026-02-21T08:44:10.0711345Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:10.0712478Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:10.0712759Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:10.0713006Z  'num_stages': 2,
2026-02-21T08:44:10.0713199Z  'num_warps': 4,
2026-02-21T08:44:10.0713409Z  'pid_type': 'flat',
2026-02-21T08:44:10.0715569Z  'range_flattens': [None, False],
2026-02-21T08:44:10.0716019Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:10.0720861Z  'range_num_stages': [0, 0],
2026-02-21T08:44:10.0726545Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:10.0730654Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:10.0735304Z [163s] Fitting surrogate: 472 points, 472 targets
2026-02-21T08:44:10.3331217Z [164s] Generation 9 starting: 7 neighbors, 1 active search path(s)
2026-02-21T08:44:12.8013152Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 90.8 configs/s
2026-02-21T08:44:13.2280646Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 18.5 configs/s
2026-02-21T08:44:13.8256385Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1658.0         
2026-02-21T08:44:13.8257055Z                                                                  configs/s      
2026-02-21T08:44:13.8802025Z [167s] Generation 9 complete: 
2026-02-21T08:44:13.8807229Z ok=9
2026-02-21T08:44:13.8809446Z min=0.0307
2026-02-21T08:44:13.8809694Z mid=0.0326
2026-02-21T08:44:13.8812193Z max=0.0553
2026-02-21T08:44:13.8812461Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:13.8812789Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:13.8813064Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:13.8813312Z  'num_stages': 2,
2026-02-21T08:44:13.8813494Z  'num_warps': 4,
2026-02-21T08:44:13.8813708Z  'pid_type': 'flat',
2026-02-21T08:44:13.8813909Z  'range_flattens': [None, False],
2026-02-21T08:44:13.8814163Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:13.8814417Z  'range_num_stages': [0, 0],
2026-02-21T08:44:13.8814629Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:13.8814882Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:13.8818854Z [167s] Fitting surrogate: 481 points, 481 targets
2026-02-21T08:44:14.1955028Z [168s] Generation 10 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:44:16.7643263Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 7.8 configs/s
2026-02-21T08:44:17.4969302Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.5 configs/s
2026-02-21T08:44:18.5478596Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 954.6         
2026-02-21T08:44:18.5479619Z                                                                   configs/s     
2026-02-21T08:44:18.6353147Z [172s] Generation 10 complete: 
2026-02-21T08:44:18.6353568Z ok=14
2026-02-21T08:44:18.6353794Z min=0.0307
2026-02-21T08:44:18.6354037Z mid=0.0308
2026-02-21T08:44:18.6354223Z max=0.0451
2026-02-21T08:44:18.6354445Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:18.6354735Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:18.6355003Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:18.6355253Z  'num_stages': 2,
2026-02-21T08:44:18.6355445Z  'num_warps': 4,
2026-02-21T08:44:18.6355658Z  'pid_type': 'flat',
2026-02-21T08:44:18.6355855Z  'range_flattens': [None, False],
2026-02-21T08:44:18.6356112Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:18.6358722Z  'range_num_stages': [0, 0],
2026-02-21T08:44:18.6359453Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:18.6359758Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:18.6367305Z [172s] Fitting surrogate: 495 points, 495 targets
2026-02-21T08:44:18.9433002Z [172s] Generation 11 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:44:21.2717907Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.1 configs/s
2026-02-21T08:44:22.0769133Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 15.7 configs/s
2026-02-21T08:44:22.9985179Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1088.0        
2026-02-21T08:44:22.9985825Z                                                                   configs/s     
2026-02-21T08:44:23.0720138Z [176s] Generation 11 complete: 
2026-02-21T08:44:23.0722091Z ok=13
2026-02-21T08:44:23.0722357Z min=0.0307
2026-02-21T08:44:23.0722572Z mid=0.0369
2026-02-21T08:44:23.0722740Z max=0.0593
2026-02-21T08:44:23.0725449Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:23.0725840Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:23.0726172Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:23.0726412Z  'num_stages': 2,
2026-02-21T08:44:23.0726644Z  'num_warps': 4,
2026-02-21T08:44:23.0726841Z  'pid_type': 'flat',
2026-02-21T08:44:23.0727084Z  'range_flattens': [None, False],
2026-02-21T08:44:23.0727354Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:23.0727592Z  'range_num_stages': [0, 0],
2026-02-21T08:44:23.0727850Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:23.0728084Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:23.0738566Z [176s] Fitting surrogate: 508 points, 508 targets
2026-02-21T08:44:23.4468665Z [177s] Generation 12 starting: 8 neighbors, 1 active search path(s)
2026-02-21T08:44:25.8739518Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 23.3 configs/s
2026-02-21T08:44:26.3623572Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 18.1 configs/s
2026-02-21T08:44:27.1250192Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1308.0        
2026-02-21T08:44:27.1251284Z                                                                   configs/s     
2026-02-21T08:44:27.1898478Z [181s] Generation 12 complete: 
2026-02-21T08:44:27.1898876Z ok=10
2026-02-21T08:44:27.1899077Z min=0.0307
2026-02-21T08:44:27.1899305Z mid=0.0308
2026-02-21T08:44:27.1899474Z max=0.0451
2026-02-21T08:44:27.1899718Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:27.1900064Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:27.1900378Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:27.1900654Z  'num_stages': 2,
2026-02-21T08:44:27.1900852Z  'num_warps': 4,
2026-02-21T08:44:27.1901078Z  'pid_type': 'flat',
2026-02-21T08:44:27.1901289Z  'range_flattens': [None, False],
2026-02-21T08:44:27.1901591Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:27.1901851Z  'range_num_stages': [0, 0],
2026-02-21T08:44:27.1902389Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:27.1902671Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:27.1920065Z [181s] Fitting surrogate: 518 points, 518 targets
2026-02-21T08:44:27.5485601Z [181s] Generation 13 starting: 9 neighbors, 1 active search path(s)
2026-02-21T08:44:31.0976568Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.7 configs/s
2026-02-21T08:44:31.6976207Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s
2026-02-21T08:44:32.3774941Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1462.1        
2026-02-21T08:44:32.3776560Z                                                                   configs/s     
2026-02-21T08:44:32.4382112Z [186s] Generation 13 complete: 
2026-02-21T08:44:32.4386653Z ok=11
2026-02-21T08:44:32.4388822Z min=0.0307
2026-02-21T08:44:32.4389064Z mid=0.0308
2026-02-21T08:44:32.4389238Z max=0.0593
2026-02-21T08:44:32.4389451Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:32.4389788Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:32.4390079Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:32.4390332Z  'num_stages': 2,
2026-02-21T08:44:32.4390515Z  'num_warps': 4,
2026-02-21T08:44:32.4390742Z  'pid_type': 'flat',
2026-02-21T08:44:32.4390942Z  'range_flattens': [None, False],
2026-02-21T08:44:32.4391186Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:32.4391407Z  'range_num_stages': [0, 0],
2026-02-21T08:44:32.4391699Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:32.4391953Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:32.4397835Z [186s] Fitting surrogate: 529 points, 529 targets
2026-02-21T08:44:32.8353591Z [186s] Generation 14 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:44:35.3046960Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.6 configs/s
2026-02-21T08:44:35.9618259Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.0 configs/s
2026-02-21T08:44:36.9341362Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1030.6        
2026-02-21T08:44:36.9342200Z                                                                   configs/s     
2026-02-21T08:44:37.0136965Z [190s] Generation 14 complete: 
2026-02-21T08:44:37.0138694Z ok=13
2026-02-21T08:44:37.0138916Z min=0.0307
2026-02-21T08:44:37.0139126Z mid=0.0308
2026-02-21T08:44:37.0139294Z max=0.0451
2026-02-21T08:44:37.0139504Z best={'block_sizes': [1, 8192],
2026-02-21T08:44:37.0139771Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:44:37.0140068Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:37.0140284Z  'num_stages': 2,
2026-02-21T08:44:37.0140495Z  'num_warps': 4,
2026-02-21T08:44:37.0140675Z  'pid_type': 'flat',
2026-02-21T08:44:37.0140911Z  'range_flattens': [None, False],
2026-02-21T08:44:37.0146103Z  'range_multi_buffers': [None, None],
2026-02-21T08:44:37.0147634Z  'range_num_stages': [0, 0],
2026-02-21T08:44:37.0147905Z  'range_unroll_factors': [0, 1],
2026-02-21T08:44:37.0148192Z  'range_warp_specializes': [None, True]}
2026-02-21T08:44:37.0153714Z [190s] Fitting surrogate: 542 points, 542 targets
2026-02-21T08:44:37.3120178Z [191s] Autotuning complete in 191.2s after searching 507 configs.
2026-02-21T08:44:37.3120785Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:44:37.3122114Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:44:37.3122999Z 
2026-02-21T08:44:37.3123366Z [191s] Code of selected kernel: /tmp/torchinductor_root/hq/chqj2t3kr6zihvxuaucr5c7hpwcwekhixqtqvxw23ih2hwb5gstv.py
2026-02-21T08:44:37.9169457Z WARNING:tritonbench.utils.triton_op:Completed input ID 61:
2026-02-21T08:44:37.9170025Z (M, N)
2026-02-21T08:44:37.9170257Z ------------
2026-02-21T08:44:37.9171176Z (4096, 8064)
2026-02-21T08:44:37.9171388Z 
2026-02-21T08:44:37.9179023Z  65%|██████▌   | 13/20 [35:35<20:56, 179.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 66:
2026-02-21T08:44:37.9179544Z (M, N)
2026-02-21T08:44:37.9182464Z ------------
2026-02-21T08:44:37.9182693Z (4096, 8704)
2026-02-21T08:44:37.9183149Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:44:39.0623312Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:44:40.3452842Z INFO:tritonbench.utils.triton_op:Took 2.39ms to get benchmark function for torch_compile_softmax
2026-02-21T08:44:44.9857176Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:44:44.9862258Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:44:44.9865795Z               'dtype': 'torch.float16',
2026-02-21T08:44:44.9867953Z               'shape': (4096, 8704),
2026-02-21T08:44:44.9868318Z               'stride': (8704, 1)},),
2026-02-21T08:44:44.9873882Z   'kwargs': {}}
2026-02-21T08:44:44.9890955Z INFO:tritonbench.utils.triton_op:Took 3.59ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:44:45.1783753Z [0s] Autotune random seed: 2134816249
2026-02-21T08:44:45.3255016Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:45:24.9641265Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T08:45:24.9660846Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T08:45:25.8215922Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:45:25.8216631Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:45:25.8218851Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:45:25.8219091Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:45:25.8219378Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:45:25.8224108Z     %cst = arith.constant dense<8704> : tensor<8x1xi32>
2026-02-21T08:45:25.8229226Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:45:25.8234645Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:45:25.8236728Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:45:25.8237016Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:45:25.8237295Z     %c8704_i32 = arith.constant 8704 : i32
2026-02-21T08:45:25.8237553Z     %c8704_i64 = arith.constant 8704 : i64
2026-02-21T08:45:25.8237774Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:45:25.8238502Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:45:25.8238869Z     %1 = tt.get_program_id x : i32
2026-02-21T08:45:25.8239153Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:45:25.8239411Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:45:25.8239708Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:45:25.8240049Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:45:25.8240292Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:45:25.8240561Z       %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:45:25.8240791Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:45:25.8241229Z       %6:2 = scf.for %arg3 = %c0_i32 to %c8192_i32 step %c2048_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:45:25.8242251Z         %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:45:25.8242657Z         %49 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8242965Z         %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8243206Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8243464Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:25.8243702Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8243960Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8244222Z         %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:45:25.8244531Z         %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:45:25.8244827Z         %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32>
2026-02-21T08:45:25.8245089Z         %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:45:25.8245367Z         %55 = arith.ori %53, %54 : tensor<8xi1>
2026-02-21T08:45:25.8245645Z         %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:45:25.8245948Z         %57 = arith.subf %arg4, %56 : tensor<8xf32>
2026-02-21T08:45:25.8246347Z         %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8246769Z         %59 = arith.mulf %arg5, %58 : tensor<8xf32>
2026-02-21T08:45:25.8247082Z         %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8247405Z         %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8247704Z         %62 = arith.subf %49, %61 : tensor<8x512xf32>
2026-02-21T08:45:25.8248106Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8248550Z         %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8248805Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8249028Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:25.8249288Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8249519Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8249784Z         %65 = arith.addf %59, %64 : tensor<8xf32>
2026-02-21T08:45:25.8250014Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:45:25.8250278Z         %66 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:45:25.8250509Z         %67 = arith.addi %arg3, %66 : i32
2026-02-21T08:45:25.8250853Z         %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:45:25.8251234Z         %69 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8251506Z         %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8251798Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8252024Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:25.8252284Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8252596Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8252887Z         %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:45:25.8253200Z         %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:45:25.8253544Z         %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32>
2026-02-21T08:45:25.8253798Z         %74 = arith.cmpf une, %56, %56 : tensor<8xf32>
2026-02-21T08:45:25.8254067Z         %75 = arith.ori %73, %74 : tensor<8xi1>
2026-02-21T08:45:25.8254335Z         %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:45:25.8254633Z         %77 = arith.subf %56, %76 : tensor<8xf32>
2026-02-21T08:45:25.8255041Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8255429Z         %79 = arith.mulf %65, %78 : tensor<8xf32>
2026-02-21T08:45:25.8255807Z         %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8256131Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8256428Z         %82 = arith.subf %69, %81 : tensor<8x512xf32>
2026-02-21T08:45:25.8256830Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8257255Z         %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8257514Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8257737Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:25.8257993Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8258214Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8258475Z         %85 = arith.addf %79, %84 : tensor<8xf32>
2026-02-21T08:45:25.8258709Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:45:25.8258964Z         %86 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:45:25.8259221Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:45:25.8259533Z         %88 = tt.descriptor_load %0[%2, %87] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:45:25.8259910Z         %89 = arith.extf %88 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8260174Z         %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8260473Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8260691Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:25.8260948Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8261194Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8261452Z         %91 = arith.truncf %90 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:45:25.8261801Z         %92 = arith.extf %91 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:45:25.8262071Z         %93 = arith.cmpf ogt, %76, %92 : tensor<8xf32>
2026-02-21T08:45:25.8262345Z         %94 = arith.cmpf une, %76, %76 : tensor<8xf32>
2026-02-21T08:45:25.8262586Z         %95 = arith.ori %93, %94 : tensor<8xi1>
2026-02-21T08:45:25.8262878Z         %96 = arith.select %95, %76, %92 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:45:25.8263174Z         %97 = arith.subf %76, %96 : tensor<8xf32>
2026-02-21T08:45:25.8263563Z         %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8263976Z         %99 = arith.mulf %85, %98 : tensor<8xf32>
2026-02-21T08:45:25.8264264Z         %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8264623Z         %101 = tt.broadcast %100 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8264905Z         %102 = arith.subf %89, %101 : tensor<8x512xf32>
2026-02-21T08:45:25.8265343Z         %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8265782Z         %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8266087Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8266337Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:25.8266566Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8266820Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8267058Z         %105 = arith.addf %99, %104 : tensor<8xf32>
2026-02-21T08:45:25.8267327Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:45:25.8267585Z         %106 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:45:25.8267823Z         %107 = arith.addi %arg3, %106 : i32
2026-02-21T08:45:25.8268172Z         %108 = tt.descriptor_load %0[%2, %107] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:45:25.8268532Z         %109 = arith.extf %108 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8268831Z         %110 = "tt.reduce"(%109) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8269058Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8269361Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:25.8269618Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8269843Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8270132Z         %111 = arith.truncf %110 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:45:25.8270416Z         %112 = arith.extf %111 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:45:25.8270709Z         %113 = arith.cmpf ogt, %96, %112 : tensor<8xf32>
2026-02-21T08:45:25.8270964Z         %114 = arith.cmpf une, %96, %96 : tensor<8xf32>
2026-02-21T08:45:25.8271234Z         %115 = arith.ori %113, %114 : tensor<8xi1>
2026-02-21T08:45:25.8271575Z         %116 = arith.select %115, %96, %112 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:45:25.8271865Z         %117 = arith.subf %96, %116 : tensor<8xf32>
2026-02-21T08:45:25.8272314Z         %118 = tt.extern_elementwise %117 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8272737Z         %119 = arith.mulf %105, %118 : tensor<8xf32>
2026-02-21T08:45:25.8273079Z         %120 = tt.expand_dims %116 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8273433Z         %121 = tt.broadcast %120 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8273765Z         %122 = arith.subf %109, %121 : tensor<8x512xf32>
2026-02-21T08:45:25.8274228Z         %123 = tt.extern_elementwise %122 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8274659Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8274929Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:25.8275160Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:25.8275432Z           tt.reduce.return %126 : f32
2026-02-21T08:45:25.8275663Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8275941Z         %125 = arith.addf %119, %124 : tensor<8xf32>
2026-02-21T08:45:25.8276239Z         scf.yield %116, %125 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:45:25.8276537Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:45:25.8276938Z       %7 = tt.descriptor_load %0[%2, %c8192_i32] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:45:25.8277317Z       %8 = arith.extf %7 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8277616Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8277851Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:45:25.8278111Z         %48 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:45:25.8278375Z         tt.reduce.return %48 : f32
2026-02-21T08:45:25.8278610Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8278905Z       %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:45:25.8279197Z       %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:45:25.8279499Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32>
2026-02-21T08:45:25.8279765Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T08:45:25.8280134Z       %14 = arith.ori %12, %13 : tensor<8xi1>
2026-02-21T08:45:25.8280433Z       %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:45:25.8280755Z       %16 = arith.subf %6#0, %15 : tensor<8xf32>
2026-02-21T08:45:25.8281341Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8281986Z       %18 = arith.mulf %6#1, %17 : tensor<8xf32>
2026-02-21T08:45:25.8282334Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8282655Z       %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8282954Z       %21 = arith.subf %8, %20 : tensor<8x512xf32>
2026-02-21T08:45:25.8283382Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8283872Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:45:25.8284151Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:45:25.8284373Z         %48 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:45:25.8284639Z         tt.reduce.return %48 : f32
2026-02-21T08:45:25.8284865Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:45:25.8285129Z       %24 = arith.addf %18, %23 : tensor<8xf32>
2026-02-21T08:45:25.8285390Z       %c8192_i32_2 = arith.constant 8192 : i32
2026-02-21T08:45:25.8285663Z       %c2048_i32_3 = arith.constant 2048 : i32
2026-02-21T08:45:25.8286089Z       scf.for %arg3 = %c0_i32 to %c8192_i32_2 step %c2048_i32_3  : i32 {
2026-02-21T08:45:25.8286590Z         %48 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:45:25.8287064Z         %49 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:45:25.8287313Z         %50 = arith.addi %49, %48 : tensor<512xi32>
2026-02-21T08:45:25.8287636Z         %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:45:25.8287971Z         %52 = arith.muli %51, %cst : tensor<8x1xi32>
2026-02-21T08:45:25.8288270Z         %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:45:25.8288631Z         %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8288929Z         %55 = tt.broadcast %53 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8289232Z         %56 = arith.addi %54, %55 : tensor<8x512xi32>
2026-02-21T08:45:25.8289510Z         %57 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8289857Z         %58 = tt.addptr %57, %56 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8290226Z         %59 = tt.load %58 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8290563Z         %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8290912Z         %61 = arith.extf %59 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8291351Z         %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8291854Z         %63 = arith.subf %61, %62 : tensor<8x512xf32>
2026-02-21T08:45:25.8292623Z         %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8293316Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8293819Z         %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8294209Z         %67 = arith.divf %64, %66 : tensor<8x512xf32>
2026-02-21T08:45:25.8294636Z         %68 = arith.truncf %67 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:45:25.8295084Z         %69 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8295583Z         %70 = tt.addptr %69, %56 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8296053Z         tt.store %70, %68 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8296591Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:45:25.8296939Z         %71 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:45:25.8297276Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:45:25.8297703Z         %73 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:45:25.8298113Z         %74 = tt.splat %72 : i32 -> tensor<512xi32>
2026-02-21T08:45:25.8298496Z         %75 = arith.addi %74, %73 : tensor<512xi32>
2026-02-21T08:45:25.8298976Z         %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:45:25.8299443Z         %77 = arith.muli %76, %cst : tensor<8x1xi32>
2026-02-21T08:45:25.8299952Z         %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:45:25.8300452Z         %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8301098Z         %80 = tt.broadcast %78 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8301583Z         %81 = arith.addi %79, %80 : tensor<8x512xi32>
2026-02-21T08:45:25.8302038Z         %82 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8302480Z         %83 = tt.addptr %82, %81 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8302991Z         %84 = tt.load %83 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8303542Z         %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8304013Z         %86 = arith.extf %84 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8304487Z         %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8304911Z         %88 = arith.subf %86, %87 : tensor<8x512xf32>
2026-02-21T08:45:25.8305518Z         %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8306250Z         %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8306732Z         %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8307155Z         %92 = arith.divf %89, %91 : tensor<8x512xf32>
2026-02-21T08:45:25.8307548Z         %93 = arith.truncf %92 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:45:25.8308033Z         %94 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8308522Z         %95 = tt.addptr %94, %81 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8308954Z         tt.store %95, %93 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8309331Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:45:25.8309639Z         %96 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:45:25.8309969Z         %97 = arith.addi %arg3, %96 : i32
2026-02-21T08:45:25.8310239Z         %98 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:45:25.8310554Z         %99 = tt.splat %97 : i32 -> tensor<512xi32>
2026-02-21T08:45:25.8310833Z         %100 = arith.addi %99, %98 : tensor<512xi32>
2026-02-21T08:45:25.8311119Z         %101 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:45:25.8311454Z         %102 = arith.muli %101, %cst : tensor<8x1xi32>
2026-02-21T08:45:25.8311801Z         %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:45:25.8312176Z         %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8312486Z         %105 = tt.broadcast %103 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8312802Z         %106 = arith.addi %104, %105 : tensor<8x512xi32>
2026-02-21T08:45:25.8313109Z         %107 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8313429Z         %108 = tt.addptr %107, %106 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8313806Z         %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8314270Z         %110 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8314627Z         %111 = arith.extf %109 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8314958Z         %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8315236Z         %113 = arith.subf %111, %112 : tensor<8x512xf32>
2026-02-21T08:45:25.8315677Z         %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8316150Z         %115 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8316526Z         %116 = tt.broadcast %115 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8316821Z         %117 = arith.divf %114, %116 : tensor<8x512xf32>
2026-02-21T08:45:25.8317231Z         %118 = arith.truncf %117 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:45:25.8317599Z         %119 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8317943Z         %120 = tt.addptr %119, %106 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8318287Z         tt.store %120, %118 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8318546Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:45:25.8318822Z         %121 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:45:25.8319066Z         %122 = arith.addi %arg3, %121 : i32
2026-02-21T08:45:25.8319388Z         %123 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:45:25.8319729Z         %124 = tt.splat %122 : i32 -> tensor<512xi32>
2026-02-21T08:45:25.8319990Z         %125 = arith.addi %124, %123 : tensor<512xi32>
2026-02-21T08:45:25.8320328Z         %126 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:45:25.8320647Z         %127 = arith.muli %126, %cst : tensor<8x1xi32>
2026-02-21T08:45:25.8321002Z         %128 = tt.expand_dims %125 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:45:25.8321380Z         %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8321737Z         %130 = tt.broadcast %128 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8322064Z         %131 = arith.addi %129, %130 : tensor<8x512xi32>
2026-02-21T08:45:25.8322362Z         %132 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8322731Z         %133 = tt.addptr %132, %131 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8323099Z         %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8323501Z         %135 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8323869Z         %136 = arith.extf %134 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8324188Z         %137 = tt.broadcast %135 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8324520Z         %138 = arith.subf %136, %137 : tensor<8x512xf32>
2026-02-21T08:45:25.8324933Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8325411Z         %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8325762Z         %141 = tt.broadcast %140 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8326040Z         %142 = arith.divf %139, %141 : tensor<8x512xf32>
2026-02-21T08:45:25.8326341Z         %143 = arith.truncf %142 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:45:25.8326654Z         %144 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8327002Z         %145 = tt.addptr %144, %131 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8327300Z         tt.store %145, %143 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8327604Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:45:25.8327996Z       %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:45:25.8328299Z       %26 = tt.splat %c8192_i32_2 : i32 -> tensor<512xi32>
2026-02-21T08:45:25.8328577Z       %27 = arith.addi %26, %25 : tensor<512xi32>
2026-02-21T08:45:25.8328871Z       %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:45:25.8329190Z       %29 = arith.muli %28, %cst : tensor<8x1xi32>
2026-02-21T08:45:25.8329479Z       %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:45:25.8329833Z       %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8330154Z       %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:45:25.8330419Z       %33 = arith.addi %31, %32 : tensor<8x512xi32>
2026-02-21T08:45:25.8330790Z       %34 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8331102Z       %35 = tt.addptr %34, %33 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8331460Z       %36 = tt.load %35 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8331813Z       %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8332159Z       %38 = arith.extf %36 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:45:25.8332476Z       %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8332803Z       %40 = arith.subf %38, %39 : tensor<8x512xf32>
2026-02-21T08:45:25.8333232Z       %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:45:25.8333675Z       %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:45:25.8334013Z       %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:45:25.8334311Z       %44 = arith.divf %41, %43 : tensor<8x512xf32>
2026-02-21T08:45:25.8334577Z       %45 = arith.truncf %44 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:45:25.8334903Z       %46 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8335205Z       %47 = tt.addptr %46, %33 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:45:25.8335523Z       tt.store %47, %45 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:45:25.8335832Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:45:25.8336152Z     tt.return
2026-02-21T08:45:25.8336355Z   }
2026-02-21T08:45:25.8336517Z }
2026-02-21T08:45:25.8336609Z 
2026-02-21T08:45:25.8336709Z {-#
2026-02-21T08:45:25.8336878Z   external_resources: {
2026-02-21T08:45:25.8337110Z     mlir_reproducer: {
2026-02-21T08:45:25.8341491Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:45:25.8346103Z       disable_threading: false,
2026-02-21T08:45:25.8346338Z       verify_each: true
2026-02-21T08:45:25.8346521Z     }
2026-02-21T08:45:25.8346708Z   }
2026-02-21T08:45:25.8346862Z #-}
2026-02-21T08:45:25.8347360Z /tmp/torchinductor_root/ho/cho633uoi2rsxfmseduqhol3myx3aexbxpany4zvj3jhnwb2qnfz.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:45:25.8348691Z /tmp/torchinductor_root/ho/cho633uoi2rsxfmseduqhol3myx3aexbxpany4zvj3jhnwb2qnfz.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:45:25.8349773Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:45:25.8350948Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:45:25.8352031Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:45:25.8352338Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:45:30.5708996Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:45:30.5713823Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:45:30.5718438Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:45:30.5722155Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:45:30.5726229Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:45:30.5731031Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:45:30.5733234Z     %cst = arith.constant dense<8704> : tensor<128x1xi32>
2026-02-21T08:45:30.5733587Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T08:45:30.5733941Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T08:45:30.5734220Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:45:30.5734526Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:45:30.5734776Z     %c8704_i32 = arith.constant 8704 : i32
2026-02-21T08:45:30.5735031Z     %c8704_i64 = arith.constant 8704 : i64
2026-02-21T08:45:30.5735281Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:45:30.5735646Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : <f16>, <tensor<128x256xf16>>
2026-02-21T08:45:30.5736041Z     %1 = tt.get_program_id x : i32
2026-02-21T08:45:30.5736259Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:45:30.5736510Z     %3 = arith.minsi %2, %c32_i32 : i32
2026-02-21T08:45:30.5736772Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:45:30.5737102Z       %4 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T08:45:30.5737378Z       %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:45:30.5737709Z       %6 = tt.splat %4 : i32 -> tensor<128xi32>
2026-02-21T08:45:30.5737971Z       %7 = arith.addi %6, %5 : tensor<128xi32>
2026-02-21T08:45:30.5738209Z       %c8448_i32 = arith.constant 8448 : i32
2026-02-21T08:45:30.5738817Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:45:30.5739238Z       %8:2 = scf.for %arg3 = %c0_i32 to %c8448_i32 step %c768_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T08:45:30.5739789Z         %48 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5740215Z         %49 = arith.extf %48 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5740514Z         %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5740795Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5741039Z           %106 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:30.5741322Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5741634Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5741954Z         %51 = arith.truncf %50 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:45:30.5742454Z         %52 = arith.extf %51 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:45:30.5742758Z         %53 = arith.cmpf ogt, %arg4, %52 : tensor<128xf32>
2026-02-21T08:45:30.5743085Z         %54 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T08:45:30.5743353Z         %55 = arith.ori %53, %54 : tensor<128xi1>
2026-02-21T08:45:30.5743682Z         %56 = arith.select %55, %arg4, %52 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:45:30.5744078Z         %57 = arith.subf %arg4, %56 : tensor<128xf32>
2026-02-21T08:45:30.5744537Z         %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5744964Z         %59 = arith.mulf %arg5, %58 : tensor<128xf32>
2026-02-21T08:45:30.5745307Z         %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5745668Z         %61 = tt.broadcast %60 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5746005Z         %62 = arith.subf %49, %61 : tensor<128x256xf32>
2026-02-21T08:45:30.5746443Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5746914Z         %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5747192Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5747430Z           %106 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:30.5747706Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5747946Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5748230Z         %65 = arith.addf %59, %64 : tensor<128xf32>
2026-02-21T08:45:30.5748476Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:45:30.5748758Z         %66 = arith.muli %c256_i32, %c1_i32_4 : i32
2026-02-21T08:45:30.5749030Z         %67 = arith.addi %arg3, %66 : i32
2026-02-21T08:45:30.5749376Z         %68 = tt.descriptor_load %0[%4, %67] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5749799Z         %69 = arith.extf %68 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5750093Z         %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5750359Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5750596Z           %106 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:30.5750876Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5751126Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5751395Z         %71 = arith.truncf %70 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:45:30.5751753Z         %72 = arith.extf %71 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:45:30.5752024Z         %73 = arith.cmpf ogt, %56, %72 : tensor<128xf32>
2026-02-21T08:45:30.5752305Z         %74 = arith.cmpf une, %56, %56 : tensor<128xf32>
2026-02-21T08:45:30.5752546Z         %75 = arith.ori %73, %74 : tensor<128xi1>
2026-02-21T08:45:30.5752848Z         %76 = arith.select %75, %56, %72 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:45:30.5753231Z         %77 = arith.subf %56, %76 : tensor<128xf32>
2026-02-21T08:45:30.5753629Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5754053Z         %79 = arith.mulf %65, %78 : tensor<128xf32>
2026-02-21T08:45:30.5754347Z         %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5754710Z         %81 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5754995Z         %82 = arith.subf %69, %81 : tensor<128x256xf32>
2026-02-21T08:45:30.5755430Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5755872Z         %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5756104Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5756436Z           %106 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:30.5756664Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5756927Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5757165Z         %85 = arith.addf %79, %84 : tensor<128xf32>
2026-02-21T08:45:30.5757429Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:45:30.5757694Z         %86 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:45:30.5757929Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:45:30.5758271Z         %88 = tt.descriptor_load %0[%4, %87] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5758630Z         %89 = arith.extf %88 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5758935Z         %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5759161Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5759406Z           %106 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:45:30.5759664Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5759893Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5760187Z         %91 = arith.truncf %90 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:45:30.5760472Z         %92 = arith.extf %91 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:45:30.5760771Z         %93 = arith.cmpf ogt, %76, %92 : tensor<128xf32>
2026-02-21T08:45:30.5761024Z         %94 = arith.cmpf une, %76, %76 : tensor<128xf32>
2026-02-21T08:45:30.5761296Z         %95 = arith.ori %93, %94 : tensor<128xi1>
2026-02-21T08:45:30.5761632Z         %96 = arith.select %95, %76, %92 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:45:30.5761905Z         %97 = arith.subf %76, %96 : tensor<128xf32>
2026-02-21T08:45:30.5762329Z         %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5762723Z         %99 = arith.mulf %85, %98 : tensor<128xf32>
2026-02-21T08:45:30.5763053Z         %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5763403Z         %101 = tt.broadcast %100 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5763731Z         %102 = arith.subf %89, %101 : tensor<128x256xf32>
2026-02-21T08:45:30.5764179Z         %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5764599Z         %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5764868Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:45:30.5765096Z           %106 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:45:30.5765353Z           tt.reduce.return %106 : f32
2026-02-21T08:45:30.5765582Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5765853Z         %105 = arith.addf %99, %104 : tensor<128xf32>
2026-02-21T08:45:30.5766144Z         scf.yield %96, %105 : tensor<128xf32>, tensor<128xf32>
2026-02-21T08:45:30.5766465Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:45:30.5766823Z       %9 = tt.descriptor_load %0[%4, %c8448_i32] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5767200Z       %10 = arith.extf %9 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5767502Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5767736Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:45:30.5767993Z         %48 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:45:30.5768250Z         tt.reduce.return %48 : f32
2026-02-21T08:45:30.5768475Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5768768Z       %12 = arith.truncf %11 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:45:30.5769052Z       %13 = arith.extf %12 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:45:30.5769359Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<128xf32>
2026-02-21T08:45:30.5769616Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32>
2026-02-21T08:45:30.5769961Z       %16 = arith.ori %14, %15 : tensor<128xi1>
2026-02-21T08:45:30.5770258Z       %17 = arith.select %16, %8#0, %13 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:45:30.5770535Z       %18 = arith.subf %8#0, %17 : tensor<128xf32>
2026-02-21T08:45:30.5770956Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5771343Z       %20 = arith.mulf %8#1, %19 : tensor<128xf32>
2026-02-21T08:45:30.5771712Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5772046Z       %22 = tt.broadcast %21 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5772350Z       %23 = arith.subf %10, %22 : tensor<128x256xf32>
2026-02-21T08:45:30.5772782Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5773192Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T08:45:30.5773468Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:45:30.5773692Z         %48 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:45:30.5773943Z         tt.reduce.return %48 : f32
2026-02-21T08:45:30.5774164Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:45:30.5774430Z       %26 = arith.addf %20, %25 : tensor<128xf32>
2026-02-21T08:45:30.5774693Z       %c8448_i32_2 = arith.constant 8448 : i32
2026-02-21T08:45:30.5774923Z       %c768_i32_3 = arith.constant 768 : i32
2026-02-21T08:45:30.5775218Z       scf.for %arg3 = %c0_i32 to %c8448_i32_2 step %c768_i32_3  : i32 {
2026-02-21T08:45:30.5775542Z         %48 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:45:30.5775868Z         %49 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:45:30.5776109Z         %50 = arith.addi %49, %48 : tensor<256xi32>
2026-02-21T08:45:30.5776477Z         %51 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5776896Z         %52 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5777230Z         %53 = arith.extf %51 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5777564Z         %54 = tt.broadcast %52 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5777894Z         %55 = arith.subf %53, %54 : tensor<128x256xf32>
2026-02-21T08:45:30.5778330Z         %56 = tt.extern_elementwise %55 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5778791Z         %57 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5779151Z         %58 = tt.broadcast %57 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5779437Z         %59 = arith.divf %56, %58 : tensor<128x256xf32>
2026-02-21T08:45:30.5779755Z         %60 = arith.truncf %59 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:45:30.5780244Z         %61 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:45:30.5780552Z         %62 = arith.muli %61, %cst : tensor<128x1xi32>
2026-02-21T08:45:30.5780877Z         %63 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:45:30.5781208Z         %64 = tt.broadcast %62 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5781583Z         %65 = tt.broadcast %63 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5781889Z         %66 = arith.addi %64, %65 : tensor<128x256xi32>
2026-02-21T08:45:30.5782171Z         %67 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5782527Z         %68 = tt.addptr %67, %66 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:45:30.5782834Z         tt.store %68, %60 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5783115Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T08:45:30.5783414Z         %69 = arith.muli %c256_i32, %c1_i32_4 : i32
2026-02-21T08:45:30.5783681Z         %70 = arith.addi %arg3, %69 : i32
2026-02-21T08:45:30.5783955Z         %71 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:45:30.5784294Z         %72 = tt.splat %70 : i32 -> tensor<256xi32>
2026-02-21T08:45:30.5784581Z         %73 = arith.addi %72, %71 : tensor<256xi32>
2026-02-21T08:45:30.5784929Z         %74 = tt.descriptor_load %0[%4, %70] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5785363Z         %75 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5785710Z         %76 = arith.extf %74 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5786059Z         %77 = tt.broadcast %75 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5786384Z         %78 = arith.subf %76, %77 : tensor<128x256xf32>
2026-02-21T08:45:30.5786817Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5787339Z         %80 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5787686Z         %81 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5788013Z         %82 = arith.divf %79, %81 : tensor<128x256xf32>
2026-02-21T08:45:30.5788304Z         %83 = arith.truncf %82 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:45:30.5788679Z         %84 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:45:30.5789026Z         %85 = arith.muli %84, %cst : tensor<128x1xi32>
2026-02-21T08:45:30.5789333Z         %86 = tt.expand_dims %73 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:45:30.5789702Z         %87 = tt.broadcast %85 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5790017Z         %88 = tt.broadcast %86 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5790341Z         %89 = arith.addi %87, %88 : tensor<128x256xi32>
2026-02-21T08:45:30.5790662Z         %90 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5791000Z         %91 = tt.addptr %90, %89 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:45:30.5791341Z         tt.store %91, %83 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5791619Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:45:30.5791891Z         %92 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:45:30.5792134Z         %93 = arith.addi %arg3, %92 : i32
2026-02-21T08:45:30.5792444Z         %94 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:45:30.5792772Z         %95 = tt.splat %93 : i32 -> tensor<256xi32>
2026-02-21T08:45:30.5793032Z         %96 = arith.addi %95, %94 : tensor<256xi32>
2026-02-21T08:45:30.5793387Z         %97 = tt.descriptor_load %0[%4, %93] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5793825Z         %98 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5794177Z         %99 = arith.extf %97 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5794496Z         %100 = tt.broadcast %98 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5794817Z         %101 = arith.subf %99, %100 : tensor<128x256xf32>
2026-02-21T08:45:30.5795281Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5795756Z         %103 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5796125Z         %104 = tt.broadcast %103 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5796421Z         %105 = arith.divf %102, %104 : tensor<128x256xf32>
2026-02-21T08:45:30.5796748Z         %106 = arith.truncf %105 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:45:30.5797178Z         %107 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:45:30.5797496Z         %108 = arith.muli %107, %cst : tensor<128x1xi32>
2026-02-21T08:45:30.5797830Z         %109 = tt.expand_dims %96 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:45:30.5798174Z         %110 = tt.broadcast %108 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5798524Z         %111 = tt.broadcast %109 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5798817Z         %112 = arith.addi %110, %111 : tensor<128x256xi32>
2026-02-21T08:45:30.5799139Z         %113 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5799511Z         %114 = tt.addptr %113, %112 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:45:30.5799835Z         tt.store %114, %106 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5800115Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:45:30.5800391Z       %27 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:45:30.5800731Z       %28 = tt.splat %c8448_i32_2 : i32 -> tensor<256xi32>
2026-02-21T08:45:30.5800990Z       %29 = arith.addi %28, %27 : tensor<256xi32>
2026-02-21T08:45:30.5801374Z       %30 = tt.descriptor_load %0[%4, %c8448_i32_2] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:45:30.5801848Z       %31 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5802182Z       %32 = arith.extf %30 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:45:30.5802523Z       %33 = tt.broadcast %31 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5802813Z       %34 = arith.subf %32, %33 : tensor<128x256xf32>
2026-02-21T08:45:30.5803270Z       %35 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:45:30.5803767Z       %36 = tt.expand_dims %26 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:45:30.5804109Z       %37 = tt.broadcast %36 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:45:30.5804425Z       %38 = arith.divf %35, %37 : tensor<128x256xf32>
2026-02-21T08:45:30.5804712Z       %39 = arith.truncf %38 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:45:30.5805068Z       %40 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:45:30.5805406Z       %41 = arith.muli %40, %cst : tensor<128x1xi32>
2026-02-21T08:45:30.5805704Z       %42 = tt.expand_dims %29 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:45:30.5806062Z       %43 = tt.broadcast %41 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5806369Z       %44 = tt.broadcast %42 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:45:30.5806673Z       %45 = arith.addi %43, %44 : tensor<128x256xi32>
2026-02-21T08:45:30.5806985Z       %46 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5807399Z       %47 = tt.addptr %46, %45 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:45:30.5807724Z       tt.store %47, %39 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:45:30.5807987Z     } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:45:30.5808246Z     tt.return
2026-02-21T08:45:30.5808417Z   }
2026-02-21T08:45:30.5808607Z }
2026-02-21T08:45:30.5808700Z 
2026-02-21T08:45:30.5808771Z {-#
2026-02-21T08:45:30.5808969Z   external_resources: {
2026-02-21T08:45:30.5809162Z     mlir_reproducer: {
2026-02-21T08:45:30.5813612Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:45:30.5818195Z       disable_threading: false,
2026-02-21T08:45:30.5818442Z       verify_each: true
2026-02-21T08:45:30.5818632Z     }
2026-02-21T08:45:30.5818822Z   }
2026-02-21T08:45:30.5818977Z #-}
2026-02-21T08:45:30.5819470Z /tmp/torchinductor_root/xo/cxoakznh6e5gx4h7szyr7cv3z4x2fdmhship7mdqmxtbzwj433yn.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:45:30.5820707Z /tmp/torchinductor_root/xo/cxoakznh6e5gx4h7szyr7cv3z4x2fdmhship7mdqmxtbzwj433yn.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:45:30.5821787Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:45:30.5822957Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:45:30.5824007Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:45:30.5824304Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:45:34.2292598Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.8 configs/s
2026-02-21T08:45:34.2303439Z [48s] Adaptive compile timeout: 30s (90% percentile=9.9s, bounds=[30.0s, 30s])
2026-02-21T08:45:35.0098310Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1267.7 configs/s
2026-02-21T08:45:35.0735799Z [49s] Initial random population of 100, 5 starting points: 
2026-02-21T08:45:35.0740031Z error=6
2026-02-21T08:45:35.0744232Z timeout=1
2026-02-21T08:45:35.0748733Z ok=93
2026-02-21T08:45:35.0754023Z min=0.0451
2026-02-21T08:45:35.0758856Z mid=0.7834
2026-02-21T08:45:35.0764253Z max=44.0392
2026-02-21T08:45:35.0764593Z best={'block_sizes': [1, 16384],
2026-02-21T08:45:35.0764918Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:45:35.0770746Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:45:35.0775283Z  'num_sm_multiplier': 8,
2026-02-21T08:45:35.0779315Z  'num_stages': 3,
2026-02-21T08:45:35.0784487Z  'num_warps': 1,
2026-02-21T08:45:35.0789068Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:45:35.0793087Z  'range_flattens': [False, None],
2026-02-21T08:45:35.0799254Z  'range_multi_buffers': [True, True],
2026-02-21T08:45:35.0804190Z  'range_num_stages': [1, 2],
2026-02-21T08:45:35.0804525Z  'range_unroll_factors': [0, 1],
2026-02-21T08:45:35.0804818Z  'range_warp_specializes': [True, None]}
2026-02-21T08:45:35.0810074Z [49s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:45:36.1323570Z [50s] Generation 1 starting: 77 neighbors, 5 active search path(s)
2026-02-21T08:45:51.1329191Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 3.8 configs/s
2026-02-21T08:45:56.6581288Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 14.9 configs/s
2026-02-21T08:46:01.4863499Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 209.4         
2026-02-21T08:46:01.4867350Z                                                                   configs/s     
2026-02-21T08:46:01.7538173Z [76s] Generation 1 complete: 
2026-02-21T08:46:01.7540564Z ok=83
2026-02-21T08:46:01.7540787Z min=0.0430
2026-02-21T08:46:01.7540993Z mid=0.0594
2026-02-21T08:46:01.7541199Z max=0.2048
2026-02-21T08:46:01.7541425Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:01.7541961Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:01.7542283Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:46:01.7542539Z  'num_stages': 3,
2026-02-21T08:46:01.7542724Z  'num_warps': 1,
2026-02-21T08:46:01.7542938Z  'pid_type': 'flat',
2026-02-21T08:46:01.7543136Z  'range_flattens': [None, None],
2026-02-21T08:46:01.7543388Z  'range_multi_buffers': [None, None],
2026-02-21T08:46:01.7543612Z  'range_num_stages': [0, 2],
2026-02-21T08:46:01.7543845Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:01.7544064Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:01.7552681Z [76s] Fitting surrogate: 183 points, 183 targets
2026-02-21T08:46:02.5674232Z [77s] Generation 2 starting: 55 neighbors, 5 active search path(s)
2026-02-21T08:46:12.6312966Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 3.3 configs/s
2026-02-21T08:46:16.6946706Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.1 configs/s
2026-02-21T08:46:19.6157860Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 443.3         
2026-02-21T08:46:19.6159935Z                                                                   configs/s     
2026-02-21T08:46:19.7583692Z [94s] Generation 2 complete: 
2026-02-21T08:46:19.7587749Z ok=60
2026-02-21T08:46:19.7592170Z min=0.0348
2026-02-21T08:46:19.7593946Z mid=0.0553
2026-02-21T08:46:19.7594202Z max=0.1639
2026-02-21T08:46:19.7594395Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:19.7594707Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:19.7594992Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:46:19.7595255Z  'num_stages': 3,
2026-02-21T08:46:19.7595437Z  'num_warps': 1,
2026-02-21T08:46:19.7595638Z  'pid_type': 'flat',
2026-02-21T08:46:19.7604909Z  'range_flattens': [None, None],
2026-02-21T08:46:19.7605151Z  'range_multi_buffers': [None, None],
2026-02-21T08:46:19.7605410Z  'range_num_stages': [0, 2],
2026-02-21T08:46:19.7605969Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:19.7606189Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:19.7606862Z [94s] Fitting surrogate: 243 points, 243 targets
2026-02-21T08:46:20.4056457Z [95s] Generation 3 starting: 43 neighbors, 4 active search path(s)
2026-02-21T08:46:27.4099548Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 3.7 configs/s
2026-02-21T08:46:30.6686903Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 13.9 configs/s
2026-02-21T08:46:32.2426505Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 638.3         
2026-02-21T08:46:32.2427408Z                                                                   configs/s     
2026-02-21T08:46:32.3579741Z [107s] Generation 3 complete: 
2026-02-21T08:46:32.3581124Z ok=47
2026-02-21T08:46:32.3581374Z min=0.0347
2026-02-21T08:46:32.3581835Z mid=0.0553
2026-02-21T08:46:32.3582039Z max=1.2422
2026-02-21T08:46:32.3582630Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:32.3582965Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:32.3583249Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:46:32.3583504Z  'num_stages': 3,
2026-02-21T08:46:32.3583678Z  'num_warps': 1,
2026-02-21T08:46:32.3583885Z  'pid_type': 'flat',
2026-02-21T08:46:32.3584105Z  'range_flattens': [None, None],
2026-02-21T08:46:32.3584315Z  'range_multi_buffers': [None, None],
2026-02-21T08:46:32.3584557Z  'range_num_stages': [0, 2],
2026-02-21T08:46:32.3584763Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:32.3585006Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:32.3595841Z [107s] Fitting surrogate: 290 points, 290 targets
2026-02-21T08:46:32.8943245Z [107s] Generation 4 starting: 32 neighbors, 3 active search path(s)
2026-02-21T08:46:37.7933535Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 7.2 configs/s
2026-02-21T08:46:39.7433973Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 17.3 configs/s
2026-02-21T08:46:41.0894253Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 745.2         
2026-02-21T08:46:41.0898833Z                                                                   configs/s     
2026-02-21T08:46:41.1841615Z [115s] Generation 4 complete: 
2026-02-21T08:46:41.1845502Z ok=35
2026-02-21T08:46:41.1849804Z min=0.0348
2026-02-21T08:46:41.1851972Z mid=0.0532
2026-02-21T08:46:41.1852216Z max=0.1537
2026-02-21T08:46:41.1852478Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:41.1852791Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:41.1853104Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:46:41.1853378Z  'num_stages': 3,
2026-02-21T08:46:41.1853602Z  'num_warps': 2,
2026-02-21T08:46:41.1853854Z  'pid_type': 'flat',
2026-02-21T08:46:41.1854052Z  'range_flattens': [None, None],
2026-02-21T08:46:41.1854303Z  'range_multi_buffers': [None, False],
2026-02-21T08:46:41.1854508Z  'range_num_stages': [0, 1],
2026-02-21T08:46:41.1854794Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:41.1855086Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:41.1855391Z [115s] Fitting surrogate: 325 points, 325 targets
2026-02-21T08:46:41.6091043Z [116s] Generation 5 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:46:45.3620077Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 6.7 configs/s
2026-02-21T08:46:46.6725485Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.4 configs/s
2026-02-21T08:46:47.7327004Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 940.7         
2026-02-21T08:46:47.7328776Z                                                                   configs/s     
2026-02-21T08:46:47.8141277Z [122s] Generation 5 complete: 
2026-02-21T08:46:47.8145142Z ok=23
2026-02-21T08:46:47.8148941Z min=0.0348
2026-02-21T08:46:47.8152854Z mid=0.0513
2026-02-21T08:46:47.8154425Z max=0.1331
2026-02-21T08:46:47.8154638Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:47.8155280Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:47.8155708Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:46:47.8155963Z  'num_stages': 3,
2026-02-21T08:46:47.8156143Z  'num_warps': 2,
2026-02-21T08:46:47.8156345Z  'pid_type': 'flat',
2026-02-21T08:46:47.8156565Z  'range_flattens': [None, None],
2026-02-21T08:46:47.8156781Z  'range_multi_buffers': [None, False],
2026-02-21T08:46:47.8157028Z  'range_num_stages': [0, 1],
2026-02-21T08:46:47.8157231Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:47.8157478Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:47.8157805Z [122s] Fitting surrogate: 348 points, 348 targets
2026-02-21T08:46:48.1229737Z [122s] Generation 6 starting: 13 neighbors, 1 active search path(s)
2026-02-21T08:46:51.1131458Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 4.2 configs/s
2026-02-21T08:46:51.8949192Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.7 configs/s
2026-02-21T08:46:52.8434473Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1050.5         
2026-02-21T08:46:52.8437876Z                                                                  configs/s      
2026-02-21T08:46:52.9182969Z [127s] Generation 6 complete: 
2026-02-21T08:46:52.9186903Z ok=15
2026-02-21T08:46:52.9191484Z min=0.0348
2026-02-21T08:46:52.9196438Z mid=0.0471
2026-02-21T08:46:52.9198354Z max=0.0798
2026-02-21T08:46:52.9198654Z best={'block_sizes': [1, 16384],
2026-02-21T08:46:52.9204439Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:46:52.9206493Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:46:52.9206795Z  'num_stages': 3,
2026-02-21T08:46:52.9206987Z  'num_warps': 2,
2026-02-21T08:46:52.9207202Z  'pid_type': 'flat',
2026-02-21T08:46:52.9207406Z  'range_flattens': [None, None],
2026-02-21T08:46:52.9207655Z  'range_multi_buffers': [None, False],
2026-02-21T08:46:52.9207909Z  'range_num_stages': [0, 1],
2026-02-21T08:46:52.9208124Z  'range_unroll_factors': [0, 1],
2026-02-21T08:46:52.9208382Z  'range_warp_specializes': [None, True]}
2026-02-21T08:46:52.9208641Z [127s] Fitting surrogate: 363 points, 363 targets
2026-02-21T08:46:53.0962996Z [127s] Autotuning complete in 127.8s after searching 351 configs.
2026-02-21T08:46:53.0967058Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:46:53.0972277Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:46:53.0973369Z 
2026-02-21T08:46:53.0977922Z [127s] Code of selected kernel: /tmp/torchinductor_root/gr/cgrf4vzaiybqq2vtl4f4ipkrbsrfxtqdt5p3nddu75heazgfy3hn.py
2026-02-21T08:46:53.7931333Z WARNING:tritonbench.utils.triton_op:Completed input ID 66:
2026-02-21T08:46:53.7933789Z (M, N)
2026-02-21T08:46:53.7934024Z ------------
2026-02-21T08:46:53.7934207Z (4096, 8704)
2026-02-21T08:46:53.7934398Z 
2026-02-21T08:46:53.7946577Z  70%|███████   | 14/20 [37:51<16:37, 166.31s/it]WARNING:tritonbench.utils.triton_op:Running input ID 71:
2026-02-21T08:46:53.7950687Z (M, N)
2026-02-21T08:46:53.7952189Z ------------
2026-02-21T08:46:53.7952435Z (4096, 9344)
2026-02-21T08:46:53.7952804Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:46:55.0184371Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:46:56.3716248Z INFO:tritonbench.utils.triton_op:Took 2.60ms to get benchmark function for torch_compile_softmax
2026-02-21T08:47:00.0684098Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:47:00.0689597Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:47:00.0693675Z               'dtype': 'torch.float16',
2026-02-21T08:47:00.0696960Z               'shape': (4096, 9344),
2026-02-21T08:47:00.0699356Z               'stride': (9344, 1)},),
2026-02-21T08:47:00.0699680Z   'kwargs': {}}
2026-02-21T08:47:00.0704539Z INFO:tritonbench.utils.triton_op:Took 2.40ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:47:00.2490805Z [0s] Autotune random seed: 2134816249
2026-02-21T08:47:00.3925338Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:47:39.6882572Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T08:47:39.6897224Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T08:47:42.0299828Z module {
2026-02-21T08:47:42.0304570Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:47:42.0309600Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:47:42.0310962Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:47:42.0311242Z     %c148_i32 = arith.constant 148 : i32
2026-02-21T08:47:42.0311905Z     %cst = arith.constant dense<9344> : tensor<128x1xi32>
2026-02-21T08:47:42.0312227Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T08:47:42.0312566Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T08:47:42.0312856Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:47:42.0313087Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:47:42.0313320Z     %c9344_i32 = arith.constant 9344 : i32
2026-02-21T08:47:42.0313539Z     %c9344_i64 = arith.constant 9344 : i64
2026-02-21T08:47:42.0313804Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:47:42.0314185Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : <f16>, <tensor<128x128xf16>>
2026-02-21T08:47:42.0314574Z     %1 = tt.get_program_id x : i32
2026-02-21T08:47:42.0314855Z     scf.for %arg2 = %1 to %c32_i32 step %c148_i32  : i32 {
2026-02-21T08:47:42.0315117Z       %2 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T08:47:42.0315445Z       %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:47:42.0315758Z       %4 = tt.splat %2 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0316037Z       %5 = arith.addi %4, %3 : tensor<128xi32>
2026-02-21T08:47:42.0316276Z       %c9216_i32 = arith.constant 9216 : i32
2026-02-21T08:47:42.0316562Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T08:47:42.0317040Z       %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T08:47:42.0317497Z         %55 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0318146Z         %56 = arith.addi %55, %3 : tensor<128xi32>
2026-02-21T08:47:42.0318475Z         %57 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0318848Z         %58 = arith.muli %57, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0319166Z         %59 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0319544Z         %60 = tt.broadcast %58 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0319897Z         %61 = tt.broadcast %59 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0320193Z         %62 = arith.addi %60, %61 : tensor<128x128xi32>
2026-02-21T08:47:42.0320510Z         %63 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0320831Z         %64 = tt.addptr %63, %62 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0321318Z         %65 = tt.load %64 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0321694Z         %66 = arith.extf %65 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0321951Z         %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0322161Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0322359Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:42.0322570Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0322855Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0323093Z         %68 = arith.truncf %67 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:42.0323363Z         %69 = arith.extf %68 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:42.0323611Z         %70 = arith.cmpf ogt, %arg4, %69 : tensor<128xf32>
2026-02-21T08:47:42.0323855Z         %71 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T08:47:42.0324084Z         %72 = arith.ori %70, %71 : tensor<128xi1>
2026-02-21T08:47:42.0324339Z         %73 = arith.select %72, %arg4, %69 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:42.0324606Z         %74 = arith.subf %arg4, %73 : tensor<128xf32>
2026-02-21T08:47:42.0324987Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0325373Z         %76 = arith.mulf %arg5, %75 : tensor<128xf32>
2026-02-21T08:47:42.0325638Z         %77 = tt.expand_dims %73 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0325952Z         %78 = tt.broadcast %77 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0326191Z         %79 = arith.subf %66, %78 : tensor<128x128xf32>
2026-02-21T08:47:42.0326567Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0326942Z         %81 = "tt.reduce"(%80) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0327133Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0327322Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:42.0327506Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0327701Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0327895Z         %82 = arith.addf %76, %81 : tensor<128xf32>
2026-02-21T08:47:42.0328093Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:47:42.0328284Z         %83 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T08:47:42.0328468Z         %84 = arith.addi %arg3, %83 : i32
2026-02-21T08:47:42.0328659Z         %85 = tt.splat %84 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0328855Z         %86 = arith.addi %85, %3 : tensor<128xi32>
2026-02-21T08:47:42.0329106Z         %87 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0329367Z         %88 = arith.muli %87, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0329634Z         %89 = tt.expand_dims %86 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0330016Z         %90 = tt.broadcast %88 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0330280Z         %91 = tt.broadcast %89 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0330522Z         %92 = arith.addi %90, %91 : tensor<128x128xi32>
2026-02-21T08:47:42.0330757Z         %93 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0331045Z         %94 = tt.addptr %93, %92 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0331347Z         %95 = tt.load %94 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0331683Z         %96 = arith.extf %95 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0331924Z         %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0332114Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0332307Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:42.0332561Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0332768Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0332996Z         %98 = arith.truncf %97 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:42.0333257Z         %99 = arith.extf %98 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:42.0333495Z         %100 = arith.cmpf ogt, %73, %99 : tensor<128xf32>
2026-02-21T08:47:42.0333709Z         %101 = arith.cmpf une, %73, %73 : tensor<128xf32>
2026-02-21T08:47:42.0333925Z         %102 = arith.ori %100, %101 : tensor<128xi1>
2026-02-21T08:47:42.0334165Z         %103 = arith.select %102, %73, %99 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:42.0334411Z         %104 = arith.subf %73, %103 : tensor<128xf32>
2026-02-21T08:47:42.0334773Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0335141Z         %106 = arith.mulf %82, %105 : tensor<128xf32>
2026-02-21T08:47:42.0335410Z         %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0335714Z         %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0335967Z         %109 = arith.subf %96, %108 : tensor<128x128xf32>
2026-02-21T08:47:42.0336341Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0336724Z         %111 = "tt.reduce"(%110) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0336924Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0337103Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:42.0337295Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0337484Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0337691Z         %112 = arith.addf %106, %111 : tensor<128xf32>
2026-02-21T08:47:42.0337884Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:42.0338079Z         %113 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:47:42.0338271Z         %114 = arith.addi %arg3, %113 : i32
2026-02-21T08:47:42.0338469Z         %115 = tt.splat %114 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0338676Z         %116 = arith.addi %115, %3 : tensor<128xi32>
2026-02-21T08:47:42.0338922Z         %117 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0339194Z         %118 = arith.muli %117, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0339452Z         %119 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0339760Z         %120 = tt.broadcast %118 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0340036Z         %121 = tt.broadcast %119 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0340280Z         %122 = arith.addi %120, %121 : tensor<128x128xi32>
2026-02-21T08:47:42.0340526Z         %123 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0340816Z         %124 = tt.addptr %123, %122 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0341199Z         %125 = tt.load %124 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0341497Z         %126 = arith.extf %125 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0341794Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0341988Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0342169Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:42.0342367Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0342554Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0342786Z         %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:42.0343040Z         %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:42.0343286Z         %130 = arith.cmpf ogt, %103, %129 : tensor<128xf32>
2026-02-21T08:47:42.0343581Z         %131 = arith.cmpf une, %103, %103 : tensor<128xf32>
2026-02-21T08:47:42.0343801Z         %132 = arith.ori %130, %131 : tensor<128xi1>
2026-02-21T08:47:42.0344053Z         %133 = arith.select %132, %103, %129 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:42.0344309Z         %134 = arith.subf %103, %133 : tensor<128xf32>
2026-02-21T08:47:42.0344691Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0345069Z         %136 = arith.mulf %112, %135 : tensor<128xf32>
2026-02-21T08:47:42.0345347Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0345669Z         %138 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0345926Z         %139 = arith.subf %126, %138 : tensor<128x128xf32>
2026-02-21T08:47:42.0346325Z         %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0346717Z         %141 = "tt.reduce"(%140) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0346936Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0347140Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:42.0347335Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0347538Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0347745Z         %142 = arith.addf %136, %141 : tensor<128xf32>
2026-02-21T08:47:42.0347951Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:47:42.0348146Z         %143 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:47:42.0348353Z         %144 = arith.addi %arg3, %143 : i32
2026-02-21T08:47:42.0348553Z         %145 = tt.splat %144 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0348769Z         %146 = arith.addi %145, %3 : tensor<128xi32>
2026-02-21T08:47:42.0349035Z         %147 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0349319Z         %148 = arith.muli %147, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0349595Z         %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0349906Z         %150 = tt.broadcast %148 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0350195Z         %151 = tt.broadcast %149 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0350448Z         %152 = arith.addi %150, %151 : tensor<128x128xi32>
2026-02-21T08:47:42.0350711Z         %153 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0351018Z         %154 = tt.addptr %153, %152 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0351343Z         %155 = tt.load %154 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0351685Z         %156 = arith.extf %155 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0351922Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0352176Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0352364Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:42.0352553Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0352746Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0352970Z         %158 = arith.truncf %157 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:42.0353227Z         %159 = arith.extf %158 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:42.0353459Z         %160 = arith.cmpf ogt, %133, %159 : tensor<128xf32>
2026-02-21T08:47:42.0353684Z         %161 = arith.cmpf une, %133, %133 : tensor<128xf32>
2026-02-21T08:47:42.0353894Z         %162 = arith.ori %160, %161 : tensor<128xi1>
2026-02-21T08:47:42.0354134Z         %163 = arith.select %162, %133, %159 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:42.0354382Z         %164 = arith.subf %133, %163 : tensor<128xf32>
2026-02-21T08:47:42.0354828Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0355200Z         %166 = arith.mulf %142, %165 : tensor<128xf32>
2026-02-21T08:47:42.0355457Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0355762Z         %168 = tt.broadcast %167 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0356018Z         %169 = arith.subf %156, %168 : tensor<128x128xf32>
2026-02-21T08:47:42.0356388Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0356764Z         %171 = "tt.reduce"(%170) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0356952Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:42.0357141Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:42.0357327Z           tt.reduce.return %173 : f32
2026-02-21T08:47:42.0357522Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0357732Z         %172 = arith.addf %166, %171 : tensor<128xf32>
2026-02-21T08:47:42.0357954Z         scf.yield %163, %172 : tensor<128xf32>, tensor<128xf32>
2026-02-21T08:47:42.0358234Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:47:42.0358498Z       %7 = tt.splat %c9216_i32 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0358726Z       %8 = arith.addi %7, %3 : tensor<128xi32>
2026-02-21T08:47:42.0358966Z       %9 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0359268Z       %10 = arith.muli %9, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0359522Z       %11 = tt.expand_dims %8 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0359809Z       %12 = tt.broadcast %10 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0360091Z       %13 = tt.broadcast %11 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0360335Z       %14 = arith.addi %12, %13 : tensor<128x128xi32>
2026-02-21T08:47:42.0360589Z       %15 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0360881Z       %16 = tt.addptr %15, %14 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0361208Z       %17 = tt.load %16 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0361516Z       %18 = arith.extf %17 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0361786Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0361992Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:47:42.0362182Z         %55 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:47:42.0362386Z         tt.reduce.return %55 : f32
2026-02-21T08:47:42.0362578Z       }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0362818Z       %20 = arith.truncf %19 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:42.0363085Z       %21 = arith.extf %20 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:42.0363409Z       %22 = arith.cmpf ogt, %6#0, %21 : tensor<128xf32>
2026-02-21T08:47:42.0363640Z       %23 = arith.cmpf une, %6#0, %6#0 : tensor<128xf32>
2026-02-21T08:47:42.0363854Z       %24 = arith.ori %22, %23 : tensor<128xi1>
2026-02-21T08:47:42.0364099Z       %25 = arith.select %24, %6#0, %21 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:42.0364343Z       %26 = arith.subf %6#0, %25 : tensor<128xf32>
2026-02-21T08:47:42.0364720Z       %27 = tt.extern_elementwise %26 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0365093Z       %28 = arith.mulf %6#1, %27 : tensor<128xf32>
2026-02-21T08:47:42.0365348Z       %29 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0365658Z       %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0365904Z       %31 = arith.subf %18, %30 : tensor<128x128xf32>
2026-02-21T08:47:42.0366344Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0366732Z       %33 = "tt.reduce"(%32) <{axis = 1 : i32}> ({
2026-02-21T08:47:42.0366927Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:47:42.0367117Z         %55 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:47:42.0367310Z         tt.reduce.return %55 : f32
2026-02-21T08:47:42.0367507Z       }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:47:42.0367711Z       %34 = arith.addf %28, %33 : tensor<128xf32>
2026-02-21T08:47:42.0367921Z       %c9216_i32_2 = arith.constant 9216 : i32
2026-02-21T08:47:42.0368122Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T08:47:42.0368365Z       scf.for %arg3 = %c0_i32 to %c9216_i32_2 step %c512_i32_3  : i32 {
2026-02-21T08:47:42.0368627Z         %55 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0368838Z         %56 = arith.addi %55, %3 : tensor<128xi32>
2026-02-21T08:47:42.0369161Z         %57 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:47:42.0369512Z         %58 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0369810Z         %59 = arith.extf %57 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0370078Z         %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0370319Z         %61 = arith.subf %59, %60 : tensor<128x128xf32>
2026-02-21T08:47:42.0370696Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0371117Z         %63 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0371415Z         %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0371687Z         %65 = arith.divf %62, %64 : tensor<128x128xf32>
2026-02-21T08:47:42.0371942Z         %66 = arith.truncf %65 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:47:42.0372237Z         %67 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0372499Z         %68 = arith.muli %67, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0372755Z         %69 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0373038Z         %70 = tt.broadcast %68 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0373307Z         %71 = tt.broadcast %69 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0373546Z         %72 = arith.addi %70, %71 : tensor<128x128xi32>
2026-02-21T08:47:42.0373780Z         %73 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0374066Z         %74 = tt.addptr %73, %72 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0374326Z         tt.store %74, %66 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0374538Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:47:42.0374789Z         %75 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T08:47:42.0374984Z         %76 = arith.addi %arg3, %75 : i32
2026-02-21T08:47:42.0375177Z         %77 = tt.splat %76 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0375373Z         %78 = arith.addi %77, %3 : tensor<128xi32>
2026-02-21T08:47:42.0375662Z         %79 = tt.descriptor_load %0[%2, %76] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:47:42.0376001Z         %80 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0376288Z         %81 = arith.extf %79 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0376545Z         %82 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0376789Z         %83 = arith.subf %81, %82 : tensor<128x128xf32>
2026-02-21T08:47:42.0377215Z         %84 = tt.extern_elementwise %83 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0377633Z         %85 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0377929Z         %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0378166Z         %87 = arith.divf %84, %86 : tensor<128x128xf32>
2026-02-21T08:47:42.0378407Z         %88 = arith.truncf %87 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:47:42.0378700Z         %89 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0378965Z         %90 = arith.muli %89, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0379223Z         %91 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0379510Z         %92 = tt.broadcast %90 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0379779Z         %93 = tt.broadcast %91 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0380015Z         %94 = arith.addi %92, %93 : tensor<128x128xi32>
2026-02-21T08:47:42.0380259Z         %95 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0380542Z         %96 = tt.addptr %95, %94 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0380802Z         tt.store %96, %88 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0381010Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:42.0381195Z         %97 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:47:42.0381390Z         %98 = arith.addi %arg3, %97 : i32
2026-02-21T08:47:42.0381603Z         %99 = tt.splat %98 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0381811Z         %100 = arith.addi %99, %3 : tensor<128xi32>
2026-02-21T08:47:42.0382105Z         %101 = tt.descriptor_load %0[%2, %98] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:47:42.0382445Z         %102 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0382745Z         %103 = arith.extf %101 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0383018Z         %104 = tt.broadcast %102 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0383271Z         %105 = arith.subf %103, %104 : tensor<128x128xf32>
2026-02-21T08:47:42.0383654Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0384073Z         %107 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0384375Z         %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0384626Z         %109 = arith.divf %106, %108 : tensor<128x128xf32>
2026-02-21T08:47:42.0384884Z         %110 = arith.truncf %109 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:47:42.0385180Z         %111 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0385466Z         %112 = arith.muli %111, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0385799Z         %113 = tt.expand_dims %100 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0386090Z         %114 = tt.broadcast %112 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0386363Z         %115 = tt.broadcast %113 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0386606Z         %116 = arith.addi %114, %115 : tensor<128x128xi32>
2026-02-21T08:47:42.0386853Z         %117 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0387145Z         %118 = tt.addptr %117, %116 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0387411Z         tt.store %118, %110 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0387623Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:47:42.0387809Z         %119 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:47:42.0388008Z         %120 = arith.addi %arg3, %119 : i32
2026-02-21T08:47:42.0388254Z         %121 = tt.splat %120 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0388471Z         %122 = arith.addi %121, %3 : tensor<128xi32>
2026-02-21T08:47:42.0388763Z         %123 = tt.descriptor_load %0[%2, %120] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:47:42.0389119Z         %124 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0389416Z         %125 = arith.extf %123 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0389682Z         %126 = tt.broadcast %124 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0389930Z         %127 = arith.subf %125, %126 : tensor<128x128xf32>
2026-02-21T08:47:42.0390307Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0390739Z         %129 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0391040Z         %130 = tt.broadcast %129 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0391282Z         %131 = arith.divf %128, %130 : tensor<128x128xf32>
2026-02-21T08:47:42.0391575Z         %132 = arith.truncf %131 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:47:42.0391867Z         %133 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0392135Z         %134 = arith.muli %133, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0392397Z         %135 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0392688Z         %136 = tt.broadcast %134 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0392959Z         %137 = tt.broadcast %135 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0393200Z         %138 = arith.addi %136, %137 : tensor<128x128xi32>
2026-02-21T08:47:42.0393444Z         %139 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0393733Z         %140 = tt.addptr %139, %138 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0394000Z         tt.store %140, %132 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0394261Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:47:42.0394519Z       %35 = tt.splat %c9216_i32_2 : i32 -> tensor<128xi32>
2026-02-21T08:47:42.0394736Z       %36 = arith.addi %35, %3 : tensor<128xi32>
2026-02-21T08:47:42.0395034Z       %37 = tt.descriptor_load %0[%2, %c9216_i32_2] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:47:42.0395390Z       %38 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0395688Z       %39 = arith.extf %37 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:47:42.0395950Z       %40 = tt.broadcast %38 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0396192Z       %41 = arith.subf %39, %40 : tensor<128x128xf32>
2026-02-21T08:47:42.0396575Z       %42 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:47:42.0397083Z       %43 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:42.0397379Z       %44 = tt.broadcast %43 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:47:42.0397615Z       %45 = arith.divf %42, %44 : tensor<128x128xf32>
2026-02-21T08:47:42.0397857Z       %46 = arith.truncf %45 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:47:42.0398143Z       %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:42.0398413Z       %48 = arith.muli %47, %cst : tensor<128x1xi32>
2026-02-21T08:47:42.0398674Z       %49 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:47:42.0398963Z       %50 = tt.broadcast %48 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0399287Z       %51 = tt.broadcast %49 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:47:42.0399523Z       %52 = arith.addi %50, %51 : tensor<128x128xi32>
2026-02-21T08:47:42.0399763Z       %53 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0400036Z       %54 = tt.addptr %53, %52 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:47:42.0400296Z       tt.store %54, %46 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:47:42.0400571Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:47:42.0400819Z     tt.return
2026-02-21T08:47:42.0400953Z   }
2026-02-21T08:47:42.0401076Z }
2026-02-21T08:47:42.0401144Z 
2026-02-21T08:47:42.0401204Z {-#
2026-02-21T08:47:42.0401330Z   external_resources: {
2026-02-21T08:47:42.0401493Z     mlir_reproducer: {
2026-02-21T08:47:42.0405999Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:47:42.0410704Z       disable_threading: false,
2026-02-21T08:47:42.0410925Z       verify_each: true
2026-02-21T08:47:42.0411098Z     }
2026-02-21T08:47:42.0411242Z   }
2026-02-21T08:47:42.0411392Z #-}
2026-02-21T08:47:42.0411923Z /tmp/torchinductor_root/a2/ca2sar7nma3f3dfsksjyc2hlyfuz4jhavdzoiwo4fl6jczqwnmxf.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:47:42.0413311Z /tmp/torchinductor_root/a2/ca2sar7nma3f3dfsksjyc2hlyfuz4jhavdzoiwo4fl6jczqwnmxf.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:47:42.0414443Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:47:42.0415551Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_sm_multiplier=1, num_stages=7, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:47:42.0416512Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:47:42.0416825Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:47:43.9958214Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:47:43.9958736Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:47:43.9959237Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:47:43.9959444Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:47:43.9959644Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:47:43.9959822Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:47:43.9960035Z     %cst = arith.constant dense<9344> : tensor<128x1xi32>
2026-02-21T08:47:43.9960298Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x256xf32>
2026-02-21T08:47:43.9960567Z     %cst_1 = arith.constant dense<0xFC00> : tensor<128x256xf16>
2026-02-21T08:47:43.9960812Z     %cst_2 = arith.constant dense<9344> : tensor<256xi32>
2026-02-21T08:47:43.9961086Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T08:47:43.9961389Z     %cst_4 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T08:47:43.9961818Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:47:43.9962052Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:47:43.9962323Z     %c9344_i32 = arith.constant 9344 : i32
2026-02-21T08:47:43.9962598Z     %c9344_i64 = arith.constant 9344 : i64
2026-02-21T08:47:43.9962860Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:47:43.9963347Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : <f16>, <tensor<128x256xf16>>
2026-02-21T08:47:43.9963851Z     %1 = tt.get_program_id x : i32
2026-02-21T08:47:43.9964117Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:47:43.9964394Z     %3 = arith.minsi %2, %c32_i32 : i32
2026-02-21T08:47:43.9964690Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:47:43.9965016Z       %4 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T08:47:43.9965398Z       %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:47:43.9965817Z       %6 = tt.splat %4 : i32 -> tensor<128xi32>
2026-02-21T08:47:43.9966116Z       %7 = arith.addi %6, %5 : tensor<128xi32>
2026-02-21T08:47:43.9966378Z       %c9216_i32 = arith.constant 9216 : i32
2026-02-21T08:47:43.9966576Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:47:43.9966942Z       %8:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c768_i32 iter_args(%arg4 = %cst_4, %arg5 = %cst_3) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T08:47:43.9967405Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:43.9967732Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:47:43.9967941Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:47:43.9968162Z         %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32>
2026-02-21T08:47:43.9968482Z         %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:43.9969156Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:43.9969476Z         %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:43.9969763Z         %67 = arith.select %66, %64, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:47:43.9977051Z         %68 = arith.extf %67 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:43.9977547Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:47:43.9977918Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:43.9978272Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:43.9978634Z           tt.reduce.return %145 : f32
2026-02-21T08:47:43.9978982Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9979347Z         %70 = arith.truncf %69 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:43.9979978Z         %71 = arith.extf %70 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:43.9980240Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<128xf32>
2026-02-21T08:47:43.9980495Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T08:47:43.9980742Z         %74 = arith.ori %72, %73 : tensor<128xi1>
2026-02-21T08:47:43.9981032Z         %75 = arith.select %74, %arg4, %71 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:43.9981274Z         %76 = arith.subf %arg4, %75 : tensor<128xf32>
2026-02-21T08:47:43.9981707Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9982134Z         %78 = arith.mulf %arg5, %77 : tensor<128xf32>
2026-02-21T08:47:43.9982400Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:43.9982712Z         %80 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:43.9982982Z         %81 = tt.broadcast %79 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:43.9983237Z         %82 = arith.subf %80, %81 : tensor<128x256xf32>
2026-02-21T08:47:43.9983607Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:43.9984029Z         %84 = arith.select %66, %83, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:47:43.9984281Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:47:43.9984507Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:43.9984740Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:43.9984926Z           tt.reduce.return %145 : f32
2026-02-21T08:47:43.9985117Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9985318Z         %86 = arith.addf %78, %85 : tensor<128xf32>
2026-02-21T08:47:43.9985604Z         %c1_i32_7 = arith.constant 1 : i32
2026-02-21T08:47:43.9985882Z         %87 = arith.muli %c256_i32, %c1_i32_7 : i32
2026-02-21T08:47:43.9986167Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:47:43.9986536Z         %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:43.9986924Z         %90 = tt.splat %88 : i32 -> tensor<256xi32>
2026-02-21T08:47:43.9987229Z         %91 = arith.addi %90, %89 : tensor<256xi32>
2026-02-21T08:47:43.9987560Z         %92 = arith.cmpi slt, %91, %cst_2 : tensor<256xi32>
2026-02-21T08:47:43.9988069Z         %93 = tt.descriptor_load %0[%4, %88] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:43.9988656Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:43.9989108Z         %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:43.9989452Z         %96 = arith.select %95, %93, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:47:43.9989737Z         %97 = arith.extf %96 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:43.9989984Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:47:43.9990299Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:43.9990494Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:43.9990698Z           tt.reduce.return %145 : f32
2026-02-21T08:47:43.9990886Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9991121Z         %99 = arith.truncf %98 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:43.9991376Z         %100 = arith.extf %99 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:43.9991680Z         %101 = arith.cmpf ogt, %75, %100 : tensor<128xf32>
2026-02-21T08:47:43.9991937Z         %102 = arith.cmpf une, %75, %75 : tensor<128xf32>
2026-02-21T08:47:43.9992164Z         %103 = arith.ori %101, %102 : tensor<128xi1>
2026-02-21T08:47:43.9992415Z         %104 = arith.select %103, %75, %100 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:43.9992666Z         %105 = arith.subf %75, %104 : tensor<128xf32>
2026-02-21T08:47:43.9993136Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9993516Z         %107 = arith.mulf %86, %106 : tensor<128xf32>
2026-02-21T08:47:43.9993786Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:43.9994094Z         %109 = arith.extf %93 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:43.9994385Z         %110 = tt.broadcast %108 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:43.9994687Z         %111 = arith.subf %109, %110 : tensor<128x256xf32>
2026-02-21T08:47:43.9995071Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:43.9995566Z         %113 = arith.select %95, %112, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:47:43.9995936Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:47:43.9996191Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:43.9996386Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:43.9996575Z           tt.reduce.return %145 : f32
2026-02-21T08:47:43.9996774Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:43.9996978Z         %115 = arith.addf %107, %114 : tensor<128xf32>
2026-02-21T08:47:43.9997183Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:43.9997386Z         %116 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:47:43.9997597Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:47:43.9997841Z         %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:43.9998097Z         %119 = tt.splat %117 : i32 -> tensor<256xi32>
2026-02-21T08:47:43.9998301Z         %120 = arith.addi %119, %118 : tensor<256xi32>
2026-02-21T08:47:43.9998547Z         %121 = arith.cmpi slt, %120, %cst_2 : tensor<256xi32>
2026-02-21T08:47:43.9999019Z         %122 = tt.descriptor_load %0[%4, %117] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:43.9999577Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0000032Z         %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0000494Z         %125 = arith.select %124, %122, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:47:44.0000961Z         %126 = arith.extf %125 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0001350Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:47:44.0001695Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:44.0001979Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:47:44.0002270Z           tt.reduce.return %145 : f32
2026-02-21T08:47:44.0002555Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0002887Z         %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:44.0003255Z         %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:44.0003770Z         %130 = arith.cmpf ogt, %104, %129 : tensor<128xf32>
2026-02-21T08:47:44.0004104Z         %131 = arith.cmpf une, %104, %104 : tensor<128xf32>
2026-02-21T08:47:44.0004422Z         %132 = arith.ori %130, %131 : tensor<128xi1>
2026-02-21T08:47:44.0004781Z         %133 = arith.select %132, %104, %129 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:44.0005166Z         %134 = arith.subf %104, %133 : tensor<128xf32>
2026-02-21T08:47:44.0005729Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0006266Z         %136 = arith.mulf %115, %135 : tensor<128xf32>
2026-02-21T08:47:44.0006648Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0007073Z         %138 = arith.extf %122 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0007587Z         %139 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0007942Z         %140 = arith.subf %138, %139 : tensor<128x256xf32>
2026-02-21T08:47:44.0008392Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0008826Z         %142 = arith.select %124, %141, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:47:44.0009084Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:47:44.0009301Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:47:44.0009551Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:47:44.0009814Z           tt.reduce.return %145 : f32
2026-02-21T08:47:44.0010077Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0010355Z         %144 = arith.addf %136, %143 : tensor<128xf32>
2026-02-21T08:47:44.0010654Z         scf.yield %133, %144 : tensor<128xf32>, tensor<128xf32>
2026-02-21T08:47:44.0010935Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:47:44.0011168Z       %9 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:44.0011422Z       %10 = tt.splat %c9216_i32 : i32 -> tensor<256xi32>
2026-02-21T08:47:44.0011683Z       %11 = arith.addi %10, %9 : tensor<256xi32>
2026-02-21T08:47:44.0011893Z       %12 = arith.cmpi slt, %11, %cst_2 : tensor<256xi32>
2026-02-21T08:47:44.0012210Z       %13 = tt.descriptor_load %0[%4, %c9216_i32] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:44.0012576Z       %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0012858Z       %15 = tt.broadcast %14 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0013140Z       %16 = arith.select %15, %13, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:47:44.0013415Z       %17 = arith.extf %16 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0013657Z       %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({
2026-02-21T08:47:44.0013861Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:47:44.0014043Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:47:44.0014239Z         tt.reduce.return %60 : f32
2026-02-21T08:47:44.0014418Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0014645Z       %19 = arith.truncf %18 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:47:44.0014884Z       %20 = arith.extf %19 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:47:44.0015121Z       %21 = arith.cmpf ogt, %8#0, %20 : tensor<128xf32>
2026-02-21T08:47:44.0015340Z       %22 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32>
2026-02-21T08:47:44.0015542Z       %23 = arith.ori %21, %22 : tensor<128xi1>
2026-02-21T08:47:44.0015773Z       %24 = arith.select %23, %8#0, %20 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:47:44.0016003Z       %25 = arith.subf %8#0, %24 : tensor<128xf32>
2026-02-21T08:47:44.0016367Z       %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0016789Z       %27 = arith.mulf %8#1, %26 : tensor<128xf32>
2026-02-21T08:47:44.0017057Z       %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0017365Z       %29 = arith.extf %13 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0017635Z       %30 = tt.broadcast %28 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0017891Z       %31 = arith.subf %29, %30 : tensor<128x256xf32>
2026-02-21T08:47:44.0018269Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0018703Z       %33 = arith.select %15, %32, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:47:44.0018969Z       %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({
2026-02-21T08:47:44.0019164Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:47:44.0019439Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:47:44.0019640Z         tt.reduce.return %60 : f32
2026-02-21T08:47:44.0019839Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:47:44.0020042Z       %35 = arith.addf %27, %34 : tensor<128xf32>
2026-02-21T08:47:44.0020254Z       %c9216_i32_5 = arith.constant 9216 : i32
2026-02-21T08:47:44.0020453Z       %c768_i32_6 = arith.constant 768 : i32
2026-02-21T08:47:44.0020699Z       scf.for %arg3 = %c0_i32 to %c9216_i32_5 step %c768_i32_6  : i32 {
2026-02-21T08:47:44.0020999Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:44.0021266Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:47:44.0021485Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:47:44.0021749Z         %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32>
2026-02-21T08:47:44.0022075Z         %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:44.0022437Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0022746Z         %66 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0023028Z         %67 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0023276Z         %68 = arith.subf %66, %67 : tensor<128x256xf32>
2026-02-21T08:47:44.0023669Z         %69 = tt.extern_elementwise %68 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0024102Z         %70 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0024410Z         %71 = tt.broadcast %70 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0024668Z         %72 = arith.divf %69, %71 : tensor<128x256xf32>
2026-02-21T08:47:44.0024915Z         %73 = arith.truncf %72 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:47:44.0025221Z         %74 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:44.0025498Z         %75 = arith.muli %74, %cst : tensor<128x1xi32>
2026-02-21T08:47:44.0025764Z         %76 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:47:44.0026063Z         %77 = tt.broadcast %75 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0026342Z         %78 = tt.broadcast %76 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0026593Z         %79 = arith.addi %77, %78 : tensor<128x256xi32>
2026-02-21T08:47:44.0026842Z         %80 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0027126Z         %81 = tt.addptr %80, %79 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:47:44.0027420Z         %82 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0027711Z         %83 = tt.broadcast %82 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0028017Z         tt.store %81, %73, %83 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0028231Z         %c1_i32_7 = arith.constant 1 : i32
2026-02-21T08:47:44.0028428Z         %84 = arith.muli %c256_i32, %c1_i32_7 : i32
2026-02-21T08:47:44.0028620Z         %85 = arith.addi %arg3, %84 : i32
2026-02-21T08:47:44.0028854Z         %86 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:44.0029098Z         %87 = tt.splat %85 : i32 -> tensor<256xi32>
2026-02-21T08:47:44.0029296Z         %88 = arith.addi %87, %86 : tensor<256xi32>
2026-02-21T08:47:44.0029501Z         %89 = arith.cmpi slt, %88, %cst_2 : tensor<256xi32>
2026-02-21T08:47:44.0029804Z         %90 = tt.descriptor_load %0[%4, %85] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:44.0030153Z         %91 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0030485Z         %92 = arith.extf %90 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0030752Z         %93 = tt.broadcast %91 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0030984Z         %94 = arith.subf %92, %93 : tensor<128x256xf32>
2026-02-21T08:47:44.0031359Z         %95 = tt.extern_elementwise %94 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0031816Z         %96 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0032104Z         %97 = tt.broadcast %96 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0032348Z         %98 = arith.divf %95, %97 : tensor<128x256xf32>
2026-02-21T08:47:44.0032582Z         %99 = arith.truncf %98 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:47:44.0032882Z         %100 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:44.0033160Z         %101 = arith.muli %100, %cst : tensor<128x1xi32>
2026-02-21T08:47:44.0033423Z         %102 = tt.expand_dims %88 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:47:44.0033733Z         %103 = tt.broadcast %101 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0034006Z         %104 = tt.broadcast %102 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0034263Z         %105 = arith.addi %103, %104 : tensor<128x256xi32>
2026-02-21T08:47:44.0034510Z         %106 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0034810Z         %107 = tt.addptr %106, %105 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:47:44.0035122Z         %108 = tt.expand_dims %89 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0035408Z         %109 = tt.broadcast %108 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0035669Z         tt.store %107, %99, %109 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0035884Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:44.0036081Z         %110 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:47:44.0036274Z         %111 = arith.addi %arg3, %110 : i32
2026-02-21T08:47:44.0036512Z         %112 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:44.0036772Z         %113 = tt.splat %111 : i32 -> tensor<256xi32>
2026-02-21T08:47:44.0036977Z         %114 = arith.addi %113, %112 : tensor<256xi32>
2026-02-21T08:47:44.0037201Z         %115 = arith.cmpi slt, %114, %cst_2 : tensor<256xi32>
2026-02-21T08:47:44.0037505Z         %116 = tt.descriptor_load %0[%4, %111] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:44.0037865Z         %117 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0038165Z         %118 = arith.extf %116 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0038436Z         %119 = tt.broadcast %117 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0038693Z         %120 = arith.subf %118, %119 : tensor<128x256xf32>
2026-02-21T08:47:44.0039123Z         %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0039557Z         %122 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0039847Z         %123 = tt.broadcast %122 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0040101Z         %124 = arith.divf %121, %123 : tensor<128x256xf32>
2026-02-21T08:47:44.0040352Z         %125 = arith.truncf %124 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:47:44.0040643Z         %126 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:44.0040914Z         %127 = arith.muli %126, %cst : tensor<128x1xi32>
2026-02-21T08:47:44.0041173Z         %128 = tt.expand_dims %114 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:47:44.0041532Z         %129 = tt.broadcast %127 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0041845Z         %130 = tt.broadcast %128 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0042084Z         %131 = arith.addi %129, %130 : tensor<128x256xi32>
2026-02-21T08:47:44.0042331Z         %132 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0042636Z         %133 = tt.addptr %132, %131 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:47:44.0042956Z         %134 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0043251Z         %135 = tt.broadcast %134 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0043518Z         tt.store %133, %125, %135 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0043747Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:47:44.0043968Z       %36 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:47:44.0044229Z       %37 = tt.splat %c9216_i32_5 : i32 -> tensor<256xi32>
2026-02-21T08:47:44.0044441Z       %38 = arith.addi %37, %36 : tensor<256xi32>
2026-02-21T08:47:44.0044658Z       %39 = arith.cmpi slt, %38, %cst_2 : tensor<256xi32>
2026-02-21T08:47:44.0044969Z       %40 = tt.descriptor_load %0[%4, %c9216_i32_5] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:47:44.0045333Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0045631Z       %42 = arith.extf %40 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:47:44.0045894Z       %43 = tt.broadcast %41 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0046137Z       %44 = arith.subf %42, %43 : tensor<128x256xf32>
2026-02-21T08:47:44.0046504Z       %45 = tt.extern_elementwise %44 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:47:44.0046922Z       %46 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:47:44.0047238Z       %47 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:47:44.0047474Z       %48 = arith.divf %45, %47 : tensor<128x256xf32>
2026-02-21T08:47:44.0047713Z       %49 = arith.truncf %48 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:47:44.0047993Z       %50 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:47:44.0048258Z       %51 = arith.muli %50, %cst : tensor<128x1xi32>
2026-02-21T08:47:44.0048509Z       %52 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:47:44.0048788Z       %53 = tt.broadcast %51 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0049049Z       %54 = tt.broadcast %52 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:47:44.0049279Z       %55 = arith.addi %53, %54 : tensor<128x256xi32>
2026-02-21T08:47:44.0049517Z       %56 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0049801Z       %57 = tt.addptr %56, %55 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:47:44.0050140Z       %58 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:47:44.0050427Z       %59 = tt.broadcast %58 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:47:44.0050671Z       tt.store %57, %49, %59 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:47:44.0050899Z     } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:47:44.0051088Z     tt.return
2026-02-21T08:47:44.0051220Z   }
2026-02-21T08:47:44.0051337Z }
2026-02-21T08:47:44.0051410Z 
2026-02-21T08:47:44.0051460Z {-#
2026-02-21T08:47:44.0051642Z   external_resources: {
2026-02-21T08:47:44.0051795Z     mlir_reproducer: {
2026-02-21T08:47:44.0056165Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:47:44.0060599Z       disable_threading: false,
2026-02-21T08:47:44.0060782Z       verify_each: true
2026-02-21T08:47:44.0060931Z     }
2026-02-21T08:47:44.0061063Z   }
2026-02-21T08:47:44.0061181Z #-}
2026-02-21T08:47:44.0061676Z /tmp/torchinductor_root/yd/cydwhqvmvwim7d7qkg4nbhz4g2n7u3a42tsph3mw7voszwmou3lu.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:47:44.0062935Z /tmp/torchinductor_root/yd/cydwhqvmvwim7d7qkg4nbhz4g2n7u3a42tsph3mw7voszwmou3lu.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:47:44.0063961Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:47:44.0065118Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:47:44.0066165Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:47:44.0066433Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:47:47.3807386Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.0 configs/s
2026-02-21T08:47:47.3820374Z [46s] Adaptive compile timeout: 30s (90% percentile=9.6s, bounds=[30.0s, 30s])
2026-02-21T08:47:48.1964516Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1209.5 configs/s
2026-02-21T08:47:48.2603460Z [47s] Initial random population of 100, 5 starting points: 
2026-02-21T08:47:48.2605561Z error=6
2026-02-21T08:47:48.2605756Z timeout=1
2026-02-21T08:47:48.2611129Z ok=93
2026-02-21T08:47:48.2615883Z min=0.0497
2026-02-21T08:47:48.2617370Z mid=0.8457
2026-02-21T08:47:48.2617531Z max=46.4589
2026-02-21T08:47:48.2617688Z best={'block_sizes': [1, 16384],
2026-02-21T08:47:48.2617917Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:47:48.2618153Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:47:48.2618339Z  'num_sm_multiplier': 8,
2026-02-21T08:47:48.2618502Z  'num_stages': 3,
2026-02-21T08:47:48.2618645Z  'num_warps': 1,
2026-02-21T08:47:48.2619152Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:47:48.2619386Z  'range_flattens': [False, None],
2026-02-21T08:47:48.2619564Z  'range_multi_buffers': [True, True],
2026-02-21T08:47:48.2619751Z  'range_num_stages': [1, 2],
2026-02-21T08:47:48.2619913Z  'range_unroll_factors': [0, 1],
2026-02-21T08:47:48.2620096Z  'range_warp_specializes': [True, None]}
2026-02-21T08:47:48.2624289Z [47s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:47:49.3126373Z [48s] Generation 1 starting: 81 neighbors, 5 active search path(s)
2026-02-21T08:48:02.4108844Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 6.7 configs/s
2026-02-21T08:48:07.5485381Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 16.9 configs/s
2026-02-21T08:48:10.9633889Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 295.5         
2026-02-21T08:48:10.9635292Z                                                                   configs/s     
2026-02-21T08:48:11.1696831Z [70s] Generation 1 complete: 
2026-02-21T08:48:11.1701307Z ok=87
2026-02-21T08:48:11.1706852Z min=0.0390
2026-02-21T08:48:11.1711024Z mid=0.0615
2026-02-21T08:48:11.1712552Z max=0.2046
2026-02-21T08:48:11.1712765Z best={'block_sizes': [1, 16384],
2026-02-21T08:48:11.1713090Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:48:11.1715259Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:48:11.1715502Z  'num_sm_multiplier': 8,
2026-02-21T08:48:11.1715669Z  'num_stages': 3,
2026-02-21T08:48:11.1715821Z  'num_warps': 1,
2026-02-21T08:48:11.1715982Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:48:11.1716192Z  'range_flattens': [False, None],
2026-02-21T08:48:11.1716378Z  'range_multi_buffers': [True, True],
2026-02-21T08:48:11.1716573Z  'range_num_stages': [1, 2],
2026-02-21T08:48:11.1716742Z  'range_unroll_factors': [1, 1],
2026-02-21T08:48:11.1716930Z  'range_warp_specializes': [True, None]}
2026-02-21T08:48:11.1717225Z [70s] Fitting surrogate: 187 points, 187 targets
2026-02-21T08:48:12.1804020Z [71s] Generation 2 starting: 74 neighbors, 5 active search path(s)
2026-02-21T08:48:25.2385678Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 7.5 configs/s
2026-02-21T08:48:31.0180013Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 13.6 configs/s
2026-02-21T08:48:34.4705359Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 292.6         
2026-02-21T08:48:34.4709212Z                                                                   configs/s     
2026-02-21T08:48:34.6738638Z [94s] Generation 2 complete: 
2026-02-21T08:48:34.6738912Z ok=79
2026-02-21T08:48:34.6739079Z min=0.0389
2026-02-21T08:48:34.6739209Z mid=0.0594
2026-02-21T08:48:34.6739338Z max=0.3748
2026-02-21T08:48:34.6739476Z best={'block_sizes': [1, 16384],
2026-02-21T08:48:34.6739745Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:48:34.6739981Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:48:34.6740204Z  'num_stages': 3,
2026-02-21T08:48:34.6740786Z  'num_warps': 1,
2026-02-21T08:48:34.7121070Z  'pid_type': 'flat',
2026-02-21T08:48:34.7121249Z  'range_flattens': [None, None],
2026-02-21T08:48:34.7121430Z  'range_multi_buffers': [None, True],
2026-02-21T08:48:34.7121858Z  'range_num_stages': [0, 2],
2026-02-21T08:48:34.7122025Z  'range_unroll_factors': [0, 1],
2026-02-21T08:48:34.7122213Z  'range_warp_specializes': [None, True]}
2026-02-21T08:48:34.7122441Z [94s] Fitting surrogate: 266 points, 266 targets
2026-02-21T08:48:35.4615497Z [95s] Generation 3 starting: 56 neighbors, 4 active search path(s)
2026-02-21T08:48:45.7695229Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 4.3 configs/s
2026-02-21T08:48:49.3014684Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.9 configs/s
2026-02-21T08:48:53.1186971Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 264.6         
2026-02-21T08:48:53.1190444Z                                                                   configs/s     
2026-02-21T08:48:53.3430627Z [112s] Generation 3 complete: 
2026-02-21T08:48:53.3434631Z ok=61
2026-02-21T08:48:53.3436741Z min=0.0369
2026-02-21T08:48:53.3436907Z mid=0.0532
2026-02-21T08:48:53.3437044Z max=0.6912
2026-02-21T08:48:53.3437199Z best={'block_sizes': [1, 16384],
2026-02-21T08:48:53.3437465Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:48:53.3437755Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:48:53.3437950Z  'num_sm_multiplier': 32,
2026-02-21T08:48:53.3438115Z  'num_stages': 5,
2026-02-21T08:48:53.3438252Z  'num_warps': 2,
2026-02-21T08:48:53.3438415Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:48:53.3438604Z  'range_flattens': [True, True],
2026-02-21T08:48:53.3438785Z  'range_multi_buffers': [False, None],
2026-02-21T08:48:53.3438971Z  'range_num_stages': [3, 2],
2026-02-21T08:48:53.3439135Z  'range_unroll_factors': [0, 2],
2026-02-21T08:48:53.3439317Z  'range_warp_specializes': [True, None]}
2026-02-21T08:48:53.3448108Z [112s] Fitting surrogate: 327 points, 327 targets
2026-02-21T08:48:53.9445761Z [113s] Generation 4 starting: 37 neighbors, 2 active search path(s)
2026-02-21T08:49:03.3421183Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 2.8 configs/s
2026-02-21T08:49:05.7228050Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.7 configs/s
2026-02-21T08:49:07.8250692Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 479.3         
2026-02-21T08:49:07.8254750Z                                                                   configs/s     
2026-02-21T08:49:07.9782881Z [127s] Generation 4 complete: 
2026-02-21T08:49:07.9783237Z ok=40
2026-02-21T08:49:07.9783458Z min=0.0368
2026-02-21T08:49:07.9783625Z mid=0.0513
2026-02-21T08:49:07.9783808Z max=0.6871
2026-02-21T08:49:07.9783994Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:07.9784293Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:49:07.9784594Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:49:07.9784845Z  'num_sm_multiplier': 32,
2026-02-21T08:49:07.9785033Z  'num_stages': 5,
2026-02-21T08:49:07.9785176Z  'num_warps': 4,
2026-02-21T08:49:07.9785344Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:49:07.9785600Z  'range_flattens': [True, True],
2026-02-21T08:49:07.9785872Z  'range_multi_buffers': [False, None],
2026-02-21T08:49:07.9786146Z  'range_num_stages': [3, 2],
2026-02-21T08:49:07.9786383Z  'range_unroll_factors': [0, 2],
2026-02-21T08:49:07.9786658Z  'range_warp_specializes': [True, None]}
2026-02-21T08:49:07.9806097Z [127s] Fitting surrogate: 367 points, 367 targets
2026-02-21T08:49:08.6287486Z [128s] Generation 5 starting: 38 neighbors, 2 active search path(s)
2026-02-21T08:49:15.3934103Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 7.0 configs/s
2026-02-21T08:49:17.8500119Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 16.6 configs/s
2026-02-21T08:49:20.9142818Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 330.5         
2026-02-21T08:49:20.9144175Z                                                                   configs/s     
2026-02-21T08:49:21.1283363Z [140s] Generation 5 complete: 
2026-02-21T08:49:21.1287610Z ok=41
2026-02-21T08:49:21.1289298Z min=0.0368
2026-02-21T08:49:21.1289514Z mid=0.0471
2026-02-21T08:49:21.1295328Z max=0.2550
2026-02-21T08:49:21.1300209Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:21.1302313Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:49:21.1302601Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:49:21.1302799Z  'num_sm_multiplier': 32,
2026-02-21T08:49:21.1302969Z  'num_stages': 5,
2026-02-21T08:49:21.1303108Z  'num_warps': 1,
2026-02-21T08:49:21.1303273Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:49:21.1303466Z  'range_flattens': [True, True],
2026-02-21T08:49:21.1303648Z  'range_multi_buffers': [False, True],
2026-02-21T08:49:21.1303841Z  'range_num_stages': [3, 2],
2026-02-21T08:49:21.1304006Z  'range_unroll_factors': [0, 3],
2026-02-21T08:49:21.1304655Z  'range_warp_specializes': [True, None]}
2026-02-21T08:49:21.1304886Z [140s] Fitting surrogate: 408 points, 408 targets
2026-02-21T08:49:21.4886343Z [141s] Generation 6 starting: 16 neighbors, 1 active search path(s)
2026-02-21T08:49:24.5440345Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s
2026-02-21T08:49:25.5205522Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s
2026-02-21T08:49:26.9769639Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 690.0         
2026-02-21T08:49:26.9770423Z                                                                   configs/s     
2026-02-21T08:49:27.0896231Z [146s] Generation 6 complete: 
2026-02-21T08:49:27.0901278Z ok=18
2026-02-21T08:49:27.0905970Z min=0.0368
2026-02-21T08:49:27.0910685Z mid=0.0389
2026-02-21T08:49:27.0914878Z max=0.0532
2026-02-21T08:49:27.0917120Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:27.0917436Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:49:27.0917727Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:49:27.0917926Z  'num_sm_multiplier': 32,
2026-02-21T08:49:27.0918092Z  'num_stages': 5,
2026-02-21T08:49:27.0918234Z  'num_warps': 1,
2026-02-21T08:49:27.0918390Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:49:27.0918587Z  'range_flattens': [True, True],
2026-02-21T08:49:27.0918760Z  'range_multi_buffers': [False, True],
2026-02-21T08:49:27.0918946Z  'range_num_stages': [3, 2],
2026-02-21T08:49:27.0919105Z  'range_unroll_factors': [0, 3],
2026-02-21T08:49:27.0919285Z  'range_warp_specializes': [True, None]}
2026-02-21T08:49:27.0919495Z [146s] Fitting surrogate: 426 points, 426 targets
2026-02-21T08:49:27.4177499Z [147s] Generation 7 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:49:29.9281495Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 9.9 configs/s
2026-02-21T08:49:30.7225701Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.4 configs/s
2026-02-21T08:49:31.6850194Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1040.2         
2026-02-21T08:49:31.6851933Z                                                                  configs/s      
2026-02-21T08:49:31.7596539Z [151s] Generation 7 complete: 
2026-02-21T08:49:31.7598647Z ok=14
2026-02-21T08:49:31.7598861Z min=0.0368
2026-02-21T08:49:31.7603756Z mid=0.0389
2026-02-21T08:49:31.7605742Z max=0.0584
2026-02-21T08:49:31.7605923Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:31.7606195Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:49:31.7606474Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:49:31.7606669Z  'num_stages': 4,
2026-02-21T08:49:31.7606825Z  'num_warps': 8,
2026-02-21T08:49:31.7606971Z  'pid_type': 'flat',
2026-02-21T08:49:31.7607139Z  'range_flattens': [None, True],
2026-02-21T08:49:31.7607330Z  'range_multi_buffers': [None, True],
2026-02-21T08:49:31.7607534Z  'range_num_stages': [0, 2],
2026-02-21T08:49:31.7608109Z  'range_unroll_factors': [0, 1],
2026-02-21T08:49:31.7608321Z  'range_warp_specializes': [None, True]}
2026-02-21T08:49:31.7622913Z [151s] Fitting surrogate: 440 points, 440 targets
2026-02-21T08:49:32.1029558Z [151s] Generation 8 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:49:34.6972166Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 7.7 configs/s
2026-02-21T08:49:35.4999706Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.1 configs/s
2026-02-21T08:49:36.6383299Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 879.3         
2026-02-21T08:49:36.6387371Z                                                                   configs/s     
2026-02-21T08:49:36.7310791Z [156s] Generation 8 complete: 
2026-02-21T08:49:36.7312019Z ok=14
2026-02-21T08:49:36.7312158Z min=0.0369
2026-02-21T08:49:36.7312311Z mid=0.0389
2026-02-21T08:49:36.7312434Z max=0.0532
2026-02-21T08:49:36.7312599Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:36.7312910Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:49:36.7313736Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:49:36.7313937Z  'num_sm_multiplier': 32,
2026-02-21T08:49:36.7314103Z  'num_stages': 4,
2026-02-21T08:49:36.7314248Z  'num_warps': 8,
2026-02-21T08:49:36.7314405Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:49:36.7314602Z  'range_flattens': [False, True],
2026-02-21T08:49:36.7314790Z  'range_multi_buffers': [False, True],
2026-02-21T08:49:36.7314972Z  'range_num_stages': [0, 2],
2026-02-21T08:49:36.7315145Z  'range_unroll_factors': [1, 1],
2026-02-21T08:49:36.7315319Z  'range_warp_specializes': [True, None]}
2026-02-21T08:49:36.7321165Z [156s] Fitting surrogate: 454 points, 454 targets
2026-02-21T08:49:37.1104946Z [156s] Generation 9 starting: 19 neighbors, 1 active search path(s)
2026-02-21T08:49:45.8761107Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s
2026-02-21T08:49:47.2701470Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 13.9 configs/s
2026-02-21T08:49:48.3116886Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 962.8         
2026-02-21T08:49:48.3117334Z                                                                   configs/s     
2026-02-21T08:49:48.3946366Z [168s] Generation 9 complete: 
2026-02-21T08:49:48.3946605Z ok=20
2026-02-21T08:49:48.3946743Z min=0.0369
2026-02-21T08:49:48.3946869Z mid=0.0410
2026-02-21T08:49:48.3946997Z max=0.1352
2026-02-21T08:49:48.3947131Z best={'block_sizes': [1, 16384],
2026-02-21T08:49:48.3947389Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:49:48.3947647Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:49:48.3947849Z  'num_sm_multiplier': 32,
2026-02-21T08:49:48.3948009Z  'num_stages': 4,
2026-02-21T08:49:48.3948160Z  'num_warps': 8,
2026-02-21T08:49:48.3948322Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:49:48.3948513Z  'range_flattens': [False, True],
2026-02-21T08:49:48.3949184Z  'range_multi_buffers': [False, True],
2026-02-21T08:49:48.3962488Z  'range_num_stages': [0, 2],
2026-02-21T08:49:48.3962720Z  'range_unroll_factors': [1, 1],
2026-02-21T08:49:48.3962913Z  'range_warp_specializes': [True, None]}
2026-02-21T08:49:48.3963138Z [168s] Fitting surrogate: 474 points, 474 targets
2026-02-21T08:49:48.5966868Z [168s] Autotuning complete in 168.2s after searching 462 configs.
2026-02-21T08:49:48.5969294Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:49:48.5970353Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[0, 2], range_unroll_factors=[1, 1], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:49:48.5971292Z 
2026-02-21T08:49:48.5972280Z [168s] Code of selected kernel: /tmp/torchinductor_root/im/cimneque2jlxruvi7ivem6r5lzucrbfserm5ratl2hy54aqmngsj.py
2026-02-21T08:49:49.6363822Z WARNING:tritonbench.utils.triton_op:Completed input ID 71:
2026-02-21T08:49:49.6365487Z (M, N)
2026-02-21T08:49:49.6365650Z ------------
2026-02-21T08:49:49.6365805Z (4096, 9344)
2026-02-21T08:49:49.6365885Z 
2026-02-21T08:49:49.6376196Z  75%|███████▌  | 15/20 [40:47<14:05, 169.19s/it]WARNING:tritonbench.utils.triton_op:Running input ID 77:
2026-02-21T08:49:49.6378502Z (M, N)
2026-02-21T08:49:49.6378721Z -------------
2026-02-21T08:49:49.6382188Z (4096, 10112)
2026-02-21T08:49:49.6382545Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:49:50.7998756Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:49:52.1452584Z INFO:tritonbench.utils.triton_op:Took 2.32ms to get benchmark function for torch_compile_softmax
2026-02-21T08:49:55.9499367Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:49:55.9501075Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:49:55.9501299Z               'dtype': 'torch.float16',
2026-02-21T08:49:55.9501496Z               'shape': (4096, 10112),
2026-02-21T08:49:55.9501965Z               'stride': (10112, 1)},),
2026-02-21T08:49:55.9502142Z   'kwargs': {}}
2026-02-21T08:49:55.9518570Z INFO:tritonbench.utils.triton_op:Took 2.63ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:49:56.1330568Z [0s] Autotune random seed: 2134816249
2026-02-21T08:49:56.2818441Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:50:36.3632328Z [40s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T08:50:36.3648897Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:50:40.7471245Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:50:40.7475458Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:50:40.7476710Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:50:40.7477103Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:50:40.7477538Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:50:40.7477914Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:50:40.7478299Z     %cst = arith.constant dense<10112> : tensor<128x1xi32>
2026-02-21T08:50:40.7478806Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x256xf32>
2026-02-21T08:50:40.7479334Z     %cst_1 = arith.constant dense<0xFC00> : tensor<128x256xf16>
2026-02-21T08:50:40.7480119Z     %cst_2 = arith.constant dense<10112> : tensor<256xi32>
2026-02-21T08:50:40.7480610Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T08:50:40.7481083Z     %cst_4 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T08:50:40.7481523Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:50:40.7481912Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:50:40.7482303Z     %c10112_i32 = arith.constant 10112 : i32
2026-02-21T08:50:40.7482695Z     %c10112_i64 = arith.constant 10112 : i64
2026-02-21T08:50:40.7483035Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:50:40.7483639Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10112_i32], [%c10112_i64, %c1_i64] : <f16>, <tensor<128x256xf16>>
2026-02-21T08:50:40.7484234Z     %1 = tt.get_program_id x : i32
2026-02-21T08:50:40.7484598Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:50:40.7484901Z     %3 = arith.minsi %2, %c32_i32 : i32
2026-02-21T08:50:40.7485467Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:50:40.7485844Z       %4 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T08:50:40.7486299Z       %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:50:40.7486789Z       %6 = tt.splat %4 : i32 -> tensor<128xi32>
2026-02-21T08:50:40.7487151Z       %7 = arith.addi %6, %5 : tensor<128xi32>
2026-02-21T08:50:40.7487535Z       %c9984_i32 = arith.constant 9984 : i32
2026-02-21T08:50:40.7487851Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:50:40.7488549Z       %8:2 = scf.for %arg3 = %c0_i32 to %c9984_i32 step %c768_i32 iter_args(%arg4 = %cst_4, %arg5 = %cst_3) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T08:50:40.7489330Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7489797Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7490203Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:50:40.7490586Z         %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7491163Z         %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7491843Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7492419Z         %66 = tt.broadcast %65 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7492988Z         %67 = arith.select %66, %64, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:50:40.7493489Z         %68 = arith.extf %67 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7494003Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7494375Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7494770Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:50:40.7495149Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7495553Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7496034Z         %70 = arith.truncf %69 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:50:40.7496513Z         %71 = arith.extf %70 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:50:40.7497000Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<128xf32>
2026-02-21T08:50:40.7497436Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T08:50:40.7497875Z         %74 = arith.ori %72, %73 : tensor<128xi1>
2026-02-21T08:50:40.7498323Z         %75 = arith.select %74, %arg4, %71 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:50:40.7498835Z         %76 = arith.subf %arg4, %75 : tensor<128xf32>
2026-02-21T08:50:40.7499553Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7500253Z         %78 = arith.mulf %arg5, %77 : tensor<128xf32>
2026-02-21T08:50:40.7500772Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7501323Z         %80 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7502054Z         %81 = tt.broadcast %79 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7502463Z         %82 = arith.subf %80, %81 : tensor<128x256xf32>
2026-02-21T08:50:40.7503149Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7503943Z         %84 = arith.select %66, %83, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:50:40.7504358Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7504670Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7504955Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:50:40.7505258Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7505563Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7505878Z         %86 = arith.addf %78, %85 : tensor<128xf32>
2026-02-21T08:50:40.7506312Z         %c1_i32_7 = arith.constant 1 : i32
2026-02-21T08:50:40.7506611Z         %87 = arith.muli %c256_i32, %c1_i32_7 : i32
2026-02-21T08:50:40.7506921Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T08:50:40.7507281Z         %89 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7507692Z         %90 = tt.splat %88 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7508007Z         %91 = arith.addi %90, %89 : tensor<256xi32>
2026-02-21T08:50:40.7508354Z         %92 = arith.cmpi slt, %91, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7508858Z         %93 = tt.descriptor_load %0[%4, %88] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7509420Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7509898Z         %95 = tt.broadcast %94 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7510350Z         %96 = arith.select %95, %93, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:50:40.7510826Z         %97 = arith.extf %96 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7511210Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7511504Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7511832Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:50:40.7512142Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7512447Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7512804Z         %99 = arith.truncf %98 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:50:40.7513219Z         %100 = arith.extf %99 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:50:40.7513605Z         %101 = arith.cmpf ogt, %75, %100 : tensor<128xf32>
2026-02-21T08:50:40.7513959Z         %102 = arith.cmpf une, %75, %75 : tensor<128xf32>
2026-02-21T08:50:40.7514299Z         %103 = arith.ori %101, %102 : tensor<128xi1>
2026-02-21T08:50:40.7514687Z         %104 = arith.select %103, %75, %100 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:50:40.7515098Z         %105 = arith.subf %75, %104 : tensor<128xf32>
2026-02-21T08:50:40.7515697Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7516280Z         %107 = arith.mulf %86, %106 : tensor<128xf32>
2026-02-21T08:50:40.7516701Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7517186Z         %109 = arith.extf %93 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7517638Z         %110 = tt.broadcast %108 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7518042Z         %111 = arith.subf %109, %110 : tensor<128x256xf32>
2026-02-21T08:50:40.7518686Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7519413Z         %113 = arith.select %95, %112, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:50:40.7519957Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7520261Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7520541Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:50:40.7520840Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7521134Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7521462Z         %115 = arith.addf %107, %114 : tensor<128xf32>
2026-02-21T08:50:40.7521809Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:50:40.7522117Z         %116 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:50:40.7522430Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T08:50:40.7522804Z         %118 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7523219Z         %119 = tt.splat %117 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7523662Z         %120 = arith.addi %119, %118 : tensor<256xi32>
2026-02-21T08:50:40.7524026Z         %121 = arith.cmpi slt, %120, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7524532Z         %122 = tt.descriptor_load %0[%4, %117] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7525106Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7525598Z         %124 = tt.broadcast %123 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7526073Z         %125 = arith.select %124, %122, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:50:40.7526560Z         %126 = arith.extf %125 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7526950Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7527259Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7527555Z           %145 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:50:40.7527858Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7528164Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7528530Z         %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:50:40.7528949Z         %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:50:40.7529321Z         %130 = arith.cmpf ogt, %104, %129 : tensor<128xf32>
2026-02-21T08:50:40.7529682Z         %131 = arith.cmpf une, %104, %104 : tensor<128xf32>
2026-02-21T08:50:40.7530021Z         %132 = arith.ori %130, %131 : tensor<128xi1>
2026-02-21T08:50:40.7530411Z         %133 = arith.select %132, %104, %129 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:50:40.7530820Z         %134 = arith.subf %104, %133 : tensor<128xf32>
2026-02-21T08:50:40.7531420Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7532080Z         %136 = arith.mulf %115, %135 : tensor<128xf32>
2026-02-21T08:50:40.7532500Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7533005Z         %138 = arith.extf %122 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7533458Z         %139 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7533845Z         %140 = arith.subf %138, %139 : tensor<128x256xf32>
2026-02-21T08:50:40.7534489Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7535213Z         %142 = arith.select %124, %141, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:50:40.7535672Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7535996Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:50:40.7536297Z           %145 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:50:40.7536620Z           tt.reduce.return %145 : f32
2026-02-21T08:50:40.7536934Z         }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7537291Z         %144 = arith.addf %136, %143 : tensor<128xf32>
2026-02-21T08:50:40.7541161Z         scf.yield %133, %144 : tensor<128xf32>, tensor<128xf32>
2026-02-21T08:50:40.7541527Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:50:40.7541925Z       %9 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7542361Z       %10 = tt.splat %c9984_i32 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7542728Z       %11 = arith.addi %10, %9 : tensor<256xi32>
2026-02-21T08:50:40.7543077Z       %12 = arith.cmpi slt, %11, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7543624Z       %13 = tt.descriptor_load %0[%4, %c9984_i32] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7544246Z       %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7544722Z       %15 = tt.broadcast %14 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7545299Z       %16 = arith.select %15, %13, %cst_1 : tensor<128x256xi1>, tensor<128x256xf16>
2026-02-21T08:50:40.7545760Z       %17 = arith.extf %16 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7546142Z       %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7546440Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:50:40.7546734Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:50:40.7547032Z         tt.reduce.return %60 : f32
2026-02-21T08:50:40.7547330Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7547688Z       %19 = arith.truncf %18 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:50:40.7548088Z       %20 = arith.extf %19 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:50:40.7548507Z       %21 = arith.cmpf ogt, %8#0, %20 : tensor<128xf32>
2026-02-21T08:50:40.7548861Z       %22 = arith.cmpf une, %8#0, %8#0 : tensor<128xf32>
2026-02-21T08:50:40.7549186Z       %23 = arith.ori %21, %22 : tensor<128xi1>
2026-02-21T08:50:40.7549562Z       %24 = arith.select %23, %8#0, %20 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:50:40.7549929Z       %25 = arith.subf %8#0, %24 : tensor<128xf32>
2026-02-21T08:50:40.7550516Z       %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7551115Z       %27 = arith.mulf %8#1, %26 : tensor<128xf32>
2026-02-21T08:50:40.7551514Z       %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7552039Z       %29 = arith.extf %13 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7552467Z       %30 = tt.broadcast %28 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7552862Z       %31 = arith.subf %29, %30 : tensor<128x256xf32>
2026-02-21T08:50:40.7553491Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7554195Z       %33 = arith.select %15, %32, %cst_0 : tensor<128x256xi1>, tensor<128x256xf32>
2026-02-21T08:50:40.7554599Z       %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({
2026-02-21T08:50:40.7554898Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:50:40.7555183Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:50:40.7555474Z         tt.reduce.return %60 : f32
2026-02-21T08:50:40.7555770Z       }) : (tensor<128x256xf32>) -> tensor<128xf32>
2026-02-21T08:50:40.7556093Z       %35 = arith.addf %27, %34 : tensor<128xf32>
2026-02-21T08:50:40.7556402Z       %c9984_i32_5 = arith.constant 9984 : i32
2026-02-21T08:50:40.7556712Z       %c768_i32_6 = arith.constant 768 : i32
2026-02-21T08:50:40.7557070Z       scf.for %arg3 = %c0_i32 to %c9984_i32_5 step %c768_i32_6  : i32 {
2026-02-21T08:50:40.7557529Z         %60 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7557937Z         %61 = tt.splat %arg3 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7558269Z         %62 = arith.addi %61, %60 : tensor<256xi32>
2026-02-21T08:50:40.7558610Z         %63 = arith.cmpi slt, %62, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7559224Z         %64 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7559819Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7560364Z         %66 = arith.extf %64 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7560871Z         %67 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7561316Z         %68 = arith.subf %66, %67 : tensor<128x256xf32>
2026-02-21T08:50:40.7562064Z         %69 = tt.extern_elementwise %68 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7562858Z         %70 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7563375Z         %71 = tt.broadcast %70 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7563910Z         %72 = arith.divf %69, %71 : tensor<128x256xf32>
2026-02-21T08:50:40.7564349Z         %73 = arith.truncf %72 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:50:40.7564916Z         %74 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:50:40.7565442Z         %75 = arith.muli %74, %cst : tensor<128x1xi32>
2026-02-21T08:50:40.7565903Z         %76 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:50:40.7566459Z         %77 = tt.broadcast %75 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7566944Z         %78 = tt.broadcast %76 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7567413Z         %79 = arith.addi %77, %78 : tensor<128x256xi32>
2026-02-21T08:50:40.7567849Z         %80 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7568395Z         %81 = tt.addptr %80, %79 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:50:40.7568972Z         %82 = tt.expand_dims %63 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7569497Z         %83 = tt.broadcast %82 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7569986Z         tt.store %81, %73, %83 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7570378Z         %c1_i32_7 = arith.constant 1 : i32
2026-02-21T08:50:40.7570772Z         %84 = arith.muli %c256_i32, %c1_i32_7 : i32
2026-02-21T08:50:40.7571129Z         %85 = arith.addi %arg3, %84 : i32
2026-02-21T08:50:40.7571608Z         %86 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7572093Z         %87 = tt.splat %85 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7572463Z         %88 = arith.addi %87, %86 : tensor<256xi32>
2026-02-21T08:50:40.7572888Z         %89 = arith.cmpi slt, %88, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7573429Z         %90 = tt.descriptor_load %0[%4, %85] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7574090Z         %91 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7574625Z         %92 = arith.extf %90 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7575136Z         %93 = tt.broadcast %91 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7575591Z         %94 = arith.subf %92, %93 : tensor<128x256xf32>
2026-02-21T08:50:40.7576256Z         %95 = tt.extern_elementwise %94 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7577041Z         %96 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7577565Z         %97 = tt.broadcast %96 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7578038Z         %98 = arith.divf %95, %97 : tensor<128x256xf32>
2026-02-21T08:50:40.7578508Z         %99 = arith.truncf %98 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:50:40.7579042Z         %100 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:50:40.7579708Z         %101 = arith.muli %100, %cst : tensor<128x1xi32>
2026-02-21T08:50:40.7580210Z         %102 = tt.expand_dims %88 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:50:40.7580815Z         %103 = tt.broadcast %101 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7581320Z         %104 = tt.broadcast %102 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7581871Z         %105 = arith.addi %103, %104 : tensor<128x256xi32>
2026-02-21T08:50:40.7582380Z         %106 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7582934Z         %107 = tt.addptr %106, %105 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:50:40.7583560Z         %108 = tt.expand_dims %89 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7584118Z         %109 = tt.broadcast %108 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7584747Z         tt.store %107, %99, %109 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7585201Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:50:40.7585572Z         %110 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:50:40.7585987Z         %111 = arith.addi %arg3, %110 : i32
2026-02-21T08:50:40.7586430Z         %112 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7586954Z         %113 = tt.splat %111 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7587353Z         %114 = arith.addi %113, %112 : tensor<256xi32>
2026-02-21T08:50:40.7587787Z         %115 = arith.cmpi slt, %114, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7588375Z         %116 = tt.descriptor_load %0[%4, %111] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7588999Z         %117 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7589538Z         %118 = arith.extf %116 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7590034Z         %119 = tt.broadcast %117 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7590485Z         %120 = arith.subf %118, %119 : tensor<128x256xf32>
2026-02-21T08:50:40.7591149Z         %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7591987Z         %122 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7592572Z         %123 = tt.broadcast %122 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7593027Z         %124 = arith.divf %121, %123 : tensor<128x256xf32>
2026-02-21T08:50:40.7593514Z         %125 = arith.truncf %124 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:50:40.7594050Z         %126 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:50:40.7594569Z         %127 = arith.muli %126, %cst : tensor<128x1xi32>
2026-02-21T08:50:40.7595088Z         %128 = tt.expand_dims %114 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:50:40.7595626Z         %129 = tt.broadcast %127 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7596154Z         %130 = tt.broadcast %128 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7596610Z         %131 = arith.addi %129, %130 : tensor<128x256xi32>
2026-02-21T08:50:40.7597094Z         %132 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7597628Z         %133 = tt.addptr %132, %131 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:50:40.7598232Z         %134 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7598790Z         %135 = tt.broadcast %134 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7599245Z         tt.store %133, %125, %135 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7599645Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:50:40.7600049Z       %36 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:50:40.7600667Z       %37 = tt.splat %c9984_i32_5 : i32 -> tensor<256xi32>
2026-02-21T08:50:40.7601096Z       %38 = arith.addi %37, %36 : tensor<256xi32>
2026-02-21T08:50:40.7601486Z       %39 = arith.cmpi slt, %38, %cst_2 : tensor<256xi32>
2026-02-21T08:50:40.7602151Z       %40 = tt.descriptor_load %0[%4, %c9984_i32_5] : !tt.tensordesc<tensor<128x256xf16>> -> tensor<128x256xf16>
2026-02-21T08:50:40.7602806Z       %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7603362Z       %42 = arith.extf %40 : tensor<128x256xf16> to tensor<128x256xf32>
2026-02-21T08:50:40.7603829Z       %43 = tt.broadcast %41 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7604295Z       %44 = arith.subf %42, %43 : tensor<128x256xf32>
2026-02-21T08:50:40.7605121Z       %45 = tt.extern_elementwise %44 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x256xf32>) -> tensor<128x256xf32>
2026-02-21T08:50:40.7605883Z       %46 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:50:40.7606439Z       %47 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x256xf32>
2026-02-21T08:50:40.7606873Z       %48 = arith.divf %45, %47 : tensor<128x256xf32>
2026-02-21T08:50:40.7607338Z       %49 = arith.truncf %48 : tensor<128x256xf32> to tensor<128x256xf16>
2026-02-21T08:50:40.7607882Z       %50 = tt.expand_dims %7 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:50:40.7608365Z       %51 = arith.muli %50, %cst : tensor<128x1xi32>
2026-02-21T08:50:40.7608860Z       %52 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32> -> tensor<1x256xi32>
2026-02-21T08:50:40.7609376Z       %53 = tt.broadcast %51 : tensor<128x1xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7609891Z       %54 = tt.broadcast %52 : tensor<1x256xi32> -> tensor<128x256xi32>
2026-02-21T08:50:40.7610325Z       %55 = arith.addi %53, %54 : tensor<128x256xi32>
2026-02-21T08:50:40.7610795Z       %56 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7611332Z       %57 = tt.addptr %56, %55 : tensor<128x256x!tt.ptr<f16>>, tensor<128x256xi32>
2026-02-21T08:50:40.7611904Z       %58 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi1> -> tensor<1x256xi1>
2026-02-21T08:50:40.7612312Z       %59 = tt.broadcast %58 : tensor<1x256xi1> -> tensor<128x256xi1>
2026-02-21T08:50:40.7612602Z       tt.store %57, %49, %59 : tensor<128x256x!tt.ptr<f16>>
2026-02-21T08:50:40.7612901Z     } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:50:40.7613134Z     tt.return
2026-02-21T08:50:40.7613328Z   }
2026-02-21T08:50:40.7613514Z }
2026-02-21T08:50:40.7613604Z 
2026-02-21T08:50:40.7613676Z {-#
2026-02-21T08:50:40.7613852Z   external_resources: {
2026-02-21T08:50:40.7614049Z     mlir_reproducer: {
2026-02-21T08:50:40.7621764Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:50:40.7630250Z       disable_threading: false,
2026-02-21T08:50:40.7630587Z       verify_each: true
2026-02-21T08:50:40.7630908Z     }
2026-02-21T08:50:40.7631133Z   }
2026-02-21T08:50:40.7631386Z #-}
2026-02-21T08:50:40.7632240Z /tmp/torchinductor_root/ev/cevvux4li7rnsh5wz7iwurivhxq66lvzlqufnv4e3fb3pym4xoo2.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:50:40.7634457Z /tmp/torchinductor_root/ev/cevvux4li7rnsh5wz7iwurivhxq66lvzlqufnv4e3fb3pym4xoo2.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:50:40.7636291Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:50:40.7638258Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:50:40.7640038Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:50:40.7640543Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:50:44.2124910Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.8 configs/s
2026-02-21T08:50:44.2135573Z [47s] Adaptive compile timeout: 30s (90% percentile=10.3s, bounds=[30.0s, 30s])
2026-02-21T08:50:45.0400154Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1191.8 configs/s
2026-02-21T08:50:45.1036833Z [48s] Initial random population of 100, 5 starting points: 
2026-02-21T08:50:45.1042155Z error=5
2026-02-21T08:50:45.1047135Z timeout=1
2026-02-21T08:50:45.1051479Z ok=94
2026-02-21T08:50:45.1053120Z min=0.0511
2026-02-21T08:50:45.1053367Z mid=0.9227
2026-02-21T08:50:45.1053539Z max=49.1755
2026-02-21T08:50:45.1053758Z best={'block_sizes': [1, 16384],
2026-02-21T08:50:45.1054031Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:50:45.1054336Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:50:45.1054611Z  'num_sm_multiplier': 8,
2026-02-21T08:50:45.1054853Z  'num_stages': 3,
2026-02-21T08:50:45.1055029Z  'num_warps': 1,
2026-02-21T08:50:45.1055249Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:50:45.1055479Z  'range_flattens': [False, None],
2026-02-21T08:50:45.1055729Z  'range_multi_buffers': [True, True],
2026-02-21T08:50:45.1055978Z  'range_num_stages': [1, 2],
2026-02-21T08:50:45.1056187Z  'range_unroll_factors': [0, 1],
2026-02-21T08:50:45.1056435Z  'range_warp_specializes': [True, None]}
2026-02-21T08:50:45.1056688Z [48s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:50:46.1488572Z [49s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T08:51:00.3485347Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 1.2 configs/s
2026-02-21T08:51:09.2137669Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 84/84 9.5 configs/s
2026-02-21T08:51:14.7880826Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 181.3         
2026-02-21T08:51:14.7882522Z                                                                   configs/s     
2026-02-21T08:51:15.0754172Z [78s] Generation 1 complete: 
2026-02-21T08:51:15.0756013Z ok=86
2026-02-21T08:51:15.0756274Z min=0.0471
2026-02-21T08:51:15.0756460Z mid=0.0655
2026-02-21T08:51:15.0756662Z max=0.3850
2026-02-21T08:51:15.0756854Z best={'block_sizes': [1, 16384],
2026-02-21T08:51:15.0757179Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:51:15.0757466Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:51:15.0757743Z  'num_sm_multiplier': 16,
2026-02-21T08:51:15.0757955Z  'num_stages': 3,
2026-02-21T08:51:15.0758177Z  'num_warps': 1,
2026-02-21T08:51:15.0758393Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:51:15.0758671Z  'range_flattens': [False, True],
2026-02-21T08:51:15.0758922Z  'range_multi_buffers': [True, True],
2026-02-21T08:51:15.0759150Z  'range_num_stages': [1, 2],
2026-02-21T08:51:15.0759356Z  'range_unroll_factors': [0, 1],
2026-02-21T08:51:15.0759903Z  'range_warp_specializes': [True, None]}
2026-02-21T08:51:15.0770986Z [78s] Fitting surrogate: 186 points, 186 targets
2026-02-21T08:51:16.1222051Z [79s] Generation 2 starting: 73 neighbors, 5 active search path(s)
2026-02-21T08:51:27.5429367Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 2.1 configs/s
2026-02-21T08:51:31.9270827Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.1 configs/s
2026-02-21T08:51:38.4827259Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 172.4         
2026-02-21T08:51:38.4829573Z                                                                   configs/s     
2026-02-21T08:51:38.8119817Z [102s] Generation 2 complete: 
2026-02-21T08:51:38.8123933Z ok=78
2026-02-21T08:51:38.8129031Z min=0.0451
2026-02-21T08:51:38.8130542Z mid=0.0573
2026-02-21T08:51:38.8130837Z max=0.2806
2026-02-21T08:51:38.8135696Z best={'block_sizes': [1, 16384],
2026-02-21T08:51:38.8140296Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:51:38.8144819Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:51:38.8149321Z  'num_sm_multiplier': 32,
2026-02-21T08:51:38.8153313Z  'num_stages': 4,
2026-02-21T08:51:38.8158598Z  'num_warps': 1,
2026-02-21T08:51:38.8163183Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:51:38.8165125Z  'range_flattens': [False, True],
2026-02-21T08:51:38.8165511Z  'range_multi_buffers': [True, True],
2026-02-21T08:51:38.8165795Z  'range_num_stages': [1, 2],
2026-02-21T08:51:38.8166085Z  'range_unroll_factors': [0, 1],
2026-02-21T08:51:38.8166349Z  'range_warp_specializes': [True, None]}
2026-02-21T08:51:38.8166785Z [102s] Fitting surrogate: 264 points, 264 targets
2026-02-21T08:51:39.7419093Z [103s] Generation 3 starting: 72 neighbors, 5 active search path(s)
2026-02-21T08:51:53.1800953Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 8.8 configs/s
2026-02-21T08:51:58.9119454Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 13.7 configs/s
2026-02-21T08:52:03.2253310Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 234.2         
2026-02-21T08:52:03.2254774Z                                                                   configs/s     
2026-02-21T08:52:03.4755945Z [127s] Generation 3 complete: 
2026-02-21T08:52:03.4760004Z ok=78
2026-02-21T08:52:03.4760305Z min=0.0470
2026-02-21T08:52:03.4760529Z mid=0.0616
2026-02-21T08:52:03.4760759Z max=0.2089
2026-02-21T08:52:03.4761019Z best={'block_sizes': [1, 16384],
2026-02-21T08:52:03.4761328Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:52:03.4762308Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:52:03.4762635Z  'num_sm_multiplier': 32,
2026-02-21T08:52:03.4762918Z  'num_stages': 4,
2026-02-21T08:52:03.4763121Z  'num_warps': 1,
2026-02-21T08:52:03.4763350Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:52:03.4763589Z  'range_flattens': [False, None],
2026-02-21T08:52:03.4763849Z  'range_multi_buffers': [True, True],
2026-02-21T08:52:03.4764113Z  'range_num_stages': [1, 2],
2026-02-21T08:52:03.4764761Z  'range_unroll_factors': [0, 1],
2026-02-21T08:52:03.4765075Z  'range_warp_specializes': [True, None]}
2026-02-21T08:52:03.4771054Z [127s] Fitting surrogate: 342 points, 342 targets
2026-02-21T08:52:04.5167703Z [128s] Generation 4 starting: 73 neighbors, 5 active search path(s)
2026-02-21T08:52:23.2237078Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.3 configs/s
2026-02-21T08:52:27.8813897Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.9 configs/s
2026-02-21T08:52:31.5934521Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.1         
2026-02-21T08:52:31.5936676Z                                                                   configs/s     
2026-02-21T08:52:31.8340663Z [155s] Generation 4 complete: 
2026-02-21T08:52:31.8345074Z ok=79
2026-02-21T08:52:31.8350800Z min=0.0389
2026-02-21T08:52:31.8352678Z mid=0.0554
2026-02-21T08:52:31.8353012Z max=0.2560
2026-02-21T08:52:31.8353680Z best={'block_sizes': [1, 16384],
2026-02-21T08:52:31.8354100Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:52:31.8360378Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:31.8363260Z  'num_sm_multiplier': 64,
2026-02-21T08:52:31.8363622Z  'num_stages': 5,
2026-02-21T08:52:31.8368228Z  'num_warps': 2,
2026-02-21T08:52:31.8370597Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:52:31.8370907Z  'range_flattens': [False, True],
2026-02-21T08:52:31.8371191Z  'range_multi_buffers': [False, False],
2026-02-21T08:52:31.8371443Z  'range_num_stages': [3, 1],
2026-02-21T08:52:31.8371801Z  'range_unroll_factors': [0, 2],
2026-02-21T08:52:31.8372068Z  'range_warp_specializes': [True, None]}
2026-02-21T08:52:31.8372411Z [155s] Fitting surrogate: 421 points, 421 targets
2026-02-21T08:52:33.0531055Z [156s] Generation 5 starting: 76 neighbors, 5 active search path(s)
2026-02-21T08:52:51.8480958Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 2.8 configs/s
2026-02-21T08:52:56.7467809Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 16.7 configs/s
2026-02-21T08:53:00.9418410Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 304.5         
2026-02-21T08:53:00.9418852Z                                                                   configs/s     
2026-02-21T08:53:01.1560866Z [184s] Generation 5 complete: 
2026-02-21T08:53:01.1565051Z ok=82
2026-02-21T08:53:01.1569611Z min=0.0369
2026-02-21T08:53:01.1574267Z mid=0.0615
2026-02-21T08:53:01.1578928Z max=0.3031
2026-02-21T08:53:01.1582947Z best={'block_sizes': [1, 16384],
2026-02-21T08:53:01.1586991Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:53:01.1588302Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:01.1588578Z  'num_sm_multiplier': 32,
2026-02-21T08:53:01.1588764Z  'num_stages': 5,
2026-02-21T08:53:01.1588969Z  'num_warps': 8,
2026-02-21T08:53:01.1589172Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:53:01.1589445Z  'range_flattens': [False, True],
2026-02-21T08:53:01.1589711Z  'range_multi_buffers': [False, False],
2026-02-21T08:53:01.1590387Z  'range_num_stages': [3, 1],
2026-02-21T08:53:01.1590600Z  'range_unroll_factors': [0, 1],
2026-02-21T08:53:01.1590861Z  'range_warp_specializes': [True, None]}
2026-02-21T08:53:01.1591144Z [184s] Fitting surrogate: 503 points, 503 targets
2026-02-21T08:53:02.0529588Z [185s] Generation 6 starting: 56 neighbors, 4 active search path(s)
2026-02-21T08:53:14.7028309Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 3.9 configs/s
2026-02-21T08:53:18.6869741Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 15.2 configs/s
2026-02-21T08:53:20.8043378Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 475.3         
2026-02-21T08:53:20.8046554Z                                                                   configs/s     
2026-02-21T08:53:20.9367445Z [204s] Generation 6 complete: 
2026-02-21T08:53:20.9373416Z error=1
2026-02-21T08:53:20.9374968Z ok=60
2026-02-21T08:53:20.9375621Z min=0.0369
2026-02-21T08:53:20.9378366Z mid=0.0614
2026-02-21T08:53:20.9378608Z max=0.2376
2026-02-21T08:53:20.9378796Z best={'block_sizes': [1, 16384],
2026-02-21T08:53:20.9379095Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:53:20.9379365Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:20.9379618Z  'num_sm_multiplier': 32,
2026-02-21T08:53:20.9379844Z  'num_stages': 5,
2026-02-21T08:53:20.9380025Z  'num_warps': 8,
2026-02-21T08:53:20.9380250Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:53:20.9380481Z  'range_flattens': [False, True],
2026-02-21T08:53:20.9380732Z  'range_multi_buffers': [True, False],
2026-02-21T08:53:20.9380952Z  'range_num_stages': [3, 1],
2026-02-21T08:53:20.9381184Z  'range_unroll_factors': [0, 1],
2026-02-21T08:53:20.9381400Z  'range_warp_specializes': [True, None]}
2026-02-21T08:53:20.9392712Z [204s] Fitting surrogate: 564 points, 564 targets
2026-02-21T08:53:21.7409622Z [205s] Generation 7 starting: 44 neighbors, 3 active search path(s)
2026-02-21T08:53:32.8663061Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 1.4 configs/s
2026-02-21T08:53:35.6734080Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.0 configs/s
2026-02-21T08:53:37.6774860Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 503.4         
2026-02-21T08:53:37.6777386Z                                                                   configs/s     
2026-02-21T08:53:37.8091223Z [221s] Generation 7 complete: 
2026-02-21T08:53:37.8092845Z ok=48
2026-02-21T08:53:37.8093063Z min=0.0369
2026-02-21T08:53:37.8093278Z mid=0.0554
2026-02-21T08:53:37.8093445Z max=0.2560
2026-02-21T08:53:37.8093659Z best={'block_sizes': [1, 16384],
2026-02-21T08:53:37.8093934Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:53:37.8094238Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:37.8094473Z  'num_sm_multiplier': 32,
2026-02-21T08:53:37.8094704Z  'num_stages': 5,
2026-02-21T08:53:37.8094911Z  'num_warps': 8,
2026-02-21T08:53:37.8095155Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:53:37.8095441Z  'range_flattens': [False, True],
2026-02-21T08:53:37.8095662Z  'range_multi_buffers': [True, False],
2026-02-21T08:53:37.8095912Z  'range_num_stages': [3, 1],
2026-02-21T08:53:37.8096121Z  'range_unroll_factors': [0, 1],
2026-02-21T08:53:37.8096373Z  'range_warp_specializes': [True, None]}
2026-02-21T08:53:37.8114561Z [221s] Fitting surrogate: 612 points, 612 targets
2026-02-21T08:53:38.2034254Z [221s] Generation 8 starting: 10 neighbors, 1 active search path(s)
2026-02-21T08:53:42.5589477Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.3 configs/s
2026-02-21T08:53:43.2287872Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 15.9 configs/s
2026-02-21T08:53:44.0741063Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1176.6         
2026-02-21T08:53:44.0741529Z                                                                  configs/s      
2026-02-21T08:53:44.1392341Z [227s] Generation 8 complete: 
2026-02-21T08:53:44.1394426Z ok=12
2026-02-21T08:53:44.1394740Z min=0.0389
2026-02-21T08:53:44.1400728Z mid=0.0409
2026-02-21T08:53:44.1402582Z max=0.0798
2026-02-21T08:53:44.1402839Z best={'block_sizes': [1, 16384],
2026-02-21T08:53:44.1403125Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:53:44.1403440Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:44.1403672Z  'num_sm_multiplier': 32,
2026-02-21T08:53:44.1403909Z  'num_stages': 5,
2026-02-21T08:53:44.1404087Z  'num_warps': 8,
2026-02-21T08:53:44.1404321Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:53:44.1404586Z  'range_flattens': [False, True],
2026-02-21T08:53:44.1404810Z  'range_multi_buffers': [True, False],
2026-02-21T08:53:44.1405045Z  'range_num_stages': [3, 1],
2026-02-21T08:53:44.1405254Z  'range_unroll_factors': [0, 1],
2026-02-21T08:53:44.1405508Z  'range_warp_specializes': [True, None]}
2026-02-21T08:53:44.1417577Z [227s] Fitting surrogate: 624 points, 624 targets
2026-02-21T08:53:44.4843439Z [228s] Generation 9 starting: 6 neighbors, 1 active search path(s)
2026-02-21T08:53:46.4425825Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 3.0 configs/s
2026-02-21T08:53:46.8484579Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 6/6 16.5 configs/s
2026-02-21T08:53:47.4898630Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1544.1         
2026-02-21T08:53:47.4902887Z                                                                  configs/s      
2026-02-21T08:53:47.5447848Z [231s] Generation 9 complete: 
2026-02-21T08:53:47.5451814Z ok=8
2026-02-21T08:53:47.5453127Z min=0.0389
2026-02-21T08:53:47.5453417Z mid=0.0408
2026-02-21T08:53:47.5458910Z max=0.0430
2026-02-21T08:53:47.5460332Z best={'block_sizes': [1, 16384],
2026-02-21T08:53:47.5460674Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:53:47.5460981Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:47.5461216Z  'num_sm_multiplier': 32,
2026-02-21T08:53:47.5461471Z  'num_stages': 5,
2026-02-21T08:53:47.5461732Z  'num_warps': 8,
2026-02-21T08:53:47.5461961Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:53:47.5462202Z  'range_flattens': [False, True],
2026-02-21T08:53:47.5462452Z  'range_multi_buffers': [True, False],
2026-02-21T08:53:47.5462677Z  'range_num_stages': [3, 1],
2026-02-21T08:53:47.5462911Z  'range_unroll_factors': [0, 1],
2026-02-21T08:53:47.5463130Z  'range_warp_specializes': [True, None]}
2026-02-21T08:53:47.5473310Z [231s] Fitting surrogate: 632 points, 632 targets
2026-02-21T08:53:47.8378046Z [231s] Autotuning complete in 231.6s after searching 616 configs.
2026-02-21T08:53:47.8383369Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:53:47.8389289Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=32, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 1], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:53:47.8390385Z 
2026-02-21T08:53:47.8390907Z [231s] Code of selected kernel: /tmp/torchinductor_root/dq/cdqq7zp6bqkjpqkfmt7mzu4ydfou4qpeunr74tt4laftwcjdwbex.py
2026-02-21T08:53:48.9456211Z WARNING:tritonbench.utils.triton_op:Completed input ID 77:
2026-02-21T08:53:48.9458197Z (M, N)
2026-02-21T08:53:48.9458563Z -------------
2026-02-21T08:53:48.9458829Z (4096, 10112)
2026-02-21T08:53:48.9463801Z 
2026-02-21T08:53:48.9468591Z  80%|████████  | 16/20 [44:46<12:41, 190.29s/it]WARNING:tritonbench.utils.triton_op:Running input ID 82:
2026-02-21T08:53:48.9469338Z (M, N)
2026-02-21T08:53:48.9469535Z -------------
2026-02-21T08:53:48.9469792Z (4096, 10752)
2026-02-21T08:53:48.9470153Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax
2026-02-21T08:53:50.1476764Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:53:51.4878186Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax
2026-02-21T08:53:56.2149995Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:53:56.2151858Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:53:56.2152210Z               'dtype': 'torch.float16',
2026-02-21T08:53:56.2152523Z               'shape': (4096, 10752),
2026-02-21T08:53:56.2152783Z               'stride': (10752, 1)},),
2026-02-21T08:53:56.2153068Z   'kwargs': {}}
2026-02-21T08:53:56.2176817Z INFO:tritonbench.utils.triton_op:Took 3.06ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:53:56.4000372Z [0s] Autotune random seed: 2134816249
2026-02-21T08:53:59.0300278Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:54:40.3355677Z [41s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T08:54:40.3371997Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:54:41.4620093Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:54:41.4622361Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:54:41.4622998Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:54:41.4628559Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:54:41.4629948Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:54:41.4630303Z     %cst = arith.constant dense<10752> : tensor<8x1xi32>
2026-02-21T08:54:41.4630643Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T08:54:41.4630988Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T08:54:41.4631271Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:54:41.4631503Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:54:41.4631843Z     %c10752_i32 = arith.constant 10752 : i32
2026-02-21T08:54:41.4632076Z     %c10752_i64 = arith.constant 10752 : i64
2026-02-21T08:54:41.4632338Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:54:41.4632705Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10752_i32], [%c10752_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T08:54:41.4633108Z     %1 = tt.get_program_id x : i32
2026-02-21T08:54:41.4633394Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T08:54:41.4633657Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T08:54:41.4633957Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:54:41.4634248Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T08:54:41.4634559Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T08:54:41.4634786Z       %c10240_i32 = arith.constant 10240 : i32
2026-02-21T08:54:41.4635054Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:54:41.4635502Z       %6:2 = scf.for %arg3 = %c0_i32 to %c10240_i32 step %c2048_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T08:54:41.4636016Z         %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:54:41.4636416Z         %49 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4636695Z         %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4636968Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4637207Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:54:41.4637474Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4637732Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4638326Z         %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:54:41.4638679Z         %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:54:41.4638977Z         %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32>
2026-02-21T08:54:41.4639255Z         %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T08:54:41.4639540Z         %55 = arith.ori %53, %54 : tensor<8xi1>
2026-02-21T08:54:41.4639838Z         %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:54:41.4640113Z         %57 = arith.subf %arg4, %56 : tensor<8xf32>
2026-02-21T08:54:41.4640540Z         %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4640938Z         %59 = arith.mulf %arg5, %58 : tensor<8xf32>
2026-02-21T08:54:41.4641259Z         %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4641768Z         %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4642076Z         %62 = arith.subf %49, %61 : tensor<8x512xf32>
2026-02-21T08:54:41.4642511Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4642921Z         %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4643178Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4643403Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:54:41.4643667Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4643892Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4644158Z         %65 = arith.addf %59, %64 : tensor<8xf32>
2026-02-21T08:54:41.4644418Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:54:41.4644642Z         %66 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:54:41.4644916Z         %67 = arith.addi %arg3, %66 : i32
2026-02-21T08:54:41.4645231Z         %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:54:41.4645609Z         %69 = arith.extf %68 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4645874Z         %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4646128Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4646370Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:54:41.4646602Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4646849Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4647106Z         %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:54:41.4647411Z         %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:54:41.4647677Z         %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32>
2026-02-21T08:54:41.4647955Z         %74 = arith.cmpf une, %56, %56 : tensor<8xf32>
2026-02-21T08:54:41.4648222Z         %75 = arith.ori %73, %74 : tensor<8xi1>
2026-02-21T08:54:41.4648492Z         %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:54:41.4648792Z         %77 = arith.subf %56, %76 : tensor<8xf32>
2026-02-21T08:54:41.4649179Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4649597Z         %79 = arith.mulf %65, %78 : tensor<8xf32>
2026-02-21T08:54:41.4649883Z         %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4650254Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4650556Z         %82 = arith.subf %69, %81 : tensor<8x512xf32>
2026-02-21T08:54:41.4650959Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4651393Z         %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4651662Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4652003Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:54:41.4652232Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4652484Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4652750Z         %85 = arith.addf %79, %84 : tensor<8xf32>
2026-02-21T08:54:41.4652981Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:54:41.4653235Z         %86 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:54:41.4653462Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T08:54:41.4653797Z         %88 = tt.descriptor_load %0[%2, %87] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:54:41.4654149Z         %89 = arith.extf %88 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4654443Z         %90 = "tt.reduce"(%89) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4654697Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4654917Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:54:41.4655246Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4655474Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4655759Z         %91 = arith.truncf %90 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:54:41.4656034Z         %92 = arith.extf %91 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:54:41.4656326Z         %93 = arith.cmpf ogt, %76, %92 : tensor<8xf32>
2026-02-21T08:54:41.4656607Z         %94 = arith.cmpf une, %76, %76 : tensor<8xf32>
2026-02-21T08:54:41.4656845Z         %95 = arith.ori %93, %94 : tensor<8xi1>
2026-02-21T08:54:41.4657132Z         %96 = arith.select %95, %76, %92 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:54:41.4657395Z         %97 = arith.subf %76, %96 : tensor<8xf32>
2026-02-21T08:54:41.4657836Z         %98 = tt.extern_elementwise %97 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4658226Z         %99 = arith.mulf %85, %98 : tensor<8xf32>
2026-02-21T08:54:41.4658543Z         %100 = tt.expand_dims %96 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4658901Z         %101 = tt.broadcast %100 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4659182Z         %102 = arith.subf %89, %101 : tensor<8x512xf32>
2026-02-21T08:54:41.4659621Z         %103 = tt.extern_elementwise %102 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4660028Z         %104 = "tt.reduce"(%103) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4660288Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4660510Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:54:41.4660767Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4661009Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4661243Z         %105 = arith.addf %99, %104 : tensor<8xf32>
2026-02-21T08:54:41.4661506Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:54:41.4661780Z         %106 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:54:41.4662053Z         %107 = arith.addi %arg3, %106 : i32
2026-02-21T08:54:41.4662371Z         %108 = tt.descriptor_load %0[%2, %107] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:54:41.4662762Z         %109 = arith.extf %108 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4663069Z         %110 = "tt.reduce"(%109) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4663304Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4663559Z           %126 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:54:41.4663806Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4664048Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4664315Z         %111 = arith.truncf %110 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:54:41.4664649Z         %112 = arith.extf %111 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:54:41.4664965Z         %113 = arith.cmpf ogt, %96, %112 : tensor<8xf32>
2026-02-21T08:54:41.4665239Z         %114 = arith.cmpf une, %96, %96 : tensor<8xf32>
2026-02-21T08:54:41.4665601Z         %115 = arith.ori %113, %114 : tensor<8xi1>
2026-02-21T08:54:41.4665888Z         %116 = arith.select %115, %96, %112 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:54:41.4666209Z         %117 = arith.subf %96, %116 : tensor<8xf32>
2026-02-21T08:54:41.4666624Z         %118 = tt.extern_elementwise %117 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4667076Z         %119 = arith.mulf %105, %118 : tensor<8xf32>
2026-02-21T08:54:41.4667417Z         %120 = tt.expand_dims %116 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4667767Z         %121 = tt.broadcast %120 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4668087Z         %122 = arith.subf %109, %121 : tensor<8x512xf32>
2026-02-21T08:54:41.4668515Z         %123 = tt.extern_elementwise %122 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4669035Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4669277Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:54:41.4669534Z           %126 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:54:41.4669792Z           tt.reduce.return %126 : f32
2026-02-21T08:54:41.4670025Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4670279Z         %125 = arith.addf %119, %124 : tensor<8xf32>
2026-02-21T08:54:41.4670546Z         scf.yield %116, %125 : tensor<8xf32>, tensor<8xf32>
2026-02-21T08:54:41.4670869Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:54:41.4671247Z       %7 = tt.descriptor_load %0[%2, %c10240_i32] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T08:54:41.4671685Z       %8 = arith.extf %7 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4671994Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4672234Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:54:41.4672496Z         %48 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:54:41.4672738Z         tt.reduce.return %48 : f32
2026-02-21T08:54:41.4673012Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4673284Z       %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16>
2026-02-21T08:54:41.4673612Z       %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32>
2026-02-21T08:54:41.4673900Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32>
2026-02-21T08:54:41.4674155Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T08:54:41.4674422Z       %14 = arith.ori %12, %13 : tensor<8xi1>
2026-02-21T08:54:41.4674683Z       %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32>
2026-02-21T08:54:41.4674975Z       %16 = arith.subf %6#0, %15 : tensor<8xf32>
2026-02-21T08:54:41.4675362Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4675781Z       %18 = arith.mulf %6#1, %17 : tensor<8xf32>
2026-02-21T08:54:41.4676099Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4676418Z       %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4676719Z       %21 = arith.subf %8, %20 : tensor<8x512xf32>
2026-02-21T08:54:41.4677112Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4677534Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T08:54:41.4677762Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:54:41.4678010Z         %48 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:54:41.4678265Z         tt.reduce.return %48 : f32
2026-02-21T08:54:41.4678489Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:54:41.4678748Z       %24 = arith.addf %18, %23 : tensor<8xf32>
2026-02-21T08:54:41.4678985Z       %c10240_i32_2 = arith.constant 10240 : i32
2026-02-21T08:54:41.4679259Z       %c2048_i32_3 = arith.constant 2048 : i32
2026-02-21T08:54:41.4679607Z       scf.for %arg3 = %c0_i32 to %c10240_i32_2 step %c2048_i32_3  : i32 {
2026-02-21T08:54:41.4679969Z         %48 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:54:41.4680301Z         %49 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T08:54:41.4680551Z         %50 = arith.addi %49, %48 : tensor<512xi32>
2026-02-21T08:54:41.4680872Z         %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:54:41.4681176Z         %52 = arith.muli %51, %cst : tensor<8x1xi32>
2026-02-21T08:54:41.4681500Z         %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:54:41.4681857Z         %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4682187Z         %55 = tt.broadcast %53 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4682484Z         %56 = arith.addi %54, %55 : tensor<8x512xi32>
2026-02-21T08:54:41.4682836Z         %57 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4683176Z         %58 = tt.addptr %57, %56 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4683511Z         %59 = tt.load %58 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4683875Z         %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4684191Z         %61 = arith.extf %59 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4684511Z         %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4684810Z         %63 = arith.subf %61, %62 : tensor<8x512xf32>
2026-02-21T08:54:41.4685213Z         %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4685684Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4686003Z         %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4686302Z         %67 = arith.divf %64, %66 : tensor<8x512xf32>
2026-02-21T08:54:41.4686595Z         %68 = arith.truncf %67 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:54:41.4686897Z         %69 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4687237Z         %70 = tt.addptr %69, %56 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4687531Z         tt.store %70, %68 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4687801Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:54:41.4688033Z         %71 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:54:41.4688291Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T08:54:41.4688595Z         %73 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:54:41.4688883Z         %74 = tt.splat %72 : i32 -> tensor<512xi32>
2026-02-21T08:54:41.4689152Z         %75 = arith.addi %74, %73 : tensor<512xi32>
2026-02-21T08:54:41.4689438Z         %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:54:41.4689760Z         %77 = arith.muli %76, %cst : tensor<8x1xi32>
2026-02-21T08:54:41.4690051Z         %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:54:41.4690401Z         %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4690728Z         %80 = tt.broadcast %78 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4690999Z         %81 = arith.addi %79, %80 : tensor<8x512xi32>
2026-02-21T08:54:41.4691298Z         %82 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4691635Z         %83 = tt.addptr %82, %81 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4691999Z         %84 = tt.load %83 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4692337Z         %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4692742Z         %86 = arith.extf %84 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4693066Z         %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4693334Z         %88 = arith.subf %86, %87 : tensor<8x512xf32>
2026-02-21T08:54:41.4693770Z         %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4694215Z         %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4694571Z         %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4694866Z         %92 = arith.divf %89, %91 : tensor<8x512xf32>
2026-02-21T08:54:41.4695141Z         %93 = arith.truncf %92 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:54:41.4695476Z         %94 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4695841Z         %95 = tt.addptr %94, %81 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4696165Z         tt.store %95, %93 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4696409Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:54:41.4696667Z         %96 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:54:41.4696933Z         %97 = arith.addi %arg3, %96 : i32
2026-02-21T08:54:41.4697200Z         %98 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:54:41.4697518Z         %99 = tt.splat %97 : i32 -> tensor<512xi32>
2026-02-21T08:54:41.4697764Z         %100 = arith.addi %99, %98 : tensor<512xi32>
2026-02-21T08:54:41.4698076Z         %101 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:54:41.4698375Z         %102 = arith.muli %101, %cst : tensor<8x1xi32>
2026-02-21T08:54:41.4698708Z         %103 = tt.expand_dims %100 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:54:41.4699074Z         %104 = tt.broadcast %102 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4699377Z         %105 = tt.broadcast %103 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4699689Z         %106 = arith.addi %104, %105 : tensor<8x512xi32>
2026-02-21T08:54:41.4699969Z         %107 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4700310Z         %108 = tt.addptr %107, %106 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4700656Z         %109 = tt.load %108 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4701027Z         %110 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4701384Z         %111 = arith.extf %109 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4701721Z         %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4702030Z         %113 = arith.subf %111, %112 : tensor<8x512xf32>
2026-02-21T08:54:41.4702444Z         %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4702943Z         %115 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4703296Z         %116 = tt.broadcast %115 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4703576Z         %117 = arith.divf %114, %116 : tensor<8x512xf32>
2026-02-21T08:54:41.4703878Z         %118 = arith.truncf %117 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:54:41.4704189Z         %119 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4704545Z         %120 = tt.addptr %119, %106 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4704869Z         tt.store %120, %118 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4705119Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:54:41.4705376Z         %121 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T08:54:41.4705610Z         %122 = arith.addi %arg3, %121 : i32
2026-02-21T08:54:41.4705980Z         %123 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:54:41.4706270Z         %124 = tt.splat %122 : i32 -> tensor<512xi32>
2026-02-21T08:54:41.4706547Z         %125 = arith.addi %124, %123 : tensor<512xi32>
2026-02-21T08:54:41.4706837Z         %126 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:54:41.4707163Z         %127 = arith.muli %126, %cst : tensor<8x1xi32>
2026-02-21T08:54:41.4707501Z         %128 = tt.expand_dims %125 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:54:41.4707835Z         %129 = tt.broadcast %127 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4708169Z         %130 = tt.broadcast %128 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4708449Z         %131 = arith.addi %129, %130 : tensor<8x512xi32>
2026-02-21T08:54:41.4708759Z         %132 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4709184Z         %133 = tt.addptr %132, %131 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4709553Z         %134 = tt.load %133 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4709953Z         %135 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4710298Z         %136 = arith.extf %134 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4710645Z         %137 = tt.broadcast %135 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4710969Z         %138 = arith.subf %136, %137 : tensor<8x512xf32>
2026-02-21T08:54:41.4711404Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4711947Z         %140 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4712289Z         %141 = tt.broadcast %140 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4712616Z         %142 = arith.divf %139, %141 : tensor<8x512xf32>
2026-02-21T08:54:41.4712941Z         %143 = arith.truncf %142 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:54:41.4713272Z         %144 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4713645Z         %145 = tt.addptr %144, %131 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4713964Z         tt.store %145, %143 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4714290Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:54:41.4714621Z       %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:54:41.4714976Z       %26 = tt.splat %c10240_i32_2 : i32 -> tensor<512xi32>
2026-02-21T08:54:41.4715267Z       %27 = arith.addi %26, %25 : tensor<512xi32>
2026-02-21T08:54:41.4715564Z       %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T08:54:41.4715905Z       %29 = arith.muli %28, %cst : tensor<8x1xi32>
2026-02-21T08:54:41.4716214Z       %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:54:41.4716587Z       %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4716899Z       %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T08:54:41.4717211Z       %33 = arith.addi %31, %32 : tensor<8x512xi32>
2026-02-21T08:54:41.4717523Z       %34 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4717853Z       %35 = tt.addptr %34, %33 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4718228Z       %36 = tt.load %35 evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4718579Z       %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4718926Z       %38 = arith.extf %36 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T08:54:41.4719240Z       %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4763796Z       %40 = arith.subf %38, %39 : tensor<8x512xf32>
2026-02-21T08:54:41.4764220Z       %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:54:41.4764670Z       %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T08:54:41.4765009Z       %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T08:54:41.4765275Z       %44 = arith.divf %41, %43 : tensor<8x512xf32>
2026-02-21T08:54:41.4765596Z       %45 = arith.truncf %44 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T08:54:41.4765925Z       %46 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4766231Z       %47 = tt.addptr %46, %33 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T08:54:41.4766544Z       tt.store %47, %45 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T08:54:41.4767002Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:54:41.4767330Z     tt.return
2026-02-21T08:54:41.4767503Z   }
2026-02-21T08:54:41.4767697Z }
2026-02-21T08:54:41.4767791Z 
2026-02-21T08:54:41.4767894Z {-#
2026-02-21T08:54:41.4768068Z   external_resources: {
2026-02-21T08:54:41.4768292Z     mlir_reproducer: {
2026-02-21T08:54:41.4772699Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:54:41.4777209Z       disable_threading: false,
2026-02-21T08:54:41.4777451Z       verify_each: true
2026-02-21T08:54:41.4777635Z     }
2026-02-21T08:54:41.4777827Z   }
2026-02-21T08:54:41.4777984Z #-}
2026-02-21T08:54:41.4778480Z /tmp/torchinductor_root/ng/cng5e4yetnouvd2hbraksannbxwtzpiyha37lrnoy6ervzlsattq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:54:41.4779743Z /tmp/torchinductor_root/ng/cng5e4yetnouvd2hbraksannbxwtzpiyha37lrnoy6ervzlsattq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:54:41.4780766Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:54:41.4781944Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:54:41.4783057Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:54:41.4783352Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:54:49.5574703Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.9 configs/s
2026-02-21T08:54:49.5585870Z [50s] Adaptive compile timeout: 30s (90% percentile=11.0s, bounds=[30.0s, 30s])
2026-02-21T08:54:50.4239809Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1136.9 configs/s
2026-02-21T08:54:50.4892210Z [51s] Initial random population of 100, 5 starting points: 
2026-02-21T08:54:50.4894399Z error=6
2026-02-21T08:54:50.4894698Z timeout=1
2026-02-21T08:54:50.4894915Z ok=93
2026-02-21T08:54:50.4895118Z min=0.0553
2026-02-21T08:54:50.4895288Z mid=1.0046
2026-02-21T08:54:50.4895483Z max=53.8255
2026-02-21T08:54:50.4895703Z best={'block_sizes': [1, 16384],
2026-02-21T08:54:50.4896010Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:54:50.4896281Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:54:50.4896540Z  'num_sm_multiplier': 8,
2026-02-21T08:54:50.4896738Z  'num_stages': 3,
2026-02-21T08:54:50.4896951Z  'num_warps': 1,
2026-02-21T08:54:50.4897150Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:54:50.4897413Z  'range_flattens': [False, None],
2026-02-21T08:54:50.4897706Z  'range_multi_buffers': [True, True],
2026-02-21T08:54:50.4897974Z  'range_num_stages': [1, 2],
2026-02-21T08:54:50.4898216Z  'range_unroll_factors': [0, 1],
2026-02-21T08:54:50.4898438Z  'range_warp_specializes': [True, None]}
2026-02-21T08:54:50.4907217Z [51s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:54:51.5351664Z [52s] Generation 1 starting: 79 neighbors, 5 active search path(s)
2026-02-21T08:55:03.6071947Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 11.0 configs/s
2026-02-21T08:55:08.5832776Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.0 configs/s
2026-02-21T08:55:15.0529043Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 175.1         
2026-02-21T08:55:15.0530479Z                                                                   configs/s     
2026-02-21T08:55:15.3387370Z [76s] Generation 1 complete: 
2026-02-21T08:55:15.3389314Z ok=85
2026-02-21T08:55:15.3389555Z min=0.0512
2026-02-21T08:55:15.3389805Z mid=0.0696
2026-02-21T08:55:15.3394597Z max=0.3931
2026-02-21T08:55:15.3398147Z best={'block_sizes': [1, 16384],
2026-02-21T08:55:15.3400565Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:55:15.3400912Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:55:15.3401204Z  'num_stages': 3,
2026-02-21T08:55:15.3401463Z  'num_warps': 1,
2026-02-21T08:55:15.3401920Z  'pid_type': 'flat',
2026-02-21T08:55:15.3402120Z  'range_flattens': [None, None],
2026-02-21T08:55:15.3402369Z  'range_multi_buffers': [None, True],
2026-02-21T08:55:15.3406071Z  'range_num_stages': [0, 2],
2026-02-21T08:55:15.3410762Z  'range_unroll_factors': [0, 1],
2026-02-21T08:55:15.3412893Z  'range_warp_specializes': [None, True]}
2026-02-21T08:55:15.3413267Z [76s] Fitting surrogate: 185 points, 185 targets
2026-02-21T08:55:16.1277215Z [77s] Generation 2 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:55:27.5110597Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 17.1 configs/s
2026-02-21T08:55:31.2267729Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.2 configs/s
2026-02-21T08:55:33.4591357Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 449.9         
2026-02-21T08:55:33.4595295Z                                                                   configs/s     
2026-02-21T08:55:33.5918105Z [94s] Generation 2 complete: 
2026-02-21T08:55:33.5920079Z ok=66
2026-02-21T08:55:33.5920330Z min=0.0410
2026-02-21T08:55:33.5920560Z mid=0.0675
2026-02-21T08:55:33.5920756Z max=0.7946
2026-02-21T08:55:33.5920992Z best={'block_sizes': [1, 16384],
2026-02-21T08:55:33.5921320Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:55:33.5921731Z  'load_eviction_policies': ['', ''],
2026-02-21T08:55:33.5921957Z  'num_stages': 3,
2026-02-21T08:55:33.5922211Z  'num_warps': 2,
2026-02-21T08:55:33.5922454Z  'pid_type': 'flat',
2026-02-21T08:55:33.5922654Z  'range_flattens': [None, None],
2026-02-21T08:55:33.5922919Z  'range_multi_buffers': [None, True],
2026-02-21T08:55:33.5923148Z  'range_num_stages': [0, 2],
2026-02-21T08:55:33.5923377Z  'range_unroll_factors': [0, 1],
2026-02-21T08:55:33.5923600Z  'range_warp_specializes': [None, True]}
2026-02-21T08:55:33.5935663Z [94s] Fitting surrogate: 251 points, 251 targets
2026-02-21T08:55:34.4207816Z [95s] Generation 3 starting: 56 neighbors, 4 active search path(s)
2026-02-21T08:55:45.1062402Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 4.1 configs/s
2026-02-21T08:55:49.3252470Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 14.1 configs/s
2026-02-21T08:55:51.0009540Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 599.6         
2026-02-21T08:55:51.0010978Z                                                                   configs/s     
2026-02-21T08:55:51.1128437Z [112s] Generation 3 complete: 
2026-02-21T08:55:51.1130134Z ok=61
2026-02-21T08:55:51.1130407Z min=0.0409
2026-02-21T08:55:51.1130666Z mid=0.0655
2026-02-21T08:55:51.1130872Z max=0.4711
2026-02-21T08:55:51.1131128Z best={'block_sizes': [1, 16384],
2026-02-21T08:55:51.1131423Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:55:51.1132134Z  'load_eviction_policies': ['', ''],
2026-02-21T08:55:51.1133936Z  'num_stages': 3,
2026-02-21T08:55:51.1134213Z  'num_warps': 4,
2026-02-21T08:55:51.1139814Z  'pid_type': 'flat',
2026-02-21T08:55:51.1144538Z  'range_flattens': [None, None],
2026-02-21T08:55:51.1144906Z  'range_multi_buffers': [None, True],
2026-02-21T08:55:51.1149299Z  'range_num_stages': [0, 2],
2026-02-21T08:55:51.1149663Z  'range_unroll_factors': [0, 1],
2026-02-21T08:55:51.1149928Z  'range_warp_specializes': [None, True]}
2026-02-21T08:55:51.1155372Z [112s] Fitting surrogate: 312 points, 312 targets
2026-02-21T08:55:51.8470920Z [112s] Generation 4 starting: 48 neighbors, 3 active search path(s)
2026-02-21T08:56:01.9910726Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 11.2 configs/s
2026-02-21T08:56:05.0315831Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.0 configs/s
2026-02-21T08:56:07.6408852Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 386.2         
2026-02-21T08:56:07.6415418Z                                                                   configs/s     
2026-02-21T08:56:07.7917549Z [128s] Generation 4 complete: 
2026-02-21T08:56:07.7918575Z ok=52
2026-02-21T08:56:07.7918774Z min=0.0409
2026-02-21T08:56:07.7918946Z mid=0.0614
2026-02-21T08:56:07.7919134Z max=0.2724
2026-02-21T08:56:07.7919315Z best={'block_sizes': [1, 16384],
2026-02-21T08:56:07.7919584Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:56:07.7919837Z  'load_eviction_policies': ['', ''],
2026-02-21T08:56:07.7920092Z  'num_stages': 3,
2026-02-21T08:56:07.7920271Z  'num_warps': 4,
2026-02-21T08:56:07.7920479Z  'pid_type': 'flat',
2026-02-21T08:56:07.7920706Z  'range_flattens': [None, None],
2026-02-21T08:56:07.7920922Z  'range_multi_buffers': [None, True],
2026-02-21T08:56:07.7921174Z  'range_num_stages': [0, 1],
2026-02-21T08:56:07.7921383Z  'range_unroll_factors': [0, 1],
2026-02-21T08:56:07.7921893Z  'range_warp_specializes': [None, True]}
2026-02-21T08:56:07.7935451Z [128s] Fitting surrogate: 364 points, 364 targets
2026-02-21T08:56:08.2964802Z [129s] Generation 5 starting: 31 neighbors, 2 active search path(s)
2026-02-21T08:56:16.1704884Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 6.5 configs/s
2026-02-21T08:56:18.1100513Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.9 configs/s
2026-02-21T08:56:20.8652835Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 497.5         
2026-02-21T08:56:20.8653333Z                                                                   configs/s     
2026-02-21T08:56:20.9839707Z [142s] Generation 5 complete: 
2026-02-21T08:56:20.9840097Z ok=34
2026-02-21T08:56:20.9840348Z min=0.0410
2026-02-21T08:56:20.9840560Z mid=0.0594
2026-02-21T08:56:20.9840782Z max=0.5181
2026-02-21T08:56:20.9840994Z best={'block_sizes': [1, 16384],
2026-02-21T08:56:20.9841306Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:56:20.9841682Z  'load_eviction_policies': ['', ''],
2026-02-21T08:56:20.9841907Z  'num_stages': 3,
2026-02-21T08:56:20.9842126Z  'num_warps': 4,
2026-02-21T08:56:20.9842315Z  'pid_type': 'flat',
2026-02-21T08:56:20.9842581Z  'range_flattens': [None, None],
2026-02-21T08:56:20.9842819Z  'range_multi_buffers': [None, True],
2026-02-21T08:56:20.9843076Z  'range_num_stages': [0, 1],
2026-02-21T08:56:20.9843277Z  'range_unroll_factors': [0, 1],
2026-02-21T08:56:20.9843534Z  'range_warp_specializes': [None, True]}
2026-02-21T08:56:20.9853015Z [142s] Fitting surrogate: 398 points, 398 targets
2026-02-21T08:56:21.4074037Z [142s] Generation 6 starting: 24 neighbors, 2 active search path(s)
2026-02-21T08:56:27.2540588Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 1.5 configs/s
2026-02-21T08:56:28.8999835Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 16.2 configs/s
2026-02-21T08:56:30.3933190Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 670.4         
2026-02-21T08:56:30.3933783Z                                                                   configs/s     
2026-02-21T08:56:30.4922587Z [151s] Generation 6 complete: 
2026-02-21T08:56:30.4926294Z ok=27
2026-02-21T08:56:30.4929513Z min=0.0409
2026-02-21T08:56:30.4933438Z mid=0.0594
2026-02-21T08:56:30.4937880Z max=0.1885
2026-02-21T08:56:30.4939470Z best={'block_sizes': [1, 16384],
2026-02-21T08:56:30.4939851Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:56:30.4943888Z  'load_eviction_policies': ['', ''],
2026-02-21T08:56:30.4944263Z  'num_stages': 4,
2026-02-21T08:56:30.4944483Z  'num_warps': 1,
2026-02-21T08:56:30.4949483Z  'pid_type': 'flat',
2026-02-21T08:56:30.4952825Z  'range_flattens': [None, None],
2026-02-21T08:56:30.4956034Z  'range_multi_buffers': [None, True],
2026-02-21T08:56:30.4957973Z  'range_num_stages': [0, 1],
2026-02-21T08:56:30.4958257Z  'range_unroll_factors': [0, 0],
2026-02-21T08:56:30.4958495Z  'range_warp_specializes': [None, True]}
2026-02-21T08:56:30.4958862Z [151s] Fitting surrogate: 425 points, 425 targets
2026-02-21T08:56:30.7775049Z [151s] Generation 7 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:56:36.5980817Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 0.8 configs/s
2026-02-21T08:56:37.3105771Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.0 configs/s
2026-02-21T08:56:38.3217529Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 986.5         
2026-02-21T08:56:38.3219192Z                                                                   configs/s     
2026-02-21T08:56:38.3956749Z [159s] Generation 7 complete: 
2026-02-21T08:56:38.3960554Z ok=13
2026-02-21T08:56:38.3964479Z min=0.0409
2026-02-21T08:56:38.3966473Z mid=0.0409
2026-02-21T08:56:38.3966701Z max=0.3767
2026-02-21T08:56:38.3966885Z best={'block_sizes': [1, 16384],
2026-02-21T08:56:38.3967197Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:56:38.3967461Z  'load_eviction_policies': ['', ''],
2026-02-21T08:56:38.3967699Z  'num_stages': 4,
2026-02-21T08:56:38.3967963Z  'num_warps': 1,
2026-02-21T08:56:38.3972239Z  'pid_type': 'flat',
2026-02-21T08:56:38.3976795Z  'range_flattens': [None, None],
2026-02-21T08:56:38.3978578Z  'range_multi_buffers': [None, True],
2026-02-21T08:56:38.3978898Z  'range_num_stages': [0, 1],
2026-02-21T08:56:38.3979169Z  'range_unroll_factors': [0, 0],
2026-02-21T08:56:38.3979408Z  'range_warp_specializes': [None, True]}
2026-02-21T08:56:38.3979829Z [159s] Fitting surrogate: 438 points, 438 targets
2026-02-21T08:56:38.5687845Z [159s] Autotuning complete in 159.7s after searching 426 configs.
2026-02-21T08:56:38.5688303Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:56:38.5693132Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:56:38.5694018Z 
2026-02-21T08:56:38.5694322Z [159s] Code of selected kernel: /tmp/torchinductor_root/v4/cv4c7vlts4ihrxwoqka3tpcvarlgtl5v757t2iold6n2j3k6uap6.py
2026-02-21T08:56:39.6588638Z WARNING:tritonbench.utils.triton_op:Completed input ID 82:
2026-02-21T08:56:39.6594747Z (M, N)
2026-02-21T08:56:39.6596881Z -------------
2026-02-21T08:56:39.6597124Z (4096, 10752)
2026-02-21T08:56:39.6597264Z 
2026-02-21T08:56:39.6603002Z  85%|████████▌ | 17/20 [47:37<09:13, 184.41s/it]WARNING:tritonbench.utils.triton_op:Running input ID 87:
2026-02-21T08:56:39.6607247Z (M, N)
2026-02-21T08:56:39.6612236Z -------------
2026-02-21T08:56:39.6614213Z (4096, 11392)
2026-02-21T08:56:39.6614722Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T08:56:40.8571319Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T08:56:42.2133646Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax
2026-02-21T08:56:46.0290711Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:56:46.0295078Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:56:46.0296610Z               'dtype': 'torch.float16',
2026-02-21T08:56:46.0296926Z               'shape': (4096, 11392),
2026-02-21T08:56:46.0297156Z               'stride': (11392, 1)},),
2026-02-21T08:56:46.0297479Z   'kwargs': {}}
2026-02-21T08:56:46.0310362Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T08:56:46.2069041Z [0s] Autotune random seed: 2134816249
2026-02-21T08:56:46.3509883Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:57:26.1987076Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T08:57:26.2005421Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:57:28.7032390Z module {
2026-02-21T08:57:28.7034597Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:57:28.7035158Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:57:28.7035401Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:57:28.7035665Z     %c148_i32 = arith.constant 148 : i32
2026-02-21T08:57:28.7035927Z     %cst = arith.constant dense<11392> : tensor<128x1xi32>
2026-02-21T08:57:28.7036260Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T08:57:28.7036563Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T08:57:28.7036854Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:57:28.7037110Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:57:28.7037718Z     %c11392_i32 = arith.constant 11392 : i32
2026-02-21T08:57:28.7038017Z     %c11392_i64 = arith.constant 11392 : i64
2026-02-21T08:57:28.7038244Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:57:28.7038639Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c11392_i32], [%c11392_i64, %c1_i64] : <f16>, <tensor<128x128xf16>>
2026-02-21T08:57:28.7039017Z     %1 = tt.get_program_id x : i32
2026-02-21T08:57:28.7039298Z     scf.for %arg2 = %1 to %c32_i32 step %c148_i32  : i32 {
2026-02-21T08:57:28.7039596Z       %2 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T08:57:28.7039887Z       %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:57:28.7040232Z       %4 = tt.splat %2 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7040478Z       %5 = arith.addi %4, %3 : tensor<128xi32>
2026-02-21T08:57:28.7040750Z       %c11264_i32 = arith.constant 11264 : i32
2026-02-21T08:57:28.7041021Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T08:57:28.7041513Z       %6:2 = scf.for %arg3 = %c0_i32 to %c11264_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T08:57:28.7042058Z         %55 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7042321Z         %56 = arith.addi %55, %3 : tensor<128xi32>
2026-02-21T08:57:28.7042724Z         %57 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7043070Z         %58 = arith.muli %57, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7043381Z         %59 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7043751Z         %60 = tt.broadcast %58 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7044076Z         %61 = tt.broadcast %59 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7044404Z         %62 = arith.addi %60, %61 : tensor<128x128xi32>
2026-02-21T08:57:28.7044737Z         %63 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7045094Z         %64 = tt.addptr %63, %62 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7045489Z         %65 = tt.load %64 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7045841Z         %66 = arith.extf %65 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7046159Z         %67 = "tt.reduce"(%66) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7046403Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7046674Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:57:28.7046945Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7047188Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7047501Z         %68 = arith.truncf %67 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:57:28.7047808Z         %69 = arith.extf %68 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:57:28.7048130Z         %70 = arith.cmpf ogt, %arg4, %69 : tensor<128xf32>
2026-02-21T08:57:28.7048417Z         %71 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T08:57:28.7048870Z         %72 = arith.ori %70, %71 : tensor<128xi1>
2026-02-21T08:57:28.7049157Z         %73 = arith.select %72, %arg4, %69 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:57:28.7049460Z         %74 = arith.subf %arg4, %73 : tensor<128xf32>
2026-02-21T08:57:28.7049926Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7050337Z         %76 = arith.mulf %arg5, %75 : tensor<128xf32>
2026-02-21T08:57:28.7050660Z         %77 = tt.expand_dims %73 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7050995Z         %78 = tt.broadcast %77 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7051311Z         %79 = arith.subf %66, %78 : tensor<128x128xf32>
2026-02-21T08:57:28.7051904Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7052326Z         %81 = "tt.reduce"(%80) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7052595Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7052821Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:57:28.7053088Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7053320Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7053590Z         %82 = arith.addf %76, %81 : tensor<128xf32>
2026-02-21T08:57:28.7053854Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:57:28.7054101Z         %83 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T08:57:28.7054363Z         %84 = arith.addi %arg3, %83 : i32
2026-02-21T08:57:28.7054590Z         %85 = tt.splat %84 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7054856Z         %86 = arith.addi %85, %3 : tensor<128xi32>
2026-02-21T08:57:28.7055139Z         %87 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7055476Z         %88 = arith.muli %87, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7055793Z         %89 = tt.expand_dims %86 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7056118Z         %90 = tt.broadcast %88 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7056447Z         %91 = tt.broadcast %89 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7056729Z         %92 = arith.addi %90, %91 : tensor<128x128xi32>
2026-02-21T08:57:28.7057036Z         %93 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7057359Z         %94 = tt.addptr %93, %92 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7057729Z         %95 = tt.load %94 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7058088Z         %96 = arith.extf %95 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7058361Z         %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7058653Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7058883Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:57:28.7059175Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7059374Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7059670Z         %98 = arith.truncf %97 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:57:28.7059985Z         %99 = arith.extf %98 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:57:28.7060257Z         %100 = arith.cmpf ogt, %73, %99 : tensor<128xf32>
2026-02-21T08:57:28.7060543Z         %101 = arith.cmpf une, %73, %73 : tensor<128xf32>
2026-02-21T08:57:28.7060791Z         %102 = arith.ori %100, %101 : tensor<128xi1>
2026-02-21T08:57:28.7061100Z         %103 = arith.select %102, %73, %99 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:57:28.7061387Z         %104 = arith.subf %73, %103 : tensor<128xf32>
2026-02-21T08:57:28.7061852Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7062368Z         %106 = arith.mulf %82, %105 : tensor<128xf32>
2026-02-21T08:57:28.7062641Z         %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7063016Z         %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7063313Z         %109 = arith.subf %96, %108 : tensor<128x128xf32>
2026-02-21T08:57:28.7063765Z         %110 = tt.extern_elementwise %109 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7064216Z         %111 = "tt.reduce"(%110) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7064459Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7064708Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:57:28.7064938Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7065196Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7065625Z         %112 = arith.addf %106, %111 : tensor<128xf32>
2026-02-21T08:57:28.7065896Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:57:28.7066127Z         %113 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:57:28.7066390Z         %114 = arith.addi %arg3, %113 : i32
2026-02-21T08:57:28.7066627Z         %115 = tt.splat %114 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7066904Z         %116 = arith.addi %115, %3 : tensor<128xi32>
2026-02-21T08:57:28.7067199Z         %117 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7067539Z         %118 = arith.muli %117, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7067842Z         %119 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7068211Z         %120 = tt.broadcast %118 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7068557Z         %121 = tt.broadcast %119 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7068849Z         %122 = arith.addi %120, %121 : tensor<128x128xi32>
2026-02-21T08:57:28.7069161Z         %123 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7069495Z         %124 = tt.addptr %123, %122 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7069883Z         %125 = tt.load %124 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7070249Z         %126 = arith.extf %125 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7070528Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7070789Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7071013Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:57:28.7071271Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7071498Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7071829Z         %128 = arith.truncf %127 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:57:28.7072158Z         %129 = arith.extf %128 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:57:28.7072443Z         %130 = arith.cmpf ogt, %103, %129 : tensor<128xf32>
2026-02-21T08:57:28.7072735Z         %131 = arith.cmpf une, %103, %103 : tensor<128xf32>
2026-02-21T08:57:28.7072988Z         %132 = arith.ori %130, %131 : tensor<128xi1>
2026-02-21T08:57:28.7073306Z         %133 = arith.select %132, %103, %129 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:57:28.7073602Z         %134 = arith.subf %103, %133 : tensor<128xf32>
2026-02-21T08:57:28.7074043Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7074472Z         %136 = arith.mulf %112, %135 : tensor<128xf32>
2026-02-21T08:57:28.7074774Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7075139Z         %138 = tt.broadcast %137 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7075431Z         %139 = arith.subf %126, %138 : tensor<128x128xf32>
2026-02-21T08:57:28.7075946Z         %140 = tt.extern_elementwise %139 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7076385Z         %141 = "tt.reduce"(%140) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7076619Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7076874Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:57:28.7077105Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7077363Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7077605Z         %142 = arith.addf %136, %141 : tensor<128xf32>
2026-02-21T08:57:28.7077870Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:57:28.7078100Z         %143 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:57:28.7078389Z         %144 = arith.addi %arg3, %143 : i32
2026-02-21T08:57:28.7078657Z         %145 = tt.splat %144 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7078994Z         %146 = arith.addi %145, %3 : tensor<128xi32>
2026-02-21T08:57:28.7079327Z         %147 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7079645Z         %148 = arith.muli %147, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7079980Z         %149 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7080320Z         %150 = tt.broadcast %148 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7080665Z         %151 = tt.broadcast %149 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7080984Z         %152 = arith.addi %150, %151 : tensor<128x128xi32>
2026-02-21T08:57:28.7081271Z         %153 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7081664Z         %154 = tt.addptr %153, %152 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7082027Z         %155 = tt.load %154 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7082400Z         %156 = arith.extf %155 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7082707Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7082942Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7083167Z           %173 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T08:57:28.7083401Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7083656Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7083921Z         %158 = arith.truncf %157 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:57:28.7084262Z         %159 = arith.extf %158 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:57:28.7084551Z         %160 = arith.cmpf ogt, %133, %159 : tensor<128xf32>
2026-02-21T08:57:28.7084853Z         %161 = arith.cmpf une, %133, %133 : tensor<128xf32>
2026-02-21T08:57:28.7085151Z         %162 = arith.ori %160, %161 : tensor<128xi1>
2026-02-21T08:57:28.7085448Z         %163 = arith.select %162, %133, %159 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:57:28.7085780Z         %164 = arith.subf %133, %163 : tensor<128xf32>
2026-02-21T08:57:28.7086209Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7086662Z         %166 = arith.mulf %142, %165 : tensor<128xf32>
2026-02-21T08:57:28.7087010Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7087371Z         %168 = tt.broadcast %167 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7087709Z         %169 = arith.subf %156, %168 : tensor<128x128xf32>
2026-02-21T08:57:28.7088156Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7088619Z         %171 = "tt.reduce"(%170) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7088863Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:57:28.7089143Z           %173 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:57:28.7089482Z           tt.reduce.return %173 : f32
2026-02-21T08:57:28.7089718Z         }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7090000Z         %172 = arith.addf %166, %171 : tensor<128xf32>
2026-02-21T08:57:28.7090276Z         scf.yield %163, %172 : tensor<128xf32>, tensor<128xf32>
2026-02-21T08:57:28.7090642Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:57:28.7090974Z       %7 = tt.splat %c11264_i32 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7091267Z       %8 = arith.addi %7, %3 : tensor<128xi32>
2026-02-21T08:57:28.7091637Z       %9 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7091954Z       %10 = arith.muli %9, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7092297Z       %11 = tt.expand_dims %8 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7092695Z       %12 = tt.broadcast %10 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7093056Z       %13 = tt.broadcast %11 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7093366Z       %14 = arith.addi %12, %13 : tensor<128x128xi32>
2026-02-21T08:57:28.7093674Z       %15 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7094029Z       %16 = tt.addptr %15, %14 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7094372Z       %17 = tt.load %16 evictionPolicy = evict_first : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7094741Z       %18 = arith.extf %17 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7095016Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7095277Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:57:28.7095504Z         %55 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T08:57:28.7095771Z         tt.reduce.return %55 : f32
2026-02-21T08:57:28.7096033Z       }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7096302Z       %20 = arith.truncf %19 : tensor<128xf32> to tensor<128xf16>
2026-02-21T08:57:28.7096628Z       %21 = arith.extf %20 : tensor<128xf16> to tensor<128xf32>
2026-02-21T08:57:28.7096898Z       %22 = arith.cmpf ogt, %6#0, %21 : tensor<128xf32>
2026-02-21T08:57:28.7097182Z       %23 = arith.cmpf une, %6#0, %6#0 : tensor<128xf32>
2026-02-21T08:57:28.7097424Z       %24 = arith.ori %22, %23 : tensor<128xi1>
2026-02-21T08:57:28.7097725Z       %25 = arith.select %24, %6#0, %21 : tensor<128xi1>, tensor<128xf32>
2026-02-21T08:57:28.7098031Z       %26 = arith.subf %6#0, %25 : tensor<128xf32>
2026-02-21T08:57:28.7098435Z       %27 = tt.extern_elementwise %26 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7098862Z       %28 = arith.mulf %6#1, %27 : tensor<128xf32>
2026-02-21T08:57:28.7099151Z       %29 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7099520Z       %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7099823Z       %31 = arith.subf %18, %30 : tensor<128x128xf32>
2026-02-21T08:57:28.7100232Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7100660Z       %33 = "tt.reduce"(%32) <{axis = 1 : i32}> ({
2026-02-21T08:57:28.7100878Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T08:57:28.7101106Z         %55 = arith.addf %arg3, %arg4 : f32
2026-02-21T08:57:28.7101332Z         tt.reduce.return %55 : f32
2026-02-21T08:57:28.7101615Z       }) : (tensor<128x128xf32>) -> tensor<128xf32>
2026-02-21T08:57:28.7101880Z       %34 = arith.addf %28, %33 : tensor<128xf32>
2026-02-21T08:57:28.7102124Z       %c11264_i32_2 = arith.constant 11264 : i32
2026-02-21T08:57:28.7102360Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T08:57:28.7102630Z       scf.for %arg3 = %c0_i32 to %c11264_i32_2 step %c512_i32_3  : i32 {
2026-02-21T08:57:28.7102947Z         %55 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7103255Z         %56 = arith.addi %55, %3 : tensor<128xi32>
2026-02-21T08:57:28.7103619Z         %57 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:57:28.7104039Z         %58 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7104370Z         %59 = arith.extf %57 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7104713Z         %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7105003Z         %61 = arith.subf %59, %60 : tensor<128x128xf32>
2026-02-21T08:57:28.7105442Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7105899Z         %63 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7106323Z         %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7106632Z         %65 = arith.divf %62, %64 : tensor<128x128xf32>
2026-02-21T08:57:28.7106905Z         %66 = arith.truncf %65 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:57:28.7107255Z         %67 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7107556Z         %68 = arith.muli %67, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7107879Z         %69 = tt.expand_dims %56 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7108232Z         %70 = tt.broadcast %68 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7108533Z         %71 = tt.broadcast %69 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7108838Z         %72 = arith.addi %70, %71 : tensor<128x128xi32>
2026-02-21T08:57:28.7109119Z         %73 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7109473Z         %74 = tt.addptr %73, %72 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7109777Z         tt.store %74, %66 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7110054Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:57:28.7110313Z         %75 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T08:57:28.7110548Z         %76 = arith.addi %arg3, %75 : i32
2026-02-21T08:57:28.7110807Z         %77 = tt.splat %76 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7111044Z         %78 = arith.addi %77, %3 : tensor<128xi32>
2026-02-21T08:57:28.7111403Z         %79 = tt.descriptor_load %0[%2, %76] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:57:28.7111818Z         %80 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7112177Z         %81 = arith.extf %79 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7112508Z         %82 = tt.broadcast %80 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7112791Z         %83 = arith.subf %81, %82 : tensor<128x128xf32>
2026-02-21T08:57:28.7113236Z         %84 = tt.extern_elementwise %83 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7113697Z         %85 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7114058Z         %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7114372Z         %87 = arith.divf %84, %86 : tensor<128x128xf32>
2026-02-21T08:57:28.7114649Z         %88 = arith.truncf %87 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:57:28.7115001Z         %89 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7115302Z         %90 = arith.muli %89, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7115619Z         %91 = tt.expand_dims %78 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7115946Z         %92 = tt.broadcast %90 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7116327Z         %93 = tt.broadcast %91 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7116628Z         %94 = arith.addi %92, %93 : tensor<128x128xi32>
2026-02-21T08:57:28.7116897Z         %95 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7117222Z         %96 = tt.addptr %95, %94 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7117523Z         tt.store %96, %88 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7117799Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:57:28.7118028Z         %97 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T08:57:28.7118286Z         %98 = arith.addi %arg3, %97 : i32
2026-02-21T08:57:28.7118543Z         %99 = tt.splat %98 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7118792Z         %100 = arith.addi %99, %3 : tensor<128xi32>
2026-02-21T08:57:28.7119211Z         %101 = tt.descriptor_load %0[%2, %98] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:57:28.7119608Z         %102 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7119971Z         %103 = arith.extf %101 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7120286Z         %104 = tt.broadcast %102 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7120605Z         %105 = arith.subf %103, %104 : tensor<128x128xf32>
2026-02-21T08:57:28.7121059Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7121562Z         %107 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7121924Z         %108 = tt.broadcast %107 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7122213Z         %109 = arith.divf %106, %108 : tensor<128x128xf32>
2026-02-21T08:57:28.7122533Z         %110 = arith.truncf %109 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:57:28.7122899Z         %111 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7123210Z         %112 = arith.muli %111, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7123543Z         %113 = tt.expand_dims %100 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7123879Z         %114 = tt.broadcast %112 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7124214Z         %115 = tt.broadcast %113 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7124497Z         %116 = arith.addi %114, %115 : tensor<128x128xi32>
2026-02-21T08:57:28.7124812Z         %117 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7125174Z         %118 = tt.addptr %117, %116 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7125488Z         tt.store %118, %110 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7125764Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T08:57:28.7125996Z         %119 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T08:57:28.7126262Z         %120 = arith.addi %arg3, %119 : i32
2026-02-21T08:57:28.7126502Z         %121 = tt.splat %120 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7126773Z         %122 = arith.addi %121, %3 : tensor<128xi32>
2026-02-21T08:57:28.7127139Z         %123 = tt.descriptor_load %0[%2, %120] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:57:28.7127546Z         %124 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7127926Z         %125 = arith.extf %123 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7128253Z         %126 = tt.broadcast %124 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7128582Z         %127 = arith.subf %125, %126 : tensor<128x128xf32>
2026-02-21T08:57:28.7129059Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7129625Z         %129 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7130012Z         %130 = tt.broadcast %129 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7130315Z         %131 = arith.divf %128, %130 : tensor<128x128xf32>
2026-02-21T08:57:28.7130650Z         %132 = arith.truncf %131 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:57:28.7131031Z         %133 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7131320Z         %134 = arith.muli %133, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7131692Z         %135 = tt.expand_dims %122 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7132048Z         %136 = tt.broadcast %134 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7132397Z         %137 = tt.broadcast %135 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7132752Z         %138 = arith.addi %136, %137 : tensor<128x128xi32>
2026-02-21T08:57:28.7133081Z         %139 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7133452Z         %140 = tt.addptr %139, %138 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7133778Z         tt.store %140, %132 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7134122Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:57:28.7134445Z       %35 = tt.splat %c11264_i32_2 : i32 -> tensor<128xi32>
2026-02-21T08:57:28.7134749Z       %36 = arith.addi %35, %3 : tensor<128xi32>
2026-02-21T08:57:28.7135094Z       %37 = tt.descriptor_load %0[%2, %c11264_i32_2] : !tt.tensordesc<tensor<128x128xf16>> -> tensor<128x128xf16>
2026-02-21T08:57:28.7135521Z       %38 = tt.expand_dims %25 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7135883Z       %39 = arith.extf %37 : tensor<128x128xf16> to tensor<128x128xf32>
2026-02-21T08:57:28.7136191Z       %40 = tt.broadcast %38 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7136499Z       %41 = arith.subf %39, %40 : tensor<128x128xf32>
2026-02-21T08:57:28.7136904Z       %42 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x128xf32>) -> tensor<128x128xf32>
2026-02-21T08:57:28.7137365Z       %43 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T08:57:28.7137726Z       %44 = tt.broadcast %43 : tensor<128x1xf32> -> tensor<128x128xf32>
2026-02-21T08:57:28.7138000Z       %45 = arith.divf %42, %44 : tensor<128x128xf32>
2026-02-21T08:57:28.7138302Z       %46 = arith.truncf %45 : tensor<128x128xf32> to tensor<128x128xf16>
2026-02-21T08:57:28.7138620Z       %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:57:28.7138939Z       %48 = arith.muli %47, %cst : tensor<128x1xi32>
2026-02-21T08:57:28.7139229Z       %49 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T08:57:28.7139570Z       %50 = tt.broadcast %48 : tensor<128x1xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7139895Z       %51 = tt.broadcast %49 : tensor<1x128xi32> -> tensor<128x128xi32>
2026-02-21T08:57:28.7140166Z       %52 = arith.addi %50, %51 : tensor<128x128xi32>
2026-02-21T08:57:28.7140466Z       %53 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7140785Z       %54 = tt.addptr %53, %52 : tensor<128x128x!tt.ptr<f16>>, tensor<128x128xi32>
2026-02-21T08:57:28.7141108Z       tt.store %54, %46 : tensor<128x128x!tt.ptr<f16>>
2026-02-21T08:57:28.7141445Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:57:28.7141765Z     tt.return
2026-02-21T08:57:28.7141967Z   }
2026-02-21T08:57:28.7142132Z }
2026-02-21T08:57:28.7142222Z 
2026-02-21T08:57:28.7142324Z {-#
2026-02-21T08:57:28.7142496Z   external_resources: {
2026-02-21T08:57:28.7142725Z     mlir_reproducer: {
2026-02-21T08:57:28.7147112Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:57:28.7151745Z       disable_threading: false,
2026-02-21T08:57:28.7152038Z       verify_each: true
2026-02-21T08:57:28.7152247Z     }
2026-02-21T08:57:28.7152468Z   }
2026-02-21T08:57:28.7152662Z #-}
2026-02-21T08:57:28.7153271Z /tmp/torchinductor_root/yi/cyilusxyo4s4jj75drhhgdbymmwkq5c346veo2nlrpr3gu3qs3ra.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:57:28.7154675Z /tmp/torchinductor_root/yi/cyilusxyo4s4jj75drhhgdbymmwkq5c346veo2nlrpr3gu3qs3ra.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:57:28.7155784Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:57:28.7157035Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_sm_multiplier=1, num_stages=7, num_warps=32, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:57:28.7158130Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:57:28.7158521Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:57:34.3836938Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.3 configs/s
2026-02-21T08:57:34.3847264Z [48s] Adaptive compile timeout: 30s (90% percentile=11.0s, bounds=[30.0s, 30s])
2026-02-21T08:57:35.2905637Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1086.0 configs/s
2026-02-21T08:57:35.3583346Z [49s] Initial random population of 100, 5 starting points: 
2026-02-21T08:57:35.3587809Z error=6
2026-02-21T08:57:35.3592145Z timeout=1
2026-02-21T08:57:35.3593642Z ok=93
2026-02-21T08:57:35.3593954Z min=0.0615
2026-02-21T08:57:35.3594201Z mid=1.0433
2026-02-21T08:57:35.3594425Z max=58.3782
2026-02-21T08:57:35.3594683Z best={'block_sizes': [1, 16384],
2026-02-21T08:57:35.3595003Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:57:35.3595747Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:57:35.3595984Z  'num_sm_multiplier': 8,
2026-02-21T08:57:35.3596220Z  'num_stages': 3,
2026-02-21T08:57:35.3596401Z  'num_warps': 1,
2026-02-21T08:57:35.3596632Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:57:35.3596883Z  'range_flattens': [False, None],
2026-02-21T08:57:35.3597139Z  'range_multi_buffers': [True, True],
2026-02-21T08:57:35.3597411Z  'range_num_stages': [1, 2],
2026-02-21T08:57:35.3597623Z  'range_unroll_factors': [0, 1],
2026-02-21T08:57:35.3597875Z  'range_warp_specializes': [True, None]}
2026-02-21T08:57:35.3603302Z [49s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:57:36.4409307Z [50s] Generation 1 starting: 81 neighbors, 5 active search path(s)
2026-02-21T08:57:49.5170075Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 19.1 configs/s
2026-02-21T08:57:54.5569740Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 17.0 configs/s
2026-02-21T08:58:02.3473442Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 129.6         
2026-02-21T08:58:02.3474079Z                                                                   configs/s     
2026-02-21T08:58:02.7252096Z [76s] Generation 1 complete: 
2026-02-21T08:58:02.7254108Z ok=87
2026-02-21T08:58:02.7254377Z min=0.0574
2026-02-21T08:58:02.7254602Z mid=0.0737
2026-02-21T08:58:02.7254780Z max=0.4055
2026-02-21T08:58:02.7254998Z best={'block_sizes': [1, 16384],
2026-02-21T08:58:02.7255290Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:58:02.7255610Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:58:02.7255836Z  'num_stages': 3,
2026-02-21T08:58:02.7256050Z  'num_warps': 1,
2026-02-21T08:58:02.7256247Z  'pid_type': 'flat',
2026-02-21T08:58:02.7256496Z  'range_flattens': [None, None],
2026-02-21T08:58:02.7256727Z  'range_multi_buffers': [None, True],
2026-02-21T08:58:02.7256996Z  'range_num_stages': [0, 2],
2026-02-21T08:58:02.7257256Z  'range_unroll_factors': [0, 1],
2026-02-21T08:58:02.7257534Z  'range_warp_specializes': [None, True]}
2026-02-21T08:58:02.7268793Z [76s] Fitting surrogate: 187 points, 187 targets
2026-02-21T08:58:03.6574084Z [77s] Generation 2 starting: 64 neighbors, 5 active search path(s)
2026-02-21T08:58:18.5251230Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.5 configs/s
2026-02-21T08:58:22.5372963Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.2 configs/s
2026-02-21T08:58:28.3390764Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 174.1         
2026-02-21T08:58:28.3395012Z                                                                   configs/s     
2026-02-21T08:58:28.6258539Z [102s] Generation 2 complete: 
2026-02-21T08:58:28.6262638Z ok=70
2026-02-21T08:58:28.6267845Z min=0.0573
2026-02-21T08:58:28.6269315Z mid=0.0737
2026-02-21T08:58:28.6275115Z max=1.2350
2026-02-21T08:58:28.6279638Z best={'block_sizes': [1, 16384],
2026-02-21T08:58:28.6283615Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:58:28.6284054Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:58:28.6284321Z  'num_stages': 3,
2026-02-21T08:58:28.6288729Z  'num_warps': 1,
2026-02-21T08:58:28.6293337Z  'pid_type': 'flat',
2026-02-21T08:58:28.6298378Z  'range_flattens': [None, None],
2026-02-21T08:58:28.6303255Z  'range_multi_buffers': [None, True],
2026-02-21T08:58:28.6307914Z  'range_num_stages': [0, 2],
2026-02-21T08:58:28.6309449Z  'range_unroll_factors': [0, 1],
2026-02-21T08:58:28.6309721Z  'range_warp_specializes': [None, True]}
2026-02-21T08:58:28.6310009Z [102s] Fitting surrogate: 257 points, 257 targets
2026-02-21T08:58:29.5177074Z [103s] Generation 3 starting: 65 neighbors, 5 active search path(s)
2026-02-21T08:59:08.0859624Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 0.1 configs/s
2026-02-21T08:59:11.9543532Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 17.3 configs/s
2026-02-21T08:59:16.7408095Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 210.8         
2026-02-21T08:59:16.9991295Z [150s] Generation 3 complete: 
2026-02-21T08:59:16.9991942Z                                                                   configs/s     
2026-02-21T08:59:16.9992770Z ok=70
2026-02-21T08:59:16.9992945Z min=0.0450
2026-02-21T08:59:16.9993148Z mid=0.0614
2026-02-21T08:59:16.9993313Z max=0.2437
2026-02-21T08:59:16.9993526Z best={'block_sizes': [1, 16384],
2026-02-21T08:59:16.9993785Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:59:16.9994081Z  'load_eviction_policies': ['', ''],
2026-02-21T08:59:16.9994326Z  'num_stages': 4,
2026-02-21T08:59:16.9994509Z  'num_warps': 8,
2026-02-21T08:59:16.9994716Z  'pid_type': 'flat',
2026-02-21T08:59:16.9994916Z  'range_flattens': [None, None],
2026-02-21T08:59:16.9995155Z  'range_multi_buffers': [None, True],
2026-02-21T08:59:16.9995374Z  'range_num_stages': [0, 0],
2026-02-21T08:59:16.9995612Z  'range_unroll_factors': [0, 0],
2026-02-21T08:59:17.0009065Z  'range_warp_specializes': [None, True]}
2026-02-21T08:59:17.0009509Z [150s] Fitting surrogate: 327 points, 327 targets
2026-02-21T08:59:17.7726304Z [151s] Generation 4 starting: 51 neighbors, 4 active search path(s)
2026-02-21T08:59:27.3479497Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 3.3 configs/s
2026-02-21T08:59:30.4705034Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.2 configs/s
2026-02-21T08:59:33.8948245Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 294.3         
2026-02-21T08:59:33.8948830Z                                                                   configs/s     
2026-02-21T08:59:34.0952968Z [167s] Generation 4 complete: 
2026-02-21T08:59:34.0954691Z ok=56
2026-02-21T08:59:34.0954951Z min=0.0410
2026-02-21T08:59:34.0955144Z mid=0.0594
2026-02-21T08:59:34.0955354Z max=0.1946
2026-02-21T08:59:34.0955550Z best={'block_sizes': [1, 16384],
2026-02-21T08:59:34.0955875Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:59:34.0956208Z  'load_eviction_policies': ['', ''],
2026-02-21T08:59:34.0956490Z  'num_stages': 3,
2026-02-21T08:59:34.0956772Z  'num_warps': 2,
2026-02-21T08:59:34.0961851Z  'pid_type': 'flat',
2026-02-21T08:59:34.0963829Z  'range_flattens': [None, False],
2026-02-21T08:59:34.0964213Z  'range_multi_buffers': [None, True],
2026-02-21T08:59:34.0969254Z  'range_num_stages': [0, 0],
2026-02-21T08:59:34.0971856Z  'range_unroll_factors': [0, 0],
2026-02-21T08:59:34.0972186Z  'range_warp_specializes': [None, True]}
2026-02-21T08:59:34.0972535Z [167s] Fitting surrogate: 383 points, 383 targets
2026-02-21T08:59:34.8218339Z [168s] Generation 5 starting: 42 neighbors, 3 active search path(s)
2026-02-21T08:59:45.3503791Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 1.6 configs/s
2026-02-21T08:59:47.9690580Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 16.7 configs/s
2026-02-21T08:59:51.6857883Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 357.5         
2026-02-21T08:59:51.6858505Z                                                                   configs/s     
2026-02-21T08:59:51.8691887Z [185s] Generation 5 complete: 
2026-02-21T08:59:51.8692209Z ok=46
2026-02-21T08:59:51.8692421Z min=0.0419
2026-02-21T08:59:51.8692757Z mid=0.0574
2026-02-21T08:59:51.8692952Z max=0.1004
2026-02-21T08:59:51.8693136Z best={'block_sizes': [1, 16384],
2026-02-21T08:59:51.8693454Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T08:59:51.8693753Z  'load_eviction_policies': ['', ''],
2026-02-21T08:59:51.8698415Z  'num_stages': 3,
2026-02-21T08:59:51.8702572Z  'num_warps': 4,
2026-02-21T08:59:51.8706660Z  'pid_type': 'flat',
2026-02-21T08:59:51.8710876Z  'range_flattens': [None, False],
2026-02-21T08:59:51.8714963Z  'range_multi_buffers': [None, True],
2026-02-21T08:59:51.8719561Z  'range_num_stages': [0, 0],
2026-02-21T08:59:51.8720924Z  'range_unroll_factors': [0, 0],
2026-02-21T08:59:51.8721197Z  'range_warp_specializes': [None, True]}
2026-02-21T08:59:51.8722032Z [185s] Fitting surrogate: 429 points, 429 targets
2026-02-21T08:59:52.5780798Z [186s] Generation 6 starting: 42 neighbors, 3 active search path(s)
2026-02-21T09:00:02.0660308Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 1.7 configs/s
2026-02-21T09:00:04.5942267Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.3 configs/s
2026-02-21T09:00:07.0213027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 414.5         
2026-02-21T09:00:07.0213460Z                                                                   configs/s     
2026-02-21T09:00:07.1743816Z [200s] Generation 6 complete: 
2026-02-21T09:00:07.1748217Z error=1
2026-02-21T09:00:07.1749837Z ok=44
2026-02-21T09:00:07.1750122Z min=0.0409
2026-02-21T09:00:07.1755398Z mid=0.0574
2026-02-21T09:00:07.1757152Z max=0.1290
2026-02-21T09:00:07.1757416Z best={'block_sizes': [1, 16384],
2026-02-21T09:00:07.1757802Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:00:07.1761846Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:00:07.1763261Z  'num_stages': 5,
2026-02-21T09:00:07.1763523Z  'num_warps': 4,
2026-02-21T09:00:07.1763720Z  'pid_type': 'flat',
2026-02-21T09:00:07.1763952Z  'range_flattens': [None, True],
2026-02-21T09:00:07.1764175Z  'range_multi_buffers': [None, None],
2026-02-21T09:00:07.1764431Z  'range_num_stages': [0, 2],
2026-02-21T09:00:07.1764642Z  'range_unroll_factors': [0, 0],
2026-02-21T09:00:07.1764893Z  'range_warp_specializes': [None, True]}
2026-02-21T09:00:07.1765221Z [200s] Fitting surrogate: 474 points, 474 targets
2026-02-21T09:00:07.4627304Z [201s] Generation 7 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:00:09.6617639Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 6.1 configs/s
2026-02-21T09:00:10.2611464Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s
2026-02-21T09:00:11.0408651Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1269.7         
2026-02-21T09:00:11.0412900Z                                                                  configs/s      
2026-02-21T09:00:11.1026430Z [204s] Generation 7 complete: 
2026-02-21T09:00:11.1030777Z ok=11
2026-02-21T09:00:11.1034865Z min=0.0428
2026-02-21T09:00:11.1036914Z mid=0.0451
2026-02-21T09:00:11.1037125Z max=0.0655
2026-02-21T09:00:11.1037350Z best={'block_sizes': [1, 16384],
2026-02-21T09:00:11.1037648Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:00:11.1037976Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:00:11.1038207Z  'num_stages': 5,
2026-02-21T09:00:11.1038415Z  'num_warps': 1,
2026-02-21T09:00:11.1038622Z  'pid_type': 'flat',
2026-02-21T09:00:11.1038823Z  'range_flattens': [None, True],
2026-02-21T09:00:11.1039064Z  'range_multi_buffers': [None, None],
2026-02-21T09:00:11.1039284Z  'range_num_stages': [0, 2],
2026-02-21T09:00:11.1039513Z  'range_unroll_factors': [0, 0],
2026-02-21T09:00:11.1039726Z  'range_warp_specializes': [None, True]}
2026-02-21T09:00:11.1050292Z [204s] Fitting surrogate: 485 points, 485 targets
2026-02-21T09:00:11.2722746Z [204s] Autotuning complete in 204.9s after searching 468 configs.
2026-02-21T09:00:11.2724878Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:00:11.2725914Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:00:11.2726825Z 
2026-02-21T09:00:11.2727129Z [204s] Code of selected kernel: /tmp/torchinductor_root/vy/cvyqpaemgsb3xa23z4dnhvblqyfc45ml3zbow5nujzu3wnhoqarl.py
2026-02-21T09:00:12.4054699Z WARNING:tritonbench.utils.triton_op:Completed input ID 87:
2026-02-21T09:00:12.4059176Z (M, N)
2026-02-21T09:00:12.4064730Z -------------
2026-02-21T09:00:12.4069287Z (4096, 11392)
2026-02-21T09:00:12.4069537Z 
2026-02-21T09:00:12.4070108Z  90%|█████████ | 18/20 [51:09<06:25, 192.92s/it]WARNING:tritonbench.utils.triton_op:Running input ID 92:
2026-02-21T09:00:12.4070491Z (M, N)
2026-02-21T09:00:12.4070668Z -------------
2026-02-21T09:00:12.4070866Z (4096, 12032)
2026-02-21T09:00:12.4071176Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:00:13.5777690Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:00:14.9553898Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax
2026-02-21T09:00:18.9631315Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:00:18.9635592Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:00:18.9640219Z               'dtype': 'torch.float16',
2026-02-21T09:00:18.9644179Z               'shape': (4096, 12032),
2026-02-21T09:00:18.9646235Z               'stride': (12032, 1)},),
2026-02-21T09:00:18.9646573Z   'kwargs': {}}
2026-02-21T09:00:18.9659892Z INFO:tritonbench.utils.triton_op:Took 3.14ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:00:19.1442127Z [0s] Autotune random seed: 2134816249
2026-02-21T09:00:19.2909662Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:01:01.0775325Z [41s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T09:01:01.0779124Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s
2026-02-21T09:01:09.3442610Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.1 configs/s
2026-02-21T09:01:09.3451232Z [50s] Adaptive compile timeout: 30s (90% percentile=13.2s, bounds=[30.0s, 30s])
2026-02-21T09:01:10.2630089Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1072.3 configs/s
2026-02-21T09:01:10.3304352Z [51s] Initial random population of 100, 5 starting points: 
2026-02-21T09:01:10.3305102Z error=5
2026-02-21T09:01:10.3305333Z timeout=1
2026-02-21T09:01:10.3305561Z ok=94
2026-02-21T09:01:10.3305766Z min=0.0634
2026-02-21T09:01:10.3305952Z mid=1.1162
2026-02-21T09:01:10.3306170Z max=61.4842
2026-02-21T09:01:10.3306385Z best={'block_sizes': [1, 16384],
2026-02-21T09:01:10.3306707Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:01:10.3307017Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:01:10.3307322Z  'num_sm_multiplier': 8,
2026-02-21T09:01:10.3307531Z  'num_stages': 3,
2026-02-21T09:01:10.3307758Z  'num_warps': 1,
2026-02-21T09:01:10.3307995Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:01:10.3308280Z  'range_flattens': [False, None],
2026-02-21T09:01:10.3308955Z  'range_multi_buffers': [True, True],
2026-02-21T09:01:10.3309214Z  'range_num_stages': [1, 2],
2026-02-21T09:01:10.3309491Z  'range_unroll_factors': [0, 1],
2026-02-21T09:01:10.3309715Z  'range_warp_specializes': [True, None]}
2026-02-21T09:01:10.3318336Z [51s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:01:11.5251778Z [52s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T09:01:28.0558651Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 1.1 configs/s
2026-02-21T09:01:33.1090337Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 17.0 configs/s
2026-02-21T09:01:35.5074182Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 418.8         
2026-02-21T09:01:35.5079201Z                                                                   configs/s     
2026-02-21T09:01:35.6416453Z [76s] Generation 1 complete: 
2026-02-21T09:01:35.6418077Z ok=86
2026-02-21T09:01:35.6418861Z min=0.0450
2026-02-21T09:01:35.6419110Z mid=0.0758
2026-02-21T09:01:35.6419347Z max=0.3871
2026-02-21T09:01:35.6419561Z best={'block_sizes': [1, 16384],
2026-02-21T09:01:35.6419912Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:01:35.6420206Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:01:35.6420461Z  'num_sm_multiplier': 8,
2026-02-21T09:01:35.6420670Z  'num_stages': 3,
2026-02-21T09:01:35.6420885Z  'num_warps': 1,
2026-02-21T09:01:35.6421117Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:01:35.6421353Z  'range_flattens': [False, None],
2026-02-21T09:01:35.6421879Z  'range_multi_buffers': [True, True],
2026-02-21T09:01:35.6422115Z  'range_num_stages': [1, 2],
2026-02-21T09:01:35.6422364Z  'range_unroll_factors': [1, 1],
2026-02-21T09:01:35.6422585Z  'range_warp_specializes': [True, None]}
2026-02-21T09:01:35.6431325Z [76s] Fitting surrogate: 186 points, 186 targets
2026-02-21T09:01:36.6461214Z [77s] Generation 2 starting: 74 neighbors, 5 active search path(s)
2026-02-21T09:01:48.2962642Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 10.8 configs/s
2026-02-21T09:01:52.7255019Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.1 configs/s
2026-02-21T09:01:56.6919658Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 313.7         
2026-02-21T09:01:56.6921106Z                                                                   configs/s     
2026-02-21T09:01:56.8817605Z [97s] Generation 2 complete: 
2026-02-21T09:01:56.8819710Z ok=79
2026-02-21T09:01:56.8819924Z min=0.0430
2026-02-21T09:01:56.8820130Z mid=0.0655
2026-02-21T09:01:56.8820294Z max=0.2899
2026-02-21T09:01:56.8820510Z best={'block_sizes': [1, 16384],
2026-02-21T09:01:56.8820785Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:01:56.8821089Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:01:56.8821320Z  'num_stages': 3,
2026-02-21T09:01:56.8821523Z  'num_warps': 1,
2026-02-21T09:01:56.8821911Z  'pid_type': 'flat',
2026-02-21T09:01:56.8822193Z  'range_flattens': [None, None],
2026-02-21T09:01:56.8822430Z  'range_multi_buffers': [None, True],
2026-02-21T09:01:56.8822720Z  'range_num_stages': [0, 2],
2026-02-21T09:01:56.8825778Z  'range_unroll_factors': [0, 1],
2026-02-21T09:01:56.8827892Z  'range_warp_specializes': [None, True]}
2026-02-21T09:01:56.8830362Z [97s] Fitting surrogate: 265 points, 265 targets
2026-02-21T09:01:57.8227965Z [98s] Generation 3 starting: 68 neighbors, 5 active search path(s)
2026-02-21T09:02:13.8172242Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 1.1 configs/s
2026-02-21T09:02:17.8195188Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.2 configs/s
2026-02-21T09:02:22.9667128Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 196.2         
2026-02-21T09:02:22.9667557Z                                                                   configs/s     
2026-02-21T09:02:23.2406936Z [123s] Generation 3 complete: 
2026-02-21T09:02:23.2410714Z error=1
2026-02-21T09:02:23.2415760Z ok=72
2026-02-21T09:02:23.2419867Z min=0.0430
2026-02-21T09:02:23.2421787Z mid=0.0613
2026-02-21T09:02:23.2422031Z max=0.5242
2026-02-21T09:02:23.2422219Z best={'block_sizes': [1, 16384],
2026-02-21T09:02:23.2422536Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:02:23.2422815Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:02:23.2423078Z  'num_stages': 3,
2026-02-21T09:02:23.2423292Z  'num_warps': 1,
2026-02-21T09:02:23.2423476Z  'pid_type': 'flat',
2026-02-21T09:02:23.2423700Z  'range_flattens': [None, None],
2026-02-21T09:02:23.2423923Z  'range_multi_buffers': [None, True],
2026-02-21T09:02:23.2424176Z  'range_num_stages': [0, 2],
2026-02-21T09:02:23.2424377Z  'range_unroll_factors': [0, 1],
2026-02-21T09:02:23.2424629Z  'range_warp_specializes': [None, True]}
2026-02-21T09:02:23.2426395Z [123s] Fitting surrogate: 338 points, 338 targets
2026-02-21T09:02:24.0184706Z [124s] Generation 4 starting: 49 neighbors, 5 active search path(s)
2026-02-21T09:02:33.7161081Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 1.5 configs/s
2026-02-21T09:02:36.6228703Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 17.1 configs/s
2026-02-21T09:02:40.9405815Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 234.1         
2026-02-21T09:02:40.9409925Z                                                                   configs/s     
2026-02-21T09:02:41.1901469Z [141s] Generation 4 complete: 
2026-02-21T09:02:41.1905810Z ok=54
2026-02-21T09:02:41.1910263Z min=0.0429
2026-02-21T09:02:41.1910602Z mid=0.0594
2026-02-21T09:02:41.1910813Z max=0.1105
2026-02-21T09:02:41.1911051Z best={'block_sizes': [1, 16384],
2026-02-21T09:02:41.1911371Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:02:41.1916002Z  'load_eviction_policies': ['', ''],
2026-02-21T09:02:41.1920515Z  'num_stages': 5,
2026-02-21T09:02:41.1924910Z  'num_warps': 2,
2026-02-21T09:02:41.1926420Z  'pid_type': 'flat',
2026-02-21T09:02:41.1926715Z  'range_flattens': [None, True],
2026-02-21T09:02:41.1927018Z  'range_multi_buffers': [None, None],
2026-02-21T09:02:41.1927251Z  'range_num_stages': [0, 1],
2026-02-21T09:02:41.1927568Z  'range_unroll_factors': [0, 1],
2026-02-21T09:02:41.1932411Z  'range_warp_specializes': [None, True]}
2026-02-21T09:02:41.1937344Z [141s] Fitting surrogate: 392 points, 392 targets
2026-02-21T09:02:41.6580651Z [142s] Generation 5 starting: 24 neighbors, 3 active search path(s)
2026-02-21T09:02:48.9620598Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 0.7 configs/s
2026-02-21T09:02:50.4957879Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.5 configs/s
2026-02-21T09:02:52.7753190Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 442.0         
2026-02-21T09:02:52.7754615Z                                                                   configs/s     
2026-02-21T09:02:52.9228353Z [153s] Generation 5 complete: 
2026-02-21T09:02:52.9228724Z ok=28
2026-02-21T09:02:52.9229727Z min=0.0430
2026-02-21T09:02:52.9234155Z mid=0.0430
2026-02-21T09:02:52.9236125Z max=0.7054
2026-02-21T09:02:52.9236390Z best={'block_sizes': [1, 16384],
2026-02-21T09:02:52.9236704Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:02:52.9237044Z  'load_eviction_policies': ['', ''],
2026-02-21T09:02:52.9237278Z  'num_stages': 5,
2026-02-21T09:02:52.9237494Z  'num_warps': 2,
2026-02-21T09:02:52.9237687Z  'pid_type': 'flat',
2026-02-21T09:02:52.9237923Z  'range_flattens': [None, True],
2026-02-21T09:02:52.9238148Z  'range_multi_buffers': [None, None],
2026-02-21T09:02:52.9238414Z  'range_num_stages': [0, 1],
2026-02-21T09:02:52.9238660Z  'range_unroll_factors': [0, 0],
2026-02-21T09:02:52.9238881Z  'range_warp_specializes': [None, True]}
2026-02-21T09:02:52.9245479Z [153s] Fitting surrogate: 420 points, 420 targets
2026-02-21T09:02:53.0748127Z [153s] Autotuning complete in 153.8s after searching 402 configs.
2026-02-21T09:02:53.0748937Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:02:53.0749981Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=5, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:02:53.0750804Z 
2026-02-21T09:02:53.0751113Z [153s] Code of selected kernel: /tmp/torchinductor_root/nv/cnvxfp2hrfiggj3jsxrsbinlsyuo22orr5g7hz2573inp7sojikm.py
2026-02-21T09:02:54.1278763Z WARNING:tritonbench.utils.triton_op:Completed input ID 92:
2026-02-21T09:02:54.1283050Z (M, N)
2026-02-21T09:02:54.1287005Z -------------
2026-02-21T09:02:54.1288385Z (4096, 12032)
2026-02-21T09:02:54.1288544Z 
2026-02-21T09:02:54.1289099Z  95%|█████████▌| 19/20 [53:51<03:03, 183.55s/it]WARNING:tritonbench.utils.triton_op:Running input ID 97:
2026-02-21T09:02:54.1294220Z (M, N)
2026-02-21T09:02:54.1295626Z -------------
2026-02-21T09:02:54.1295862Z (4096, 12672)
2026-02-21T09:02:54.1296272Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:02:55.3133179Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:02:56.6876721Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax
2026-02-21T09:03:00.3480996Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:03:00.3485057Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:03:00.3490073Z               'dtype': 'torch.float16',
2026-02-21T09:03:00.3493318Z               'shape': (4096, 12672),
2026-02-21T09:03:00.3494727Z               'stride': (12672, 1)},),
2026-02-21T09:03:00.3495026Z   'kwargs': {}}
2026-02-21T09:03:00.3503258Z INFO:tritonbench.utils.triton_op:Took 2.43ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:03:00.5296026Z [0s] Autotune random seed: 2134816249
2026-02-21T09:03:00.6725034Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:03:43.8873704Z [43s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[4, 1], range_warp_specializes=[None, None])
2026-02-21T09:03:43.8891976Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s
2026-02-21T09:03:44.0846396Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T09:03:44.0848073Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:03:44.0848697Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf16>
2026-02-21T09:03:44.0849307Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:03:44.0849574Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:03:44.0849805Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T09:03:44.0850096Z     %cst_0 = arith.constant dense<12672> : tensor<8x1xi32>
2026-02-21T09:03:44.0850391Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x512xf32>
2026-02-21T09:03:44.0850720Z     %cst_2 = arith.constant dense<0xFC00> : tensor<8x512xf16>
2026-02-21T09:03:44.0851029Z     %cst_3 = arith.constant dense<12672> : tensor<512xi32>
2026-02-21T09:03:44.0851313Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:03:44.0851839Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:03:44.0852097Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:03:44.0852381Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:03:44.0852763Z     %c12672_i32 = arith.constant 12672 : i32
2026-02-21T09:03:44.0853033Z     %c12672_i64 = arith.constant 12672 : i64
2026-02-21T09:03:44.0853257Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:03:44.0853647Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c12672_i32], [%c12672_i64, %c1_i64] : <f16>, <tensor<8x512xf16>>
2026-02-21T09:03:44.0854043Z     %1 = tt.get_program_id x : i32
2026-02-21T09:03:44.0854292Z     scf.for %arg2 = %1 to %c512_i32 step %c9472_i32  : i32 {
2026-02-21T09:03:44.0854577Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:03:44.0854843Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:03:44.0855157Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:03:44.0855395Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:03:44.0855651Z       %c12288_i32 = arith.constant 12288 : i32
2026-02-21T09:03:44.0855919Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:03:44.0856328Z       %6:2 = scf.for %arg3 = %c0_i32 to %c12288_i32 step %c2048_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:03:44.0856816Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0857116Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0857388Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T09:03:44.0857671Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0858014Z         %64 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T09:03:44.0858426Z         %65 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0858756Z         %66 = tt.broadcast %65 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0859096Z         %67 = arith.select %66, %64, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T09:03:44.0859412Z         %68 = arith.extf %67 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0859715Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0859989Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0860218Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:03:44.0860484Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0860709Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0860992Z         %70 = arith.truncf %69 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:03:44.0861269Z         %71 = arith.extf %70 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:03:44.0861605Z         %72 = arith.cmpf ogt, %arg4, %71 : tensor<8xf32>
2026-02-21T09:03:44.0861900Z         %73 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:03:44.0862158Z         %74 = arith.ori %72, %73 : tensor<8xi1>
2026-02-21T09:03:44.0862455Z         %75 = arith.select %74, %arg4, %71 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:03:44.0862737Z         %76 = arith.subf %arg4, %75 : tensor<8xf32>
2026-02-21T09:03:44.0863217Z         %77 = tt.extern_elementwise %76 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0863727Z         %78 = arith.mulf %arg5, %77 : tensor<8xf32>
2026-02-21T09:03:44.0864045Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0864372Z         %80 = arith.extf %64 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0864697Z         %81 = tt.broadcast %79 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0864973Z         %82 = arith.subf %80, %81 : tensor<8x512xf32>
2026-02-21T09:03:44.0865405Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0865849Z         %84 = arith.select %66, %83, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T09:03:44.0866168Z         %85 = "tt.reduce"(%84) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0866496Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0866720Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:03:44.0866977Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0867216Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0867493Z         %86 = arith.addf %78, %85 : tensor<8xf32>
2026-02-21T09:03:44.0867724Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:03:44.0867978Z         %87 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T09:03:44.0868233Z         %88 = arith.addi %arg3, %87 : i32
2026-02-21T09:03:44.0868511Z         %89 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0868832Z         %90 = tt.splat %88 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0869072Z         %91 = arith.addi %90, %89 : tensor<512xi32>
2026-02-21T09:03:44.0869359Z         %92 = arith.cmpi slt, %91, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0869691Z         %93 = tt.descriptor_load %0[%2, %88] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T09:03:44.0870100Z         %94 = tt.expand_dims %92 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0870464Z         %95 = tt.broadcast %94 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0870773Z         %96 = arith.select %95, %93, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T09:03:44.0871116Z         %97 = arith.extf %96 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0871381Z         %98 = "tt.reduce"(%97) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0871687Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0871906Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:03:44.0872169Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0872418Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0872677Z         %99 = arith.truncf %98 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:03:44.0872991Z         %100 = arith.extf %99 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:03:44.0873272Z         %101 = arith.cmpf ogt, %75, %100 : tensor<8xf32>
2026-02-21T09:03:44.0873562Z         %102 = arith.cmpf une, %75, %75 : tensor<8xf32>
2026-02-21T09:03:44.0873828Z         %103 = arith.ori %101, %102 : tensor<8xi1>
2026-02-21T09:03:44.0874137Z         %104 = arith.select %103, %75, %100 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:03:44.0874466Z         %105 = arith.subf %75, %104 : tensor<8xf32>
2026-02-21T09:03:44.0874892Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0875342Z         %107 = arith.mulf %86, %106 : tensor<8xf32>
2026-02-21T09:03:44.0875654Z         %108 = tt.expand_dims %104 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0876033Z         %109 = arith.extf %93 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0876355Z         %110 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0876684Z         %111 = arith.subf %109, %110 : tensor<8x512xf32>
2026-02-21T09:03:44.0877267Z         %112 = tt.extern_elementwise %111 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0877742Z         %113 = arith.select %95, %112, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T09:03:44.0878085Z         %114 = "tt.reduce"(%113) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0878356Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0878623Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:03:44.0878888Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0879125Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0879409Z         %115 = arith.addf %107, %114 : tensor<8xf32>
2026-02-21T09:03:44.0879655Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:03:44.0879929Z         %116 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T09:03:44.0880173Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T09:03:44.0880542Z         %118 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0880851Z         %119 = tt.splat %117 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0881135Z         %120 = arith.addi %119, %118 : tensor<512xi32>
2026-02-21T09:03:44.0881428Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0881819Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T09:03:44.0882253Z         %123 = tt.expand_dims %121 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0882607Z         %124 = tt.broadcast %123 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0882971Z         %125 = arith.select %124, %122, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T09:03:44.0883340Z         %126 = arith.extf %125 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0883631Z         %127 = "tt.reduce"(%126) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0883908Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0884143Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:03:44.0884412Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0884635Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0884925Z         %128 = arith.truncf %127 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:03:44.0885211Z         %129 = arith.extf %128 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:03:44.0885507Z         %130 = arith.cmpf ogt, %104, %129 : tensor<8xf32>
2026-02-21T09:03:44.0885798Z         %131 = arith.cmpf une, %104, %104 : tensor<8xf32>
2026-02-21T09:03:44.0886043Z         %132 = arith.ori %130, %131 : tensor<8xi1>
2026-02-21T09:03:44.0886345Z         %133 = arith.select %132, %104, %129 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:03:44.0886625Z         %134 = arith.subf %104, %133 : tensor<8xf32>
2026-02-21T09:03:44.0887048Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0887474Z         %136 = arith.mulf %115, %135 : tensor<8xf32>
2026-02-21T09:03:44.0887766Z         %137 = tt.expand_dims %133 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0888122Z         %138 = arith.extf %122 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0888424Z         %139 = tt.broadcast %137 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0888727Z         %140 = arith.subf %138, %139 : tensor<8x512xf32>
2026-02-21T09:03:44.0889139Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0889619Z         %142 = arith.select %124, %141, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T09:03:44.0889945Z         %143 = "tt.reduce"(%142) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0890175Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0890425Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:03:44.0890721Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0890972Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0891210Z         %144 = arith.addf %136, %143 : tensor<8xf32>
2026-02-21T09:03:44.0891474Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:03:44.0891764Z         %145 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T09:03:44.0891997Z         %146 = arith.addi %arg3, %145 : i32
2026-02-21T09:03:44.0892296Z         %147 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0892590Z         %148 = tt.splat %146 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0892862Z         %149 = arith.addi %148, %147 : tensor<512xi32>
2026-02-21T09:03:44.0893122Z         %150 = arith.cmpi slt, %149, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0893487Z         %151 = tt.descriptor_load %0[%2, %146] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T09:03:44.0893958Z         %152 = tt.expand_dims %150 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0894298Z         %153 = tt.broadcast %152 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0894650Z         %154 = arith.select %153, %151, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T09:03:44.0894973Z         %155 = arith.extf %154 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0895271Z         %156 = "tt.reduce"(%155) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0895502Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0895753Z           %174 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:03:44.0896013Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0896236Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0896524Z         %157 = arith.truncf %156 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:03:44.0896805Z         %158 = arith.extf %157 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:03:44.0897102Z         %159 = arith.cmpf ogt, %133, %158 : tensor<8xf32>
2026-02-21T09:03:44.0897364Z         %160 = arith.cmpf une, %133, %133 : tensor<8xf32>
2026-02-21T09:03:44.0897637Z         %161 = arith.ori %159, %160 : tensor<8xi1>
2026-02-21T09:03:44.0897940Z         %162 = arith.select %161, %133, %158 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:03:44.0898218Z         %163 = arith.subf %133, %162 : tensor<8xf32>
2026-02-21T09:03:44.0898639Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0899036Z         %165 = arith.mulf %144, %164 : tensor<8xf32>
2026-02-21T09:03:44.0899353Z         %166 = tt.expand_dims %162 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0899681Z         %167 = arith.extf %151 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0900012Z         %168 = tt.broadcast %166 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0900316Z         %169 = arith.subf %167, %168 : tensor<8x512xf32>
2026-02-21T09:03:44.0900731Z         %170 = tt.extern_elementwise %169 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0901234Z         %171 = arith.select %153, %170, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T09:03:44.0901529Z         %172 = "tt.reduce"(%171) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0901818Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:03:44.0902067Z           %174 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:03:44.0902292Z           tt.reduce.return %174 : f32
2026-02-21T09:03:44.0902543Z         }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0902776Z         %173 = arith.addf %165, %172 : tensor<8xf32>
2026-02-21T09:03:44.0903065Z         scf.yield %162, %173 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:03:44.0903354Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:03:44.0903690Z       %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0904079Z       %8 = tt.splat %c12288_i32 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0904323Z       %9 = arith.addi %8, %7 : tensor<512xi32>
2026-02-21T09:03:44.0904598Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0904951Z       %11 = tt.descriptor_load %0[%2, %c12288_i32] : !tt.tensordesc<tensor<8x512xf16>> -> tensor<8x512xf16>
2026-02-21T09:03:44.0905371Z       %12 = tt.expand_dims %10 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0905698Z       %13 = tt.broadcast %12 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0906041Z       %14 = arith.select %13, %11, %cst_2 : tensor<8x512xi1>, tensor<8x512xf16>
2026-02-21T09:03:44.0906380Z       %15 = arith.extf %14 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0906645Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0906906Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:03:44.0907129Z         %60 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:03:44.0907443Z         tt.reduce.return %60 : f32
2026-02-21T09:03:44.0907672Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0907955Z       %17 = arith.truncf %16 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:03:44.0908230Z       %18 = arith.extf %17 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:03:44.0908520Z       %19 = arith.cmpf ogt, %6#0, %18 : tensor<8xf32>
2026-02-21T09:03:44.0908803Z       %20 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T09:03:44.0909043Z       %21 = arith.ori %19, %20 : tensor<8xi1>
2026-02-21T09:03:44.0909331Z       %22 = arith.select %21, %6#0, %18 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:03:44.0909599Z       %23 = arith.subf %6#0, %22 : tensor<8xf32>
2026-02-21T09:03:44.0910016Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0910434Z       %25 = arith.mulf %6#1, %24 : tensor<8xf32>
2026-02-21T09:03:44.0910725Z       %26 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0911074Z       %27 = arith.extf %11 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0911369Z       %28 = tt.broadcast %26 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0911692Z       %29 = arith.subf %27, %28 : tensor<8x512xf32>
2026-02-21T09:03:44.0912091Z       %30 = tt.extern_elementwise %29 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0912564Z       %31 = arith.select %13, %30, %cst_1 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T09:03:44.0912872Z       %32 = "tt.reduce"(%31) <{axis = 1 : i32}> ({
2026-02-21T09:03:44.0913101Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:03:44.0913348Z         %60 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:03:44.0913577Z         tt.reduce.return %60 : f32
2026-02-21T09:03:44.0913824Z       }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T09:03:44.0914057Z       %33 = arith.addf %25, %32 : tensor<8xf32>
2026-02-21T09:03:44.0914323Z       %c12288_i32_6 = arith.constant 12288 : i32
2026-02-21T09:03:44.0914589Z       %c2048_i32_7 = arith.constant 2048 : i32
2026-02-21T09:03:44.0914866Z       scf.for %arg3 = %c0_i32 to %c12288_i32_6 step %c2048_i32_7  : i32 {
2026-02-21T09:03:44.0915219Z         %60 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0915513Z         %61 = tt.splat %arg3 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0915784Z         %62 = arith.addi %61, %60 : tensor<512xi32>
2026-02-21T09:03:44.0916042Z         %63 = arith.cmpi slt, %62, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0916404Z         %64 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:03:44.0916728Z         %65 = arith.muli %64, %cst_0 : tensor<8x1xi32>
2026-02-21T09:03:44.0917053Z         %66 = tt.expand_dims %62 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T09:03:44.0917388Z         %67 = tt.broadcast %65 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0917821Z         %68 = tt.broadcast %66 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0918104Z         %69 = arith.addi %67, %68 : tensor<8x512xi32>
2026-02-21T09:03:44.0918429Z         %70 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0918796Z         %71 = tt.addptr %70, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0919149Z         %72 = tt.expand_dims %63 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0919529Z         %73 = tt.broadcast %72 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0919876Z         %74 = tt.load %71, %73, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0920282Z         %75 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0920619Z         %76 = arith.extf %74 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0921036Z         %77 = tt.broadcast %75 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0921355Z         %78 = arith.subf %76, %77 : tensor<8x512xf32>
2026-02-21T09:03:44.0921846Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0922353Z         %80 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0922689Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0923008Z         %82 = arith.divf %79, %81 : tensor<8x512xf32>
2026-02-21T09:03:44.0923322Z         %83 = arith.truncf %82 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T09:03:44.0923645Z         %84 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0924002Z         %85 = tt.addptr %84, %69 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0924321Z         tt.store %85, %83, %73 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0924620Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:03:44.0924864Z         %86 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T09:03:44.0925139Z         %87 = arith.addi %arg3, %86 : i32
2026-02-21T09:03:44.0925447Z         %88 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0925753Z         %89 = tt.splat %87 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0926021Z         %90 = arith.addi %89, %88 : tensor<512xi32>
2026-02-21T09:03:44.0926275Z         %91 = arith.cmpi slt, %90, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0926602Z         %92 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:03:44.0926901Z         %93 = arith.muli %92, %cst_0 : tensor<8x1xi32>
2026-02-21T09:03:44.0927221Z         %94 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T09:03:44.0927575Z         %95 = tt.broadcast %93 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0927872Z         %96 = tt.broadcast %94 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0928169Z         %97 = arith.addi %95, %96 : tensor<8x512xi32>
2026-02-21T09:03:44.0928441Z         %98 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0928771Z         %99 = tt.addptr %98, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0929103Z         %100 = tt.expand_dims %91 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0929465Z         %101 = tt.broadcast %100 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0929836Z         %102 = tt.load %99, %101, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0930199Z         %103 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0930553Z         %104 = arith.extf %102 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0930855Z         %105 = tt.broadcast %103 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0931223Z         %106 = arith.subf %104, %105 : tensor<8x512xf32>
2026-02-21T09:03:44.0931714Z         %107 = tt.extern_elementwise %106 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0932167Z         %108 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0932519Z         %109 = tt.broadcast %108 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0932810Z         %110 = arith.divf %107, %109 : tensor<8x512xf32>
2026-02-21T09:03:44.0933115Z         %111 = arith.truncf %110 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T09:03:44.0933423Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0933772Z         %113 = tt.addptr %112, %97 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0934107Z         tt.store %113, %111, %101 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0934412Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:03:44.0934671Z         %114 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T09:03:44.0934907Z         %115 = arith.addi %arg3, %114 : i32
2026-02-21T09:03:44.0935203Z         %116 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0935491Z         %117 = tt.splat %115 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0935759Z         %118 = arith.addi %117, %116 : tensor<512xi32>
2026-02-21T09:03:44.0936045Z         %119 = arith.cmpi slt, %118, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0936353Z         %120 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:03:44.0936688Z         %121 = arith.muli %120, %cst_0 : tensor<8x1xi32>
2026-02-21T09:03:44.0936990Z         %122 = tt.expand_dims %118 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T09:03:44.0937355Z         %123 = tt.broadcast %121 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0937682Z         %124 = tt.broadcast %122 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0937966Z         %125 = arith.addi %123, %124 : tensor<8x512xi32>
2026-02-21T09:03:44.0938277Z         %126 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0938601Z         %127 = tt.addptr %126, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0938971Z         %128 = tt.expand_dims %119 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0939306Z         %129 = tt.broadcast %128 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0939683Z         %130 = tt.load %127, %129, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0940073Z         %131 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0940396Z         %132 = arith.extf %130 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0940726Z         %133 = tt.broadcast %131 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0941009Z         %134 = arith.subf %132, %133 : tensor<8x512xf32>
2026-02-21T09:03:44.0941445Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0941962Z         %136 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0942285Z         %137 = tt.broadcast %136 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0942588Z         %138 = arith.divf %135, %137 : tensor<8x512xf32>
2026-02-21T09:03:44.0942865Z         %139 = arith.truncf %138 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T09:03:44.0943200Z         %140 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0943518Z         %141 = tt.addptr %140, %125 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0943844Z         tt.store %141, %139, %129 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0944129Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:03:44.0944444Z         %142 = arith.muli %c512_i32, %c3_i32 : i32
2026-02-21T09:03:44.0944712Z         %143 = arith.addi %arg3, %142 : i32
2026-02-21T09:03:44.0944987Z         %144 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0945308Z         %145 = tt.splat %143 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0945562Z         %146 = arith.addi %145, %144 : tensor<512xi32>
2026-02-21T09:03:44.0945848Z         %147 = arith.cmpi slt, %146, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0946177Z         %148 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:03:44.0946478Z         %149 = arith.muli %148, %cst_0 : tensor<8x1xi32>
2026-02-21T09:03:44.0946809Z         %150 = tt.expand_dims %146 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T09:03:44.0947144Z         %151 = tt.broadcast %149 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0947542Z         %152 = tt.broadcast %150 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0947825Z         %153 = arith.addi %151, %152 : tensor<8x512xi32>
2026-02-21T09:03:44.0948132Z         %154 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0948475Z         %155 = tt.addptr %154, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0948828Z         %156 = tt.expand_dims %147 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0949186Z         %157 = tt.broadcast %156 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0949534Z         %158 = tt.load %155, %157, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0949926Z         %159 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0950269Z         %160 = arith.extf %158 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0950566Z         %161 = tt.broadcast %159 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0950872Z         %162 = arith.subf %160, %161 : tensor<8x512xf32>
2026-02-21T09:03:44.0951282Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0951776Z         %164 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0952121Z         %165 = tt.broadcast %164 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0952401Z         %166 = arith.divf %163, %165 : tensor<8x512xf32>
2026-02-21T09:03:44.0952705Z         %167 = arith.truncf %166 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T09:03:44.0953014Z         %168 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0953360Z         %169 = tt.addptr %168, %153 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0953670Z         tt.store %169, %167, %157 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0953990Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:03:44.0954330Z       %34 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T09:03:44.0954631Z       %35 = tt.splat %c12288_i32_6 : i32 -> tensor<512xi32>
2026-02-21T09:03:44.0954915Z       %36 = arith.addi %35, %34 : tensor<512xi32>
2026-02-21T09:03:44.0955168Z       %37 = arith.cmpi slt, %36, %cst_3 : tensor<512xi32>
2026-02-21T09:03:44.0955490Z       %38 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:03:44.0955792Z       %39 = arith.muli %38, %cst_0 : tensor<8x1xi32>
2026-02-21T09:03:44.0956113Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T09:03:44.0956464Z       %41 = tt.broadcast %39 : tensor<8x1xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0956758Z       %42 = tt.broadcast %40 : tensor<1x512xi32> -> tensor<8x512xi32>
2026-02-21T09:03:44.0957061Z       %43 = arith.addi %41, %42 : tensor<8x512xi32>
2026-02-21T09:03:44.0957335Z       %44 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0957733Z       %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0958063Z       %46 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi1> -> tensor<1x512xi1>
2026-02-21T09:03:44.0958415Z       %47 = tt.broadcast %46 : tensor<1x512xi1> -> tensor<8x512xi1>
2026-02-21T09:03:44.0958760Z       %48 = tt.load %45, %47, %cst evictionPolicy = evict_last : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0959112Z       %49 = tt.expand_dims %22 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0959455Z       %50 = arith.extf %48 : tensor<8x512xf16> to tensor<8x512xf32>
2026-02-21T09:03:44.0959740Z       %51 = tt.broadcast %49 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0960034Z       %52 = arith.subf %50, %51 : tensor<8x512xf32>
2026-02-21T09:03:44.0960533Z       %53 = tt.extern_elementwise %52 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T09:03:44.0961003Z       %54 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:03:44.0961357Z       %55 = tt.broadcast %54 : tensor<8x1xf32> -> tensor<8x512xf32>
2026-02-21T09:03:44.0961668Z       %56 = arith.divf %53, %55 : tensor<8x512xf32>
2026-02-21T09:03:44.0961973Z       %57 = arith.truncf %56 : tensor<8x512xf32> to tensor<8x512xf16>
2026-02-21T09:03:44.0962286Z       %58 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0962647Z       %59 = tt.addptr %58, %43 : tensor<8x512x!tt.ptr<f16>>, tensor<8x512xi32>
2026-02-21T09:03:44.0962983Z       tt.store %59, %57, %47 : tensor<8x512x!tt.ptr<f16>>
2026-02-21T09:03:44.0963319Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:03:44.0963656Z     tt.return
2026-02-21T09:03:44.0963836Z   }
2026-02-21T09:03:44.0964035Z }
2026-02-21T09:03:44.0964129Z 
2026-02-21T09:03:44.0964203Z {-#
2026-02-21T09:03:44.0964413Z   external_resources: {
2026-02-21T09:03:44.0964638Z     mlir_reproducer: {
2026-02-21T09:03:44.0969187Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:03:44.0973749Z       disable_threading: false,
2026-02-21T09:03:44.0973986Z       verify_each: true
2026-02-21T09:03:44.0974178Z     }
2026-02-21T09:03:44.0974377Z   }
2026-02-21T09:03:44.0974587Z #-}
2026-02-21T09:03:44.0975080Z /tmp/torchinductor_root/cb/ccb5nm5euemrmlemnx3l2mj5qkox75vbvwlyuk34mtkf6mmguhb5.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:03:44.0976329Z /tmp/torchinductor_root/cb/ccb5nm5euemrmlemnx3l2mj5qkox75vbvwlyuk34mtkf6mmguhb5.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:03:44.0977365Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:03:44.0978591Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:03:44.0979647Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:03:44.0979949Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:03:52.3143519Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.9 configs/s
2026-02-21T09:03:52.3158020Z [51s] Adaptive compile timeout: 30s (90% percentile=13.8s, bounds=[30.0s, 30s])
2026-02-21T09:03:53.2758177Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1023.1 configs/s
2026-02-21T09:03:53.3444072Z [52s] Initial random population of 100, 5 starting points: 
2026-02-21T09:03:53.3448387Z error=7
2026-02-21T09:03:53.3452093Z timeout=1
2026-02-21T09:03:53.3453483Z ok=92
2026-02-21T09:03:53.3453751Z min=0.0674
2026-02-21T09:03:53.3453943Z mid=1.1777
2026-02-21T09:03:53.3454169Z max=62.4026
2026-02-21T09:03:53.3454413Z best={'block_sizes': [1, 16384],
2026-02-21T09:03:53.3454740Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:03:53.3455033Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:03:53.3455295Z  'num_sm_multiplier': 8,
2026-02-21T09:03:53.3455494Z  'num_stages': 3,
2026-02-21T09:03:53.3455715Z  'num_warps': 1,
2026-02-21T09:03:53.3455963Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:03:53.3456198Z  'range_flattens': [False, None],
2026-02-21T09:03:53.3456445Z  'range_multi_buffers': [True, True],
2026-02-21T09:03:53.3456672Z  'range_num_stages': [1, 2],
2026-02-21T09:03:53.3456921Z  'range_unroll_factors': [0, 1],
2026-02-21T09:03:53.3457142Z  'range_warp_specializes': [True, None]}
2026-02-21T09:03:53.3457814Z [52s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:03:54.4345269Z [53s] Generation 1 starting: 79 neighbors, 5 active search path(s)
2026-02-21T09:04:08.8620286Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 3.8 configs/s
2026-02-21T09:04:13.8129963Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.1 configs/s
2026-02-21T09:04:15.0791702Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 787.0         
2026-02-21T09:04:15.0795156Z                                                                   configs/s     
2026-02-21T09:04:15.1600818Z [74s] Generation 1 complete: 
2026-02-21T09:04:15.1605218Z ok=85
2026-02-21T09:04:15.1609104Z min=0.0471
2026-02-21T09:04:15.1613607Z mid=0.0820
2026-02-21T09:04:15.1615092Z max=0.4300
2026-02-21T09:04:15.1615349Z best={'block_sizes': [1, 16384],
2026-02-21T09:04:15.1615669Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:04:15.1615987Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:04:15.1616245Z  'num_sm_multiplier': 8,
2026-02-21T09:04:15.1616489Z  'num_stages': 3,
2026-02-21T09:04:15.1616675Z  'num_warps': 1,
2026-02-21T09:04:15.1616901Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:04:15.1617167Z  'range_flattens': [False, None],
2026-02-21T09:04:15.1617763Z  'range_multi_buffers': [True, True],
2026-02-21T09:04:15.1618004Z  'range_num_stages': [1, 2],
2026-02-21T09:04:15.1618247Z  'range_unroll_factors': [1, 1],
2026-02-21T09:04:15.1618471Z  'range_warp_specializes': [True, None]}
2026-02-21T09:04:15.1618869Z [74s] Fitting surrogate: 185 points, 185 targets
2026-02-21T09:04:16.0374417Z [75s] Generation 2 starting: 62 neighbors, 5 active search path(s)
2026-02-21T09:04:52.0737589Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 0.4 configs/s
2026-02-21T09:04:55.8862805Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 17.0 configs/s
2026-02-21T09:04:58.5915293Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 371.9         
2026-02-21T09:04:58.5918935Z                                                                   configs/s     
2026-02-21T09:04:58.7452218Z [118s] Generation 2 complete: 
2026-02-21T09:04:58.7454093Z error=1
2026-02-21T09:04:58.7454298Z ok=66
2026-02-21T09:04:58.7454908Z min=0.0491
2026-02-21T09:04:58.7455096Z mid=0.0757
2026-02-21T09:04:58.7455291Z max=2.0634
2026-02-21T09:04:58.7455567Z best={'block_sizes': [1, 16384],
2026-02-21T09:04:58.7459831Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:04:58.7464142Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:04:58.7466259Z  'num_sm_multiplier': 8,
2026-02-21T09:04:58.7466517Z  'num_stages': 3,
2026-02-21T09:04:58.7466739Z  'num_warps': 1,
2026-02-21T09:04:58.7466931Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:04:58.7467209Z  'range_flattens': [False, None],
2026-02-21T09:04:58.7467440Z  'range_multi_buffers': [True, True],
2026-02-21T09:04:58.7467697Z  'range_num_stages': [1, 2],
2026-02-21T09:04:58.7467909Z  'range_unroll_factors': [1, 1],
2026-02-21T09:04:58.7468168Z  'range_warp_specializes': [True, None]}
2026-02-21T09:04:58.7468458Z [118s] Fitting surrogate: 252 points, 252 targets
2026-02-21T09:04:59.6050070Z [118s] Generation 3 starting: 62 neighbors, 5 active search path(s)
2026-02-21T09:05:11.4449171Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 11.6 configs/s
2026-02-21T09:05:15.2157340Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 17.7 configs/s
2026-02-21T09:05:16.5088310Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 772.9         
2026-02-21T09:05:16.5088707Z                                                                   configs/s     
2026-02-21T09:05:16.5950706Z [135s] Generation 3 complete: 
2026-02-21T09:05:16.5953977Z error=2
2026-02-21T09:05:16.5958372Z ok=65
2026-02-21T09:05:16.5962250Z min=0.0472
2026-02-21T09:05:16.5966789Z mid=0.0799
2026-02-21T09:05:16.5971219Z max=0.2785
2026-02-21T09:05:16.5976292Z best={'block_sizes': [1, 16384],
2026-02-21T09:05:16.5980956Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:05:16.5981950Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:05:16.5982246Z  'num_sm_multiplier': 8,
2026-02-21T09:05:16.5982485Z  'num_stages': 3,
2026-02-21T09:05:16.5982710Z  'num_warps': 1,
2026-02-21T09:05:16.5982940Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:05:16.5983174Z  'range_flattens': [False, None],
2026-02-21T09:05:16.5983435Z  'range_multi_buffers': [True, True],
2026-02-21T09:05:16.5983657Z  'range_num_stages': [1, 2],
2026-02-21T09:05:16.5983891Z  'range_unroll_factors': [1, 1],
2026-02-21T09:05:16.5984107Z  'range_warp_specializes': [True, None]}
2026-02-21T09:05:16.5984384Z [135s] Fitting surrogate: 319 points, 319 targets
2026-02-21T09:05:17.1823918Z [136s] Generation 4 starting: 38 neighbors, 3 active search path(s)
2026-02-21T09:05:24.1080261Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 7.7 configs/s
2026-02-21T09:05:26.5232952Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 17.3 configs/s
2026-02-21T09:05:26.5240822Z [145s] Generation 4 complete: 
2026-02-21T09:05:26.5245515Z ok=42
2026-02-21T09:05:26.5250178Z min=0.0472
2026-02-21T09:05:26.5251858Z mid=0.0841
2026-02-21T09:05:26.5252489Z max=0.4608
2026-02-21T09:05:26.5252722Z best={'block_sizes': [1, 16384],
2026-02-21T09:05:26.5253014Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:05:26.5253347Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:05:26.5253610Z  'num_sm_multiplier': 8,
2026-02-21T09:05:26.5253870Z  'num_stages': 3,
2026-02-21T09:05:26.5254081Z  'num_warps': 1,
2026-02-21T09:05:26.5254347Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:05:26.5254595Z  'range_flattens': [False, None],
2026-02-21T09:05:26.5254861Z  'range_multi_buffers': [True, True],
2026-02-21T09:05:26.5255125Z  'range_num_stages': [1, 2],
2026-02-21T09:05:26.5255348Z  'range_unroll_factors': [1, 1],
2026-02-21T09:05:26.5255609Z  'range_warp_specializes': [True, None]}
2026-02-21T09:05:26.5256788Z [145s] Fitting surrogate: 361 points, 361 targets
2026-02-21T09:05:27.1254513Z [146s] Generation 5 starting: 38 neighbors, 3 active search path(s)
2026-02-21T09:05:38.6421256Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 2.2 configs/s
2026-02-21T09:05:41.0785436Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 17.1 configs/s
2026-02-21T09:05:41.6746949Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1636.1         
2026-02-21T09:05:41.6751245Z                                                                  configs/s      
2026-02-21T09:05:41.7259838Z [161s] Generation 5 complete: 
2026-02-21T09:05:41.7264271Z ok=42
2026-02-21T09:05:41.7269187Z min=0.0471
2026-02-21T09:05:41.7273622Z mid=0.0820
2026-02-21T09:05:41.7282014Z max=0.4690
2026-02-21T09:05:41.7282190Z best={'block_sizes': [1, 16384],
2026-02-21T09:05:41.7282440Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:05:41.7282684Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:05:41.7282892Z  'num_sm_multiplier': 8,
2026-02-21T09:05:41.7283064Z  'num_stages': 3,
2026-02-21T09:05:41.7283207Z  'num_warps': 1,
2026-02-21T09:05:41.7283402Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:05:41.7283614Z  'range_flattens': [False, None],
2026-02-21T09:05:41.7283807Z  'range_multi_buffers': [True, True],
2026-02-21T09:05:41.7283989Z  'range_num_stages': [1, 2],
2026-02-21T09:05:41.7284163Z  'range_unroll_factors': [1, 1],
2026-02-21T09:05:41.7284344Z  'range_warp_specializes': [True, None]}
2026-02-21T09:05:41.7284559Z [161s] Fitting surrogate: 403 points, 403 targets
2026-02-21T09:05:42.2910183Z [161s] Generation 6 starting: 35 neighbors, 3 active search path(s)
2026-02-21T09:05:49.5501646Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 3.3 configs/s
2026-02-21T09:05:51.7390117Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 17.3 configs/s
2026-02-21T09:05:53.9808965Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 740.1         
2026-02-21T09:05:53.9812681Z                                                                   configs/s     
2026-02-21T09:05:54.0722286Z [173s] Generation 6 complete: 
2026-02-21T09:05:54.0724226Z ok=39
2026-02-21T09:05:54.0724741Z min=0.0471
2026-02-21T09:05:54.0724880Z mid=0.0779
2026-02-21T09:05:54.0725002Z max=0.1885
2026-02-21T09:05:54.0725148Z best={'block_sizes': [1, 16384],
2026-02-21T09:05:54.0725389Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:05:54.0725630Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:05:54.0725832Z  'num_sm_multiplier': 8,
2026-02-21T09:05:54.0725991Z  'num_stages': 3,
2026-02-21T09:05:54.0726140Z  'num_warps': 1,
2026-02-21T09:05:54.0726297Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:05:54.0726502Z  'range_flattens': [False, None],
2026-02-21T09:05:54.0726685Z  'range_multi_buffers': [True, True],
2026-02-21T09:05:54.0726881Z  'range_num_stages': [1, 2],
2026-02-21T09:05:54.0727045Z  'range_unroll_factors': [1, 1],
2026-02-21T09:05:54.0727234Z  'range_warp_specializes': [True, None]}
2026-02-21T09:05:54.0740286Z [173s] Fitting surrogate: 442 points, 442 targets
2026-02-21T09:05:54.5853106Z [173s] Generation 7 starting: 31 neighbors, 3 active search path(s)
2026-02-21T09:06:03.7412341Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 1.8 configs/s
2026-02-21T09:06:05.7522185Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 17.3 configs/s
2026-02-21T09:06:06.6093739Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1154.8         
2026-02-21T09:06:06.6097807Z                                                                  configs/s      
2026-02-21T09:06:06.6747921Z [186s] Generation 7 complete: 
2026-02-21T09:06:06.6749967Z ok=35
2026-02-21T09:06:06.6750170Z min=0.0471
2026-02-21T09:06:06.6755765Z mid=0.0799
2026-02-21T09:06:06.6760416Z max=0.8529
2026-02-21T09:06:06.6762537Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:06.6762831Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:06.6763091Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:06.6763307Z  'num_sm_multiplier': 8,
2026-02-21T09:06:06.6763515Z  'num_stages': 3,
2026-02-21T09:06:06.6763694Z  'num_warps': 1,
2026-02-21T09:06:06.6763864Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:06.6764062Z  'range_flattens': [False, None],
2026-02-21T09:06:06.6764253Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:06.6764438Z  'range_num_stages': [1, 2],
2026-02-21T09:06:06.6764619Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:06.6764803Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:06.6765031Z [186s] Fitting surrogate: 477 points, 477 targets
2026-02-21T09:06:07.0738038Z [186s] Generation 8 starting: 24 neighbors, 2 active search path(s)
2026-02-21T09:06:12.2128665Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 7.7 configs/s
2026-02-21T09:06:13.7539117Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.4 configs/s
2026-02-21T09:06:13.9319086Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5051.4         
2026-02-21T09:06:13.9322785Z                                                                  configs/s      
2026-02-21T09:06:13.9646547Z [193s] Generation 8 complete: 
2026-02-21T09:06:13.9651006Z ok=27
2026-02-21T09:06:13.9652492Z min=0.0471
2026-02-21T09:06:13.9652662Z mid=0.0830
2026-02-21T09:06:13.9652792Z max=0.4035
2026-02-21T09:06:13.9652951Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:13.9653190Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:13.9653445Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:13.9653642Z  'num_sm_multiplier': 8,
2026-02-21T09:06:13.9653812Z  'num_stages': 3,
2026-02-21T09:06:13.9653952Z  'num_warps': 1,
2026-02-21T09:06:13.9654118Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:13.9654310Z  'range_flattens': [False, None],
2026-02-21T09:06:13.9654500Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:13.9654695Z  'range_num_stages': [1, 2],
2026-02-21T09:06:13.9654861Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:13.9655049Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:13.9664189Z [193s] Fitting surrogate: 504 points, 504 targets
2026-02-21T09:06:14.4964959Z [193s] Generation 9 starting: 24 neighbors, 2 active search path(s)
2026-02-21T09:06:19.5566492Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 4.7 configs/s
2026-02-21T09:06:21.0958690Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.4 configs/s
2026-02-21T09:06:21.4956635Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2389.9         
2026-02-21T09:06:21.4960313Z                                                                  configs/s      
2026-02-21T09:06:21.5402135Z [200s] Generation 9 complete: 
2026-02-21T09:06:21.5406564Z ok=27
2026-02-21T09:06:21.5410421Z min=0.0472
2026-02-21T09:06:21.5412575Z mid=0.0819
2026-02-21T09:06:21.5412794Z max=0.2621
2026-02-21T09:06:21.5417511Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:21.5421445Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:21.5426389Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:21.5430340Z  'num_sm_multiplier': 8,
2026-02-21T09:06:21.5434796Z  'num_stages': 3,
2026-02-21T09:06:21.5436154Z  'num_warps': 1,
2026-02-21T09:06:21.5436379Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:21.5436607Z  'range_flattens': [False, None],
2026-02-21T09:06:21.5436798Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:21.5437001Z  'range_num_stages': [1, 2],
2026-02-21T09:06:21.5437182Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:21.5437379Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:21.5437682Z [200s] Fitting surrogate: 531 points, 531 targets
2026-02-21T09:06:22.0761158Z [201s] Generation 10 starting: 25 neighbors, 2 active search path(s)
2026-02-21T09:06:29.0514388Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 2.6 configs/s
2026-02-21T09:06:30.6542605Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 27/27 17.3 configs/s
2026-02-21T09:06:30.9432560Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3247.1        
2026-02-21T09:06:30.9434384Z                                                                   configs/s     
2026-02-21T09:06:30.9819669Z [210s] Generation 10 complete: 
2026-02-21T09:06:30.9825322Z ok=28
2026-02-21T09:06:30.9829345Z min=0.0471
2026-02-21T09:06:30.9830788Z mid=0.0820
2026-02-21T09:06:30.9830954Z max=0.2642
2026-02-21T09:06:30.9831117Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:30.9831366Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:30.9831686Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:30.9831897Z  'num_sm_multiplier': 8,
2026-02-21T09:06:30.9832062Z  'num_stages': 3,
2026-02-21T09:06:30.9832218Z  'num_warps': 1,
2026-02-21T09:06:30.9832379Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:30.9832587Z  'range_flattens': [False, None],
2026-02-21T09:06:30.9832775Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:30.9832972Z  'range_num_stages': [1, 2],
2026-02-21T09:06:30.9833165Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:30.9833367Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:30.9840514Z [210s] Fitting surrogate: 559 points, 559 targets
2026-02-21T09:06:31.3663079Z [210s] Generation 11 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:06:33.9350515Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.1 configs/s
2026-02-21T09:06:34.6429847Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s
2026-02-21T09:06:34.9283487Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3273.4        
2026-02-21T09:06:34.9288256Z                                                                   configs/s     
2026-02-21T09:06:34.9664767Z [214s] Generation 11 complete: 
2026-02-21T09:06:34.9669077Z ok=13
2026-02-21T09:06:34.9673414Z min=0.0472
2026-02-21T09:06:34.9674752Z mid=0.0841
2026-02-21T09:06:34.9674919Z max=0.1188
2026-02-21T09:06:34.9675062Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:34.9675336Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:34.9675956Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:34.9676157Z  'num_sm_multiplier': 8,
2026-02-21T09:06:34.9676321Z  'num_stages': 3,
2026-02-21T09:06:34.9676461Z  'num_warps': 1,
2026-02-21T09:06:34.9676622Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:34.9676814Z  'range_flattens': [False, None],
2026-02-21T09:06:34.9676999Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:34.9677179Z  'range_num_stages': [1, 2],
2026-02-21T09:06:34.9677350Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:34.9677527Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:34.9684190Z [214s] Fitting surrogate: 572 points, 572 targets
2026-02-21T09:06:35.3578101Z [214s] Generation 12 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:06:37.7486899Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 12.1 configs/s
2026-02-21T09:06:38.3553069Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.9 configs/s
2026-02-21T09:06:39.4166286Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 931.0         
2026-02-21T09:06:39.4170437Z                                                                   configs/s     
2026-02-21T09:06:39.4938802Z [218s] Generation 12 complete: 
2026-02-21T09:06:39.4939171Z ok=12
2026-02-21T09:06:39.4939344Z min=0.0492
2026-02-21T09:06:39.4939521Z mid=0.0656
2026-02-21T09:06:39.4939661Z max=0.0819
2026-02-21T09:06:39.4939802Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:39.4940025Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:39.4940261Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:39.4940462Z  'num_sm_multiplier': 8,
2026-02-21T09:06:39.4940618Z  'num_stages': 3,
2026-02-21T09:06:39.4940762Z  'num_warps': 1,
2026-02-21T09:06:39.4940916Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:39.4941112Z  'range_flattens': [False, None],
2026-02-21T09:06:39.4941321Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:39.4941523Z  'range_num_stages': [1, 2],
2026-02-21T09:06:39.4941770Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:39.4956416Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:39.4956676Z [218s] Fitting surrogate: 584 points, 584 targets
2026-02-21T09:06:39.8923283Z [219s] Generation 13 starting: 8 neighbors, 1 active search path(s)
2026-02-21T09:06:42.4528762Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 5.5 configs/s
2026-02-21T09:06:42.9275971Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 18.8 configs/s
2026-02-21T09:06:43.3177506Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2425.0        
2026-02-21T09:06:43.3179226Z                                                                   configs/s     
2026-02-21T09:06:43.3610161Z [222s] Generation 13 complete: 
2026-02-21T09:06:43.3614531Z ok=10
2026-02-21T09:06:43.3618926Z min=0.0492
2026-02-21T09:06:43.3620438Z mid=0.0789
2026-02-21T09:06:43.3620667Z max=0.1638
2026-02-21T09:06:43.3626442Z best={'block_sizes': [1, 16384],
2026-02-21T09:06:43.3628554Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:06:43.3628829Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:06:43.3629024Z  'num_sm_multiplier': 8,
2026-02-21T09:06:43.3629193Z  'num_stages': 3,
2026-02-21T09:06:43.3629332Z  'num_warps': 1,
2026-02-21T09:06:43.3629496Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:06:43.3629690Z  'range_flattens': [False, None],
2026-02-21T09:06:43.3629876Z  'range_multi_buffers': [True, True],
2026-02-21T09:06:43.3630056Z  'range_num_stages': [1, 2],
2026-02-21T09:06:43.3630229Z  'range_unroll_factors': [1, 1],
2026-02-21T09:06:43.3630403Z  'range_warp_specializes': [True, None]}
2026-02-21T09:06:43.3630699Z [222s] Fitting surrogate: 594 points, 594 targets
2026-02-21T09:06:43.6448541Z [222s] Autotuning complete in 223.0s after searching 575 configs.
2026-02-21T09:06:43.6448887Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:06:43.6450255Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[1, 1], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:06:43.6451152Z 
2026-02-21T09:06:43.6451398Z [222s] Code of selected kernel: /tmp/torchinductor_root/sa/csaksjz5whuhhqvf6nkuvevnby6jtj627r5wbemjjclexvgnaj33.py
2026-02-21T09:06:44.5165236Z WARNING:tritonbench.utils.triton_op:Completed input ID 97:
2026-02-21T09:06:44.5165603Z (M, N)
2026-02-21T09:06:44.5165782Z -------------
2026-02-21T09:06:44.5165943Z (4096, 12672)
2026-02-21T09:06:44.5166032Z 
2026-02-21T09:06:44.5166392Z 100%|██████████| 20/20 [57:42<00:00, 197.61s/it]
2026-02-21T09:06:44.5166654Z 100%|██████████| 20/20 [57:42<00:00, 173.10s/it]
2026-02-21T09:06:44.5205140Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp7ibdmmxl.csv
2026-02-21T09:06:46.3381407Z        (M, N)    triton_softmax-speedup    triton_softmax-accuracy    torch_compile_softmax-speedup    torch_compile_softmax-accuracy    helion_softmax_tritonbench-speedup    helion_softmax_tritonbench-accuracy
2026-02-21T09:06:46.3383249Z -------------  ------------------------  -------------------------  -------------------------------  --------------------------------  ------------------------------------  -------------------------------------
2026-02-21T09:06:46.3383923Z   (4096, 256)                  0.927277                          1                         1.50605                                  1                               1.6272                                       1
2026-02-21T09:06:46.3384479Z   (4096, 896)                  1.85317                           1                         1.43865                                  1                               2.22536                                      1
2026-02-21T09:06:46.3384978Z  (4096, 1536)                  3.53206                           1                         2.22187                                  1                               4.50606                                      1
2026-02-21T09:06:46.3385446Z  (4096, 2176)                  2.35532                           1                         1.5938                                   1                               4.63032                                      1
2026-02-21T09:06:46.3385922Z  (4096, 2816)                  2.41506                           1                         1.64293                                  1                               4.06486                                      1
2026-02-21T09:06:46.3386393Z  (4096, 3584)                  2.67651                           1                         1.54743                                  1                               3.49984                                      1
2026-02-21T09:06:46.3386861Z  (4096, 4224)                  3.73652                           1                         1.97417                                  1                               5.01296                                      1
2026-02-21T09:06:46.3387347Z  (4096, 4864)                  3.79746                           1                         1.86131                                  1                               4.97742                                      1
2026-02-21T09:06:46.3387810Z  (4096, 5504)                  4.08316                           1                         1.88111                                  1                               4.88052                                      1
2026-02-21T09:06:46.3388257Z  (4096, 6144)                  4.13037                           1                         2.09229                                  1                               4.47552                                      1
2026-02-21T09:06:46.3388725Z  (4096, 6784)                  4.28609                           1                         1.67387                                  1                               4.58957                                      1
2026-02-21T09:06:46.3389559Z  (4096, 7424)                  4.91419                           1                         1.86457                                  1                               4.83533                                      1
2026-02-21T09:06:46.3390034Z  (4096, 8064)                  4.83901                           1                         1.78632                                  1                               4.77175                                      1
2026-02-21T09:06:46.3390510Z  (4096, 8704)                  2.68658                           1                         1.91263                                  1                               4.13071                                      1
2026-02-21T09:06:46.3391136Z  (4096, 9344)                  1.74183                           1                         0.986338                                 1                               2.44567                                      1
2026-02-21T09:06:46.3391675Z (4096, 10112)                  1.74482                           1                         0.949176                                 1                               2.40527                                      1
2026-02-21T09:06:46.3392150Z (4096, 10752)                  1.73486                           1                         1.06122                                  1                               2.25434                                      1
2026-02-21T09:06:46.3392626Z (4096, 11392)                  1.74539                           1                         0.864912                                 1                               2.25341                                      1
2026-02-21T09:06:46.3393099Z (4096, 12032)                  1.7451                            1                         0.843066                                 1                               2.20091                                      1
2026-02-21T09:06:46.3393573Z (4096, 12672)                  1.7628                            1                         0.834627                                 1                               2.05979                                      1
2026-02-21T09:06:46.3394052Z       average                  2.83538                           1                         1.52682                                  1                               3.59234                                      1
2026-02-21T09:08:54.4878461Z Using num_inputs=20 for softmax
2026-02-21T09:08:54.4998627Z Running softmax benchmark with Helion implementation...
2026-02-21T09:08:54.5000316Z 
2026-02-21T09:08:54.7235003Z Equally-spaced-k mode: Selected 20 equally spaced inputs (total available: 98)
2026-02-21T09:08:54.7236984Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 5, 10, 15, 20, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 77, 82, 87, 92, 97]
2026-02-21T09:08:54.7243691Z 
2026-02-21T09:08:54.7251174Z   0%|          | 0/20 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T09:08:54.7252229Z (M, N)
2026-02-21T09:08:54.7252369Z -----------
2026-02-21T09:08:54.7252517Z (4096, 256)
2026-02-21T09:08:54.7252806Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:08:56.3809000Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for triton_softmax
2026-02-21T09:08:58.1393229Z INFO:tritonbench.utils.triton_op:Took 38.29ms to get benchmark function for torch_compile_softmax
2026-02-21T09:08:59.6085900Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:08:59.6090379Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:08:59.6095865Z               'dtype': 'torch.float16',
2026-02-21T09:08:59.6100587Z               'shape': (4096, 256),
2026-02-21T09:08:59.6105334Z               'stride': (256, 1)},),
2026-02-21T09:08:59.6110603Z   'kwargs': {}}
2026-02-21T09:08:59.6112249Z INFO:tritonbench.utils.triton_op:Took 0.79ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:08:59.8490816Z [0s] Autotune random seed: 2138408546
2026-02-21T09:08:59.8775141Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:09:23.9217699Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.4 configs/s
2026-02-21T09:09:25.5305681Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T09:09:25.5307155Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:09:25.5307674Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:09:25.5307875Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:09:25.5308070Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:09:25.5308265Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T09:09:25.5308857Z     %cst = arith.constant dense<0.000000e+00> : tensor<128xf32>
2026-02-21T09:09:25.5309176Z     %cst_0 = arith.constant dense<0xFF800000> : tensor<128xf32>
2026-02-21T09:09:25.5309414Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:09:25.5309612Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:09:25.5309803Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:09:25.5310000Z     %c256_i64 = arith.constant 256 : i64
2026-02-21T09:09:25.5310188Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:09:25.5310509Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : <f16>, <tensor<128x16xf16>>
2026-02-21T09:09:25.5310981Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c256_i32], [%c256_i64, %c1_i64] : <f16>, <tensor<128x16xf16>>
2026-02-21T09:09:25.5311321Z     %2 = tt.get_program_id x : i32
2026-02-21T09:09:25.5311720Z     scf.for %arg2 = %2 to %c32_i32 step %c9472_i32  : i32 {
2026-02-21T09:09:25.5311971Z       %3 = arith.muli %arg2, %c128_i32 : i32
2026-02-21T09:09:25.5312172Z       %c240_i32 = arith.constant 240 : i32
2026-02-21T09:09:25.5312376Z       %c48_i32 = arith.constant 48 : i32
2026-02-21T09:09:25.5312749Z       %4:2 = scf.for %arg3 = %c0_i32 to %c240_i32 step %c48_i32 iter_args(%arg4 = %cst_0, %arg5 = %cst) -> (tensor<128xf32>, tensor<128xf32>)  : i32 {
2026-02-21T09:09:25.5313225Z         %33 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5313555Z         %34 = arith.extf %33 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5313835Z         %35 = "tt.reduce"(%34) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5314035Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5314220Z           %91 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:09:25.5314417Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5314611Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5314836Z         %36 = arith.truncf %35 : tensor<128xf32> to tensor<128xf16>
2026-02-21T09:09:25.5315093Z         %37 = arith.extf %36 : tensor<128xf16> to tensor<128xf32>
2026-02-21T09:09:25.5315486Z         %38 = arith.cmpf ogt, %arg4, %37 : tensor<128xf32>
2026-02-21T09:09:25.5315722Z         %39 = arith.cmpf une, %arg4, %arg4 : tensor<128xf32>
2026-02-21T09:09:25.5315940Z         %40 = arith.ori %38, %39 : tensor<128xi1>
2026-02-21T09:09:25.5316182Z         %41 = arith.select %40, %arg4, %37 : tensor<128xi1>, tensor<128xf32>
2026-02-21T09:09:25.5316436Z         %42 = arith.subf %arg4, %41 : tensor<128xf32>
2026-02-21T09:09:25.5316804Z         %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5317175Z         %44 = arith.mulf %arg5, %43 : tensor<128xf32>
2026-02-21T09:09:25.5317437Z         %45 = tt.expand_dims %41 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5317741Z         %46 = tt.broadcast %45 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5317981Z         %47 = arith.subf %34, %46 : tensor<128x16xf32>
2026-02-21T09:09:25.5318375Z         %48 = tt.extern_elementwise %47 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5318744Z         %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5318936Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5319125Z           %91 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:09:25.5319315Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5319499Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5319705Z         %50 = arith.addf %44, %49 : tensor<128xf32>
2026-02-21T09:09:25.5319896Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:09:25.5320087Z         %51 = arith.muli %c16_i32, %c1_i32 : i32
2026-02-21T09:09:25.5320273Z         %52 = arith.addi %arg3, %51 : i32
2026-02-21T09:09:25.5320549Z         %53 = tt.descriptor_load %0[%3, %52] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5320937Z         %54 = arith.extf %53 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5321169Z         %55 = "tt.reduce"(%54) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5321363Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5321577Z           %91 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:09:25.5321773Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5321956Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5322185Z         %56 = arith.truncf %55 : tensor<128xf32> to tensor<128xf16>
2026-02-21T09:09:25.5322428Z         %57 = arith.extf %56 : tensor<128xf16> to tensor<128xf32>
2026-02-21T09:09:25.5322664Z         %58 = arith.cmpf ogt, %41, %57 : tensor<128xf32>
2026-02-21T09:09:25.5322884Z         %59 = arith.cmpf une, %41, %41 : tensor<128xf32>
2026-02-21T09:09:25.5323089Z         %60 = arith.ori %58, %59 : tensor<128xi1>
2026-02-21T09:09:25.5323326Z         %61 = arith.select %60, %41, %57 : tensor<128xi1>, tensor<128xf32>
2026-02-21T09:09:25.5323560Z         %62 = arith.subf %41, %61 : tensor<128xf32>
2026-02-21T09:09:25.5323920Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5324281Z         %64 = arith.mulf %50, %63 : tensor<128xf32>
2026-02-21T09:09:25.5324529Z         %65 = tt.expand_dims %61 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5324830Z         %66 = tt.broadcast %65 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5325068Z         %67 = arith.subf %54, %66 : tensor<128x16xf32>
2026-02-21T09:09:25.5325429Z         %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5325781Z         %69 = "tt.reduce"(%68) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5325973Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5326162Z           %91 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:09:25.5326419Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5326616Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5326817Z         %70 = arith.addf %64, %69 : tensor<128xf32>
2026-02-21T09:09:25.5327020Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:09:25.5327210Z         %71 = arith.muli %c16_i32, %c2_i32 : i32
2026-02-21T09:09:25.5327409Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T09:09:25.5327687Z         %73 = tt.descriptor_load %0[%3, %72] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5328009Z         %74 = arith.extf %73 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5328247Z         %75 = "tt.reduce"(%74) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5328438Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5328633Z           %91 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:09:25.5328827Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5329026Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5329254Z         %76 = arith.truncf %75 : tensor<128xf32> to tensor<128xf16>
2026-02-21T09:09:25.5329515Z         %77 = arith.extf %76 : tensor<128xf16> to tensor<128xf32>
2026-02-21T09:09:25.5329758Z         %78 = arith.cmpf ogt, %61, %77 : tensor<128xf32>
2026-02-21T09:09:25.5329980Z         %79 = arith.cmpf une, %61, %61 : tensor<128xf32>
2026-02-21T09:09:25.5330193Z         %80 = arith.ori %78, %79 : tensor<128xi1>
2026-02-21T09:09:25.5330425Z         %81 = arith.select %80, %61, %77 : tensor<128xi1>, tensor<128xf32>
2026-02-21T09:09:25.5330668Z         %82 = arith.subf %61, %81 : tensor<128xf32>
2026-02-21T09:09:25.5331019Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5331388Z         %84 = arith.mulf %70, %83 : tensor<128xf32>
2026-02-21T09:09:25.5331729Z         %85 = tt.expand_dims %81 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5332022Z         %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5332262Z         %87 = arith.subf %74, %86 : tensor<128x16xf32>
2026-02-21T09:09:25.5332615Z         %88 = tt.extern_elementwise %87 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5332980Z         %89 = "tt.reduce"(%88) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5333174Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:09:25.5333351Z           %91 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:09:25.5333542Z           tt.reduce.return %91 : f32
2026-02-21T09:09:25.5333725Z         }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5333922Z         %90 = arith.addf %84, %89 : tensor<128xf32>
2026-02-21T09:09:25.5334157Z         scf.yield %81, %90 : tensor<128xf32>, tensor<128xf32>
2026-02-21T09:09:25.5334381Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:09:25.5334675Z       %5 = tt.descriptor_load %0[%3, %c240_i32] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5335018Z       %6 = arith.extf %5 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5335259Z       %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5335451Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:09:25.5335643Z         %33 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:09:25.5335834Z         tt.reduce.return %33 : f32
2026-02-21T09:09:25.5336031Z       }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5336256Z       %8 = arith.truncf %7 : tensor<128xf32> to tensor<128xf16>
2026-02-21T09:09:25.5336509Z       %9 = arith.extf %8 : tensor<128xf16> to tensor<128xf32>
2026-02-21T09:09:25.5336745Z       %10 = arith.cmpf ogt, %4#0, %9 : tensor<128xf32>
2026-02-21T09:09:25.5336968Z       %11 = arith.cmpf une, %4#0, %4#0 : tensor<128xf32>
2026-02-21T09:09:25.5337187Z       %12 = arith.ori %10, %11 : tensor<128xi1>
2026-02-21T09:09:25.5337423Z       %13 = arith.select %12, %4#0, %9 : tensor<128xi1>, tensor<128xf32>
2026-02-21T09:09:25.5337792Z       %14 = arith.subf %4#0, %13 : tensor<128xf32>
2026-02-21T09:09:25.5338158Z       %15 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5338531Z       %16 = arith.mulf %4#1, %15 : tensor<128xf32>
2026-02-21T09:09:25.5338798Z       %17 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5339101Z       %18 = tt.broadcast %17 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5339349Z       %19 = arith.subf %6, %18 : tensor<128x16xf32>
2026-02-21T09:09:25.5339725Z       %20 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5340108Z       %21 = "tt.reduce"(%20) <{axis = 1 : i32}> ({
2026-02-21T09:09:25.5340312Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:09:25.5340498Z         %33 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:09:25.5340702Z         tt.reduce.return %33 : f32
2026-02-21T09:09:25.5340889Z       }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T09:09:25.5341094Z       %22 = arith.addf %16, %21 : tensor<128xf32>
2026-02-21T09:09:25.5341292Z       %c240_i32_1 = arith.constant 240 : i32
2026-02-21T09:09:25.5341495Z       %c48_i32_2 = arith.constant 48 : i32
2026-02-21T09:09:25.5341771Z       scf.for %arg3 = %c0_i32 to %c240_i32_1 step %c48_i32_2  : i32 {
2026-02-21T09:09:25.5342104Z         %33 = tt.descriptor_load %0[%3, %arg3] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5342473Z         %34 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5342781Z         %35 = arith.extf %33 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5343059Z         %36 = tt.broadcast %34 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5343361Z         %37 = arith.subf %35, %36 : tensor<128x16xf32>
2026-02-21T09:09:25.5343733Z         %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5344151Z         %39 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5344442Z         %40 = tt.broadcast %39 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5344691Z         %41 = arith.divf %38, %40 : tensor<128x16xf32>
2026-02-21T09:09:25.5344927Z         %42 = arith.truncf %41 : tensor<128x16xf32> to tensor<128x16xf16>
2026-02-21T09:09:25.5345250Z         tt.descriptor_store %1[%3, %arg3], %42 : !tt.tensordesc<tensor<128x16xf16>>, tensor<128x16xf16>
2026-02-21T09:09:25.5345560Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:09:25.5345746Z         %43 = arith.muli %c16_i32, %c1_i32 : i32
2026-02-21T09:09:25.5345939Z         %44 = arith.addi %arg3, %43 : i32
2026-02-21T09:09:25.5346207Z         %45 = tt.descriptor_load %0[%3, %44] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5346547Z         %46 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5346833Z         %47 = arith.extf %45 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5347086Z         %48 = tt.broadcast %46 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5347324Z         %49 = arith.subf %47, %48 : tensor<128x16xf32>
2026-02-21T09:09:25.5347684Z         %50 = tt.extern_elementwise %49 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5348101Z         %51 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5348395Z         %52 = tt.broadcast %51 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5348629Z         %53 = arith.divf %50, %52 : tensor<128x16xf32>
2026-02-21T09:09:25.5348866Z         %54 = arith.truncf %53 : tensor<128x16xf32> to tensor<128x16xf16>
2026-02-21T09:09:25.5349234Z         tt.descriptor_store %1[%3, %44], %54 : !tt.tensordesc<tensor<128x16xf16>>, tensor<128x16xf16>
2026-02-21T09:09:25.5349521Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:09:25.5349706Z         %55 = arith.muli %c16_i32, %c2_i32 : i32
2026-02-21T09:09:25.5349899Z         %56 = arith.addi %arg3, %55 : i32
2026-02-21T09:09:25.5350164Z         %57 = tt.descriptor_load %0[%3, %56] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5350492Z         %58 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5350778Z         %59 = arith.extf %57 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5351030Z         %60 = tt.broadcast %58 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5351262Z         %61 = arith.subf %59, %60 : tensor<128x16xf32>
2026-02-21T09:09:25.5351674Z         %62 = tt.extern_elementwise %61 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5352091Z         %63 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5352377Z         %64 = tt.broadcast %63 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5352603Z         %65 = arith.divf %62, %64 : tensor<128x16xf32>
2026-02-21T09:09:25.5352838Z         %66 = arith.truncf %65 : tensor<128x16xf32> to tensor<128x16xf16>
2026-02-21T09:09:25.5353145Z         tt.descriptor_store %1[%3, %56], %66 : !tt.tensordesc<tensor<128x16xf16>>, tensor<128x16xf16>
2026-02-21T09:09:25.5353426Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:09:25.5353707Z       %23 = tt.descriptor_load %0[%3, %c240_i32_1] : !tt.tensordesc<tensor<128x16xf16>> -> tensor<128x16xf16>
2026-02-21T09:09:25.5354057Z       %24 = tt.expand_dims %13 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5354402Z       %25 = arith.extf %23 : tensor<128x16xf16> to tensor<128x16xf32>
2026-02-21T09:09:25.5354664Z       %26 = tt.broadcast %24 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5354903Z       %27 = arith.subf %25, %26 : tensor<128x16xf32>
2026-02-21T09:09:25.5355261Z       %28 = tt.extern_elementwise %27 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T09:09:25.5355672Z       %29 = tt.expand_dims %22 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
2026-02-21T09:09:25.5355963Z       %30 = tt.broadcast %29 : tensor<128x1xf32> -> tensor<128x16xf32>
2026-02-21T09:09:25.5356189Z       %31 = arith.divf %28, %30 : tensor<128x16xf32>
2026-02-21T09:09:25.5356430Z       %32 = arith.truncf %31 : tensor<128x16xf32> to tensor<128x16xf16>
2026-02-21T09:09:25.5356755Z       tt.descriptor_store %1[%3, %c240_i32_1], %32 : !tt.tensordesc<tensor<128x16xf16>>, tensor<128x16xf16>
2026-02-21T09:09:25.5357154Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T09:09:25.5357447Z     tt.return
2026-02-21T09:09:25.5357576Z   }
2026-02-21T09:09:25.5357703Z }
2026-02-21T09:09:25.5357770Z 
2026-02-21T09:09:25.5357820Z {-#
2026-02-21T09:09:25.5357951Z   external_resources: {
2026-02-21T09:09:25.5358108Z     mlir_reproducer: {
2026-02-21T09:09:25.5362519Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:09:25.5366969Z       disable_threading: false,
2026-02-21T09:09:25.5367134Z       verify_each: true
2026-02-21T09:09:25.5367289Z     }
2026-02-21T09:09:25.5367412Z   }
2026-02-21T09:09:25.5367522Z #-}
2026-02-21T09:09:25.5367945Z /tmp/torchinductor_root/gh/cghsz2gbngaiott6x5qitkdnthg7svazopk76hcnafw7y6j7mcjy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:09:25.5369145Z /tmp/torchinductor_root/gh/cghsz2gbngaiott6x5qitkdnthg7svazopk76hcnafw7y6j7mcjy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:09:25.5370125Z [25s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:09:25.5371311Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:09:25.5372389Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:09:25.5372643Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:09:29.6949885Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.5 configs/s
2026-02-21T09:09:29.6960638Z [29s] Adaptive compile timeout: 30s (90% percentile=1.9s, bounds=[30.0s, 60s])
2026-02-21T09:09:30.2080681Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1966.4 configs/s
2026-02-21T09:09:30.2622884Z [30s] Initial random population of 100, 5 starting points: 
2026-02-21T09:09:30.2626578Z error=6
2026-02-21T09:09:30.2631720Z ok=94
2026-02-21T09:09:30.2634781Z min=0.0082
2026-02-21T09:09:30.2639265Z mid=0.0410
2026-02-21T09:09:30.2643050Z max=5.7293
2026-02-21T09:09:30.2644694Z best={'block_sizes': [4, 64],
2026-02-21T09:09:30.2644948Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:09:30.2645172Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:09:30.2645369Z  'num_stages': 6,
2026-02-21T09:09:30.2645510Z  'num_warps': 2,
2026-02-21T09:09:30.2645663Z  'pid_type': 'flat',
2026-02-21T09:09:30.2645823Z  'range_flattens': [None, True],
2026-02-21T09:09:30.2646007Z  'range_multi_buffers': [None, False],
2026-02-21T09:09:30.2646199Z  'range_num_stages': [0, 0],
2026-02-21T09:09:30.2646362Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:30.2646549Z  'range_warp_specializes': [None, True]}
2026-02-21T09:09:30.2646876Z [30s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:09:31.4714609Z [31s] Generation 1 starting: 88 neighbors, 5 active search path(s)
2026-02-21T09:09:34.5019606Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 41.8 configs/s
2026-02-21T09:09:40.3688521Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 15.8 configs/s
2026-02-21T09:09:43.9366883Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 300.3         
2026-02-21T09:09:43.9368326Z                                                                   configs/s     
2026-02-21T09:09:44.2836482Z [44s] Generation 1 complete: 
2026-02-21T09:09:44.2838116Z ok=94
2026-02-21T09:09:44.2838293Z min=0.0081
2026-02-21T09:09:44.2838421Z mid=0.0083
2026-02-21T09:09:44.2838550Z max=0.1045
2026-02-21T09:09:44.2838689Z best={'block_sizes': [4, 256],
2026-02-21T09:09:44.2838906Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:09:44.2839122Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:09:44.2839315Z  'num_stages': 6,
2026-02-21T09:09:44.2839453Z  'num_warps': 1,
2026-02-21T09:09:44.2839598Z  'pid_type': 'flat',
2026-02-21T09:09:44.2840091Z  'range_flattens': [None, True],
2026-02-21T09:09:44.2840315Z  'range_multi_buffers': [None, None],
2026-02-21T09:09:44.2840509Z  'range_num_stages': [0, 0],
2026-02-21T09:09:44.2840678Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:44.2840871Z  'range_warp_specializes': [None, True]}
2026-02-21T09:09:44.2851197Z [44s] Fitting surrogate: 194 points, 194 targets
2026-02-21T09:09:45.3382249Z [45s] Generation 2 starting: 73 neighbors, 5 active search path(s)
2026-02-21T09:09:48.2370869Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 14.6 configs/s
2026-02-21T09:09:52.8535083Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.2 configs/s
2026-02-21T09:09:56.1650093Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 328.1         
2026-02-21T09:09:56.1654124Z                                                                   configs/s     
2026-02-21T09:09:56.4695112Z [56s] Generation 2 complete: 
2026-02-21T09:09:56.4698188Z ok=78
2026-02-21T09:09:56.4702114Z min=0.0062
2026-02-21T09:09:56.4705959Z mid=0.0082
2026-02-21T09:09:56.4707588Z max=0.0348
2026-02-21T09:09:56.4707805Z best={'block_sizes': [4, 256],
2026-02-21T09:09:56.4713615Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:09:56.4718012Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:09:56.4722367Z  'num_stages': 3,
2026-02-21T09:09:56.4725817Z  'num_warps': 1,
2026-02-21T09:09:56.4729244Z  'pid_type': 'flat',
2026-02-21T09:09:56.4733037Z  'range_flattens': [None, True],
2026-02-21T09:09:56.4737952Z  'range_multi_buffers': [None, True],
2026-02-21T09:09:56.4742352Z  'range_num_stages': [0, 0],
2026-02-21T09:09:56.4743660Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:56.4743910Z  'range_warp_specializes': [None, True]}
2026-02-21T09:09:56.4744199Z [56s] Fitting surrogate: 272 points, 272 targets
2026-02-21T09:09:57.4476336Z [57s] Generation 3 starting: 63 neighbors, 5 active search path(s)
2026-02-21T09:09:59.6350663Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 109.8 configs/s
2026-02-21T09:10:03.7739959Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.1 configs/s
2026-02-21T09:10:07.0550346Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 336.6         
2026-02-21T09:10:07.0552015Z                                                                   configs/s     
2026-02-21T09:10:07.3819968Z [67s] Generation 3 complete: 
2026-02-21T09:10:07.3822051Z ok=68
2026-02-21T09:10:07.3829241Z min=0.0062
2026-02-21T09:10:07.3835086Z mid=0.0082
2026-02-21T09:10:07.3840383Z max=0.0164
2026-02-21T09:10:07.3844720Z best={'block_sizes': [4, 256],
2026-02-21T09:10:07.3849501Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:10:07.3849860Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:07.3850090Z  'num_stages': 4,
2026-02-21T09:10:07.3855152Z  'num_warps': 1,
2026-02-21T09:10:07.3857356Z  'pid_type': 'flat',
2026-02-21T09:10:07.3857569Z  'range_flattens': [None, True],
2026-02-21T09:10:07.3857819Z  'range_multi_buffers': [None, True],
2026-02-21T09:10:07.3858490Z  'range_num_stages': [0, 4],
2026-02-21T09:10:07.3858678Z  'range_unroll_factors': [0, 0],
2026-02-21T09:10:07.3858878Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:07.3859192Z [67s] Fitting surrogate: 340 points, 340 targets
2026-02-21T09:10:08.1969515Z [68s] Generation 4 starting: 58 neighbors, 5 active search path(s)
2026-02-21T09:10:10.2752319Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 85.8 configs/s
2026-02-21T09:10:14.0334632Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 16.1 configs/s
2026-02-21T09:10:17.3706909Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.9         
2026-02-21T09:10:17.3708083Z                                                                   configs/s     
2026-02-21T09:10:17.6706522Z [77s] Generation 4 complete: 
2026-02-21T09:10:17.6710856Z ok=63
2026-02-21T09:10:17.6715350Z min=0.0062
2026-02-21T09:10:17.6719760Z mid=0.0081
2026-02-21T09:10:17.6721884Z max=0.0102
2026-02-21T09:10:17.6722178Z best={'block_sizes': [4, 256],
2026-02-21T09:10:17.6726384Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:10:17.6727722Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:17.6727960Z  'num_stages': 4,
2026-02-21T09:10:17.6728103Z  'num_warps': 1,
2026-02-21T09:10:17.6728250Z  'pid_type': 'flat',
2026-02-21T09:10:17.6728414Z  'range_flattens': [None, True],
2026-02-21T09:10:17.6728592Z  'range_multi_buffers': [None, True],
2026-02-21T09:10:17.6728782Z  'range_num_stages': [0, 4],
2026-02-21T09:10:17.6728944Z  'range_unroll_factors': [0, 0],
2026-02-21T09:10:17.6729126Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:17.6729415Z [77s] Fitting surrogate: 403 points, 403 targets
2026-02-21T09:10:18.5847506Z [78s] Generation 5 starting: 62 neighbors, 5 active search path(s)
2026-02-21T09:10:20.9560974Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 83.7 configs/s
2026-02-21T09:10:24.9232996Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 16.3 configs/s
2026-02-21T09:10:28.3169303Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 318.2         
2026-02-21T09:10:28.3169751Z                                                                   configs/s     
2026-02-21T09:10:28.6658002Z [88s] Generation 5 complete: 
2026-02-21T09:10:28.6662023Z ok=67
2026-02-21T09:10:28.6663163Z min=0.0062
2026-02-21T09:10:28.6663332Z mid=0.0062
2026-02-21T09:10:28.6663464Z max=0.0163
2026-02-21T09:10:28.6663603Z best={'block_sizes': [4, 256],
2026-02-21T09:10:28.6663820Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:10:28.6664047Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:28.6664242Z  'num_stages': 7,
2026-02-21T09:10:28.6664380Z  'num_warps': 1,
2026-02-21T09:10:28.6664523Z  'pid_type': 'flat',
2026-02-21T09:10:28.6664680Z  'range_flattens': [None, True],
2026-02-21T09:10:28.6664863Z  'range_multi_buffers': [None, None],
2026-02-21T09:10:28.6665106Z  'range_num_stages': [0, 0],
2026-02-21T09:10:28.6665280Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:28.6665467Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:28.6674168Z [88s] Fitting surrogate: 470 points, 470 targets
2026-02-21T09:10:29.5749900Z [89s] Generation 6 starting: 61 neighbors, 5 active search path(s)
2026-02-21T09:10:31.6266966Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 42.5 configs/s
2026-02-21T09:10:35.4802861Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.3 configs/s
2026-02-21T09:10:38.8673392Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 317.9         
2026-02-21T09:10:38.8677274Z                                                                   configs/s     
2026-02-21T09:10:39.1798041Z [99s] Generation 6 complete: 
2026-02-21T09:10:39.1802303Z ok=66
2026-02-21T09:10:39.1806758Z min=0.0062
2026-02-21T09:10:39.1812133Z mid=0.0080
2026-02-21T09:10:39.1816468Z max=0.0143
2026-02-21T09:10:39.1818191Z best={'block_sizes': [4, 256],
2026-02-21T09:10:39.1818878Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:10:39.1819126Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:39.1819325Z  'num_stages': 7,
2026-02-21T09:10:39.1819466Z  'num_warps': 1,
2026-02-21T09:10:39.1819614Z  'pid_type': 'flat',
2026-02-21T09:10:39.1819774Z  'range_flattens': [None, True],
2026-02-21T09:10:39.1819962Z  'range_multi_buffers': [None, None],
2026-02-21T09:10:39.1820150Z  'range_num_stages': [0, 0],
2026-02-21T09:10:39.1820322Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:39.1820503Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:39.1820773Z [99s] Fitting surrogate: 536 points, 536 targets
2026-02-21T09:10:40.0533261Z [100s] Generation 7 starting: 53 neighbors, 4 active search path(s)
2026-02-21T09:10:41.8706036Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 86.1 configs/s
2026-02-21T09:10:45.3338030Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.4 configs/s
2026-02-21T09:10:48.2256521Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 378.1         
2026-02-21T09:10:48.2257227Z                                                                   configs/s     
2026-02-21T09:10:48.5049297Z [108s] Generation 7 complete: 
2026-02-21T09:10:48.5050908Z ok=58
2026-02-21T09:10:48.5051088Z min=0.0062
2026-02-21T09:10:48.5051227Z mid=0.0062
2026-02-21T09:10:48.5051363Z max=0.0143
2026-02-21T09:10:48.5051508Z best={'block_sizes': [4, 256],
2026-02-21T09:10:48.5052083Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:10:48.5052378Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:48.5052581Z  'num_stages': 7,
2026-02-21T09:10:48.5052738Z  'num_warps': 2,
2026-02-21T09:10:48.5052886Z  'pid_type': 'flat',
2026-02-21T09:10:48.5053058Z  'range_flattens': [None, True],
2026-02-21T09:10:48.5053244Z  'range_multi_buffers': [None, None],
2026-02-21T09:10:48.5053510Z  'range_num_stages': [0, 0],
2026-02-21T09:10:48.5053695Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:48.5053885Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:48.5065616Z [108s] Fitting surrogate: 594 points, 594 targets
2026-02-21T09:10:48.9429267Z [109s] Generation 8 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:10:49.4447848Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 34.7 configs/s
2026-02-21T09:10:50.1819474Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.3 configs/s
2026-02-21T09:10:50.8089587Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1614.6         
2026-02-21T09:10:50.8096578Z                                                                  configs/s      
2026-02-21T09:10:50.8769151Z [110s] Generation 8 complete: 
2026-02-21T09:10:50.8773591Z ok=13
2026-02-21T09:10:50.8777989Z min=0.0062
2026-02-21T09:10:50.8780104Z mid=0.0062
2026-02-21T09:10:50.8780317Z max=0.0081
2026-02-21T09:10:50.8785717Z best={'block_sizes': [4, 256],
2026-02-21T09:10:50.8789052Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:10:50.8794014Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:50.8798281Z  'num_stages': 7,
2026-02-21T09:10:50.8799576Z  'num_warps': 2,
2026-02-21T09:10:50.8799761Z  'pid_type': 'flat',
2026-02-21T09:10:50.8799940Z  'range_flattens': [None, True],
2026-02-21T09:10:50.8800121Z  'range_multi_buffers': [None, None],
2026-02-21T09:10:50.8800313Z  'range_num_stages': [0, 0],
2026-02-21T09:10:50.8800486Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:50.8800662Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:50.8800959Z [111s] Fitting surrogate: 607 points, 607 targets
2026-02-21T09:10:51.2915657Z [111s] Generation 9 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:10:51.8188003Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 36.5 configs/s
2026-02-21T09:10:52.4968052Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.4 configs/s
2026-02-21T09:10:53.0794189Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1735.4         
2026-02-21T09:10:53.0798392Z                                                                  configs/s      
2026-02-21T09:10:53.1456019Z [113s] Generation 9 complete: 
2026-02-21T09:10:53.1459889Z ok=13
2026-02-21T09:10:53.1464778Z min=0.0061
2026-02-21T09:10:53.1466572Z mid=0.0062
2026-02-21T09:10:53.1466742Z max=0.0123
2026-02-21T09:10:53.1466887Z best={'block_sizes': [4, 256],
2026-02-21T09:10:53.1467148Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:10:53.1467412Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:10:53.1467608Z  'num_stages': 7,
2026-02-21T09:10:53.1467762Z  'num_warps': 4,
2026-02-21T09:10:53.1467906Z  'pid_type': 'flat',
2026-02-21T09:10:53.1468079Z  'range_flattens': [None, True],
2026-02-21T09:10:53.1468261Z  'range_multi_buffers': [None, None],
2026-02-21T09:10:53.1468451Z  'range_num_stages': [0, 0],
2026-02-21T09:10:53.1468649Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:53.1468833Z  'range_warp_specializes': [None, True]}
2026-02-21T09:10:53.1473024Z [113s] Fitting surrogate: 620 points, 620 targets
2026-02-21T09:10:53.4231841Z [113s] Autotuning complete in 113.5s after searching 597 configs.
2026-02-21T09:10:53.4232273Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:10:53.4233290Z     @helion.kernel(config=helion.Config(block_sizes=[4, 256], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:10:53.4234158Z 
2026-02-21T09:10:53.4234418Z [113s] Code of selected kernel: /tmp/torchinductor_root/fk/cfksxei47ym7ts5y5qdyesq4ngw3aqn56gffr2islwommctdkqzr.py
2026-02-21T09:10:53.4440495Z from __future__ import annotations
2026-02-21T09:10:53.4440720Z 
2026-02-21T09:10:53.4445608Z import torch
2026-02-21T09:10:53.4450729Z import triton
2026-02-21T09:10:53.4455836Z import triton.language as tl
2026-02-21T09:10:53.4460401Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:10:53.4462040Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:10:53.4462418Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:10:53.4466954Z 
2026-02-21T09:10:53.4470448Z _BLOCK_SIZE_0 = tl.constexpr(4)
2026-02-21T09:10:53.4475210Z _BLOCK_SIZE_1 = tl.constexpr(256)
2026-02-21T09:10:53.4479668Z 
2026-02-21T09:10:53.4482514Z @triton.jit
2026-02-21T09:10:53.4484921Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:10:53.4485279Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:10:53.4489296Z     pid_0 = tl.program_id(0)
2026-02-21T09:10:53.4493868Z     offset_0 = pid_0 * _BLOCK_SIZE_0
2026-02-21T09:10:53.4495437Z     indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32)
2026-02-21T09:10:53.4496116Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:10:53.4496414Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:10:53.4496687Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:10:53.4496944Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:10:53.4497199Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:10:53.4497476Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:10:53.4497726Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:10:53.4497961Z     # src[softmax.py:82-89]: ...
2026-02-21T09:10:53.4498269Z     for offset_2 in tl.range(0, 256, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, flatten=True):
2026-02-21T09:10:53.4498741Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:10:53.4498988Z         mi_copy = mi
2026-02-21T09:10:53.4499131Z         di_copy = di
2026-02-21T09:10:53.4499282Z         mi_copy_0 = mi_copy
2026-02-21T09:10:53.4499433Z         di_copy_0 = di_copy
2026-02-21T09:10:53.4499620Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:10:53.4499953Z         values = tl.load(x + (indices_0[:, None] * 256 + indices_2[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T09:10:53.4500313Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:10:53.4500568Z         local_amax = tl.cast(tl.max(values, 1), tl.float16)
2026-02-21T09:10:53.4500819Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:10:53.4501058Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:10:53.4501266Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:10:53.4501528Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:10:53.4501845Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:10:53.4502038Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:10:53.4502219Z         v_4 = di_copy_0 * v_3
2026-02-21T09:10:53.4502411Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:10:53.4502628Z         subscript = v_1[:, None]
2026-02-21T09:10:53.4502808Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:10:53.4503005Z         v_6 = v_5 - subscript
2026-02-21T09:10:53.4503220Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:10:53.4503488Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:10:53.4503699Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:10:53.4503889Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:10:53.4504077Z         sum_1 = tl.cast(tl.sum(v_7, 1), tl.float32)
2026-02-21T09:10:53.4504268Z         di = v_4 + sum_1
2026-02-21T09:10:53.4504452Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:10:53.4504622Z         mi = v_1
2026-02-21T09:10:53.4504831Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:10:53.4505098Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:10:53.4505390Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:10:53.4505794Z     for offset_2 in tl.range(0, 256, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, flatten=True):
2026-02-21T09:10:53.4506152Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:10:53.4506383Z         mi_copy_1 = mi
2026-02-21T09:10:53.4506527Z         di_copy_1 = di
2026-02-21T09:10:53.4506681Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:10:53.4506843Z         di_copy_1_0 = di_copy_1
2026-02-21T09:10:53.4507030Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:10:53.4507363Z         values_1 = tl.load(x + (indices_0[:, None] * 256 + indices_2[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T09:10:53.4507851Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:10:53.4508134Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:10:53.4508322Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:10:53.4508508Z         v_10 = v_9 - subscript_1
2026-02-21T09:10:53.4508676Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:10:53.4508857Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:10:53.4509040Z         v_12 = v_11 / subscript_2
2026-02-21T09:10:53.4509210Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:10:53.4509459Z         tl.store(out + (indices_0[:, None] * 256 + indices_2[None, :] * 1), v_13, None)
2026-02-21T09:10:53.4509653Z 
2026-02-21T09:10:53.4509780Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:10:53.4510017Z     """
2026-02-21T09:10:53.4510216Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:10:53.4510594Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:10:53.4510818Z     Args:
2026-02-21T09:10:53.4510975Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:10:53.4511173Z     Returns:
2026-02-21T09:10:53.4511348Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:10:53.4511620Z     """
2026-02-21T09:10:53.4511757Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:10:53.4511939Z     m, n = x.size()
2026-02-21T09:10:53.4512102Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:10:53.4512310Z     out = torch.empty_like(x)
2026-02-21T09:10:53.4512545Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:10:53.4512778Z     _BLOCK_SIZE_0 = 4
2026-02-21T09:10:53.4512998Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:10:53.4513313Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:10:53.4513635Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:10:53.4513873Z     # src[softmax.py:79-92]: ...
2026-02-21T09:10:53.4514191Z     _launcher(_helion_softmax_two_pass, (triton.cdiv(4096, _BLOCK_SIZE_0),), x, out, num_warps=4, num_stages=7)
2026-02-21T09:10:53.4514524Z     # src[softmax.py:93]: return out
2026-02-21T09:10:53.4514688Z     return out
2026-02-21T09:10:54.0853893Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T09:10:54.0857722Z (M, N)
2026-02-21T09:10:54.0862995Z -----------
2026-02-21T09:10:54.0867384Z (4096, 256)
2026-02-21T09:10:54.0867544Z 
2026-02-21T09:10:54.0872734Z   5%|▌         | 1/20 [01:59<37:47, 119.36s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5:
2026-02-21T09:10:54.0876178Z (M, N)
2026-02-21T09:10:54.0876429Z -----------
2026-02-21T09:10:54.0876598Z (4096, 896)
2026-02-21T09:10:54.0876883Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:10:55.7142604Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:10:57.0473996Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax
2026-02-21T09:10:58.0773184Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:10:58.0777576Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:10:58.0779169Z               'dtype': 'torch.float16',
2026-02-21T09:10:58.0779458Z               'shape': (4096, 896),
2026-02-21T09:10:58.0785225Z               'stride': (896, 1)},),
2026-02-21T09:10:58.0787416Z   'kwargs': {}}
2026-02-21T09:10:58.0787819Z INFO:tritonbench.utils.triton_op:Took 1.85ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:10:58.2521767Z [0s] Autotune random seed: 2138408546
2026-02-21T09:10:58.2773509Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:11:30.4749298Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[0, 2], range_warp_specializes=[True, None])
2026-02-21T09:11:30.4767479Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s
2026-02-21T09:11:34.1974490Z module {
2026-02-21T09:11:34.1976880Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:11:34.1977446Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:11:34.1977684Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:11:34.1977912Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:11:34.1978161Z     %cst = arith.constant dense<896> : tensor<64x1xi32>
2026-02-21T09:11:34.1978856Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<64xf32>
2026-02-21T09:11:34.1979194Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<64xf32>
2026-02-21T09:11:34.1979461Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:11:34.1979687Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:11:34.1979901Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T09:11:34.1980099Z     %c896_i64 = arith.constant 896 : i64
2026-02-21T09:11:34.1980290Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:11:34.1980633Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<64x128xf16>>
2026-02-21T09:11:34.1981100Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<64x128xf16>>
2026-02-21T09:11:34.1981455Z     %2 = tt.get_program_id x : i32
2026-02-21T09:11:34.1981748Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T09:11:34.1981959Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T09:11:34.1982201Z     scf.for %arg2 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T09:11:34.1982445Z       %5 = arith.muli %arg2, %c64_i32 : i32
2026-02-21T09:11:34.1982719Z       %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T09:11:34.1983012Z       %7 = tt.splat %5 : i32 -> tensor<64xi32>
2026-02-21T09:11:34.1983245Z       %8 = arith.addi %7, %6 : tensor<64xi32>
2026-02-21T09:11:34.1983463Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T09:11:34.1983685Z       %c384_i32 = arith.constant 384 : i32
2026-02-21T09:11:34.1984109Z       %9:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c384_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<64xf32>, tensor<64xf32>)  : i32 {
2026-02-21T09:11:34.1984586Z         %49 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:34.1984911Z         %50 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:11:34.1985152Z         %51 = arith.addi %50, %49 : tensor<128xi32>
2026-02-21T09:11:34.1985455Z         %52 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
2026-02-21T09:11:34.1985819Z         %53 = arith.muli %52, %cst : tensor<64x1xi32>
2026-02-21T09:11:34.1986119Z         %54 = tt.expand_dims %51 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:34.1986456Z         %55 = tt.broadcast %53 : tensor<64x1xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.1986745Z         %56 = tt.broadcast %54 : tensor<1x128xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.1986996Z         %57 = arith.addi %55, %56 : tensor<64x128xi32>
2026-02-21T09:11:34.1987258Z         %58 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.1987570Z         %59 = tt.addptr %58, %57 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
2026-02-21T09:11:34.1987846Z         %60 = tt.load %59 : tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.1988102Z         %61 = arith.extf %60 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.1988351Z         %62 = "tt.reduce"(%61) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.1988716Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.1988922Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:34.1989141Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.1989353Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.1989594Z         %63 = arith.truncf %62 : tensor<64xf32> to tensor<64xf16>
2026-02-21T09:11:34.1989864Z         %64 = arith.extf %63 : tensor<64xf16> to tensor<64xf32>
2026-02-21T09:11:34.1990113Z         %65 = arith.cmpf ogt, %arg4, %64 : tensor<64xf32>
2026-02-21T09:11:34.1990370Z         %66 = arith.cmpf une, %arg4, %arg4 : tensor<64xf32>
2026-02-21T09:11:34.1990607Z         %67 = arith.ori %65, %66 : tensor<64xi1>
2026-02-21T09:11:34.1990875Z         %68 = arith.select %67, %arg4, %64 : tensor<64xi1>, tensor<64xf32>
2026-02-21T09:11:34.1991145Z         %69 = arith.subf %arg4, %68 : tensor<64xf32>
2026-02-21T09:11:34.1991626Z         %70 = tt.extern_elementwise %69 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.1992030Z         %71 = arith.mulf %arg5, %70 : tensor<64xf32>
2026-02-21T09:11:34.1992304Z         %72 = tt.expand_dims %68 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.1992629Z         %73 = tt.broadcast %72 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.1992887Z         %74 = arith.subf %61, %73 : tensor<64x128xf32>
2026-02-21T09:11:34.1993283Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.1993702Z         %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.1993923Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.1994144Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:34.1994363Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.1994586Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.1994817Z         %77 = arith.addf %71, %76 : tensor<64xf32>
2026-02-21T09:11:34.1995052Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:11:34.1995278Z         %78 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:11:34.1995495Z         %79 = arith.addi %arg3, %78 : i32
2026-02-21T09:11:34.1995769Z         %80 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:34.1996055Z         %81 = tt.splat %79 : i32 -> tensor<128xi32>
2026-02-21T09:11:34.1996289Z         %82 = arith.addi %81, %80 : tensor<128xi32>
2026-02-21T09:11:34.1996569Z         %83 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
2026-02-21T09:11:34.1996874Z         %84 = arith.muli %83, %cst : tensor<64x1xi32>
2026-02-21T09:11:34.1997175Z         %85 = tt.expand_dims %82 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:34.1997485Z         %86 = tt.broadcast %84 : tensor<64x1xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.1997769Z         %87 = tt.broadcast %85 : tensor<1x128xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.1998023Z         %88 = arith.addi %86, %87 : tensor<64x128xi32>
2026-02-21T09:11:34.1998281Z         %89 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.1998578Z         %90 = tt.addptr %89, %88 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
2026-02-21T09:11:34.1998858Z         %91 = tt.load %90 : tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.1999113Z         %92 = arith.extf %91 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.1999357Z         %93 = "tt.reduce"(%92) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.1999579Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.1999789Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:34.2000015Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.2000225Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2000488Z         %94 = arith.truncf %93 : tensor<64xf32> to tensor<64xf16>
2026-02-21T09:11:34.2000774Z         %95 = arith.extf %94 : tensor<64xf16> to tensor<64xf32>
2026-02-21T09:11:34.2001100Z         %96 = arith.cmpf ogt, %68, %95 : tensor<64xf32>
2026-02-21T09:11:34.2001348Z         %97 = arith.cmpf une, %68, %68 : tensor<64xf32>
2026-02-21T09:11:34.2001608Z         %98 = arith.ori %96, %97 : tensor<64xi1>
2026-02-21T09:11:34.2001874Z         %99 = arith.select %98, %68, %95 : tensor<64xi1>, tensor<64xf32>
2026-02-21T09:11:34.2002122Z         %100 = arith.subf %68, %99 : tensor<64xf32>
2026-02-21T09:11:34.2002514Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2002913Z         %102 = arith.mulf %77, %101 : tensor<64xf32>
2026-02-21T09:11:34.2003192Z         %103 = tt.expand_dims %99 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2003507Z         %104 = tt.broadcast %103 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2003824Z         %105 = arith.subf %92, %104 : tensor<64x128xf32>
2026-02-21T09:11:34.2004245Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2004704Z         %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.2004946Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.2005163Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:34.2005382Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.2005602Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2005846Z         %108 = arith.addf %102, %107 : tensor<64xf32>
2026-02-21T09:11:34.2006079Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:11:34.2006302Z         %109 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:11:34.2006544Z         %110 = arith.addi %arg3, %109 : i32
2026-02-21T09:11:34.2006824Z         %111 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:34.2007142Z         %112 = tt.splat %110 : i32 -> tensor<128xi32>
2026-02-21T09:11:34.2007389Z         %113 = arith.addi %112, %111 : tensor<128xi32>
2026-02-21T09:11:34.2007717Z         %114 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
2026-02-21T09:11:34.2008045Z         %115 = arith.muli %114, %cst : tensor<64x1xi32>
2026-02-21T09:11:34.2008372Z         %116 = tt.expand_dims %113 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:34.2008737Z         %117 = tt.broadcast %115 : tensor<64x1xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.2009055Z         %118 = tt.broadcast %116 : tensor<1x128xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.2009348Z         %119 = arith.addi %117, %118 : tensor<64x128xi32>
2026-02-21T09:11:34.2009635Z         %120 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.2009973Z         %121 = tt.addptr %120, %119 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
2026-02-21T09:11:34.2010251Z         %122 = tt.load %121 : tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.2010503Z         %123 = arith.extf %122 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2010747Z         %124 = "tt.reduce"(%123) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.2010948Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.2011134Z           %140 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:34.2011335Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.2011520Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2011789Z         %125 = arith.truncf %124 : tensor<64xf32> to tensor<64xf16>
2026-02-21T09:11:34.2012035Z         %126 = arith.extf %125 : tensor<64xf16> to tensor<64xf32>
2026-02-21T09:11:34.2012283Z         %127 = arith.cmpf ogt, %99, %126 : tensor<64xf32>
2026-02-21T09:11:34.2012518Z         %128 = arith.cmpf une, %99, %99 : tensor<64xf32>
2026-02-21T09:11:34.2012733Z         %129 = arith.ori %127, %128 : tensor<64xi1>
2026-02-21T09:11:34.2012992Z         %130 = arith.select %129, %99, %126 : tensor<64xi1>, tensor<64xf32>
2026-02-21T09:11:34.2013287Z         %131 = arith.subf %99, %130 : tensor<64xf32>
2026-02-21T09:11:34.2013654Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2014022Z         %133 = arith.mulf %108, %132 : tensor<64xf32>
2026-02-21T09:11:34.2014279Z         %134 = tt.expand_dims %130 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2014580Z         %135 = tt.broadcast %134 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2014822Z         %136 = arith.subf %123, %135 : tensor<64x128xf32>
2026-02-21T09:11:34.2015199Z         %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2015569Z         %138 = "tt.reduce"(%137) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.2015845Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:34.2016036Z           %140 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:34.2016220Z           tt.reduce.return %140 : f32
2026-02-21T09:11:34.2016409Z         }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2016609Z         %139 = arith.addf %133, %138 : tensor<64xf32>
2026-02-21T09:11:34.2016835Z         scf.yield %130, %139 : tensor<64xf32>, tensor<64xf32>
2026-02-21T09:11:34.2017054Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:11:34.2017309Z       %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:34.2017577Z       %11 = tt.splat %c768_i32 : i32 -> tensor<128xi32>
2026-02-21T09:11:34.2017786Z       %12 = arith.addi %11, %10 : tensor<128xi32>
2026-02-21T09:11:34.2018049Z       %13 = tt.expand_dims %8 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
2026-02-21T09:11:34.2018309Z       %14 = arith.muli %13, %cst : tensor<64x1xi32>
2026-02-21T09:11:34.2018580Z       %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:34.2018873Z       %16 = tt.broadcast %14 : tensor<64x1xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.2019147Z       %17 = tt.broadcast %15 : tensor<1x128xi32> -> tensor<64x128xi32>
2026-02-21T09:11:34.2019385Z       %18 = arith.addi %16, %17 : tensor<64x128xi32>
2026-02-21T09:11:34.2019616Z       %19 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.2019897Z       %20 = tt.addptr %19, %18 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
2026-02-21T09:11:34.2020141Z       %21 = tt.load %20 : tensor<64x128x!tt.ptr<f16>>
2026-02-21T09:11:34.2020375Z       %22 = arith.extf %21 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2020601Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.2020799Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:34.2020988Z         %49 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:11:34.2021175Z         tt.reduce.return %49 : f32
2026-02-21T09:11:34.2021365Z       }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2021631Z       %24 = arith.truncf %23 : tensor<64xf32> to tensor<64xf16>
2026-02-21T09:11:34.2021893Z       %25 = arith.extf %24 : tensor<64xf16> to tensor<64xf32>
2026-02-21T09:11:34.2022130Z       %26 = arith.cmpf ogt, %9#0, %25 : tensor<64xf32>
2026-02-21T09:11:34.2022364Z       %27 = arith.cmpf une, %9#0, %9#0 : tensor<64xf32>
2026-02-21T09:11:34.2022592Z       %28 = arith.ori %26, %27 : tensor<64xi1>
2026-02-21T09:11:34.2022815Z       %29 = arith.select %28, %9#0, %25 : tensor<64xi1>, tensor<64xf32>
2026-02-21T09:11:34.2023049Z       %30 = arith.subf %9#0, %29 : tensor<64xf32>
2026-02-21T09:11:34.2023391Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2023740Z       %32 = arith.mulf %9#1, %31 : tensor<64xf32>
2026-02-21T09:11:34.2023984Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2024327Z       %34 = tt.broadcast %33 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2024564Z       %35 = arith.subf %22, %34 : tensor<64x128xf32>
2026-02-21T09:11:34.2024914Z       %36 = tt.extern_elementwise %35 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2025271Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T09:11:34.2025456Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:34.2025638Z         %49 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:11:34.2025818Z         tt.reduce.return %49 : f32
2026-02-21T09:11:34.2026002Z       }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T09:11:34.2026195Z       %38 = arith.addf %32, %37 : tensor<64xf32>
2026-02-21T09:11:34.2026383Z       %c768_i32_2 = arith.constant 768 : i32
2026-02-21T09:11:34.2026575Z       %c384_i32_3 = arith.constant 384 : i32
2026-02-21T09:11:34.2026853Z       scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c384_i32_3  : i32 {
2026-02-21T09:11:34.2027193Z         %49 = tt.descriptor_load %0[%5, %arg3] : !tt.tensordesc<tensor<64x128xf16>> -> tensor<64x128xf16>
2026-02-21T09:11:34.2027540Z         %50 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2027839Z         %51 = arith.extf %49 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2028107Z         %52 = tt.broadcast %50 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2028343Z         %53 = arith.subf %51, %52 : tensor<64x128xf32>
2026-02-21T09:11:34.2028716Z         %54 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2029127Z         %55 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2029421Z         %56 = tt.broadcast %55 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2029666Z         %57 = arith.divf %54, %56 : tensor<64x128xf32>
2026-02-21T09:11:34.2029903Z         %58 = arith.truncf %57 : tensor<64x128xf32> to tensor<64x128xf16>
2026-02-21T09:11:34.2030234Z         tt.descriptor_store %1[%5, %arg3], %58 : !tt.tensordesc<tensor<64x128xf16>>, tensor<64x128xf16>
2026-02-21T09:11:34.2030535Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:11:34.2030736Z         %59 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:11:34.2030927Z         %60 = arith.addi %arg3, %59 : i32
2026-02-21T09:11:34.2031202Z         %61 = tt.descriptor_load %0[%5, %60] : !tt.tensordesc<tensor<64x128xf16>> -> tensor<64x128xf16>
2026-02-21T09:11:34.2031583Z         %62 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2031884Z         %63 = arith.extf %61 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2032161Z         %64 = tt.broadcast %62 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2032411Z         %65 = arith.subf %63, %64 : tensor<64x128xf32>
2026-02-21T09:11:34.2032821Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2033233Z         %67 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2033518Z         %68 = tt.broadcast %67 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2033771Z         %69 = arith.divf %66, %68 : tensor<64x128xf32>
2026-02-21T09:11:34.2034018Z         %70 = arith.truncf %69 : tensor<64x128xf32> to tensor<64x128xf16>
2026-02-21T09:11:34.2034358Z         tt.descriptor_store %1[%5, %60], %70 : !tt.tensordesc<tensor<64x128xf16>>, tensor<64x128xf16>
2026-02-21T09:11:34.2034658Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:11:34.2034866Z         %71 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:11:34.2035073Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T09:11:34.2035357Z         %73 = tt.descriptor_load %0[%5, %72] : !tt.tensordesc<tensor<64x128xf16>> -> tensor<64x128xf16>
2026-02-21T09:11:34.2035775Z         %74 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2036073Z         %75 = arith.extf %73 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2036355Z         %76 = tt.broadcast %74 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2036600Z         %77 = arith.subf %75, %76 : tensor<64x128xf32>
2026-02-21T09:11:34.2037015Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2037458Z         %79 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2037755Z         %80 = tt.broadcast %79 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2038011Z         %81 = arith.divf %78, %80 : tensor<64x128xf32>
2026-02-21T09:11:34.2038264Z         %82 = arith.truncf %81 : tensor<64x128xf32> to tensor<64x128xf16>
2026-02-21T09:11:34.2038666Z         tt.descriptor_store %1[%5, %72], %82 : !tt.tensordesc<tensor<64x128xf16>>, tensor<64x128xf16>
2026-02-21T09:11:34.2038983Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:11:34.2039307Z       %39 = tt.descriptor_load %0[%5, %c768_i32_2] : !tt.tensordesc<tensor<64x128xf16>> -> tensor<64x128xf16>
2026-02-21T09:11:34.2039700Z       %40 = tt.expand_dims %29 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2040023Z       %41 = arith.extf %39 : tensor<64x128xf16> to tensor<64x128xf32>
2026-02-21T09:11:34.2040305Z       %42 = tt.broadcast %40 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2040566Z       %43 = arith.subf %41, %42 : tensor<64x128xf32>
2026-02-21T09:11:34.2040962Z       %44 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T09:11:34.2041414Z       %45 = tt.expand_dims %38 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
2026-02-21T09:11:34.2041742Z       %46 = tt.broadcast %45 : tensor<64x1xf32> -> tensor<64x128xf32>
2026-02-21T09:11:34.2041982Z       %47 = arith.divf %44, %46 : tensor<64x128xf32>
2026-02-21T09:11:34.2042222Z       %48 = arith.truncf %47 : tensor<64x128xf32> to tensor<64x128xf16>
2026-02-21T09:11:34.2042574Z       tt.descriptor_store %1[%5, %c768_i32_2], %48 : !tt.tensordesc<tensor<64x128xf16>>, tensor<64x128xf16>
2026-02-21T09:11:34.2042931Z     } {tt.loop_unroll_factor = 1 : i32, tt.warp_specialize}
2026-02-21T09:11:34.2043152Z     tt.return
2026-02-21T09:11:34.2043296Z   }
2026-02-21T09:11:34.2043425Z }
2026-02-21T09:11:34.2043507Z 
2026-02-21T09:11:34.2043561Z {-#
2026-02-21T09:11:34.2043705Z   external_resources: {
2026-02-21T09:11:34.2043874Z     mlir_reproducer: {
2026-02-21T09:11:34.2048752Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:11:34.2053497Z       disable_threading: false,
2026-02-21T09:11:34.2053672Z       verify_each: true
2026-02-21T09:11:34.2053813Z     }
2026-02-21T09:11:34.2053943Z   }
2026-02-21T09:11:34.2054063Z #-}
2026-02-21T09:11:34.2054501Z /tmp/torchinductor_root/yd/cyd3dfdak4qyi5lrmdhe7jxqrsagzqh3pbfiqueeizcj7exphius.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:11:34.2055748Z /tmp/torchinductor_root/yd/cyd3dfdak4qyi5lrmdhe7jxqrsagzqh3pbfiqueeizcj7exphius.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:11:34.2056724Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:11:34.2057782Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_sm_multiplier=8, num_stages=6, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:11:34.2063035Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:11:34.2063349Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:11:36.3871183Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.0 configs/s
2026-02-21T09:11:36.3881791Z [38s] Adaptive compile timeout: 30s (90% percentile=1.3s, bounds=[30.0s, 30s])
2026-02-21T09:11:36.5495516Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 5895.2 configs/s
2026-02-21T09:11:36.5758818Z [38s] Initial random population of 100, 5 starting points: 
2026-02-21T09:11:36.5762532Z error=5
2026-02-21T09:11:36.5767013Z timeout=1
2026-02-21T09:11:36.5768619Z ok=94
2026-02-21T09:11:36.5768836Z min=0.0123
2026-02-21T09:11:36.5774916Z mid=0.0921
2026-02-21T09:11:36.5779318Z max=20.7043
2026-02-21T09:11:36.5780623Z best={'block_sizes': [4, 128],
2026-02-21T09:11:36.5780891Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:11:36.5781125Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:11:36.5781315Z  'maxnreg': 128,
2026-02-21T09:11:36.5781467Z  'num_sm_multiplier': 8,
2026-02-21T09:11:36.5781748Z  'num_stages': 3,
2026-02-21T09:11:36.5781901Z  'num_warps': 1,
2026-02-21T09:11:36.5782062Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:11:36.5782252Z  'range_flattens': [True, True],
2026-02-21T09:11:36.5782432Z  'range_multi_buffers': [True, True],
2026-02-21T09:11:36.5782616Z  'range_num_stages': [3, 2],
2026-02-21T09:11:36.5782777Z  'range_unroll_factors': [1, 2],
2026-02-21T09:11:36.5782958Z  'range_warp_specializes': [True, None]}
2026-02-21T09:11:36.5783169Z [38s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:11:37.9837829Z [39s] Generation 1 starting: 101 neighbors, 5 active search path(s)
2026-02-21T09:11:42.4781195Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 106/106 16.5 configs/s
2026-02-21T09:11:42.8082187Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T09:11:42.8085857Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:11:42.8086734Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:11:42.8087245Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:11:42.8087451Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:11:42.8087636Z     %c1184_i32 = arith.constant 1184 : i32
2026-02-21T09:11:42.8087856Z     %cst = arith.constant dense<896> : tensor<16x1xi32>
2026-02-21T09:11:42.8088102Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T09:11:42.8088364Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T09:11:42.8088576Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:11:42.8088763Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:11:42.8095061Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T09:11:42.8099859Z     %c896_i64 = arith.constant 896 : i64
2026-02-21T09:11:42.8103607Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:11:42.8107133Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T09:11:42.8110415Z     %1 = tt.get_program_id x : i32
2026-02-21T09:11:42.8114308Z     scf.for %arg2 = %1 to %c256_i32 step %c1184_i32  : i32 {
2026-02-21T09:11:42.8118351Z       %2 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T09:11:42.8121323Z       %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T09:11:42.8121863Z       %4 = tt.splat %2 : i32 -> tensor<16xi32>
2026-02-21T09:11:42.8122081Z       %5 = arith.addi %4, %3 : tensor<16xi32>
2026-02-21T09:11:42.8122278Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T09:11:42.8122483Z       %c256_i32_2 = arith.constant 256 : i32
2026-02-21T09:11:42.8122866Z       %6:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c256_i32_2 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T09:11:42.8123336Z         %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:11:42.8123687Z         %49 = arith.extf %48 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8123931Z         %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8124138Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:42.8124339Z           %86 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:42.8124538Z           tt.reduce.return %86 : f32
2026-02-21T09:11:42.8124742Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8124968Z         %51 = arith.truncf %50 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:11:42.8125213Z         %52 = arith.extf %51 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:11:42.8125443Z         %53 = arith.cmpf ogt, %arg4, %52 : tensor<16xf32>
2026-02-21T09:11:42.8125677Z         %54 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T09:11:42.8125890Z         %55 = arith.ori %53, %54 : tensor<16xi1>
2026-02-21T09:11:42.8126126Z         %56 = arith.select %55, %arg4, %52 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:11:42.8126377Z         %57 = arith.subf %arg4, %56 : tensor<16xf32>
2026-02-21T09:11:42.8126746Z         %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8127111Z         %59 = arith.mulf %arg5, %58 : tensor<16xf32>
2026-02-21T09:11:42.8127363Z         %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8127658Z         %61 = tt.broadcast %60 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8127900Z         %62 = arith.subf %49, %61 : tensor<16x128xf32>
2026-02-21T09:11:42.8128265Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8128630Z         %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8128821Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:42.8129011Z           %86 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:42.8129198Z           tt.reduce.return %86 : f32
2026-02-21T09:11:42.8129610Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8129820Z         %65 = arith.addf %59, %64 : tensor<16xf32>
2026-02-21T09:11:42.8130013Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:11:42.8130207Z         %66 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T09:11:42.8130395Z         %67 = arith.addi %arg3, %66 : i32
2026-02-21T09:11:42.8130678Z         %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:11:42.8130989Z         %69 = arith.extf %68 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8131222Z         %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8131415Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:42.8131634Z           %86 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:42.8131828Z           tt.reduce.return %86 : f32
2026-02-21T09:11:42.8132009Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8132302Z         %71 = arith.truncf %70 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:11:42.8132544Z         %72 = arith.extf %71 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:11:42.8132777Z         %73 = arith.cmpf ogt, %56, %72 : tensor<16xf32>
2026-02-21T09:11:42.8132990Z         %74 = arith.cmpf une, %56, %56 : tensor<16xf32>
2026-02-21T09:11:42.8133196Z         %75 = arith.ori %73, %74 : tensor<16xi1>
2026-02-21T09:11:42.8133429Z         %76 = arith.select %75, %56, %72 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:11:42.8133657Z         %77 = arith.subf %56, %76 : tensor<16xf32>
2026-02-21T09:11:42.8134009Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8134359Z         %79 = arith.mulf %65, %78 : tensor<16xf32>
2026-02-21T09:11:42.8134613Z         %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8134910Z         %81 = tt.broadcast %80 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8135144Z         %82 = arith.subf %69, %81 : tensor<16x128xf32>
2026-02-21T09:11:42.8135509Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8135868Z         %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8136063Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:42.8136241Z           %86 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:42.8136430Z           tt.reduce.return %86 : f32
2026-02-21T09:11:42.8136622Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8136816Z         %85 = arith.addf %79, %84 : tensor<16xf32>
2026-02-21T09:11:42.8137039Z         scf.yield %76, %85 : tensor<16xf32>, tensor<16xf32>
2026-02-21T09:11:42.8137248Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:11:42.8137537Z       %7 = tt.descriptor_load %0[%2, %c768_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:11:42.8137863Z       %8 = arith.extf %7 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8138094Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8138285Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:42.8138465Z         %48 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:11:42.8138672Z         tt.reduce.return %48 : f32
2026-02-21T09:11:42.8138857Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8139089Z       %10 = arith.truncf %9 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:11:42.8139331Z       %11 = arith.extf %10 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:11:42.8139565Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<16xf32>
2026-02-21T09:11:42.8139783Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<16xf32>
2026-02-21T09:11:42.8139995Z       %14 = arith.ori %12, %13 : tensor<16xi1>
2026-02-21T09:11:42.8140229Z       %15 = arith.select %14, %6#0, %11 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:11:42.8140467Z       %16 = arith.subf %6#0, %15 : tensor<16xf32>
2026-02-21T09:11:42.8140895Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8141286Z       %18 = arith.mulf %6#1, %17 : tensor<16xf32>
2026-02-21T09:11:42.8141576Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8141882Z       %20 = tt.broadcast %19 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8142115Z       %21 = arith.subf %8, %20 : tensor<16x128xf32>
2026-02-21T09:11:42.8142486Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8142851Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T09:11:42.8143061Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:42.8143251Z         %48 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:11:42.8143492Z         tt.reduce.return %48 : f32
2026-02-21T09:11:42.8143698Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:11:42.8143889Z       %24 = arith.addf %18, %23 : tensor<16xf32>
2026-02-21T09:11:42.8144086Z       %c768_i32_3 = arith.constant 768 : i32
2026-02-21T09:11:42.8144270Z       %c256_i32_4 = arith.constant 256 : i32
2026-02-21T09:11:42.8144500Z       scf.for %arg3 = %c0_i32 to %c768_i32_3 step %c256_i32_4  : i32 {
2026-02-21T09:11:42.8144780Z         %48 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:42.8145033Z         %49 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:11:42.8145242Z         %50 = arith.addi %49, %48 : tensor<128xi32>
2026-02-21T09:11:42.8145486Z         %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:11:42.8145750Z         %52 = arith.muli %51, %cst : tensor<16x1xi32>
2026-02-21T09:11:42.8145998Z         %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:42.8146298Z         %54 = tt.broadcast %52 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8146567Z         %55 = tt.broadcast %53 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8146801Z         %56 = arith.addi %54, %55 : tensor<16x128xi32>
2026-02-21T09:11:42.8147046Z         %57 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8147325Z         %58 = tt.addptr %57, %56 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8147582Z         %59 = tt.load %58 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8147829Z         %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8148114Z         %61 = arith.extf %59 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8148373Z         %62 = tt.broadcast %60 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8148603Z         %63 = arith.subf %61, %62 : tensor<16x128xf32>
2026-02-21T09:11:42.8148974Z         %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8149382Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8149665Z         %66 = tt.broadcast %65 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8149899Z         %67 = arith.divf %64, %66 : tensor<16x128xf32>
2026-02-21T09:11:42.8150129Z         %68 = arith.truncf %67 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:11:42.8150399Z         %69 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8150668Z         %70 = tt.addptr %69, %56 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8150923Z         tt.store %70, %68 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8151125Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:11:42.8151321Z         %71 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T09:11:42.8151518Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T09:11:42.8151834Z         %73 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:42.8152089Z         %74 = tt.splat %72 : i32 -> tensor<128xi32>
2026-02-21T09:11:42.8152289Z         %75 = arith.addi %74, %73 : tensor<128xi32>
2026-02-21T09:11:42.8152547Z         %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:11:42.8152807Z         %77 = arith.muli %76, %cst : tensor<16x1xi32>
2026-02-21T09:11:42.8153070Z         %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:42.8153369Z         %79 = tt.broadcast %77 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8153629Z         %80 = tt.broadcast %78 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8153883Z         %81 = arith.addi %79, %80 : tensor<16x128xi32>
2026-02-21T09:11:42.8154127Z         %82 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8154504Z         %83 = tt.addptr %82, %81 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8154759Z         %84 = tt.load %83 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8155014Z         %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8155307Z         %86 = arith.extf %84 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8155566Z         %87 = tt.broadcast %85 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8155803Z         %88 = arith.subf %86, %87 : tensor<16x128xf32>
2026-02-21T09:11:42.8156173Z         %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8156596Z         %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8156885Z         %91 = tt.broadcast %90 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8157117Z         %92 = arith.divf %89, %91 : tensor<16x128xf32>
2026-02-21T09:11:42.8157360Z         %93 = arith.truncf %92 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:11:42.8157629Z         %94 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8157909Z         %95 = tt.addptr %94, %81 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8158164Z         tt.store %95, %93 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8158372Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:11:42.8158600Z       %25 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:42.8158856Z       %26 = tt.splat %c768_i32_3 : i32 -> tensor<128xi32>
2026-02-21T09:11:42.8159075Z       %27 = arith.addi %26, %25 : tensor<128xi32>
2026-02-21T09:11:42.8159321Z       %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:11:42.8159582Z       %29 = arith.muli %28, %cst : tensor<16x1xi32>
2026-02-21T09:11:42.8159835Z       %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:42.8160128Z       %31 = tt.broadcast %29 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8160395Z       %32 = tt.broadcast %30 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:11:42.8160623Z       %33 = arith.addi %31, %32 : tensor<16x128xi32>
2026-02-21T09:11:42.8160862Z       %34 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8161136Z       %35 = tt.addptr %34, %33 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8161395Z       %36 = tt.load %35 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8161712Z       %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8162003Z       %38 = arith.extf %36 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:11:42.8162263Z       %39 = tt.broadcast %37 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8162504Z       %40 = arith.subf %38, %39 : tensor<16x128xf32>
2026-02-21T09:11:42.8162942Z       %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:11:42.8163370Z       %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:11:42.8163671Z       %43 = tt.broadcast %42 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:11:42.8163919Z       %44 = arith.divf %41, %43 : tensor<16x128xf32>
2026-02-21T09:11:42.8164156Z       %45 = arith.truncf %44 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:11:42.8164436Z       %46 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8164718Z       %47 = tt.addptr %46, %33 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:11:42.8164986Z       tt.store %47, %45 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:11:42.8165264Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T09:11:42.8165584Z     tt.return
2026-02-21T09:11:42.8165726Z   }
2026-02-21T09:11:42.8165849Z }
2026-02-21T09:11:42.8165923Z 
2026-02-21T09:11:42.8165983Z {-#
2026-02-21T09:11:42.8166115Z   external_resources: {
2026-02-21T09:11:42.8166286Z     mlir_reproducer: {
2026-02-21T09:11:42.8170834Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:11:42.8175300Z       disable_threading: false,
2026-02-21T09:11:42.8175468Z       verify_each: true
2026-02-21T09:11:42.8175618Z     }
2026-02-21T09:11:42.8175734Z   }
2026-02-21T09:11:42.8175853Z #-}
2026-02-21T09:11:42.8176284Z /tmp/torchinductor_root/rc/crcdf4rqqxqatr7f3tzmzu3irll77hqde3z57327bjf64sdz6cal.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:11:42.8177464Z /tmp/torchinductor_root/rc/crcdf4rqqxqatr7f3tzmzu3irll77hqde3z57327bjf64sdz6cal.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:11:42.8178435Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:11:42.8179507Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[1, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:11:42.8180512Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:11:42.8180772Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:11:43.6424763Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T09:11:43.6426719Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:11:43.6427199Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:11:43.6427406Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:11:43.6427595Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:11:43.6428091Z     %c1184_i32 = arith.constant 1184 : i32
2026-02-21T09:11:43.6428340Z     %cst = arith.constant dense<896> : tensor<8x1xi32>
2026-02-21T09:11:43.6428595Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:11:43.6428847Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:11:43.6429069Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:11:43.6429254Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:11:43.6429434Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T09:11:43.6429622Z     %c896_i64 = arith.constant 896 : i64
2026-02-21T09:11:43.6429796Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:11:43.6430108Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c896_i32], [%c896_i64, %c1_i64] : <f16>, <tensor<8x128xf16>>
2026-02-21T09:11:43.6430421Z     %1 = tt.get_program_id x : i32
2026-02-21T09:11:43.6430638Z     scf.for %arg2 = %1 to %c512_i32 step %c1184_i32  : i32 {
2026-02-21T09:11:43.6430870Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:11:43.6431105Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:11:43.6431365Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:11:43.6431635Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:11:43.6431834Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T09:11:43.6432016Z       %c256_i32 = arith.constant 256 : i32
2026-02-21T09:11:43.6432383Z       %6:2 = scf.for %arg3 = %c0_i32 to %c768_i32 step %c256_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:11:43.6432850Z         %48 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T09:11:43.6433172Z         %49 = arith.extf %48 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6433412Z         %50 = "tt.reduce"(%49) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6433608Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:43.6433809Z           %86 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:43.6434008Z           tt.reduce.return %86 : f32
2026-02-21T09:11:43.6434201Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6434462Z         %51 = arith.truncf %50 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:11:43.6434706Z         %52 = arith.extf %51 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:11:43.6434933Z         %53 = arith.cmpf ogt, %arg4, %52 : tensor<8xf32>
2026-02-21T09:11:43.6435162Z         %54 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:11:43.6435374Z         %55 = arith.ori %53, %54 : tensor<8xi1>
2026-02-21T09:11:43.6435612Z         %56 = arith.select %55, %arg4, %52 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:11:43.6435857Z         %57 = arith.subf %arg4, %56 : tensor<8xf32>
2026-02-21T09:11:43.6436214Z         %58 = tt.extern_elementwise %57 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6436578Z         %59 = arith.mulf %arg5, %58 : tensor<8xf32>
2026-02-21T09:11:43.6436975Z         %60 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6437268Z         %61 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6437499Z         %62 = arith.subf %49, %61 : tensor<8x128xf32>
2026-02-21T09:11:43.6437862Z         %63 = tt.extern_elementwise %62 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6438234Z         %64 = "tt.reduce"(%63) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6438426Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:43.6438620Z           %86 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:43.6438804Z           tt.reduce.return %86 : f32
2026-02-21T09:11:43.6438995Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6439194Z         %65 = arith.addf %59, %64 : tensor<8xf32>
2026-02-21T09:11:43.6439396Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:11:43.6439665Z         %66 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T09:11:43.6439855Z         %67 = arith.addi %arg3, %66 : i32
2026-02-21T09:11:43.6440131Z         %68 = tt.descriptor_load %0[%2, %67] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T09:11:43.6440438Z         %69 = arith.extf %68 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6440672Z         %70 = "tt.reduce"(%69) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6440859Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:43.6441045Z           %86 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:11:43.6441242Z           tt.reduce.return %86 : f32
2026-02-21T09:11:43.6441424Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6441696Z         %71 = arith.truncf %70 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:11:43.6441936Z         %72 = arith.extf %71 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:11:43.6442163Z         %73 = arith.cmpf ogt, %56, %72 : tensor<8xf32>
2026-02-21T09:11:43.6442376Z         %74 = arith.cmpf une, %56, %56 : tensor<8xf32>
2026-02-21T09:11:43.6442585Z         %75 = arith.ori %73, %74 : tensor<8xi1>
2026-02-21T09:11:43.6442807Z         %76 = arith.select %75, %56, %72 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:11:43.6443039Z         %77 = arith.subf %56, %76 : tensor<8xf32>
2026-02-21T09:11:43.6443392Z         %78 = tt.extern_elementwise %77 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6443743Z         %79 = arith.mulf %65, %78 : tensor<8xf32>
2026-02-21T09:11:43.6443995Z         %80 = tt.expand_dims %76 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6444277Z         %81 = tt.broadcast %80 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6444516Z         %82 = arith.subf %69, %81 : tensor<8x128xf32>
2026-02-21T09:11:43.6444880Z         %83 = tt.extern_elementwise %82 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6445242Z         %84 = "tt.reduce"(%83) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6445440Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:11:43.6445619Z           %86 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:11:43.6445806Z           tt.reduce.return %86 : f32
2026-02-21T09:11:43.6445984Z         }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6446209Z         %85 = arith.addf %79, %84 : tensor<8xf32>
2026-02-21T09:11:43.6446418Z         scf.yield %76, %85 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:11:43.6446636Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:11:43.6446928Z       %7 = tt.descriptor_load %0[%2, %c768_i32] : !tt.tensordesc<tensor<8x128xf16>> -> tensor<8x128xf16>
2026-02-21T09:11:43.6447244Z       %8 = arith.extf %7 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6447476Z       %9 = "tt.reduce"(%8) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6447663Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:43.6447853Z         %48 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:11:43.6448117Z         tt.reduce.return %48 : f32
2026-02-21T09:11:43.6448305Z       }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6448518Z       %10 = arith.truncf %9 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:11:43.6448758Z       %11 = arith.extf %10 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:11:43.6448989Z       %12 = arith.cmpf ogt, %6#0, %11 : tensor<8xf32>
2026-02-21T09:11:43.6449195Z       %13 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T09:11:43.6449397Z       %14 = arith.ori %12, %13 : tensor<8xi1>
2026-02-21T09:11:43.6449613Z       %15 = arith.select %14, %6#0, %11 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:11:43.6449844Z       %16 = arith.subf %6#0, %15 : tensor<8xf32>
2026-02-21T09:11:43.6450184Z       %17 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6450599Z       %18 = arith.mulf %6#1, %17 : tensor<8xf32>
2026-02-21T09:11:43.6450853Z       %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6451132Z       %20 = tt.broadcast %19 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6451366Z       %21 = arith.subf %8, %20 : tensor<8x128xf32>
2026-02-21T09:11:43.6451747Z       %22 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6452109Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T09:11:43.6452303Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:11:43.6452475Z         %48 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:11:43.6452661Z         tt.reduce.return %48 : f32
2026-02-21T09:11:43.6452839Z       }) : (tensor<8x128xf32>) -> tensor<8xf32>
2026-02-21T09:11:43.6453034Z       %24 = arith.addf %18, %23 : tensor<8xf32>
2026-02-21T09:11:43.6453220Z       %c768_i32_2 = arith.constant 768 : i32
2026-02-21T09:11:43.6453413Z       %c256_i32_3 = arith.constant 256 : i32
2026-02-21T09:11:43.6453638Z       scf.for %arg3 = %c0_i32 to %c768_i32_2 step %c256_i32_3  : i32 {
2026-02-21T09:11:43.6453926Z         %48 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:43.6454188Z         %49 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:11:43.6454393Z         %50 = arith.addi %49, %48 : tensor<128xi32>
2026-02-21T09:11:43.6454650Z         %51 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:11:43.6454906Z         %52 = arith.muli %51, %cst : tensor<8x1xi32>
2026-02-21T09:11:43.6455167Z         %53 = tt.expand_dims %50 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:43.6455457Z         %54 = tt.broadcast %52 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6455724Z         %55 = tt.broadcast %53 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6455965Z         %56 = arith.addi %54, %55 : tensor<8x128xi32>
2026-02-21T09:11:43.6456203Z         %57 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6456487Z         %58 = tt.addptr %57, %56 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6456735Z         %59 = tt.load %58 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6456995Z         %60 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6457286Z         %61 = arith.extf %59 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6457536Z         %62 = tt.broadcast %60 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6457772Z         %63 = arith.subf %61, %62 : tensor<8x128xf32>
2026-02-21T09:11:43.6458130Z         %64 = tt.extern_elementwise %63 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6458553Z         %65 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6458841Z         %66 = tt.broadcast %65 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6459129Z         %67 = arith.divf %64, %66 : tensor<8x128xf32>
2026-02-21T09:11:43.6459361Z         %68 = arith.truncf %67 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T09:11:43.6459624Z         %69 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6459897Z         %70 = tt.addptr %69, %56 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6460143Z         tt.store %70, %68 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6460347Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:11:43.6460533Z         %71 = arith.muli %c128_i32, %c1_i32 : i32
2026-02-21T09:11:43.6460732Z         %72 = arith.addi %arg3, %71 : i32
2026-02-21T09:11:43.6460965Z         %73 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:43.6461204Z         %74 = tt.splat %72 : i32 -> tensor<128xi32>
2026-02-21T09:11:43.6461406Z         %75 = arith.addi %74, %73 : tensor<128xi32>
2026-02-21T09:11:43.6461756Z         %76 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:11:43.6462024Z         %77 = arith.muli %76, %cst : tensor<8x1xi32>
2026-02-21T09:11:43.6462275Z         %78 = tt.expand_dims %75 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:43.6462568Z         %79 = tt.broadcast %77 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6462831Z         %80 = tt.broadcast %78 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6463061Z         %81 = arith.addi %79, %80 : tensor<8x128xi32>
2026-02-21T09:11:43.6463295Z         %82 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6463567Z         %83 = tt.addptr %82, %81 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6463819Z         %84 = tt.load %83 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6464073Z         %85 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6464348Z         %86 = arith.extf %84 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6464609Z         %87 = tt.broadcast %85 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6464835Z         %88 = arith.subf %86, %87 : tensor<8x128xf32>
2026-02-21T09:11:43.6465200Z         %89 = tt.extern_elementwise %88 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6465598Z         %90 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6465878Z         %91 = tt.broadcast %90 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6466107Z         %92 = arith.divf %89, %91 : tensor<8x128xf32>
2026-02-21T09:11:43.6466333Z         %93 = arith.truncf %92 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T09:11:43.6466601Z         %94 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6466868Z         %95 = tt.addptr %94, %81 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6467124Z         tt.store %95, %93 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6467335Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:11:43.6467587Z       %25 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:11:43.6467862Z       %26 = tt.splat %c768_i32_2 : i32 -> tensor<128xi32>
2026-02-21T09:11:43.6468078Z       %27 = arith.addi %26, %25 : tensor<128xi32>
2026-02-21T09:11:43.6468335Z       %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:11:43.6468596Z       %29 = arith.muli %28, %cst : tensor<8x1xi32>
2026-02-21T09:11:43.6468862Z       %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:11:43.6469158Z       %31 = tt.broadcast %29 : tensor<8x1xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6469434Z       %32 = tt.broadcast %30 : tensor<1x128xi32> -> tensor<8x128xi32>
2026-02-21T09:11:43.6469677Z       %33 = arith.addi %31, %32 : tensor<8x128xi32>
2026-02-21T09:11:43.6470009Z       %34 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6470317Z       %35 = tt.addptr %34, %33 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6470587Z       %36 = tt.load %35 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6470845Z       %37 = tt.expand_dims %15 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6471143Z       %38 = arith.extf %36 : tensor<8x128xf16> to tensor<8x128xf32>
2026-02-21T09:11:43.6471405Z       %39 = tt.broadcast %37 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6471689Z       %40 = arith.subf %38, %39 : tensor<8x128xf32>
2026-02-21T09:11:43.6472070Z       %41 = tt.extern_elementwise %40 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x128xf32>) -> tensor<8x128xf32>
2026-02-21T09:11:43.6472509Z       %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:11:43.6472859Z       %43 = tt.broadcast %42 : tensor<8x1xf32> -> tensor<8x128xf32>
2026-02-21T09:11:43.6473096Z       %44 = arith.divf %41, %43 : tensor<8x128xf32>
2026-02-21T09:11:43.6473335Z       %45 = arith.truncf %44 : tensor<8x128xf32> to tensor<8x128xf16>
2026-02-21T09:11:43.6473605Z       %46 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6473894Z       %47 = tt.addptr %46, %33 : tensor<8x128x!tt.ptr<f16>>, tensor<8x128xi32>
2026-02-21T09:11:43.6474166Z       tt.store %47, %45 : tensor<8x128x!tt.ptr<f16>>
2026-02-21T09:11:43.6474393Z     } {tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T09:11:43.6474601Z     tt.return
2026-02-21T09:11:43.6474731Z   }
2026-02-21T09:11:43.6474865Z }
2026-02-21T09:11:43.6474937Z 
2026-02-21T09:11:43.6474988Z {-#
2026-02-21T09:11:43.6475129Z   external_resources: {
2026-02-21T09:11:43.6475291Z     mlir_reproducer: {
2026-02-21T09:11:43.6479649Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:11:43.6484129Z       disable_threading: false,
2026-02-21T09:11:43.6484297Z       verify_each: true
2026-02-21T09:11:43.6484447Z     }
2026-02-21T09:11:43.6484571Z   }
2026-02-21T09:11:43.6484684Z #-}
2026-02-21T09:11:43.6485113Z /tmp/torchinductor_root/sn/csnj4e6rue3cqjy3o56b4y3yibltaqdjhwbfymxbfefijdc47pey.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:11:43.6486341Z /tmp/torchinductor_root/sn/csnj4e6rue3cqjy3o56b4y3yibltaqdjhwbfymxbfefijdc47pey.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:11:43.6487329Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:11:43.6488391Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:11:43.6489379Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:11:43.6489644Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:11:48.9920136Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 106/106 16.4 configs/s
2026-02-21T09:11:52.6177046Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 297.0         
2026-02-21T09:11:52.6180659Z                                                                   configs/s     
2026-02-21T09:11:52.9190566Z [54s] Generation 1 complete: 
2026-02-21T09:11:52.9194894Z error=2
2026-02-21T09:11:52.9195106Z ok=105
2026-02-21T09:11:52.9199249Z min=0.0104
2026-02-21T09:11:52.9204521Z mid=0.0184
2026-02-21T09:11:52.9209065Z max=0.0799
2026-02-21T09:11:52.9210488Z best={'block_sizes': [4, 512],
2026-02-21T09:11:52.9210759Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:11:52.9211019Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T09:11:52.9211214Z  'maxnreg': 128,
2026-02-21T09:11:52.9211380Z  'num_sm_multiplier': 8,
2026-02-21T09:11:52.9211791Z  'num_stages': 3,
2026-02-21T09:11:52.9211967Z  'num_warps': 1,
2026-02-21T09:11:52.9212131Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:11:52.9212342Z  'range_flattens': [True, True],
2026-02-21T09:11:52.9212534Z  'range_multi_buffers': [True, True],
2026-02-21T09:11:52.9212723Z  'range_num_stages': [4, 2],
2026-02-21T09:11:52.9212902Z  'range_unroll_factors': [1, 2],
2026-02-21T09:11:52.9213089Z  'range_warp_specializes': [True, None]}
2026-02-21T09:11:52.9213320Z [54s] Fitting surrogate: 207 points, 207 targets
2026-02-21T09:11:54.0393306Z [55s] Generation 2 starting: 87 neighbors, 5 active search path(s)
2026-02-21T09:11:58.5828449Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 11.2 configs/s
2026-02-21T09:12:04.1852872Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 16.4 configs/s
2026-02-21T09:12:08.5305197Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 247.5         
2026-02-21T09:12:08.5309228Z                                                                   configs/s     
2026-02-21T09:12:08.9029463Z [70s] Generation 2 complete: 
2026-02-21T09:12:08.9033722Z ok=93
2026-02-21T09:12:08.9037605Z min=0.0102
2026-02-21T09:12:08.9042052Z mid=0.0143
2026-02-21T09:12:08.9046600Z max=0.0409
2026-02-21T09:12:08.9051758Z best={'block_sizes': [4, 1024],
2026-02-21T09:12:08.9056038Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:12:08.9057555Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:12:08.9057789Z  'maxnreg': 128,
2026-02-21T09:12:08.9058030Z  'num_sm_multiplier': 16,
2026-02-21T09:12:08.9058202Z  'num_stages': 3,
2026-02-21T09:12:08.9058342Z  'num_warps': 4,
2026-02-21T09:12:08.9058508Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:12:08.9058733Z  'range_flattens': [True, True],
2026-02-21T09:12:08.9062809Z  'range_multi_buffers': [True, True],
2026-02-21T09:12:08.9064215Z  'range_num_stages': [4, 2],
2026-02-21T09:12:08.9064441Z  'range_unroll_factors': [1, 2],
2026-02-21T09:12:08.9064654Z  'range_warp_specializes': [True, None]}
2026-02-21T09:12:08.9065305Z [70s] Fitting surrogate: 300 points, 300 targets
2026-02-21T09:12:10.0837659Z [71s] Generation 3 starting: 89 neighbors, 5 active search path(s)
2026-02-21T09:12:13.9430178Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 30.9 configs/s
2026-02-21T09:12:19.7871984Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 16.0 configs/s
2026-02-21T09:12:23.8160127Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 253.9         
2026-02-21T09:12:23.8160665Z                                                                   configs/s     
2026-02-21T09:12:24.2108224Z [85s] Generation 3 complete: 
2026-02-21T09:12:24.2109894Z error=2
2026-02-21T09:12:24.2110054Z ok=93
2026-02-21T09:12:24.2110177Z min=0.0102
2026-02-21T09:12:24.2110311Z mid=0.0123
2026-02-21T09:12:24.2110431Z max=0.0389
2026-02-21T09:12:24.2110576Z best={'block_sizes': [1, 1024],
2026-02-21T09:12:24.2111242Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:12:24.2111833Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:12:24.2112040Z  'num_stages': 2,
2026-02-21T09:12:24.2112180Z  'num_warps': 1,
2026-02-21T09:12:24.2112331Z  'pid_type': 'flat',
2026-02-21T09:12:24.2112486Z  'range_flattens': [None, True],
2026-02-21T09:12:24.2112671Z  'range_multi_buffers': [None, False],
2026-02-21T09:12:24.2112853Z  'range_num_stages': [0, 2],
2026-02-21T09:12:24.2113027Z  'range_unroll_factors': [0, 4],
2026-02-21T09:12:24.2113211Z  'range_warp_specializes': [None, None]}
2026-02-21T09:12:24.2124843Z [85s] Fitting surrogate: 395 points, 395 targets
2026-02-21T09:12:25.3177933Z [87s] Generation 4 starting: 76 neighbors, 5 active search path(s)
2026-02-21T09:12:28.6727758Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 57.7 configs/s
2026-02-21T09:12:33.5182732Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.2 configs/s
2026-02-21T09:12:37.9187306Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 243.9         
2026-02-21T09:12:37.9188794Z                                                                   configs/s     
2026-02-21T09:12:38.3142675Z [100s] Generation 4 complete: 
2026-02-21T09:12:38.3147044Z ok=81
2026-02-21T09:12:38.3150237Z min=0.0102
2026-02-21T09:12:38.3154231Z mid=0.0112
2026-02-21T09:12:38.3156228Z max=0.0307
2026-02-21T09:12:38.3156403Z best={'block_sizes': [1, 1024],
2026-02-21T09:12:38.3156689Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:12:38.3156977Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:12:38.3157186Z  'num_stages': 2,
2026-02-21T09:12:38.3157347Z  'num_warps': 1,
2026-02-21T09:12:38.3157489Z  'pid_type': 'flat',
2026-02-21T09:12:38.3157654Z  'range_flattens': [None, True],
2026-02-21T09:12:38.3157834Z  'range_multi_buffers': [None, False],
2026-02-21T09:12:38.3158026Z  'range_num_stages': [0, 2],
2026-02-21T09:12:38.3158215Z  'range_unroll_factors': [0, 4],
2026-02-21T09:12:38.3158414Z  'range_warp_specializes': [None, None]}
2026-02-21T09:12:38.3162547Z [100s] Fitting surrogate: 476 points, 476 targets
2026-02-21T09:12:39.1232639Z [100s] Generation 5 starting: 52 neighbors, 4 active search path(s)
2026-02-21T09:12:41.7201291Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 19.7 configs/s
2026-02-21T09:12:45.0027032Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.4 configs/s
2026-02-21T09:12:47.7658801Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.2         
2026-02-21T09:12:47.7662848Z                                                                   configs/s     
2026-02-21T09:12:48.0314445Z [109s] Generation 5 complete: 
2026-02-21T09:12:48.0318758Z ok=56
2026-02-21T09:12:48.0322196Z min=0.0083
2026-02-21T09:12:48.0326615Z mid=0.0102
2026-02-21T09:12:48.0331816Z max=0.0512
2026-02-21T09:12:48.0336221Z best={'block_sizes': [1, 1024],
2026-02-21T09:12:48.0338240Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:12:48.0338920Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:12:48.0339111Z  'num_stages': 2,
2026-02-21T09:12:48.0339261Z  'num_warps': 1,
2026-02-21T09:12:48.0339399Z  'pid_type': 'flat',
2026-02-21T09:12:48.0339561Z  'range_flattens': [None, True],
2026-02-21T09:12:48.0339746Z  'range_multi_buffers': [None, False],
2026-02-21T09:12:48.0339930Z  'range_num_stages': [0, 2],
2026-02-21T09:12:48.0340101Z  'range_unroll_factors': [0, 4],
2026-02-21T09:12:48.0340275Z  'range_warp_specializes': [None, None]}
2026-02-21T09:12:48.0340499Z [109s] Fitting surrogate: 532 points, 532 targets
2026-02-21T09:12:48.7523820Z [110s] Generation 6 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:12:50.8437112Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 32.2 configs/s
2026-02-21T09:12:53.3897048Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.4 configs/s
2026-02-21T09:12:55.4399811Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 498.2         
2026-02-21T09:12:55.4403878Z                                                                   configs/s     
2026-02-21T09:12:55.6362953Z [117s] Generation 6 complete: 
2026-02-21T09:12:55.6364350Z ok=44
2026-02-21T09:12:55.6364523Z min=0.0102
2026-02-21T09:12:55.6364671Z mid=0.0102
2026-02-21T09:12:55.6364803Z max=0.0164
2026-02-21T09:12:55.6364963Z best={'block_sizes': [1, 1024],
2026-02-21T09:12:55.6365226Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:12:55.6365508Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:12:55.6365704Z  'num_stages': 2,
2026-02-21T09:12:55.6365856Z  'num_warps': 1,
2026-02-21T09:12:55.6366008Z  'pid_type': 'flat',
2026-02-21T09:12:55.6366170Z  'range_flattens': [None, True],
2026-02-21T09:12:55.6366359Z  'range_multi_buffers': [None, False],
2026-02-21T09:12:55.6366549Z  'range_num_stages': [0, 2],
2026-02-21T09:12:55.6366727Z  'range_unroll_factors': [0, 4],
2026-02-21T09:12:55.6366941Z  'range_warp_specializes': [None, None]}
2026-02-21T09:12:55.6382142Z [117s] Fitting surrogate: 576 points, 576 targets
2026-02-21T09:12:56.2326676Z [117s] Generation 7 starting: 27 neighbors, 2 active search path(s)
2026-02-21T09:12:57.4996463Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 42.1 configs/s
2026-02-21T09:12:59.1798977Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 16.5 configs/s
2026-02-21T09:13:00.6538329Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 691.2         
2026-02-21T09:13:00.6540560Z                                                                   configs/s     
2026-02-21T09:13:00.7970951Z [122s] Generation 7 complete: 
2026-02-21T09:13:00.7974125Z ok=30
2026-02-21T09:13:00.7978487Z min=0.0092
2026-02-21T09:13:00.7982111Z mid=0.0102
2026-02-21T09:13:00.7986984Z max=0.0164
2026-02-21T09:13:00.7989121Z best={'block_sizes': [1, 1024],
2026-02-21T09:13:00.7989433Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:13:00.7990178Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:13:00.7990389Z  'num_stages': 2,
2026-02-21T09:13:00.7991502Z  'num_warps': 1,
2026-02-21T09:13:00.7991725Z  'pid_type': 'flat',
2026-02-21T09:13:00.7991927Z  'range_flattens': [None, True],
2026-02-21T09:13:00.7992149Z  'range_multi_buffers': [None, False],
2026-02-21T09:13:00.7992345Z  'range_num_stages': [0, 2],
2026-02-21T09:13:00.7992519Z  'range_unroll_factors': [0, 4],
2026-02-21T09:13:00.7992723Z  'range_warp_specializes': [None, None]}
2026-02-21T09:13:00.7994837Z [122s] Fitting surrogate: 606 points, 606 targets
2026-02-21T09:13:01.3496875Z [123s] Generation 8 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:13:02.6586338Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 46.4 configs/s
2026-02-21T09:13:04.0751126Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.7 configs/s
2026-02-21T09:13:05.3883864Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 775.5         
2026-02-21T09:13:05.3885234Z                                                                   configs/s     
2026-02-21T09:13:05.5173879Z [127s] Generation 8 complete: 
2026-02-21T09:13:05.5178276Z ok=26
2026-02-21T09:13:05.5179641Z min=0.0102
2026-02-21T09:13:05.5179814Z mid=0.0102
2026-02-21T09:13:05.5179954Z max=0.0143
2026-02-21T09:13:05.5180096Z best={'block_sizes': [1, 1024],
2026-02-21T09:13:05.5180356Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:13:05.5180620Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:13:05.5180819Z  'num_stages': 2,
2026-02-21T09:13:05.5180957Z  'num_warps': 1,
2026-02-21T09:13:05.5181102Z  'pid_type': 'flat',
2026-02-21T09:13:05.5181254Z  'range_flattens': [None, True],
2026-02-21T09:13:05.5181437Z  'range_multi_buffers': [None, False],
2026-02-21T09:13:05.5181861Z  'range_num_stages': [0, 2],
2026-02-21T09:13:05.5182035Z  'range_unroll_factors': [0, 4],
2026-02-21T09:13:05.5182238Z  'range_warp_specializes': [None, None]}
2026-02-21T09:13:05.5194030Z [127s] Fitting surrogate: 632 points, 632 targets
2026-02-21T09:13:05.8070723Z [127s] Autotuning complete in 127.5s after searching 611 configs.
2026-02-21T09:13:05.8071036Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:13:05.8072173Z     @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T09:13:05.8072997Z 
2026-02-21T09:13:05.8073249Z [127s] Code of selected kernel: /tmp/torchinductor_root/sr/csrpjvbpym57p7lvzdhfstpheoqr7wm4hgy2bqbykun7hp3dddxd.py
2026-02-21T09:13:05.8298339Z from __future__ import annotations
2026-02-21T09:13:05.8301279Z 
2026-02-21T09:13:05.8305591Z import torch
2026-02-21T09:13:05.8310749Z import triton
2026-02-21T09:13:05.8312433Z import triton.language as tl
2026-02-21T09:13:05.8312692Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:13:05.8312961Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:13:05.8313254Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:13:05.8313429Z 
2026-02-21T09:13:05.8313508Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:13:05.8313687Z _BLOCK_SIZE_1 = tl.constexpr(1024)
2026-02-21T09:13:05.8313800Z 
2026-02-21T09:13:05.8313865Z @triton.jit
2026-02-21T09:13:05.8314008Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:13:05.8314264Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:13:05.8314510Z     pid_0 = tl.program_id(0)
2026-02-21T09:13:05.8314678Z     offset_0 = pid_0
2026-02-21T09:13:05.8314849Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:13:05.8315143Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:13:05.8315778Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:13:05.8316044Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:13:05.8316307Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:13:05.8316559Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:13:05.8316835Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:13:05.8317096Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:13:05.8317323Z     # src[softmax.py:82-89]: ...
2026-02-21T09:13:05.8317689Z     for offset_2 in tl.range(0, 896, _BLOCK_SIZE_1, loop_unroll_factor=4, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:13:05.8318097Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:13:05.8318338Z         mask_1 = indices_2 < 896
2026-02-21T09:13:05.8318612Z         mi_copy = mi
2026-02-21T09:13:05.8318776Z         di_copy = di
2026-02-21T09:13:05.8318924Z         mi_copy_0 = mi_copy
2026-02-21T09:13:05.8319101Z         di_copy_0 = di_copy
2026-02-21T09:13:05.8319297Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:13:05.8319667Z         values = tl.load(x + (indices_0[:, None] * 896 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:13:05.8320064Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:13:05.8320466Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:13:05.8320859Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:13:05.8321119Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:13:05.8321353Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:13:05.8321663Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:13:05.8321924Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:13:05.8322168Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:13:05.8322336Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:13:05.8322507Z         v_4 = di_copy_0 * v_3
2026-02-21T09:13:05.8322689Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:13:05.8322915Z         subscript = v_1[:, None]
2026-02-21T09:13:05.8323093Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:13:05.8323270Z         v_6 = v_5 - subscript
2026-02-21T09:13:05.8323485Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:13:05.8323742Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:13:05.8323961Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:13:05.8324150Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:13:05.8324467Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:13:05.8324827Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:13:05.8325019Z         di = v_4 + sum_1
2026-02-21T09:13:05.8325180Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:13:05.8325351Z         mi = v_1
2026-02-21T09:13:05.8325555Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:13:05.8325823Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:13:05.8326109Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:13:05.8326555Z     for offset_2 in tl.range(0, 896, _BLOCK_SIZE_1, loop_unroll_factor=4, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:13:05.8326958Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:13:05.8327191Z         mask_2 = indices_2 < 896
2026-02-21T09:13:05.8327352Z         mi_copy_1 = mi
2026-02-21T09:13:05.8327584Z         di_copy_1 = di
2026-02-21T09:13:05.8327737Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:13:05.8327899Z         di_copy_1_0 = di_copy_1
2026-02-21T09:13:05.8328088Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:13:05.8328445Z         values_1 = tl.load(x + (indices_0[:, None] * 896 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:13:05.8328877Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:13:05.8329149Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:13:05.8329345Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:13:05.8329531Z         v_10 = v_9 - subscript_1
2026-02-21T09:13:05.8329700Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:13:05.8329883Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:13:05.8330062Z         v_12 = v_11 / subscript_2
2026-02-21T09:13:05.8330307Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:13:05.8330579Z         tl.store(out + (indices_0[:, None] * 896 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:13:05.8330801Z 
2026-02-21T09:13:05.8330928Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:13:05.8331165Z     """
2026-02-21T09:13:05.8331365Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:13:05.8331715Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:13:05.8331931Z     Args:
2026-02-21T09:13:05.8332096Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:13:05.8332289Z     Returns:
2026-02-21T09:13:05.8332476Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:13:05.8332689Z     """
2026-02-21T09:13:05.8332835Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:13:05.8333019Z     m, n = x.size()
2026-02-21T09:13:05.8333187Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:13:05.8333417Z     out = torch.empty_like(x)
2026-02-21T09:13:05.8333655Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:13:05.8333990Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:13:05.8334316Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:13:05.8334567Z     # src[softmax.py:79-92]: ...
2026-02-21T09:13:05.8334828Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=2)
2026-02-21T09:13:05.8335119Z     # src[softmax.py:93]: return out
2026-02-21T09:13:05.8335302Z     return out
2026-02-21T09:13:06.5882130Z WARNING:tritonbench.utils.triton_op:Completed input ID 5:
2026-02-21T09:13:06.5883875Z (M, N)
2026-02-21T09:13:06.5884048Z -----------
2026-02-21T09:13:06.5884181Z (4096, 896)
2026-02-21T09:13:06.5884256Z 
2026-02-21T09:13:06.5889110Z  10%|█         | 2/20 [04:11<38:07, 127.09s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10:
2026-02-21T09:13:06.5893302Z (M, N)
2026-02-21T09:13:06.5898325Z ------------
2026-02-21T09:13:06.5902707Z (4096, 1536)
2026-02-21T09:13:06.5907280Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:13:08.0589372Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:13:09.5184751Z INFO:tritonbench.utils.triton_op:Took 2.16ms to get benchmark function for torch_compile_softmax
2026-02-21T09:13:10.6219436Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:13:10.6221430Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:13:10.6221849Z               'dtype': 'torch.float16',
2026-02-21T09:13:10.6222037Z               'shape': (4096, 1536),
2026-02-21T09:13:10.6222216Z               'stride': (1536, 1)},),
2026-02-21T09:13:10.6222389Z   'kwargs': {}}
2026-02-21T09:13:10.6236824Z INFO:tritonbench.utils.triton_op:Took 1.92ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:13:10.7955389Z [0s] Autotune random seed: 2138408546
2026-02-21T09:13:10.8200863Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:13:43.0809084Z [32s] Timeout after 30s compiling Config(block_sizes=[512, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None])
2026-02-21T09:13:43.5386981Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:13:43.5396967Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s
2026-02-21T09:13:49.4175661Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 17.1 configs/s
2026-02-21T09:13:49.4186336Z [38s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s])
2026-02-21T09:13:50.1077990Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1448.2 configs/s
2026-02-21T09:13:50.1688244Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T09:13:50.1692580Z error=7
2026-02-21T09:13:50.1693855Z timeout=2
2026-02-21T09:13:50.1694017Z ok=91
2026-02-21T09:13:50.1694141Z min=0.0225
2026-02-21T09:13:50.1694275Z mid=0.1536
2026-02-21T09:13:50.1694397Z max=35.8574
2026-02-21T09:13:50.1694547Z best={'block_sizes': [32, 512],
2026-02-21T09:13:50.1694798Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:13:50.1695058Z  'load_eviction_policies': ['', 'first'],
2026-02-21T09:13:50.1695276Z  'num_stages': 3,
2026-02-21T09:13:50.1695432Z  'num_warps': 32,
2026-02-21T09:13:50.1695577Z  'pid_type': 'flat',
2026-02-21T09:13:50.1695736Z  'range_flattens': [None, False],
2026-02-21T09:13:50.1695919Z  'range_multi_buffers': [None, None],
2026-02-21T09:13:50.1696099Z  'range_num_stages': [0, 1],
2026-02-21T09:13:50.1696270Z  'range_unroll_factors': [0, 1],
2026-02-21T09:13:50.1696445Z  'range_warp_specializes': [None, False]}
2026-02-21T09:13:50.1710710Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:13:51.4984015Z [40s] Generation 1 starting: 98 neighbors, 5 active search path(s)
2026-02-21T09:14:24.1054854Z [73s] Timeout after 30s compiling Config(block_sizes=[16, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[4, 3], range_warp_specializes=[None, False])
2026-02-21T09:14:24.1072366Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 0.6 configs/s
2026-02-21T09:14:30.0251486Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 17.4 configs/s
2026-02-21T09:14:31.7495919Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 588.8         
2026-02-21T09:14:31.7499596Z                                                                   configs/s     
2026-02-21T09:14:31.8962250Z [81s] Generation 1 complete: 
2026-02-21T09:14:31.8963960Z error=4
2026-02-21T09:14:31.8964130Z timeout=1
2026-02-21T09:14:31.8964261Z ok=99
2026-02-21T09:14:31.8964401Z min=0.0143
2026-02-21T09:14:31.8964531Z mid=0.0267
2026-02-21T09:14:31.8964667Z max=0.1085
2026-02-21T09:14:31.8964808Z best={'block_sizes': [16, 256],
2026-02-21T09:14:31.8965024Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:14:31.8965236Z  'load_eviction_policies': ['', 'last'],
2026-02-21T09:14:31.8965423Z  'num_stages': 2,
2026-02-21T09:14:31.8965592Z  'num_warps': 16,
2026-02-21T09:14:31.8966065Z  'pid_type': 'flat',
2026-02-21T09:14:31.8966227Z  'range_flattens': [None, True],
2026-02-21T09:14:31.8966400Z  'range_multi_buffers': [None, True],
2026-02-21T09:14:31.8966588Z  'range_num_stages': [0, 1],
2026-02-21T09:14:31.8966753Z  'range_unroll_factors': [0, 3],
2026-02-21T09:14:31.8966939Z  'range_warp_specializes': [None, False]}
2026-02-21T09:14:31.8977718Z [81s] Fitting surrogate: 204 points, 204 targets
2026-02-21T09:14:33.0148034Z [82s] Generation 2 starting: 87 neighbors, 5 active search path(s)
2026-02-21T09:14:38.6470853Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 6.9 configs/s
2026-02-21T09:14:43.9114906Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 17.0 configs/s
2026-02-21T09:14:47.5768347Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 278.0         
2026-02-21T09:14:47.5769859Z                                                                   configs/s     
2026-02-21T09:14:47.8686548Z [97s] Generation 2 complete: 
2026-02-21T09:14:47.8690643Z error=3
2026-02-21T09:14:47.8695023Z ok=90
2026-02-21T09:14:47.8696488Z min=0.0143
2026-02-21T09:14:47.8696658Z mid=0.0205
2026-02-21T09:14:47.8696836Z max=0.0942
2026-02-21T09:14:47.8696988Z best={'block_sizes': [16, 256],
2026-02-21T09:14:47.8697229Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:14:47.8697452Z  'load_eviction_policies': ['', 'last'],
2026-02-21T09:14:47.8697646Z  'num_stages': 2,
2026-02-21T09:14:47.8697792Z  'num_warps': 16,
2026-02-21T09:14:47.8697942Z  'pid_type': 'flat',
2026-02-21T09:14:47.8698114Z  'range_flattens': [None, True],
2026-02-21T09:14:47.8698297Z  'range_multi_buffers': [None, True],
2026-02-21T09:14:47.8698493Z  'range_num_stages': [0, 1],
2026-02-21T09:14:47.8698662Z  'range_unroll_factors': [0, 3],
2026-02-21T09:14:47.8698855Z  'range_warp_specializes': [None, False]}
2026-02-21T09:14:47.8701333Z [97s] Fitting surrogate: 297 points, 297 targets
2026-02-21T09:14:49.1202006Z [98s] Generation 3 starting: 88 neighbors, 5 active search path(s)
2026-02-21T09:14:53.7005438Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 14.7 configs/s
2026-02-21T09:14:58.8533157Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 17.2 configs/s
2026-02-21T09:15:03.1168076Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 239.9         
2026-02-21T09:15:03.1171288Z                                                                   configs/s     
2026-02-21T09:15:03.4606159Z [112s] Generation 3 complete: 
2026-02-21T09:15:03.4607903Z error=4
2026-02-21T09:15:03.4608059Z ok=90
2026-02-21T09:15:03.4608185Z min=0.0143
2026-02-21T09:15:03.4608320Z mid=0.0184
2026-02-21T09:15:03.4608439Z max=0.0716
2026-02-21T09:15:03.4608586Z best={'block_sizes': [16, 256],
2026-02-21T09:15:03.4608793Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:15:03.4609009Z  'load_eviction_policies': ['', 'last'],
2026-02-21T09:15:03.4609184Z  'num_stages': 1,
2026-02-21T09:15:03.4609365Z  'num_warps': 16,
2026-02-21T09:15:03.4609515Z  'pid_type': 'flat',
2026-02-21T09:15:03.4609677Z  'range_flattens': [None, True],
2026-02-21T09:15:03.4609860Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:03.4610040Z  'range_num_stages': [0, 1],
2026-02-21T09:15:03.4610207Z  'range_unroll_factors': [0, 3],
2026-02-21T09:15:03.4610383Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:03.4624735Z [112s] Fitting surrogate: 391 points, 391 targets
2026-02-21T09:15:04.4576758Z [113s] Generation 4 starting: 71 neighbors, 5 active search path(s)
2026-02-21T09:15:08.4866140Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 12.6 configs/s
2026-02-21T09:15:13.0659980Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s
2026-02-21T09:15:16.5543043Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 292.5         
2026-02-21T09:15:16.5543438Z                                                                   configs/s     
2026-02-21T09:15:16.8499882Z [126s] Generation 4 complete: 
2026-02-21T09:15:16.8504019Z ok=77
2026-02-21T09:15:16.8505900Z min=0.0143
2026-02-21T09:15:16.8506103Z mid=0.0164
2026-02-21T09:15:16.8506243Z max=0.0389
2026-02-21T09:15:16.8506388Z best={'block_sizes': [32, 256],
2026-02-21T09:15:16.8506619Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:15:16.8506840Z  'load_eviction_policies': ['', 'first'],
2026-02-21T09:15:16.8507039Z  'num_stages': 4,
2026-02-21T09:15:16.8507185Z  'num_warps': 32,
2026-02-21T09:15:16.8507335Z  'pid_type': 'flat',
2026-02-21T09:15:16.8507501Z  'range_flattens': [None, False],
2026-02-21T09:15:16.8507698Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:16.8507887Z  'range_num_stages': [0, 1],
2026-02-21T09:15:16.8508069Z  'range_unroll_factors': [0, 1],
2026-02-21T09:15:16.8508260Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:16.8514475Z [126s] Fitting surrogate: 468 points, 468 targets
2026-02-21T09:15:17.9311302Z [127s] Generation 5 starting: 69 neighbors, 5 active search path(s)
2026-02-21T09:15:22.1706902Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 11.5 configs/s
2026-02-21T09:15:26.4579173Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.5 configs/s
2026-02-21T09:15:28.4272699Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 517.6         
2026-02-21T09:15:28.4276027Z                                                                   configs/s     
2026-02-21T09:15:28.6088532Z [137s] Generation 5 complete: 
2026-02-21T09:15:28.6092801Z ok=74
2026-02-21T09:15:28.6096837Z min=0.0102
2026-02-21T09:15:28.6097078Z mid=0.0164
2026-02-21T09:15:28.6097232Z max=0.0471
2026-02-21T09:15:28.6097373Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:28.6097630Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:28.6097901Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:28.6103361Z  'num_stages': 4,
2026-02-21T09:15:28.6105530Z  'num_warps': 2,
2026-02-21T09:15:28.6105797Z  'pid_type': 'flat',
2026-02-21T09:15:28.6109686Z  'range_flattens': [None, True],
2026-02-21T09:15:28.6112869Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:28.6116127Z  'range_num_stages': [0, 3],
2026-02-21T09:15:28.6121062Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:28.6125370Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:28.6129483Z [137s] Fitting surrogate: 542 points, 542 targets
2026-02-21T09:15:29.3717402Z [138s] Generation 6 starting: 39 neighbors, 3 active search path(s)
2026-02-21T09:15:31.5818662Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 20.4 configs/s
2026-02-21T09:15:33.9703356Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.6 configs/s
2026-02-21T09:15:35.7459081Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 574.5         
2026-02-21T09:15:35.7463129Z                                                                   configs/s     
2026-02-21T09:15:35.9118752Z [145s] Generation 6 complete: 
2026-02-21T09:15:35.9123035Z ok=42
2026-02-21T09:15:35.9124806Z min=0.0102
2026-02-21T09:15:35.9124974Z mid=0.0143
2026-02-21T09:15:35.9125097Z max=0.0205
2026-02-21T09:15:35.9125303Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:35.9125555Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:35.9129012Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:35.9132258Z  'num_stages': 4,
2026-02-21T09:15:35.9136243Z  'num_warps': 2,
2026-02-21T09:15:35.9140195Z  'pid_type': 'flat',
2026-02-21T09:15:35.9140465Z  'range_flattens': [None, True],
2026-02-21T09:15:35.9140683Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:35.9144675Z  'range_num_stages': [0, 3],
2026-02-21T09:15:35.9149062Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:35.9153396Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:35.9157794Z [145s] Fitting surrogate: 584 points, 584 targets
2026-02-21T09:15:36.5860340Z [145s] Generation 7 starting: 32 neighbors, 3 active search path(s)
2026-02-21T09:15:38.4718160Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 35.4 configs/s
2026-02-21T09:15:40.4575886Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.5 configs/s
2026-02-21T09:15:41.8450439Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 733.2         
2026-02-21T09:15:41.8453841Z                                                                   configs/s     
2026-02-21T09:15:41.9755496Z [151s] Generation 7 complete: 
2026-02-21T09:15:41.9759756Z ok=35
2026-02-21T09:15:41.9764149Z min=0.0102
2026-02-21T09:15:41.9768476Z mid=0.0143
2026-02-21T09:15:41.9770468Z max=0.0245
2026-02-21T09:15:41.9770652Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:41.9770906Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:41.9771181Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:41.9771383Z  'num_stages': 4,
2026-02-21T09:15:41.9771524Z  'num_warps': 2,
2026-02-21T09:15:41.9772148Z  'pid_type': 'flat',
2026-02-21T09:15:41.9772343Z  'range_flattens': [None, True],
2026-02-21T09:15:41.9772540Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:41.9772727Z  'range_num_stages': [0, 3],
2026-02-21T09:15:41.9772905Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:41.9773087Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:41.9774448Z [151s] Fitting surrogate: 619 points, 619 targets
2026-02-21T09:15:42.5528535Z [151s] Generation 8 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:15:44.2934811Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 10.9 configs/s
2026-02-21T09:15:45.5174285Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.0 configs/s
2026-02-21T09:15:46.5777914Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 957.3         
2026-02-21T09:15:46.5778261Z                                                                   configs/s     
2026-02-21T09:15:46.6769588Z [155s] Generation 8 complete: 
2026-02-21T09:15:46.6771350Z ok=23
2026-02-21T09:15:46.6771765Z min=0.0102
2026-02-21T09:15:46.6771937Z mid=0.0143
2026-02-21T09:15:46.6772103Z max=0.0246
2026-02-21T09:15:46.6772273Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:46.6772548Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:46.6772864Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:46.6773082Z  'num_stages': 4,
2026-02-21T09:15:46.6773247Z  'num_warps': 2,
2026-02-21T09:15:46.6773415Z  'pid_type': 'flat',
2026-02-21T09:15:46.6773603Z  'range_flattens': [None, True],
2026-02-21T09:15:46.6773814Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:46.6774027Z  'range_num_stages': [0, 3],
2026-02-21T09:15:46.6774203Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:46.6774413Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:46.6787887Z [155s] Fitting surrogate: 642 points, 642 targets
2026-02-21T09:15:47.2108383Z [156s] Generation 9 starting: 22 neighbors, 2 active search path(s)
2026-02-21T09:15:48.4005950Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 40.1 configs/s
2026-02-21T09:15:49.7509363Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.8 configs/s
2026-02-21T09:15:50.9633808Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 837.8         
2026-02-21T09:15:50.9639128Z                                                                   configs/s     
2026-02-21T09:15:51.0846715Z [160s] Generation 9 complete: 
2026-02-21T09:15:51.0852308Z ok=25
2026-02-21T09:15:51.0857265Z min=0.0102
2026-02-21T09:15:51.0859660Z mid=0.0143
2026-02-21T09:15:51.0859872Z max=0.0184
2026-02-21T09:15:51.0860033Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:51.0860325Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:51.0860610Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:51.0860824Z  'num_stages': 4,
2026-02-21T09:15:51.0860974Z  'num_warps': 2,
2026-02-21T09:15:51.0861133Z  'pid_type': 'flat',
2026-02-21T09:15:51.0861814Z  'range_flattens': [None, True],
2026-02-21T09:15:51.0862019Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:51.0862228Z  'range_num_stages': [0, 3],
2026-02-21T09:15:51.0862404Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:51.0862606Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:51.0866885Z [160s] Fitting surrogate: 667 points, 667 targets
2026-02-21T09:15:51.5686130Z [160s] Generation 10 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:15:52.6069573Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 23.5 configs/s
2026-02-21T09:15:53.2620033Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.1 configs/s
2026-02-21T09:15:53.4724157Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4625.7        
2026-02-21T09:15:53.4726114Z                                                                   configs/s     
2026-02-21T09:15:53.5022695Z [162s] Generation 10 complete: 
2026-02-21T09:15:53.5024263Z ok=13
2026-02-21T09:15:53.5024534Z min=0.0102
2026-02-21T09:15:53.5029596Z mid=0.0184
2026-02-21T09:15:53.5031277Z max=0.0348
2026-02-21T09:15:53.5031507Z best={'block_sizes': [1, 2048],
2026-02-21T09:15:53.5032009Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:15:53.5032310Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:15:53.5032527Z  'num_stages': 4,
2026-02-21T09:15:53.5032663Z  'num_warps': 2,
2026-02-21T09:15:53.5032809Z  'pid_type': 'flat',
2026-02-21T09:15:53.5032962Z  'range_flattens': [None, True],
2026-02-21T09:15:53.5033144Z  'range_multi_buffers': [None, None],
2026-02-21T09:15:53.5033324Z  'range_num_stages': [0, 3],
2026-02-21T09:15:53.5033491Z  'range_unroll_factors': [0, 0],
2026-02-21T09:15:53.5033668Z  'range_warp_specializes': [None, False]}
2026-02-21T09:15:53.5043064Z [162s] Fitting surrogate: 680 points, 680 targets
2026-02-21T09:15:53.7777446Z [162s] Autotuning complete in 163.0s after searching 645 configs.
2026-02-21T09:15:53.7777786Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:15:53.7778771Z     @helion.kernel(config=helion.Config(block_sizes=[1, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:15:53.7779603Z 
2026-02-21T09:15:53.7779858Z [162s] Code of selected kernel: /tmp/torchinductor_root/rq/crqwxm54fi4dqrcw7a7smoqs52tonjm5ukqqar7yfqzsg6sdozrg.py
2026-02-21T09:15:53.7998319Z from __future__ import annotations
2026-02-21T09:15:53.8000219Z 
2026-02-21T09:15:53.8000387Z import torch
2026-02-21T09:15:53.8000567Z import triton
2026-02-21T09:15:53.8000722Z import triton.language as tl
2026-02-21T09:15:53.8000942Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:15:53.8001223Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:15:53.8002014Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:15:53.8002193Z 
2026-02-21T09:15:53.8002266Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:15:53.8002460Z _BLOCK_SIZE_1 = tl.constexpr(2048)
2026-02-21T09:15:53.8002579Z 
2026-02-21T09:15:53.8002646Z @triton.jit
2026-02-21T09:15:53.8002790Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:15:53.8003043Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:15:53.8003286Z     pid_0 = tl.program_id(0)
2026-02-21T09:15:53.8003454Z     offset_0 = pid_0
2026-02-21T09:15:53.8003621Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:15:53.8003905Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:15:53.8004197Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:15:53.8004457Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:15:53.8004834Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:15:53.8005087Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:15:53.8005363Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:15:53.8005608Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:15:53.8005854Z     # src[softmax.py:82-89]: ...
2026-02-21T09:15:53.8006150Z     for offset_2 in tl.range(0, 1536, _BLOCK_SIZE_1, warp_specialize=False, num_stages=3, flatten=True):
2026-02-21T09:15:53.8006505Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:15:53.8006769Z         mask_1 = indices_2 < 1536
2026-02-21T09:15:53.8006939Z         mi_copy = mi
2026-02-21T09:15:53.8007081Z         di_copy = di
2026-02-21T09:15:53.8007230Z         mi_copy_0 = mi_copy
2026-02-21T09:15:53.8007382Z         di_copy_0 = di_copy
2026-02-21T09:15:53.8007573Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:15:53.8007943Z         values = tl.load(x + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:15:53.8008332Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:15:53.8008743Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:15:53.8009131Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:15:53.8009399Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:15:53.8009631Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:15:53.8009838Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:15:53.8010088Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:15:53.8010326Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:15:53.8010498Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:15:53.8010667Z         v_4 = di_copy_0 * v_3
2026-02-21T09:15:53.8010858Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:15:53.8011054Z         subscript = v_1[:, None]
2026-02-21T09:15:53.8011228Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:15:53.8011402Z         v_6 = v_5 - subscript
2026-02-21T09:15:53.8011694Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:15:53.8011956Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:15:53.8012178Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:15:53.8012370Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:15:53.8012686Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:15:53.8013050Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:15:53.8013248Z         di = v_4 + sum_1
2026-02-21T09:15:53.8013423Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:15:53.8013691Z         mi = v_1
2026-02-21T09:15:53.8013899Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:15:53.8014181Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:15:53.8014479Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:15:53.8014882Z     for offset_2 in tl.range(0, 1536, _BLOCK_SIZE_1, warp_specialize=False, num_stages=3, flatten=True):
2026-02-21T09:15:53.8015227Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:15:53.8015475Z         mask_2 = indices_2 < 1536
2026-02-21T09:15:53.8015644Z         mi_copy_1 = mi
2026-02-21T09:15:53.8015805Z         di_copy_1 = di
2026-02-21T09:15:53.8015963Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:15:53.8016124Z         di_copy_1_0 = di_copy_1
2026-02-21T09:15:53.8016315Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:15:53.8016740Z         values_1 = tl.load(x + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:15:53.8017184Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:15:53.8017460Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:15:53.8017654Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:15:53.8017841Z         v_10 = v_9 - subscript_1
2026-02-21T09:15:53.8018009Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:15:53.8018191Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:15:53.8018366Z         v_12 = v_11 / subscript_2
2026-02-21T09:15:53.8018542Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:15:53.8018810Z         tl.store(out + (indices_0[:, None] * 1536 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:15:53.8019030Z 
2026-02-21T09:15:53.8019159Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:15:53.8019402Z     """
2026-02-21T09:15:53.8019607Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:15:53.8019915Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:15:53.8020129Z     Args:
2026-02-21T09:15:53.8020296Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:15:53.8020485Z     Returns:
2026-02-21T09:15:53.8020666Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:15:53.8020870Z     """
2026-02-21T09:15:53.8021013Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:15:53.8021186Z     m, n = x.size()
2026-02-21T09:15:53.8021354Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:15:53.8021603Z     out = torch.empty_like(x)
2026-02-21T09:15:53.8021824Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:15:53.8022138Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:15:53.8022446Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:15:53.8022686Z     # src[softmax.py:79-92]: ...
2026-02-21T09:15:53.8022935Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=4)
2026-02-21T09:15:53.8023208Z     # src[softmax.py:93]: return out
2026-02-21T09:15:53.8023378Z     return out
2026-02-21T09:15:54.8772251Z WARNING:tritonbench.utils.triton_op:Completed input ID 10:
2026-02-21T09:15:54.8776584Z (M, N)
2026-02-21T09:15:54.8781212Z ------------
2026-02-21T09:15:54.8782600Z (4096, 1536)
2026-02-21T09:15:54.8782721Z 
2026-02-21T09:15:54.8783245Z  15%|█▌        | 3/20 [07:00<41:20, 145.90s/it]WARNING:tritonbench.utils.triton_op:Running input ID 15:
2026-02-21T09:15:54.8787751Z (M, N)
2026-02-21T09:15:54.8792145Z ------------
2026-02-21T09:15:54.8794025Z (4096, 2176)
2026-02-21T09:15:54.8794335Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:15:56.2953052Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:15:57.6638814Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for torch_compile_softmax
2026-02-21T09:16:02.4500955Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:16:02.4502712Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:16:02.4502934Z               'dtype': 'torch.float16',
2026-02-21T09:16:02.4503135Z               'shape': (4096, 2176),
2026-02-21T09:16:02.4503316Z               'stride': (2176, 1)},),
2026-02-21T09:16:02.4503496Z   'kwargs': {}}
2026-02-21T09:16:02.4522953Z INFO:tritonbench.utils.triton_op:Took 2.43ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:16:02.8604451Z [0s] Autotune random seed: 2138408546
2026-02-21T09:16:02.8868412Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:16:35.3679518Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None])
2026-02-21T09:16:35.8899428Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s
2026-02-21T09:16:37.8999170Z module {
2026-02-21T09:16:37.9003205Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:16:37.9004496Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:16:37.9004734Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:16:37.9004919Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:16:37.9005105Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:16:37.9005336Z     %cst = arith.constant dense<2176> : tensor<16x1xi32>
2026-02-21T09:16:37.9005607Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T09:16:37.9005860Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T09:16:37.9006080Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:16:37.9006263Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:16:37.9006450Z     %c2176_i32 = arith.constant 2176 : i32
2026-02-21T09:16:37.9006634Z     %c2176_i64 = arith.constant 2176 : i64
2026-02-21T09:16:37.9006806Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:16:37.9007125Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c2176_i32], [%c2176_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T09:16:37.9007486Z     %1 = tt.get_program_id x : i32
2026-02-21T09:16:37.9007661Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T09:16:37.9007842Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T09:16:37.9008034Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:16:37.9008246Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T09:16:37.9008483Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T09:16:37.9008734Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T09:16:37.9008932Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T09:16:37.9009118Z       %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:16:37.9009307Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T09:16:37.9009677Z       %8:2 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T09:16:37.9010157Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:16:37.9010488Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9010749Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9010950Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9011147Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:16:37.9011863Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9012062Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9012288Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:16:37.9012538Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:16:37.9012797Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T09:16:37.9013032Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T09:16:37.9013260Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T09:16:37.9013506Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:16:37.9013761Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T09:16:37.9014135Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9014651Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T09:16:37.9014928Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9015243Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9015498Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T09:16:37.9015882Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9016271Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9016467Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9016665Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:16:37.9016859Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9017060Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9017277Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T09:16:37.9017481Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:16:37.9017690Z         %68 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:16:37.9017890Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T09:16:37.9018185Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:16:37.9018516Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9018760Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9018961Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9019159Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:16:37.9019368Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9019565Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9019801Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:16:37.9020054Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:16:37.9020302Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T09:16:37.9020531Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T09:16:37.9020738Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T09:16:37.9020982Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:16:37.9021227Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T09:16:37.9021618Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9021967Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T09:16:37.9022219Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9022517Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9022751Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T09:16:37.9023117Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9023563Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9023759Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9023938Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:16:37.9024136Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9024328Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9024529Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T09:16:37.9024727Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:16:37.9024914Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:16:37.9025112Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T09:16:37.9025382Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:16:37.9025704Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9026006Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9026197Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9026384Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:16:37.9026569Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9026756Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9026972Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:16:37.9027214Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:16:37.9027441Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T09:16:37.9027645Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T09:16:37.9027851Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T09:16:37.9028075Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:16:37.9028309Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T09:16:37.9028665Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9029032Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T09:16:37.9029289Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9029583Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9029837Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T09:16:37.9030207Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9030584Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9030772Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9030958Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:16:37.9031151Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9031337Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9031590Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T09:16:37.9031793Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:16:37.9031989Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:16:37.9032176Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T09:16:37.9032464Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:16:37.9032795Z         %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9033028Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9033223Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9033403Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:16:37.9033599Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9033780Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9034019Z         %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:16:37.9034343Z         %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:16:37.9034572Z         %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32>
2026-02-21T09:16:37.9034797Z         %116 = arith.cmpf une, %98, %98 : tensor<16xf32>
2026-02-21T09:16:37.9034998Z         %117 = arith.ori %115, %116 : tensor<16xi1>
2026-02-21T09:16:37.9035237Z         %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:16:37.9035472Z         %119 = arith.subf %98, %118 : tensor<16xf32>
2026-02-21T09:16:37.9035836Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9036205Z         %121 = arith.mulf %107, %120 : tensor<16xf32>
2026-02-21T09:16:37.9036461Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9036817Z         %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9037070Z         %124 = arith.subf %111, %123 : tensor<16x128xf32>
2026-02-21T09:16:37.9037482Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9037853Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9038048Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:16:37.9038225Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:16:37.9038416Z           tt.reduce.return %128 : f32
2026-02-21T09:16:37.9038596Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9038797Z         %127 = arith.addf %121, %126 : tensor<16xf32>
2026-02-21T09:16:37.9039085Z         scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32>
2026-02-21T09:16:37.9039309Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:16:37.9039617Z       %9 = tt.descriptor_load %0[%4, %c2048_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:16:37.9039944Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9040179Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9040368Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:16:37.9040554Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:16:37.9040739Z         tt.reduce.return %50 : f32
2026-02-21T09:16:37.9040931Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9041164Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:16:37.9041399Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:16:37.9041673Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T09:16:37.9041885Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T09:16:37.9042091Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T09:16:37.9042319Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:16:37.9042558Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T09:16:37.9042910Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9043257Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T09:16:37.9043511Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9043795Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9044039Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T09:16:37.9044400Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9044775Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T09:16:37.9044975Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:16:37.9045149Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:16:37.9045446Z         tt.reduce.return %50 : f32
2026-02-21T09:16:37.9045627Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:16:37.9045836Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T09:16:37.9046029Z       %c2048_i32_2 = arith.constant 2048 : i32
2026-02-21T09:16:37.9046227Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T09:16:37.9046463Z       scf.for %arg3 = %c0_i32 to %c2048_i32_2 step %c512_i32_3  : i32 {
2026-02-21T09:16:37.9046746Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:16:37.9047012Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:16:37.9047215Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T09:16:37.9047468Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:16:37.9047733Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T09:16:37.9048049Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:16:37.9048349Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9048609Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9048848Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T09:16:37.9049082Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9049367Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9049666Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9049978Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9050268Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9050523Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9050763Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T09:16:37.9051129Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9051574Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9051858Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9052084Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T09:16:37.9052320Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:16:37.9052589Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9052870Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9053125Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9053338Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:16:37.9053538Z         %73 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:16:37.9053729Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T09:16:37.9053965Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:16:37.9054208Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T09:16:37.9054412Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T09:16:37.9054654Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:16:37.9054921Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T09:16:37.9055179Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:16:37.9055465Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9055728Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9055960Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T09:16:37.9056205Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9056551Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9056863Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9057206Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9057500Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9057776Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9058020Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T09:16:37.9058421Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9058869Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9059228Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9059481Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T09:16:37.9059723Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:16:37.9060008Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9060299Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9060566Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9060786Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:16:37.9060983Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:16:37.9061189Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T09:16:37.9061431Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:16:37.9061729Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T09:16:37.9061944Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T09:16:37.9062216Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:16:37.9062498Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T09:16:37.9062773Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:16:37.9063091Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9063371Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9063637Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T09:16:37.9063896Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9064192Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9064524Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9064854Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9065172Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9065498Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9065750Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T09:16:37.9066128Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9066540Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9066830Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9067068Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T09:16:37.9067313Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:16:37.9067596Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9067933Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9068203Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9068407Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:16:37.9068602Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:16:37.9068791Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T09:16:37.9069031Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:16:37.9069286Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T09:16:37.9069490Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T09:16:37.9069748Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:16:37.9070012Z         %129 = arith.muli %128, %cst : tensor<16x1xi32>
2026-02-21T09:16:37.9070338Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:16:37.9070638Z         %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9070916Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9071162Z         %133 = arith.addi %131, %132 : tensor<16x128xi32>
2026-02-21T09:16:37.9071396Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9071719Z         %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9072026Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9072345Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9072642Z         %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9072908Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9073157Z         %140 = arith.subf %138, %139 : tensor<16x128xf32>
2026-02-21T09:16:37.9073535Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9073963Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9074259Z         %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9074500Z         %144 = arith.divf %141, %143 : tensor<16x128xf32>
2026-02-21T09:16:37.9074748Z         %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:16:37.9075019Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9075308Z         %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9075565Z         tt.store %147, %145 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9075784Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:16:37.9076032Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:16:37.9076289Z       %28 = tt.splat %c2048_i32_2 : i32 -> tensor<128xi32>
2026-02-21T09:16:37.9076506Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T09:16:37.9076751Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:16:37.9077008Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T09:16:37.9077258Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:16:37.9077549Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9077815Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:16:37.9078049Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T09:16:37.9078285Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9078555Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9078911Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9079208Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9079493Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:16:37.9079756Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9079982Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T09:16:37.9080347Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:16:37.9080748Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:16:37.9081030Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:16:37.9081337Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T09:16:37.9081602Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:16:37.9081872Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9082139Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:16:37.9082396Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:16:37.9082689Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T09:16:37.9082972Z     tt.return
2026-02-21T09:16:37.9083108Z   }
2026-02-21T09:16:37.9083230Z }
2026-02-21T09:16:37.9083300Z 
2026-02-21T09:16:37.9083362Z {-#
2026-02-21T09:16:37.9083492Z   external_resources: {
2026-02-21T09:16:37.9083659Z     mlir_reproducer: {
2026-02-21T09:16:37.9087953Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:16:37.9092358Z       disable_threading: false,
2026-02-21T09:16:37.9092531Z       verify_each: true
2026-02-21T09:16:37.9092671Z     }
2026-02-21T09:16:37.9092793Z   }
2026-02-21T09:16:37.9092904Z #-}
2026-02-21T09:16:37.9093334Z /tmp/torchinductor_root/jy/cjyrwyokcoajngeke66pu5qatfgz2ltezf3ru4xoqva3brfvaa4b.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:16:37.9094532Z /tmp/torchinductor_root/jy/cjyrwyokcoajngeke66pu5qatfgz2ltezf3ru4xoqva3brfvaa4b.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:16:37.9095553Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:16:37.9096587Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:16:37.9097524Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:16:37.9097823Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:16:42.0624354Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.1 configs/s
2026-02-21T09:16:42.0635299Z [39s] Adaptive compile timeout: 30s (90% percentile=2.8s, bounds=[30.0s, 30s])
2026-02-21T09:16:42.0637356Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T09:16:42.0637592Z error=11
2026-02-21T09:16:42.0637726Z timeout=1
2026-02-21T09:16:42.0637861Z ok=88
2026-02-21T09:16:42.0638000Z min=0.0123
2026-02-21T09:16:42.0638146Z mid=0.1992
2026-02-21T09:16:42.0638271Z max=50.0265
2026-02-21T09:16:42.0638424Z best={'block_sizes': [1, 4096],
2026-02-21T09:16:42.0638665Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:16:42.0638933Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:16:42.0639132Z  'num_stages': 5,
2026-02-21T09:16:42.0639278Z  'num_warps': 1,
2026-02-21T09:16:42.0639432Z  'pid_type': 'flat',
2026-02-21T09:16:42.0639596Z  'range_flattens': [None, False],
2026-02-21T09:16:42.0639843Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:42.0640033Z  'range_num_stages': [0, 1],
2026-02-21T09:16:42.0640214Z  'range_unroll_factors': [0, 0],
2026-02-21T09:16:42.0640399Z  'range_warp_specializes': [None, False]}
2026-02-21T09:16:42.0651055Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:16:43.3329623Z [40s] Generation 1 starting: 89 neighbors, 5 active search path(s)
2026-02-21T09:16:51.4028980Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 7.6 configs/s
2026-02-21T09:16:57.0053154Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s
2026-02-21T09:16:57.5299309Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1909.3         
2026-02-21T09:16:57.5303250Z                                                                  configs/s      
2026-02-21T09:16:57.5860903Z [54s] Generation 1 complete: 
2026-02-21T09:16:57.5865213Z ok=94
2026-02-21T09:16:57.5869213Z min=0.0123
2026-02-21T09:16:57.5873748Z mid=0.0266
2026-02-21T09:16:57.5878090Z max=0.1680
2026-02-21T09:16:57.5882143Z best={'block_sizes': [1, 4096],
2026-02-21T09:16:57.5885853Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:16:57.5889618Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:16:57.5892754Z  'num_stages': 5,
2026-02-21T09:16:57.5894875Z  'num_warps': 1,
2026-02-21T09:16:57.5895061Z  'pid_type': 'flat',
2026-02-21T09:16:57.5895237Z  'range_flattens': [None, False],
2026-02-21T09:16:57.5895438Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:57.5895627Z  'range_num_stages': [0, 1],
2026-02-21T09:16:57.5895809Z  'range_unroll_factors': [0, 0],
2026-02-21T09:16:57.5895990Z  'range_warp_specializes': [None, False]}
2026-02-21T09:16:57.5896217Z [54s] Fitting surrogate: 194 points, 194 targets
2026-02-21T09:16:58.9084740Z [56s] Generation 2 starting: 79 neighbors, 5 active search path(s)
2026-02-21T09:17:05.1562143Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 19.7 configs/s
2026-02-21T09:17:10.0788037Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 16.4 configs/s
2026-02-21T09:17:11.2564869Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 862.1         
2026-02-21T09:17:11.2566425Z                                                                   configs/s     
2026-02-21T09:17:11.3626388Z [68s] Generation 2 complete: 
2026-02-21T09:17:11.3632045Z ok=84
2026-02-21T09:17:11.3638122Z min=0.0123
2026-02-21T09:17:11.3638330Z mid=0.0205
2026-02-21T09:17:11.3642816Z max=0.0430
2026-02-21T09:17:11.3647707Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:11.3652041Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:17:11.3656781Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:17:11.3657057Z  'num_stages': 5,
2026-02-21T09:17:11.3662384Z  'num_warps': 1,
2026-02-21T09:17:11.3667400Z  'pid_type': 'flat',
2026-02-21T09:17:11.3667692Z  'range_flattens': [None, False],
2026-02-21T09:17:11.3668301Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:11.3672924Z  'range_num_stages': [0, 1],
2026-02-21T09:17:11.3674555Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:11.3674793Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:11.3675110Z [68s] Fitting surrogate: 278 points, 278 targets
2026-02-21T09:17:12.3566378Z [69s] Generation 3 starting: 64 neighbors, 5 active search path(s)
2026-02-21T09:17:21.1596600Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 4.9 configs/s
2026-02-21T09:17:25.2235623Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.4 configs/s
2026-02-21T09:17:26.5949342Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 891.0         
2026-02-21T09:17:26.5950619Z                                                                   configs/s     
2026-02-21T09:17:26.6975640Z [83s] Generation 3 complete: 
2026-02-21T09:17:26.6979272Z ok=69
2026-02-21T09:17:26.6983561Z min=0.0123
2026-02-21T09:17:26.6988402Z mid=0.0186
2026-02-21T09:17:26.6988666Z max=0.1475
2026-02-21T09:17:26.6988835Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:26.6989074Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:17:26.6989307Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:17:26.6989500Z  'num_stages': 5,
2026-02-21T09:17:26.6989642Z  'num_warps': 1,
2026-02-21T09:17:26.6989789Z  'pid_type': 'flat',
2026-02-21T09:17:26.6989947Z  'range_flattens': [None, False],
2026-02-21T09:17:26.6990134Z  'range_multi_buffers': [None, None],
2026-02-21T09:17:26.6990318Z  'range_num_stages': [0, 1],
2026-02-21T09:17:26.6990494Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:26.6990683Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:26.6994255Z [83s] Fitting surrogate: 347 points, 347 targets
2026-02-21T09:17:27.4893058Z [84s] Generation 4 starting: 50 neighbors, 4 active search path(s)
2026-02-21T09:17:30.9225909Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 19.8 configs/s
2026-02-21T09:17:34.1079660Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.5 configs/s
2026-02-21T09:17:35.7545721Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 618.2         
2026-02-21T09:17:35.7546653Z                                                                   configs/s     
2026-02-21T09:17:35.9068028Z [93s] Generation 4 complete: 
2026-02-21T09:17:35.9072337Z ok=55
2026-02-21T09:17:35.9075622Z min=0.0123
2026-02-21T09:17:35.9080014Z mid=0.0164
2026-02-21T09:17:35.9084325Z max=0.0470
2026-02-21T09:17:35.9088222Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:35.9092607Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:17:35.9096583Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:35.9096876Z  'num_stages': 1,
2026-02-21T09:17:35.9097056Z  'num_warps': 1,
2026-02-21T09:17:35.9097236Z  'pid_type': 'flat',
2026-02-21T09:17:35.9097439Z  'range_flattens': [None, False],
2026-02-21T09:17:35.9097977Z  'range_multi_buffers': [None, None],
2026-02-21T09:17:35.9098264Z  'range_num_stages': [0, 0],
2026-02-21T09:17:35.9098468Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:35.9098661Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:35.9103053Z [93s] Fitting surrogate: 402 points, 402 targets
2026-02-21T09:17:36.5034524Z [93s] Generation 5 starting: 37 neighbors, 4 active search path(s)
2026-02-21T09:17:41.7366734Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 4.3 configs/s
2026-02-21T09:17:44.1450313Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 16.5 configs/s
2026-02-21T09:17:45.2680083Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 906.5         
2026-02-21T09:17:45.2682009Z                                                                   configs/s     
2026-02-21T09:17:45.3832153Z [102s] Generation 5 complete: 
2026-02-21T09:17:45.3835163Z ok=41
2026-02-21T09:17:45.3835368Z min=0.0123
2026-02-21T09:17:45.3835582Z mid=0.0185
2026-02-21T09:17:45.3835714Z max=0.0471
2026-02-21T09:17:45.3836247Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:45.3841023Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:17:45.3841376Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:17:45.3841678Z  'num_stages': 6,
2026-02-21T09:17:45.3847724Z  'num_warps': 1,
2026-02-21T09:17:45.3849996Z  'pid_type': 'flat',
2026-02-21T09:17:45.3850203Z  'range_flattens': [None, False],
2026-02-21T09:17:45.3850410Z  'range_multi_buffers': [None, None],
2026-02-21T09:17:45.3854230Z  'range_num_stages': [0, 1],
2026-02-21T09:17:45.3856583Z  'range_unroll_factors': [0, 0],
2026-02-21T09:17:45.3860128Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:45.3860363Z [102s] Fitting surrogate: 443 points, 443 targets
2026-02-21T09:17:46.1060320Z [103s] Generation 6 starting: 31 neighbors, 3 active search path(s)
2026-02-21T09:17:49.4091514Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 5.0 configs/s
2026-02-21T09:17:51.6310487Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 14.6 configs/s
2026-02-21T09:17:52.7368933Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 917.5         
2026-02-21T09:17:52.7369622Z                                                                   configs/s     
2026-02-21T09:17:52.8451468Z [109s] Generation 6 complete: 
2026-02-21T09:17:52.8455591Z ok=34
2026-02-21T09:17:52.8459983Z min=0.0123
2026-02-21T09:17:52.8464415Z mid=0.0143
2026-02-21T09:17:52.8465999Z max=0.1495
2026-02-21T09:17:52.8466259Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:52.8466509Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:17:52.8466779Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:17:52.8467006Z  'num_stages': 6,
2026-02-21T09:17:52.8467170Z  'num_warps': 1,
2026-02-21T09:17:52.8467320Z  'pid_type': 'flat',
2026-02-21T09:17:52.8467484Z  'range_flattens': [None, False],
2026-02-21T09:17:52.8467735Z  'range_multi_buffers': [None, None],
2026-02-21T09:17:52.8467959Z  'range_num_stages': [0, 1],
2026-02-21T09:17:52.8468161Z  'range_unroll_factors': [0, 0],
2026-02-21T09:17:52.8468772Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:52.8468986Z [109s] Fitting surrogate: 477 points, 477 targets
2026-02-21T09:17:53.2341396Z [110s] Generation 7 starting: 21 neighbors, 2 active search path(s)
2026-02-21T09:17:55.1299271Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 18.0 configs/s
2026-02-21T09:17:56.5005386Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.6 configs/s
2026-02-21T09:17:57.5744413Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 945.8         
2026-02-21T09:17:57.5748534Z                                                                   configs/s     
2026-02-21T09:17:57.6747493Z [114s] Generation 7 complete: 
2026-02-21T09:17:57.6752224Z ok=24
2026-02-21T09:17:57.6756149Z min=0.0123
2026-02-21T09:17:57.6758028Z mid=0.0164
2026-02-21T09:17:57.6758197Z max=0.0327
2026-02-21T09:17:57.6758343Z best={'block_sizes': [1, 4096],
2026-02-21T09:17:57.6758964Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:17:57.6759254Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:57.6759443Z  'num_stages': 2,
2026-02-21T09:17:57.6759588Z  'num_warps': 1,
2026-02-21T09:17:57.6759742Z  'pid_type': 'flat',
2026-02-21T09:17:57.6759902Z  'range_flattens': [None, False],
2026-02-21T09:17:57.6760089Z  'range_multi_buffers': [None, None],
2026-02-21T09:17:57.6760272Z  'range_num_stages': [0, 0],
2026-02-21T09:17:57.6760447Z  'range_unroll_factors': [0, 0],
2026-02-21T09:17:57.6760634Z  'range_warp_specializes': [None, False]}
2026-02-21T09:17:57.6764782Z [114s] Fitting surrogate: 501 points, 501 targets
2026-02-21T09:17:58.1760293Z [115s] Generation 8 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:18:07.9923220Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.0 configs/s
2026-02-21T09:18:09.2824760Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.8 configs/s
2026-02-21T09:18:10.1123021Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1219.6         
2026-02-21T09:18:10.1127164Z                                                                  configs/s      
2026-02-21T09:18:10.1969051Z [127s] Generation 8 complete: 
2026-02-21T09:18:10.1973987Z ok=22
2026-02-21T09:18:10.1976193Z min=0.0123
2026-02-21T09:18:10.1976352Z mid=0.0143
2026-02-21T09:18:10.1976482Z max=0.0676
2026-02-21T09:18:10.1976616Z best={'block_sizes': [1, 4096],
2026-02-21T09:18:10.1976871Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:18:10.1977122Z  'load_eviction_policies': ['', ''],
2026-02-21T09:18:10.1977307Z  'num_stages': 2,
2026-02-21T09:18:10.1977445Z  'num_warps': 1,
2026-02-21T09:18:10.1977593Z  'pid_type': 'flat',
2026-02-21T09:18:10.1977754Z  'range_flattens': [None, False],
2026-02-21T09:18:10.1977932Z  'range_multi_buffers': [None, None],
2026-02-21T09:18:10.1978117Z  'range_num_stages': [0, 0],
2026-02-21T09:18:10.1978281Z  'range_unroll_factors': [0, 0],
2026-02-21T09:18:10.1978499Z  'range_warp_specializes': [None, False]}
2026-02-21T09:18:10.1986633Z [127s] Fitting surrogate: 523 points, 523 targets
2026-02-21T09:18:10.6819013Z [127s] Generation 9 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:18:13.2981085Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 4.4 configs/s
2026-02-21T09:18:13.9799505Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.3 configs/s
2026-02-21T09:18:14.5741151Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1696.3         
2026-02-21T09:18:14.5742742Z                                                                  configs/s      
2026-02-21T09:18:14.6370461Z [131s] Generation 9 complete: 
2026-02-21T09:18:14.6374801Z ok=13
2026-02-21T09:18:14.6376221Z min=0.0123
2026-02-21T09:18:14.6376422Z mid=0.0123
2026-02-21T09:18:14.6379289Z max=0.0204
2026-02-21T09:18:14.6379475Z best={'block_sizes': [1, 4096],
2026-02-21T09:18:14.6379737Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:18:14.6380353Z  'load_eviction_policies': ['', ''],
2026-02-21T09:18:14.6380534Z  'num_stages': 8,
2026-02-21T09:18:14.6380685Z  'num_warps': 1,
2026-02-21T09:18:14.6380828Z  'pid_type': 'flat',
2026-02-21T09:18:14.6380989Z  'range_flattens': [None, True],
2026-02-21T09:18:14.6381171Z  'range_multi_buffers': [None, False],
2026-02-21T09:18:14.6381355Z  'range_num_stages': [0, 3],
2026-02-21T09:18:14.6381532Z  'range_unroll_factors': [0, 2],
2026-02-21T09:18:14.6381788Z  'range_warp_specializes': [None, None]}
2026-02-21T09:18:14.6386504Z [131s] Fitting surrogate: 536 points, 536 targets
2026-02-21T09:18:14.9126231Z [132s] Autotuning complete in 132.0s after searching 514 configs.
2026-02-21T09:18:14.9130502Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:18:14.9135901Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T09:18:14.9136809Z 
2026-02-21T09:18:14.9137138Z [132s] Code of selected kernel: /tmp/torchinductor_root/mo/cmoutw7hy3xdywgwxgkivyc2c2otyaqfmanyliv46l4mv3edno3n.py
2026-02-21T09:18:14.9350481Z from __future__ import annotations
2026-02-21T09:18:14.9355418Z 
2026-02-21T09:18:14.9356963Z import torch
2026-02-21T09:18:14.9357141Z import triton
2026-02-21T09:18:14.9357311Z import triton.language as tl
2026-02-21T09:18:14.9357528Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:18:14.9357810Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:18:14.9358113Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:18:14.9358302Z 
2026-02-21T09:18:14.9358375Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:18:14.9358567Z _BLOCK_SIZE_1 = tl.constexpr(4096)
2026-02-21T09:18:14.9358711Z 
2026-02-21T09:18:14.9358770Z @triton.jit
2026-02-21T09:18:14.9358930Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:18:14.9359194Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:18:14.9359461Z     pid_0 = tl.program_id(0)
2026-02-21T09:18:14.9359635Z     offset_0 = pid_0
2026-02-21T09:18:14.9359812Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:18:14.9360107Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:18:14.9360410Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:18:14.9360693Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:18:14.9360947Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:18:14.9361215Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:18:14.9361498Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:18:14.9361967Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:18:14.9362220Z     # src[softmax.py:82-89]: ...
2026-02-21T09:18:14.9362593Z     for offset_2 in tl.range(0, 2176, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:18:14.9363036Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:18:14.9363277Z         mask_1 = indices_2 < 2176
2026-02-21T09:18:14.9363453Z         mi_copy = mi
2026-02-21T09:18:14.9363606Z         di_copy = di
2026-02-21T09:18:14.9363752Z         mi_copy_0 = mi_copy
2026-02-21T09:18:14.9363922Z         di_copy_0 = di_copy
2026-02-21T09:18:14.9364109Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:18:14.9364439Z         values = tl.load(x + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), mask_1[None, :], other=0)
2026-02-21T09:18:14.9364785Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:18:14.9365197Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:18:14.9365848Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:18:14.9366105Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:18:14.9366348Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:18:14.9366557Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:18:14.9366824Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:18:14.9367059Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:18:14.9367234Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:18:14.9367413Z         v_4 = di_copy_0 * v_3
2026-02-21T09:18:14.9367603Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:18:14.9367819Z         subscript = v_1[:, None]
2026-02-21T09:18:14.9367990Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:18:14.9368257Z         v_6 = v_5 - subscript
2026-02-21T09:18:14.9368476Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:18:14.9368752Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:18:14.9368966Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:18:14.9369163Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:18:14.9369492Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:18:14.9369847Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:18:14.9370052Z         di = v_4 + sum_1
2026-02-21T09:18:14.9370210Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:18:14.9370387Z         mi = v_1
2026-02-21T09:18:14.9370584Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:18:14.9370852Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:18:14.9371151Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:18:14.9371644Z     for offset_2 in tl.range(0, 2176, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:18:14.9372054Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:18:14.9372282Z         mask_2 = indices_2 < 2176
2026-02-21T09:18:14.9372453Z         mi_copy_1 = mi
2026-02-21T09:18:14.9372599Z         di_copy_1 = di
2026-02-21T09:18:14.9372774Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:18:14.9372943Z         di_copy_1_0 = di_copy_1
2026-02-21T09:18:14.9373124Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:18:14.9373438Z         values_1 = tl.load(x + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), mask_2[None, :], other=0)
2026-02-21T09:18:14.9373812Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:18:14.9374095Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:18:14.9374290Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:18:14.9374467Z         v_10 = v_9 - subscript_1
2026-02-21T09:18:14.9374640Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:18:14.9374815Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:18:14.9375000Z         v_12 = v_11 / subscript_2
2026-02-21T09:18:14.9375167Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:18:14.9375437Z         tl.store(out + (indices_0[:, None] * 2176 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:18:14.9375648Z 
2026-02-21T09:18:14.9375778Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:18:14.9376005Z     """
2026-02-21T09:18:14.9376210Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:18:14.9376511Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:18:14.9376731Z     Args:
2026-02-21T09:18:14.9376892Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:18:14.9377158Z     Returns:
2026-02-21T09:18:14.9377332Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:18:14.9377546Z     """
2026-02-21T09:18:14.9377687Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:18:14.9377861Z     m, n = x.size()
2026-02-21T09:18:14.9378032Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:18:14.9378229Z     out = torch.empty_like(x)
2026-02-21T09:18:14.9378459Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:18:14.9378768Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:18:14.9379088Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:18:14.9379331Z     # src[softmax.py:79-92]: ...
2026-02-21T09:18:14.9379582Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=8)
2026-02-21T09:18:14.9379913Z     # src[softmax.py:93]: return out
2026-02-21T09:18:14.9380086Z     return out
2026-02-21T09:18:15.6558877Z WARNING:tritonbench.utils.triton_op:Completed input ID 15:
2026-02-21T09:18:15.6563304Z (M, N)
2026-02-21T09:18:15.6564814Z ------------
2026-02-21T09:18:15.6564996Z (4096, 2176)
2026-02-21T09:18:15.6565077Z 
2026-02-21T09:18:15.6565522Z  20%|██        | 4/20 [09:20<38:22, 143.88s/it]WARNING:tritonbench.utils.triton_op:Running input ID 20:
2026-02-21T09:18:15.6569238Z (M, N)
2026-02-21T09:18:15.6571342Z ------------
2026-02-21T09:18:15.6571525Z (4096, 2816)
2026-02-21T09:18:15.6571937Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:18:17.0443344Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:18:18.5828636Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for torch_compile_softmax
2026-02-21T09:18:19.8082121Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:18:19.8083826Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:18:19.8084105Z               'dtype': 'torch.float16',
2026-02-21T09:18:19.8084330Z               'shape': (4096, 2816),
2026-02-21T09:18:19.8084537Z               'stride': (2816, 1)},),
2026-02-21T09:18:19.8084725Z   'kwargs': {}}
2026-02-21T09:18:19.8100853Z INFO:tritonbench.utils.triton_op:Took 2.19ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:18:19.9838357Z [0s] Autotune random seed: 2138408546
2026-02-21T09:18:20.0086892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:18:52.7239067Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None])
2026-02-21T09:18:53.4617318Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T09:18:59.8265820Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.7 configs/s
2026-02-21T09:18:59.8268974Z [39s] Adaptive compile timeout: 30s (90% percentile=3.5s, bounds=[30.0s, 30s])
2026-02-21T09:18:59.8280431Z [39s] Initial random population of 100, 5 starting points: 
2026-02-21T09:18:59.8280808Z error=10
2026-02-21T09:18:59.8280960Z timeout=1
2026-02-21T09:18:59.8281137Z ok=89
2026-02-21T09:18:59.8281293Z min=0.0163
2026-02-21T09:18:59.8281458Z mid=0.2415
2026-02-21T09:18:59.8281710Z max=65.2831
2026-02-21T09:18:59.8281907Z best={'block_sizes': [1, 4096],
2026-02-21T09:18:59.8282164Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:18:59.8282431Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:18:59.8282640Z  'num_stages': 5,
2026-02-21T09:18:59.8282797Z  'num_warps': 1,
2026-02-21T09:18:59.8283368Z  'pid_type': 'flat',
2026-02-21T09:18:59.8283543Z  'range_flattens': [None, False],
2026-02-21T09:18:59.8283756Z  'range_multi_buffers': [None, False],
2026-02-21T09:18:59.8283950Z  'range_num_stages': [0, 1],
2026-02-21T09:18:59.8284118Z  'range_unroll_factors': [0, 0],
2026-02-21T09:18:59.8284300Z  'range_warp_specializes': [None, False]}
2026-02-21T09:18:59.8298996Z [39s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:19:01.2690632Z [41s] Generation 1 starting: 87 neighbors, 5 active search path(s)
2026-02-21T09:19:22.1731120Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 1.3 configs/s
2026-02-21T09:19:27.5274011Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 16.6 configs/s
2026-02-21T09:19:28.0594849Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1885.3         
2026-02-21T09:19:28.0599162Z                                                                  configs/s      
2026-02-21T09:19:28.1159155Z [68s] Generation 1 complete: 
2026-02-21T09:19:28.1163192Z ok=92
2026-02-21T09:19:28.1168213Z min=0.0143
2026-02-21T09:19:28.1172532Z mid=0.0288
2026-02-21T09:19:28.1173826Z max=0.3052
2026-02-21T09:19:28.1174008Z best={'block_sizes': [1, 4096],
2026-02-21T09:19:28.1174256Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:19:28.1174520Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:19:28.1174706Z  'num_stages': 4,
2026-02-21T09:19:28.1174850Z  'num_warps': 4,
2026-02-21T09:19:28.1174986Z  'pid_type': 'flat',
2026-02-21T09:19:28.1175145Z  'range_flattens': [None, False],
2026-02-21T09:19:28.1175322Z  'range_multi_buffers': [None, False],
2026-02-21T09:19:28.1175544Z  'range_num_stages': [0, 1],
2026-02-21T09:19:28.1175721Z  'range_unroll_factors': [0, 0],
2026-02-21T09:19:28.1175900Z  'range_warp_specializes': [None, False]}
2026-02-21T09:19:28.1179726Z [68s] Fitting surrogate: 192 points, 192 targets
2026-02-21T09:19:29.3190374Z [69s] Generation 2 starting: 87 neighbors, 5 active search path(s)
2026-02-21T09:19:37.4572746Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 10.3 configs/s
2026-02-21T09:19:42.8625889Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.6 configs/s
2026-02-21T09:19:44.0826998Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 830.5         
2026-02-21T09:19:44.0828723Z                                                                   configs/s     
2026-02-21T09:19:44.1887936Z [84s] Generation 2 complete: 
2026-02-21T09:19:44.1892326Z error=2
2026-02-21T09:19:44.1894014Z ok=91
2026-02-21T09:19:44.1894223Z min=0.0143
2026-02-21T09:19:44.1894363Z mid=0.0246
2026-02-21T09:19:44.1898138Z max=0.0859
2026-02-21T09:19:44.1902103Z best={'block_sizes': [1, 4096],
2026-02-21T09:19:44.1903650Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:19:44.1903942Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:19:44.1904130Z  'num_stages': 4,
2026-02-21T09:19:44.1904318Z  'num_warps': 2,
2026-02-21T09:19:44.1904475Z  'pid_type': 'flat',
2026-02-21T09:19:44.1904636Z  'range_flattens': [None, True],
2026-02-21T09:19:44.1904814Z  'range_multi_buffers': [None, False],
2026-02-21T09:19:44.1905005Z  'range_num_stages': [0, 1],
2026-02-21T09:19:44.1905166Z  'range_unroll_factors': [0, 0],
2026-02-21T09:19:44.1905347Z  'range_warp_specializes': [None, False]}
2026-02-21T09:19:44.1915074Z [84s] Fitting surrogate: 285 points, 285 targets
2026-02-21T09:19:44.9785122Z [84s] Generation 3 starting: 62 neighbors, 5 active search path(s)
2026-02-21T09:19:53.0929590Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 4.6 configs/s
2026-02-21T09:19:56.6508747Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 18.2 configs/s
2026-02-21T09:19:58.6649493Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 504.4         
2026-02-21T09:19:58.6653174Z                                                                   configs/s     
2026-02-21T09:19:58.8374473Z [98s] Generation 3 complete: 
2026-02-21T09:19:58.8375870Z error=6
2026-02-21T09:19:58.8378653Z ok=61
2026-02-21T09:19:58.8378828Z min=0.0143
2026-02-21T09:19:58.8379022Z mid=0.0206
2026-02-21T09:19:58.8379164Z max=0.2008
2026-02-21T09:19:58.8379307Z best={'block_sizes': [1, 4096],
2026-02-21T09:19:58.8384034Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:19:58.8388633Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:19:58.8392067Z  'num_stages': 4,
2026-02-21T09:19:58.8396298Z  'num_warps': 2,
2026-02-21T09:19:58.8397733Z  'pid_type': 'flat',
2026-02-21T09:19:58.8397997Z  'range_flattens': [None, True],
2026-02-21T09:19:58.8402838Z  'range_multi_buffers': [None, False],
2026-02-21T09:19:58.8404710Z  'range_num_stages': [0, 1],
2026-02-21T09:19:58.8404956Z  'range_unroll_factors': [0, 0],
2026-02-21T09:19:58.8405155Z  'range_warp_specializes': [None, False]}
2026-02-21T09:19:58.8405459Z [98s] Fitting surrogate: 352 points, 352 targets
2026-02-21T09:19:59.5362688Z [99s] Generation 4 starting: 44 neighbors, 4 active search path(s)
2026-02-21T09:20:02.5737972Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 19.1 configs/s
2026-02-21T09:20:05.4237408Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.7 configs/s
2026-02-21T09:20:07.8596595Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 418.5         
2026-02-21T09:20:07.8599283Z                                                                   configs/s     
2026-02-21T09:20:08.0720253Z [108s] Generation 4 complete: 
2026-02-21T09:20:08.0722234Z ok=49
2026-02-21T09:20:08.0722408Z min=0.0143
2026-02-21T09:20:08.0722538Z mid=0.0184
2026-02-21T09:20:08.0722665Z max=0.0246
2026-02-21T09:20:08.0722801Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:08.0723058Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:20:08.0723323Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:08.0723507Z  'num_stages': 4,
2026-02-21T09:20:08.0723679Z  'num_warps': 2,
2026-02-21T09:20:08.0723838Z  'pid_type': 'flat',
2026-02-21T09:20:08.0724001Z  'range_flattens': [None, True],
2026-02-21T09:20:08.0724175Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:08.0724364Z  'range_num_stages': [0, 1],
2026-02-21T09:20:08.0724524Z  'range_unroll_factors': [0, 0],
2026-02-21T09:20:08.0724711Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:08.0735415Z [108s] Fitting surrogate: 401 points, 401 targets
2026-02-21T09:20:08.6899161Z [108s] Generation 5 starting: 35 neighbors, 3 active search path(s)
2026-02-21T09:20:11.7622981Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 9.0 configs/s
2026-02-21T09:20:14.0299028Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 16.6 configs/s
2026-02-21T09:20:16.0217475Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 510.9         
2026-02-21T09:20:16.0222780Z                                                                   configs/s     
2026-02-21T09:20:16.1911917Z [116s] Generation 5 complete: 
2026-02-21T09:20:16.1915968Z ok=39
2026-02-21T09:20:16.1919806Z min=0.0143
2026-02-21T09:20:16.1923768Z mid=0.0164
2026-02-21T09:20:16.1928206Z max=0.0287
2026-02-21T09:20:16.1929690Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:16.1929958Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:20:16.1930199Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:16.1930391Z  'num_stages': 8,
2026-02-21T09:20:16.1930532Z  'num_warps': 2,
2026-02-21T09:20:16.1930680Z  'pid_type': 'flat',
2026-02-21T09:20:16.1930834Z  'range_flattens': [None, False],
2026-02-21T09:20:16.1931020Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:16.1931201Z  'range_num_stages': [0, 3],
2026-02-21T09:20:16.1931374Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:16.1931641Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:16.1931876Z [116s] Fitting surrogate: 440 points, 440 targets
2026-02-21T09:20:16.6053039Z [116s] Generation 6 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:20:47.0660654Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.3 configs/s
2026-02-21T09:20:48.3686608Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 16.7 configs/s
2026-02-21T09:20:49.4054545Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1356.1         
2026-02-21T09:20:49.4054954Z                                                                  configs/s      
2026-02-21T09:20:49.4792893Z [149s] Generation 6 complete: 
2026-02-21T09:20:49.4797875Z ok=22
2026-02-21T09:20:49.4802364Z min=0.0143
2026-02-21T09:20:49.4803815Z mid=0.0184
2026-02-21T09:20:49.4804009Z max=0.1003
2026-02-21T09:20:49.4804189Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:49.4804448Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:20:49.4804694Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:49.4804894Z  'num_stages': 8,
2026-02-21T09:20:49.4805039Z  'num_warps': 2,
2026-02-21T09:20:49.4805240Z  'pid_type': 'flat',
2026-02-21T09:20:49.4805402Z  'range_flattens': [None, False],
2026-02-21T09:20:49.4805583Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:49.4805772Z  'range_num_stages': [0, 3],
2026-02-21T09:20:49.4805938Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:49.4806121Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:49.4806339Z [149s] Fitting surrogate: 462 points, 462 targets
2026-02-21T09:20:49.7533094Z [149s] Generation 7 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:20:51.6623482Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.0 configs/s
2026-02-21T09:20:52.3409488Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 17.4 configs/s
2026-02-21T09:20:53.0279242Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1464.7         
2026-02-21T09:20:53.0287515Z                                                                  configs/s      
2026-02-21T09:20:53.0942922Z [153s] Generation 7 complete: 
2026-02-21T09:20:53.0947852Z ok=13
2026-02-21T09:20:53.0952768Z min=0.0143
2026-02-21T09:20:53.0954390Z mid=0.0162
2026-02-21T09:20:53.0954603Z max=0.0225
2026-02-21T09:20:53.0959859Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:53.0964338Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:20:53.0968680Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:53.0973584Z  'num_stages': 8,
2026-02-21T09:20:53.0977858Z  'num_warps': 2,
2026-02-21T09:20:53.0981789Z  'pid_type': 'flat',
2026-02-21T09:20:53.0982069Z  'range_flattens': [None, False],
2026-02-21T09:20:53.0982305Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:53.0986314Z  'range_num_stages': [0, 3],
2026-02-21T09:20:53.0990069Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:53.0995002Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:53.0999312Z [153s] Fitting surrogate: 475 points, 475 targets
2026-02-21T09:20:53.3937427Z [153s] Generation 8 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:20:54.7390409Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 9.2 configs/s
2026-02-21T09:20:55.4741238Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.4 configs/s
2026-02-21T09:20:56.2692095Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1268.8         
2026-02-21T09:20:56.2692599Z                                                                  configs/s      
2026-02-21T09:20:56.3449272Z [156s] Generation 8 complete: 
2026-02-21T09:20:56.3450977Z ok=14
2026-02-21T09:20:56.3451148Z min=0.0143
2026-02-21T09:20:56.3451279Z mid=0.0143
2026-02-21T09:20:56.3451411Z max=0.0205
2026-02-21T09:20:56.3451611Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:56.3451864Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:20:56.3452149Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:56.3452347Z  'num_stages': 8,
2026-02-21T09:20:56.3452513Z  'num_warps': 2,
2026-02-21T09:20:56.3452987Z  'pid_type': 'flat',
2026-02-21T09:20:56.3453190Z  'range_flattens': [None, False],
2026-02-21T09:20:56.3453378Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:56.3453577Z  'range_num_stages': [0, 3],
2026-02-21T09:20:56.3453745Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:56.3453933Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:56.3465500Z [156s] Fitting surrogate: 489 points, 489 targets
2026-02-21T09:20:56.6299205Z [156s] Generation 9 starting: 7 neighbors, 1 active search path(s)
2026-02-21T09:20:57.1363307Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 38.6 configs/s
2026-02-21T09:20:57.5642006Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 18.4 configs/s
2026-02-21T09:20:58.0703318Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1976.0         
2026-02-21T09:20:58.0703742Z                                                                  configs/s      
2026-02-21T09:20:58.1216982Z [158s] Generation 9 complete: 
2026-02-21T09:20:58.1220303Z ok=9
2026-02-21T09:20:58.1223137Z min=0.0143
2026-02-21T09:20:58.1227430Z mid=0.0143
2026-02-21T09:20:58.1230900Z max=0.0184
2026-02-21T09:20:58.1233400Z best={'block_sizes': [1, 4096],
2026-02-21T09:20:58.1238641Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:20:58.1240013Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:20:58.1240230Z  'num_stages': 8,
2026-02-21T09:20:58.1240378Z  'num_warps': 2,
2026-02-21T09:20:58.1240519Z  'pid_type': 'flat',
2026-02-21T09:20:58.1240684Z  'range_flattens': [None, False],
2026-02-21T09:20:58.1240862Z  'range_multi_buffers': [None, False],
2026-02-21T09:20:58.1241050Z  'range_num_stages': [0, 3],
2026-02-21T09:20:58.1241215Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:58.1241391Z  'range_warp_specializes': [None, False]}
2026-02-21T09:20:58.1241673Z [158s] Fitting surrogate: 498 points, 498 targets
2026-02-21T09:20:58.3076784Z [158s] Autotuning complete in 158.3s after searching 475 configs.
2026-02-21T09:20:58.3077169Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:20:58.3078086Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:20:58.3078891Z 
2026-02-21T09:20:58.3079141Z [158s] Code of selected kernel: /tmp/torchinductor_root/nn/cnn3xdqwqruzw2ye3vyccovtcwlzrcws36zg4c6mxkzxps7xq3ie.py
2026-02-21T09:20:58.3307506Z from __future__ import annotations
2026-02-21T09:20:58.3307731Z 
2026-02-21T09:20:58.3312597Z import torch
2026-02-21T09:20:58.3314673Z import triton
2026-02-21T09:20:58.3314901Z import triton.language as tl
2026-02-21T09:20:58.3318988Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:20:58.3322988Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:20:58.3324503Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:20:58.3324712Z 
2026-02-21T09:20:58.3324788Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:20:58.3324982Z _BLOCK_SIZE_1 = tl.constexpr(4096)
2026-02-21T09:20:58.3325098Z 
2026-02-21T09:20:58.3325169Z @triton.jit
2026-02-21T09:20:58.3325324Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:20:58.3325585Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:20:58.3325833Z     pid_0 = tl.program_id(0)
2026-02-21T09:20:58.3326002Z     offset_0 = pid_0
2026-02-21T09:20:58.3326171Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:20:58.3326454Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:20:58.3326752Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:20:58.3327016Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:20:58.3327564Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:20:58.3327838Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:20:58.3328114Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:20:58.3328358Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:20:58.3328592Z     # src[softmax.py:82-89]: ...
2026-02-21T09:20:58.3329000Z     for offset_2 in tl.range(0, 2816, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T09:20:58.3329454Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:20:58.3329690Z         mask_1 = indices_2 < 2816
2026-02-21T09:20:58.3329852Z         mi_copy = mi
2026-02-21T09:20:58.3329996Z         di_copy = di
2026-02-21T09:20:58.3330136Z         mi_copy_0 = mi_copy
2026-02-21T09:20:58.3330293Z         di_copy_0 = di_copy
2026-02-21T09:20:58.3330474Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:20:58.3330843Z         values = tl.load(x + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:20:58.3331229Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:20:58.3331861Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:20:58.3332265Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:20:58.3332519Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:20:58.3332757Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:20:58.3332968Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:20:58.3333222Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:20:58.3333463Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:20:58.3333632Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:20:58.3333804Z         v_4 = di_copy_0 * v_3
2026-02-21T09:20:58.3333984Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:20:58.3334187Z         subscript = v_1[:, None]
2026-02-21T09:20:58.3334364Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:20:58.3334538Z         v_6 = v_5 - subscript
2026-02-21T09:20:58.3334749Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:20:58.3335006Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:20:58.3335224Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:20:58.3335404Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:20:58.3335727Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:20:58.3336076Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:20:58.3336277Z         di = v_4 + sum_1
2026-02-21T09:20:58.3336444Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:20:58.3336705Z         mi = v_1
2026-02-21T09:20:58.3336916Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:20:58.3337186Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:20:58.3337527Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:20:58.3338045Z     for offset_2 in tl.range(0, 2816, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T09:20:58.3338509Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:20:58.3338747Z         mask_2 = indices_2 < 2816
2026-02-21T09:20:58.3338909Z         mi_copy_1 = mi
2026-02-21T09:20:58.3339059Z         di_copy_1 = di
2026-02-21T09:20:58.3339207Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:20:58.3339375Z         di_copy_1_0 = di_copy_1
2026-02-21T09:20:58.3339634Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:20:58.3339995Z         values_1 = tl.load(x + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:20:58.3340421Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:20:58.3340698Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:20:58.3340895Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:20:58.3341075Z         v_10 = v_9 - subscript_1
2026-02-21T09:20:58.3341250Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:20:58.3341433Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:20:58.3341649Z         v_12 = v_11 / subscript_2
2026-02-21T09:20:58.3341826Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:20:58.3342089Z         tl.store(out + (indices_0[:, None] * 2816 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:20:58.3342303Z 
2026-02-21T09:20:58.3342433Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:20:58.3342662Z     """
2026-02-21T09:20:58.3342869Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:20:58.3343177Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:20:58.3343391Z     Args:
2026-02-21T09:20:58.3343555Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:20:58.3343748Z     Returns:
2026-02-21T09:20:58.3343932Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:20:58.3344135Z     """
2026-02-21T09:20:58.3344277Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:20:58.3344450Z     m, n = x.size()
2026-02-21T09:20:58.3344620Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:20:58.3344820Z     out = torch.empty_like(x)
2026-02-21T09:20:58.3345034Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:20:58.3345351Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:20:58.3345655Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:20:58.3345888Z     # src[softmax.py:79-92]: ...
2026-02-21T09:20:58.3346132Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=8)
2026-02-21T09:20:58.3346401Z     # src[softmax.py:93]: return out
2026-02-21T09:20:58.3346570Z     return out
2026-02-21T09:20:59.1741499Z WARNING:tritonbench.utils.triton_op:Completed input ID 20:
2026-02-21T09:20:59.1745798Z (M, N)
2026-02-21T09:20:59.1747239Z ------------
2026-02-21T09:20:59.1747412Z (4096, 2816)
2026-02-21T09:20:59.1747500Z 
2026-02-21T09:20:59.1753773Z  25%|██▌       | 5/20 [12:04<37:44, 150.96s/it]WARNING:tritonbench.utils.triton_op:Running input ID 26:
2026-02-21T09:20:59.1755204Z (M, N)
2026-02-21T09:20:59.1755367Z ------------
2026-02-21T09:20:59.1755516Z (4096, 3584)
2026-02-21T09:20:59.1755876Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:21:00.5442325Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:21:02.1035269Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for torch_compile_softmax
2026-02-21T09:21:03.3692562Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:21:03.3696105Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:21:03.3700547Z               'dtype': 'torch.float16',
2026-02-21T09:21:03.3702248Z               'shape': (4096, 3584),
2026-02-21T09:21:03.3702517Z               'stride': (3584, 1)},),
2026-02-21T09:21:03.3706658Z   'kwargs': {}}
2026-02-21T09:21:03.3713961Z INFO:tritonbench.utils.triton_op:Took 2.28ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:21:03.5449274Z [0s] Autotune random seed: 2138408546
2026-02-21T09:21:03.5697677Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:21:36.0581636Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None])
2026-02-21T09:21:36.8746842Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T09:21:43.5669403Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.9 configs/s
2026-02-21T09:21:43.5690033Z [39s] Adaptive compile timeout: 30s (90% percentile=4.1s, bounds=[30.0s, 30s])
2026-02-21T09:21:43.6875233Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 7881.2 configs/s
2026-02-21T09:21:43.7119083Z [40s] Initial random population of 100, 5 starting points: 
2026-02-21T09:21:43.7120952Z error=10
2026-02-21T09:21:43.7121177Z timeout=1
2026-02-21T09:21:43.7121368Z ok=89
2026-02-21T09:21:43.7121719Z min=0.0205
2026-02-21T09:21:43.7121895Z mid=0.3227
2026-02-21T09:21:43.7122072Z max=82.3716
2026-02-21T09:21:43.7122264Z best={'block_sizes': [1, 4096],
2026-02-21T09:21:43.7122575Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:21:43.7122894Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:21:43.7123105Z  'num_stages': 5,
2026-02-21T09:21:43.7123258Z  'num_warps': 1,
2026-02-21T09:21:43.7123417Z  'pid_type': 'flat',
2026-02-21T09:21:43.7123584Z  'range_flattens': [None, False],
2026-02-21T09:21:43.7123818Z  'range_multi_buffers': [None, False],
2026-02-21T09:21:43.7124020Z  'range_num_stages': [0, 1],
2026-02-21T09:21:43.7124209Z  'range_unroll_factors': [0, 0],
2026-02-21T09:21:43.7124404Z  'range_warp_specializes': [None, False]}
2026-02-21T09:21:43.7128833Z [40s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:21:45.2526608Z [41s] Generation 1 starting: 95 neighbors, 5 active search path(s)
2026-02-21T09:22:16.2279006Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.9 configs/s
2026-02-21T09:22:22.2749839Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.5 configs/s
2026-02-21T09:22:23.7496404Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 688.5         
2026-02-21T09:22:23.7500580Z                                                                   configs/s     
2026-02-21T09:22:23.8641865Z [80s] Generation 1 complete: 
2026-02-21T09:22:23.8644359Z ok=101
2026-02-21T09:22:23.8649485Z min=0.0184
2026-02-21T09:22:23.8649748Z mid=0.0328
2026-02-21T09:22:23.8649922Z max=0.4095
2026-02-21T09:22:23.8650102Z best={'block_sizes': [1, 4096],
2026-02-21T09:22:23.8650382Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:22:23.8650684Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:22:23.8655597Z  'num_stages': 5,
2026-02-21T09:22:23.8657465Z  'num_warps': 2,
2026-02-21T09:22:23.8657699Z  'pid_type': 'flat',
2026-02-21T09:22:23.8658284Z  'range_flattens': [None, None],
2026-02-21T09:22:23.8658472Z  'range_multi_buffers': [None, False],
2026-02-21T09:22:23.8658671Z  'range_num_stages': [0, 1],
2026-02-21T09:22:23.8658837Z  'range_unroll_factors': [0, 1],
2026-02-21T09:22:23.8659042Z  'range_warp_specializes': [None, False]}
2026-02-21T09:22:23.8671256Z [80s] Fitting surrogate: 201 points, 201 targets
2026-02-21T09:22:24.9737212Z [81s] Generation 2 starting: 83 neighbors, 5 active search path(s)
2026-02-21T09:22:33.9571279Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 6.0 configs/s
2026-02-21T09:22:39.1416547Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 16.5 configs/s
2026-02-21T09:22:43.2170540Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 269.5         
2026-02-21T09:22:43.2174515Z                                                                   configs/s     
2026-02-21T09:22:43.4844282Z [99s] Generation 2 complete: 
2026-02-21T09:22:43.4848995Z ok=88
2026-02-21T09:22:43.4850204Z min=0.0184
2026-02-21T09:22:43.4850411Z mid=0.0284
2026-02-21T09:22:43.4850544Z max=0.1289
2026-02-21T09:22:43.4854444Z best={'block_sizes': [1, 4096],
2026-02-21T09:22:43.4858513Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:22:43.4860036Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:22:43.4860301Z  'num_stages': 5,
2026-02-21T09:22:43.4862589Z  'num_warps': 4,
2026-02-21T09:22:43.4862783Z  'pid_type': 'flat',
2026-02-21T09:22:43.4862961Z  'range_flattens': [None, None],
2026-02-21T09:22:43.4863151Z  'range_multi_buffers': [None, None],
2026-02-21T09:22:43.4863349Z  'range_num_stages': [0, 1],
2026-02-21T09:22:43.4863519Z  'range_unroll_factors': [0, 0],
2026-02-21T09:22:43.4863711Z  'range_warp_specializes': [None, False]}
2026-02-21T09:22:43.4864010Z [99s] Fitting surrogate: 289 points, 289 targets
2026-02-21T09:22:44.5739688Z [101s] Generation 3 starting: 80 neighbors, 5 active search path(s)
2026-02-21T09:22:53.1328670Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 5.4 configs/s
2026-02-21T09:22:58.2464200Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 16.6 configs/s
2026-02-21T09:23:02.0314753Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.8         
2026-02-21T09:23:02.0316095Z                                                                   configs/s     
2026-02-21T09:23:02.3292290Z [118s] Generation 3 complete: 
2026-02-21T09:23:02.3298803Z ok=86
2026-02-21T09:23:02.3302264Z min=0.0184
2026-02-21T09:23:02.3306071Z mid=0.0247
2026-02-21T09:23:02.3309934Z max=0.0984
2026-02-21T09:23:02.3314570Z best={'block_sizes': [1, 4096],
2026-02-21T09:23:02.3318232Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:23:02.3319447Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:23:02.3319646Z  'num_stages': 5,
2026-02-21T09:23:02.3319813Z  'num_warps': 4,
2026-02-21T09:23:02.3325378Z  'pid_type': 'flat',
2026-02-21T09:23:02.3329221Z  'range_flattens': [None, None],
2026-02-21T09:23:02.3330782Z  'range_multi_buffers': [None, None],
2026-02-21T09:23:02.3331069Z  'range_num_stages': [0, 1],
2026-02-21T09:23:02.3335960Z  'range_unroll_factors': [0, 0],
2026-02-21T09:23:02.3340229Z  'range_warp_specializes': [None, False]}
2026-02-21T09:23:02.3341961Z [118s] Fitting surrogate: 375 points, 375 targets
2026-02-21T09:23:03.2583784Z [119s] Generation 4 starting: 64 neighbors, 5 active search path(s)
2026-02-21T09:23:08.5678956Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 5.2 configs/s
2026-02-21T09:23:12.8793365Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 15.2 configs/s
2026-02-21T09:23:16.9722642Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 249.0         
2026-02-21T09:23:16.9724145Z                                                                   configs/s     
2026-02-21T09:23:17.3022154Z [133s] Generation 4 complete: 
2026-02-21T09:23:17.3026290Z ok=70
2026-02-21T09:23:17.3029716Z min=0.0184
2026-02-21T09:23:17.3031069Z mid=0.0225
2026-02-21T09:23:17.3031226Z max=0.0572
2026-02-21T09:23:17.3031374Z best={'block_sizes': [1, 4096],
2026-02-21T09:23:17.3031810Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:23:17.3032083Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:23:17.3032266Z  'num_stages': 5,
2026-02-21T09:23:17.3032412Z  'num_warps': 4,
2026-02-21T09:23:17.3032550Z  'pid_type': 'flat',
2026-02-21T09:23:17.3032715Z  'range_flattens': [None, False],
2026-02-21T09:23:17.3032895Z  'range_multi_buffers': [None, True],
2026-02-21T09:23:17.3033084Z  'range_num_stages': [0, 1],
2026-02-21T09:23:17.3033253Z  'range_unroll_factors': [0, 0],
2026-02-21T09:23:17.3033431Z  'range_warp_specializes': [None, False]}
2026-02-21T09:23:17.3040285Z [133s] Fitting surrogate: 445 points, 445 targets
2026-02-21T09:23:18.1269144Z [134s] Generation 5 starting: 54 neighbors, 5 active search path(s)
2026-02-21T09:23:22.1827512Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 8.4 configs/s
2026-02-21T09:23:25.5892685Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.6 configs/s
2026-02-21T09:23:29.1866501Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 307.9         
2026-02-21T09:23:29.1867323Z                                                                   configs/s     
2026-02-21T09:23:29.4498439Z [145s] Generation 5 complete: 
2026-02-21T09:23:29.4503327Z ok=59
2026-02-21T09:23:29.4507675Z min=0.0184
2026-02-21T09:23:29.4512115Z mid=0.0184
2026-02-21T09:23:29.4516498Z max=0.0573
2026-02-21T09:23:29.4521453Z best={'block_sizes': [1, 4096],
2026-02-21T09:23:29.4525786Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:23:29.4526981Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:29.4527196Z  'maxnreg': 64,
2026-02-21T09:23:29.4527358Z  'num_sm_multiplier': 16,
2026-02-21T09:23:29.4527520Z  'num_stages': 3,
2026-02-21T09:23:29.4527697Z  'num_warps': 1,
2026-02-21T09:23:29.4527875Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:23:29.4528076Z  'range_flattens': [None, None],
2026-02-21T09:23:29.4528251Z  'range_multi_buffers': [True, None],
2026-02-21T09:23:29.4528442Z  'range_num_stages': [4, 0],
2026-02-21T09:23:29.4528604Z  'range_unroll_factors': [0, 1],
2026-02-21T09:23:29.4528784Z  'range_warp_specializes': [True, None]}
2026-02-21T09:23:29.4529001Z [145s] Fitting surrogate: 504 points, 504 targets
2026-02-21T09:23:30.3020418Z [146s] Generation 6 starting: 53 neighbors, 4 active search path(s)
2026-02-21T09:23:49.1764097Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 0.9 configs/s
2026-02-21T09:23:52.5177138Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 17.0 configs/s
2026-02-21T09:23:55.0053712Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 409.4         
2026-02-21T09:23:55.0057241Z                                                                   configs/s     
2026-02-21T09:23:55.2058578Z [171s] Generation 6 complete: 
2026-02-21T09:23:55.2062242Z ok=57
2026-02-21T09:23:55.2066538Z min=0.0184
2026-02-21T09:23:55.2067956Z mid=0.0185
2026-02-21T09:23:55.2068130Z max=0.5161
2026-02-21T09:23:55.2068275Z best={'block_sizes': [1, 4096],
2026-02-21T09:23:55.2068537Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:23:55.2068789Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:55.2068979Z  'maxnreg': 64,
2026-02-21T09:23:55.2069139Z  'num_sm_multiplier': 16,
2026-02-21T09:23:55.2069302Z  'num_stages': 3,
2026-02-21T09:23:55.2069448Z  'num_warps': 1,
2026-02-21T09:23:55.2069607Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:23:55.2069806Z  'range_flattens': [None, None],
2026-02-21T09:23:55.2069986Z  'range_multi_buffers': [True, None],
2026-02-21T09:23:55.2070179Z  'range_num_stages': [4, 0],
2026-02-21T09:23:55.2070349Z  'range_unroll_factors': [0, 1],
2026-02-21T09:23:55.2077641Z  'range_warp_specializes': [True, None]}
2026-02-21T09:23:55.2080795Z [171s] Fitting surrogate: 561 points, 561 targets
2026-02-21T09:23:55.9880090Z [172s] Generation 7 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:23:58.9578985Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 19.5 configs/s
2026-02-21T09:24:01.4216505Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s
2026-02-21T09:24:04.1204317Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 427.6         
2026-02-21T09:24:04.1208128Z                                                                   configs/s     
2026-02-21T09:24:04.3258719Z [180s] Generation 7 complete: 
2026-02-21T09:24:04.3263044Z ok=45
2026-02-21T09:24:04.3264589Z min=0.0165
2026-02-21T09:24:04.3264795Z mid=0.0184
2026-02-21T09:24:04.3269684Z max=0.0880
2026-02-21T09:24:04.3273509Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:04.3278223Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:24:04.3281868Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:04.3285353Z  'num_stages': 8,
2026-02-21T09:24:04.3289115Z  'num_warps': 2,
2026-02-21T09:24:04.3290478Z  'pid_type': 'flat',
2026-02-21T09:24:04.3290711Z  'range_flattens': [None, None],
2026-02-21T09:24:04.3290905Z  'range_multi_buffers': [None, None],
2026-02-21T09:24:04.3291105Z  'range_num_stages': [0, 2],
2026-02-21T09:24:04.3291277Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:04.3291471Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:04.3291837Z [180s] Fitting surrogate: 606 points, 606 targets
2026-02-21T09:24:04.8592792Z [181s] Generation 8 starting: 19 neighbors, 2 active search path(s)
2026-02-21T09:24:06.4489716Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 22.4 configs/s
2026-02-21T09:24:07.6207930Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 16.8 configs/s
2026-02-21T09:24:08.8975294Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 792.9         
2026-02-21T09:24:08.8978384Z                                                                   configs/s     
2026-02-21T09:24:09.0096936Z [185s] Generation 8 complete: 
2026-02-21T09:24:09.0101160Z ok=21
2026-02-21T09:24:09.0102640Z min=0.0164
2026-02-21T09:24:09.0102825Z mid=0.0184
2026-02-21T09:24:09.0102952Z max=0.0245
2026-02-21T09:24:09.0103112Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:09.0103340Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:24:09.0103562Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:09.0103758Z  'num_stages': 8,
2026-02-21T09:24:09.0103901Z  'num_warps': 2,
2026-02-21T09:24:09.0104052Z  'pid_type': 'flat',
2026-02-21T09:24:09.0104210Z  'range_flattens': [None, None],
2026-02-21T09:24:09.0104397Z  'range_multi_buffers': [None, None],
2026-02-21T09:24:09.0104581Z  'range_num_stages': [0, 2],
2026-02-21T09:24:09.0104755Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:09.0104936Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:09.0117382Z [185s] Fitting surrogate: 627 points, 627 targets
2026-02-21T09:24:09.5656944Z [185s] Generation 9 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:24:11.0508916Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 43.4 configs/s
2026-02-21T09:24:12.4511401Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 16.9 configs/s
2026-02-21T09:24:13.9012572Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 699.1         
2026-02-21T09:24:13.9015958Z                                                                   configs/s     
2026-02-21T09:24:14.0193088Z [190s] Generation 9 complete: 
2026-02-21T09:24:14.0197388Z ok=25
2026-02-21T09:24:14.0201976Z min=0.0164
2026-02-21T09:24:14.0205716Z mid=0.0184
2026-02-21T09:24:14.0207623Z max=0.0307
2026-02-21T09:24:14.0207806Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:14.0208038Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:24:14.0208276Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:14.0208847Z  'num_stages': 8,
2026-02-21T09:24:14.0209044Z  'num_warps': 2,
2026-02-21T09:24:14.0209191Z  'pid_type': 'flat',
2026-02-21T09:24:14.0209362Z  'range_flattens': [None, None],
2026-02-21T09:24:14.0209546Z  'range_multi_buffers': [None, None],
2026-02-21T09:24:14.0209740Z  'range_num_stages': [0, 1],
2026-02-21T09:24:14.0209905Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:14.0210096Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:14.0215371Z [190s] Fitting surrogate: 652 points, 652 targets
2026-02-21T09:24:14.5417730Z [190s] Generation 10 starting: 22 neighbors, 2 active search path(s)
2026-02-21T09:24:16.2879475Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 14.4 configs/s
2026-02-21T09:24:17.6200976Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.1 configs/s
2026-02-21T09:24:19.0120032Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 728.7         
2026-02-21T09:24:19.0124246Z                                                                   configs/s     
2026-02-21T09:24:19.1327326Z [195s] Generation 10 complete: 
2026-02-21T09:24:19.1329105Z ok=24
2026-02-21T09:24:19.1329332Z min=0.0174
2026-02-21T09:24:19.1329515Z mid=0.0184
2026-02-21T09:24:19.1329678Z max=0.0266
2026-02-21T09:24:19.1329864Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:19.1330099Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:24:19.1330367Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:19.1330589Z  'num_stages': 8,
2026-02-21T09:24:19.1330746Z  'num_warps': 2,
2026-02-21T09:24:19.1330926Z  'pid_type': 'flat',
2026-02-21T09:24:19.1331105Z  'range_flattens': [None, None],
2026-02-21T09:24:19.1331334Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:19.1331810Z  'range_num_stages': [0, 1],
2026-02-21T09:24:19.1332022Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:19.1332250Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:19.1346160Z [195s] Fitting surrogate: 676 points, 676 targets
2026-02-21T09:24:19.6746480Z [196s] Generation 11 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:24:21.1345705Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 15.0 configs/s
2026-02-21T09:24:22.3406089Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 17.2 configs/s
2026-02-21T09:24:23.6700644Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 762.5         
2026-02-21T09:24:23.6704447Z                                                                   configs/s     
2026-02-21T09:24:23.7848210Z [200s] Generation 11 complete: 
2026-02-21T09:24:23.7849423Z ok=22
2026-02-21T09:24:23.7849610Z min=0.0164
2026-02-21T09:24:23.7849748Z mid=0.0184
2026-02-21T09:24:23.7849883Z max=0.0205
2026-02-21T09:24:23.7850029Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:23.7850277Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:24:23.7850530Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:23.7850745Z  'num_stages': 8,
2026-02-21T09:24:23.7851361Z  'num_warps': 2,
2026-02-21T09:24:23.7851635Z  'pid_type': 'flat',
2026-02-21T09:24:23.7851815Z  'range_flattens': [None, None],
2026-02-21T09:24:23.7852000Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:23.7852204Z  'range_num_stages': [0, 1],
2026-02-21T09:24:23.7852375Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:23.7852582Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:23.7869831Z [200s] Fitting surrogate: 698 points, 698 targets
2026-02-21T09:24:24.1868283Z [200s] Generation 12 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:24:25.0279482Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 22.6 configs/s
2026-02-21T09:24:25.6961916Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.7 configs/s
2026-02-21T09:24:26.7628909Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1406.2        
2026-02-21T09:24:26.7629248Z                                                                   configs/s     
2026-02-21T09:24:26.8301992Z [203s] Generation 12 complete: 
2026-02-21T09:24:26.8306206Z ok=12
2026-02-21T09:24:26.8307687Z min=0.0164
2026-02-21T09:24:26.8307845Z mid=0.0184
2026-02-21T09:24:26.8307976Z max=0.0184
2026-02-21T09:24:26.8308113Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:26.8308349Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:24:26.8308589Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:26.8308785Z  'num_stages': 8,
2026-02-21T09:24:26.8308924Z  'num_warps': 2,
2026-02-21T09:24:26.8309072Z  'pid_type': 'flat',
2026-02-21T09:24:26.8309234Z  'range_flattens': [None, None],
2026-02-21T09:24:26.8309408Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:26.8309595Z  'range_num_stages': [0, 2],
2026-02-21T09:24:26.8309757Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:26.8309940Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:26.8318085Z [203s] Fitting surrogate: 710 points, 710 targets
2026-02-21T09:24:27.2515877Z [203s] Generation 13 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:24:28.3629019Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 22.7 configs/s
2026-02-21T09:24:29.1580155Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 17.3 configs/s
2026-02-21T09:24:29.9922629Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1208.2        
2026-02-21T09:24:29.9926730Z                                                                   configs/s     
2026-02-21T09:24:30.0704400Z [206s] Generation 13 complete: 
2026-02-21T09:24:30.0704672Z ok=14
2026-02-21T09:24:30.0708931Z min=0.0164
2026-02-21T09:24:30.0712346Z mid=0.0184
2026-02-21T09:24:30.0714864Z max=0.0184
2026-02-21T09:24:30.0718110Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:30.0722204Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:24:30.0726534Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:30.0730812Z  'num_stages': 8,
2026-02-21T09:24:30.0732219Z  'num_warps': 2,
2026-02-21T09:24:30.0732785Z  'pid_type': 'flat',
2026-02-21T09:24:30.0732965Z  'range_flattens': [None, None],
2026-02-21T09:24:30.0733155Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:30.0733339Z  'range_num_stages': [0, 2],
2026-02-21T09:24:30.0733508Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:30.0733684Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:30.0738645Z [206s] Fitting surrogate: 724 points, 724 targets
2026-02-21T09:24:30.4767123Z [206s] Generation 14 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:24:31.5668939Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 13.8 configs/s
2026-02-21T09:24:32.2395388Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 17.5 configs/s
2026-02-21T09:24:32.8329652Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1687.5        
2026-02-21T09:24:32.8333984Z                                                                   configs/s     
2026-02-21T09:24:32.8909979Z [209s] Generation 14 complete: 
2026-02-21T09:24:32.8914055Z ok=12
2026-02-21T09:24:32.8918488Z min=0.0165
2026-02-21T09:24:32.8919862Z mid=0.0184
2026-02-21T09:24:32.8920026Z max=0.0267
2026-02-21T09:24:32.8920173Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:32.8920432Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:24:32.8920693Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:32.8920903Z  'num_stages': 8,
2026-02-21T09:24:32.8921042Z  'num_warps': 2,
2026-02-21T09:24:32.8921193Z  'pid_type': 'flat',
2026-02-21T09:24:32.8921351Z  'range_flattens': [None, None],
2026-02-21T09:24:32.8921912Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:32.8922117Z  'range_num_stages': [0, 1],
2026-02-21T09:24:32.8922285Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:32.8922491Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:32.8937682Z [209s] Fitting surrogate: 736 points, 736 targets
2026-02-21T09:24:33.2827662Z [209s] Generation 15 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:24:34.3542774Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 13.0 configs/s
2026-02-21T09:24:34.9638281Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.7 configs/s
2026-02-21T09:24:35.5554785Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1692.3        
2026-02-21T09:24:35.5556000Z                                                                   configs/s     
2026-02-21T09:24:35.6153467Z [212s] Generation 15 complete: 
2026-02-21T09:24:35.6157134Z ok=11
2026-02-21T09:24:35.6161803Z min=0.0164
2026-02-21T09:24:35.6165977Z mid=0.0183
2026-02-21T09:24:35.6170406Z max=0.0266
2026-02-21T09:24:35.6173655Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:35.6175920Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:24:35.6176209Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:35.6176459Z  'num_stages': 8,
2026-02-21T09:24:35.6180241Z  'num_warps': 2,
2026-02-21T09:24:35.6184675Z  'pid_type': 'flat',
2026-02-21T09:24:35.6184921Z  'range_flattens': [None, None],
2026-02-21T09:24:35.6189062Z  'range_multi_buffers': [None, True],
2026-02-21T09:24:35.6192191Z  'range_num_stages': [0, 1],
2026-02-21T09:24:35.6196171Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:35.6196446Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:35.6200520Z [212s] Fitting surrogate: 747 points, 747 targets
2026-02-21T09:24:36.0031265Z [212s] Generation 16 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:24:37.0281189Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 15.5 configs/s
2026-02-21T09:24:37.6399834Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.7 configs/s
2026-02-21T09:24:38.2419996Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1664.6        
2026-02-21T09:24:38.2424566Z                                                                   configs/s     
2026-02-21T09:24:38.3025150Z [214s] Generation 16 complete: 
2026-02-21T09:24:38.3029335Z ok=11
2026-02-21T09:24:38.3033702Z min=0.0183
2026-02-21T09:24:38.3038058Z mid=0.0184
2026-02-21T09:24:38.3041165Z max=0.0246
2026-02-21T09:24:38.3044434Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:38.3048397Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:24:38.3048755Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:38.3048975Z  'num_stages': 8,
2026-02-21T09:24:38.3053311Z  'num_warps': 2,
2026-02-21T09:24:38.3057763Z  'pid_type': 'flat',
2026-02-21T09:24:38.3062035Z  'range_flattens': [None, None],
2026-02-21T09:24:38.3066899Z  'range_multi_buffers': [None, True],
2026-02-21T09:24:38.3071833Z  'range_num_stages': [0, 1],
2026-02-21T09:24:38.3076104Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:38.3080475Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:38.3083634Z [214s] Fitting surrogate: 758 points, 758 targets
2026-02-21T09:24:38.6770946Z [215s] Generation 17 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:24:39.7776942Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 12.6 configs/s
2026-02-21T09:24:40.3849674Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 17.8 configs/s
2026-02-21T09:24:40.9894978Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1657.1        
2026-02-21T09:24:40.9898570Z                                                                   configs/s     
2026-02-21T09:24:41.0480347Z [217s] Generation 17 complete: 
2026-02-21T09:24:41.0483586Z ok=11
2026-02-21T09:24:41.0488597Z min=0.0184
2026-02-21T09:24:41.0492913Z mid=0.0184
2026-02-21T09:24:41.0494352Z max=0.0266
2026-02-21T09:24:41.0494544Z best={'block_sizes': [1, 4096],
2026-02-21T09:24:41.0494786Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:24:41.0495049Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:24:41.0495256Z  'num_stages': 8,
2026-02-21T09:24:41.0495402Z  'num_warps': 2,
2026-02-21T09:24:41.0495585Z  'pid_type': 'flat',
2026-02-21T09:24:41.0495763Z  'range_flattens': [None, None],
2026-02-21T09:24:41.0495961Z  'range_multi_buffers': [None, True],
2026-02-21T09:24:41.0496143Z  'range_num_stages': [0, 1],
2026-02-21T09:24:41.0496316Z  'range_unroll_factors': [0, 1],
2026-02-21T09:24:41.0496495Z  'range_warp_specializes': [None, False]}
2026-02-21T09:24:41.0509251Z [217s] Fitting surrogate: 769 points, 769 targets
2026-02-21T09:24:41.3276946Z [217s] Autotuning complete in 217.8s after searching 734 configs.
2026-02-21T09:24:41.3281362Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:24:41.3283454Z     @helion.kernel(config=helion.Config(block_sizes=[1, 4096], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:24:41.3284358Z 
2026-02-21T09:24:41.3284646Z [217s] Code of selected kernel: /tmp/torchinductor_root/ww/cww7kwfj4efxgrw7h2zuolovdqiaferwiztrkeu6jzgpxtlfzzv4.py
2026-02-21T09:24:41.3505982Z from __future__ import annotations
2026-02-21T09:24:41.3509776Z 
2026-02-21T09:24:41.3511819Z import torch
2026-02-21T09:24:41.3511997Z import triton
2026-02-21T09:24:41.3512157Z import triton.language as tl
2026-02-21T09:24:41.3512363Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:24:41.3512632Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:24:41.3512917Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:24:41.3513101Z 
2026-02-21T09:24:41.3513171Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:24:41.3513348Z _BLOCK_SIZE_1 = tl.constexpr(4096)
2026-02-21T09:24:41.3513471Z 
2026-02-21T09:24:41.3513526Z @triton.jit
2026-02-21T09:24:41.3513677Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:24:41.3513940Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:24:41.3514511Z     pid_0 = tl.program_id(0)
2026-02-21T09:24:41.3514682Z     offset_0 = pid_0
2026-02-21T09:24:41.3518425Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:24:41.3518788Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:24:41.3522899Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:24:41.3527724Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:24:41.3529473Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:24:41.3529824Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:24:41.3534778Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:24:41.3536883Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:24:41.3537203Z     # src[softmax.py:82-89]: ...
2026-02-21T09:24:41.3542117Z     for offset_2 in tl.range(0, 3584, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=False):
2026-02-21T09:24:41.3546263Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:24:41.3546605Z         mask_1 = indices_2 < 3584
2026-02-21T09:24:41.3546814Z         mi_copy = mi
2026-02-21T09:24:41.3551267Z         di_copy = di
2026-02-21T09:24:41.3555414Z         mi_copy_0 = mi_copy
2026-02-21T09:24:41.3560069Z         di_copy_0 = di_copy
2026-02-21T09:24:41.3565485Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:24:41.3567134Z         values = tl.load(x + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:24:41.3567587Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:24:41.3568028Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:24:41.3568458Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:24:41.3568741Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:24:41.3568996Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:24:41.3569217Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:24:41.3569491Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:24:41.3569729Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:24:41.3569904Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:24:41.3570071Z         v_4 = di_copy_0 * v_3
2026-02-21T09:24:41.3570267Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:24:41.3570475Z         subscript = v_1[:, None]
2026-02-21T09:24:41.3570649Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:24:41.3570835Z         v_6 = v_5 - subscript
2026-02-21T09:24:41.3571045Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:24:41.3571318Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:24:41.3571611Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:24:41.3571813Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:24:41.3572148Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:24:41.3572505Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:24:41.3572714Z         di = v_4 + sum_1
2026-02-21T09:24:41.3572876Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:24:41.3573057Z         mi = v_1
2026-02-21T09:24:41.3573253Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:24:41.3573534Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:24:41.3573827Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:24:41.3574313Z     for offset_2 in tl.range(0, 3584, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=False):
2026-02-21T09:24:41.3574977Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:24:41.3575212Z         mask_2 = indices_2 < 3584
2026-02-21T09:24:41.3575387Z         mi_copy_1 = mi
2026-02-21T09:24:41.3575532Z         di_copy_1 = di
2026-02-21T09:24:41.3575686Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:24:41.3575850Z         di_copy_1_0 = di_copy_1
2026-02-21T09:24:41.3576040Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:24:41.3576412Z         values_1 = tl.load(x + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:24:41.3576864Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:24:41.3577146Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:24:41.3577328Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:24:41.3577587Z         v_10 = v_9 - subscript_1
2026-02-21T09:24:41.3577769Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:24:41.3577945Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:24:41.3578129Z         v_12 = v_11 / subscript_2
2026-02-21T09:24:41.3578297Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:24:41.3578568Z         tl.store(out + (indices_0[:, None] * 3584 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:24:41.3578783Z 
2026-02-21T09:24:41.3578910Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:24:41.3579146Z     """
2026-02-21T09:24:41.3579358Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:24:41.3579660Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:24:41.3579891Z     Args:
2026-02-21T09:24:41.3580049Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:24:41.3580247Z     Returns:
2026-02-21T09:24:41.3580423Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:24:41.3580635Z     """
2026-02-21T09:24:41.3580769Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:24:41.3580948Z     m, n = x.size()
2026-02-21T09:24:41.3581118Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:24:41.3581311Z     out = torch.empty_like(x)
2026-02-21T09:24:41.3581577Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:24:41.3581885Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:24:41.3582199Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:24:41.3582434Z     # src[softmax.py:79-92]: ...
2026-02-21T09:24:41.3582714Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=2, num_stages=8)
2026-02-21T09:24:41.3582998Z     # src[softmax.py:93]: return out
2026-02-21T09:24:41.3583173Z     return out
2026-02-21T09:24:42.1946110Z WARNING:tritonbench.utils.triton_op:Completed input ID 26:
2026-02-21T09:24:42.1949816Z (M, N)
2026-02-21T09:24:42.1954887Z ------------
2026-02-21T09:24:42.1956366Z (4096, 3584)
2026-02-21T09:24:42.1956501Z 
2026-02-21T09:24:42.1957002Z  30%|███       | 6/20 [15:47<40:56, 175.46s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31:
2026-02-21T09:24:42.1961970Z (M, N)
2026-02-21T09:24:42.1966309Z ------------
2026-02-21T09:24:42.1967685Z (4096, 4224)
2026-02-21T09:24:42.1968002Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:24:43.4733746Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:24:44.8452609Z INFO:tritonbench.utils.triton_op:Took 2.38ms to get benchmark function for torch_compile_softmax
2026-02-21T09:24:46.1441507Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:24:46.1443790Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:24:46.1444014Z               'dtype': 'torch.float16',
2026-02-21T09:24:46.1444249Z               'shape': (4096, 4224),
2026-02-21T09:24:46.1444777Z               'stride': (4224, 1)},),
2026-02-21T09:24:46.1444986Z   'kwargs': {}}
2026-02-21T09:24:46.1479603Z INFO:tritonbench.utils.triton_op:Took 3.87ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:24:46.3211275Z [0s] Autotune random seed: 2138408546
2026-02-21T09:24:46.3456935Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:25:20.4595738Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:25:20.4609701Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T09:25:22.4264197Z module {
2026-02-21T09:25:22.4270836Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:25:22.4271689Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:25:22.4271922Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:25:22.4272133Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:25:22.4272359Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:25:22.4272579Z     %cst = arith.constant dense<4224> : tensor<16x1xi32>
2026-02-21T09:25:22.4272852Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T09:25:22.4273123Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T09:25:22.4273790Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:25:22.4273983Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:25:22.4274185Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T09:25:22.4274445Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T09:25:22.4276420Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:25:22.4276849Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T09:25:22.4280476Z     %1 = tt.get_program_id x : i32
2026-02-21T09:25:22.4280776Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T09:25:22.4281007Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T09:25:22.4281269Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:25:22.4284028Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T09:25:22.4284324Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T09:25:22.4284630Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T09:25:22.4284880Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T09:25:22.4287585Z       %c4096_i32_2 = arith.constant 4096 : i32
2026-02-21T09:25:22.4287821Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T09:25:22.4288274Z       %8:2 = scf.for %arg3 = %c0_i32 to %c4096_i32_2 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T09:25:22.4288881Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:25:22.4289305Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4289609Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4289862Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4290099Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:25:22.4290327Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4290541Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4290808Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:25:22.4291095Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:25:22.4291364Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T09:25:22.4292001Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T09:25:22.4292268Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T09:25:22.4292547Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:25:22.4292825Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T09:25:22.4293255Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4293687Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T09:25:22.4293981Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4294328Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4294608Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T09:25:22.4295122Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4295557Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4295779Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4296001Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:25:22.4296223Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4296449Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4296680Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T09:25:22.4296915Z         %c1_i32_5 = arith.constant 1 : i32
2026-02-21T09:25:22.4297143Z         %68 = arith.muli %c128_i32, %c1_i32_5 : i32
2026-02-21T09:25:22.4297365Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T09:25:22.4297694Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:25:22.4298064Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4298336Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4298563Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4298775Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:25:22.4298995Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4299197Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4299453Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:25:22.4299720Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:25:22.4299983Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T09:25:22.4300216Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T09:25:22.4300448Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T09:25:22.4300707Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:25:22.4300979Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T09:25:22.4301396Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4301854Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T09:25:22.4302144Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4302476Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4302753Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T09:25:22.4303169Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4303549Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4303753Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4303943Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:25:22.4304148Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4304344Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4304636Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T09:25:22.4304847Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:25:22.4305048Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:25:22.4305256Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T09:25:22.4305547Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:25:22.4305893Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4306137Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4306347Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4306549Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:25:22.4306753Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4306955Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4307189Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:25:22.4307509Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:25:22.4307753Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T09:25:22.4307980Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T09:25:22.4308190Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T09:25:22.4308436Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:25:22.4308690Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T09:25:22.4309066Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4309463Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T09:25:22.4309754Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4310112Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4310405Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T09:25:22.4310808Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4311214Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4311417Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4311652Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:25:22.4311852Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4312059Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4312283Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T09:25:22.4312493Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:25:22.4312702Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:25:22.4312904Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T09:25:22.4313218Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:25:22.4313570Z         %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4313835Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4314045Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4314246Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:25:22.4314468Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4314667Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4314936Z         %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:25:22.4315224Z         %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:25:22.4315503Z         %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32>
2026-02-21T09:25:22.4315757Z         %116 = arith.cmpf une, %98, %98 : tensor<16xf32>
2026-02-21T09:25:22.4316000Z         %117 = arith.ori %115, %116 : tensor<16xi1>
2026-02-21T09:25:22.4316285Z         %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:25:22.4316595Z         %119 = arith.subf %98, %118 : tensor<16xf32>
2026-02-21T09:25:22.4316981Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4317368Z         %121 = arith.mulf %107, %120 : tensor<16xf32>
2026-02-21T09:25:22.4317652Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4318004Z         %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4318282Z         %124 = arith.subf %111, %123 : tensor<16x128xf32>
2026-02-21T09:25:22.4318705Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4319101Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4319330Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.4319592Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:25:22.4319822Z           tt.reduce.return %128 : f32
2026-02-21T09:25:22.4320039Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4320270Z         %127 = arith.addf %121, %126 : tensor<16xf32>
2026-02-21T09:25:22.4320528Z         scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32>
2026-02-21T09:25:22.4320778Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:25:22.4321134Z       %9 = tt.descriptor_load %0[%4, %c4096_i32_2] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:25:22.4321516Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4321839Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4322064Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.4322271Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:25:22.4322496Z         tt.reduce.return %50 : f32
2026-02-21T09:25:22.4322708Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4322975Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:25:22.4323275Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:25:22.4323550Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T09:25:22.4323792Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T09:25:22.4324029Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T09:25:22.4324293Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:25:22.4324557Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T09:25:22.4324969Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4325377Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T09:25:22.4325668Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4326002Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4326278Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T09:25:22.4326699Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4327108Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.4327332Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.4327531Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:25:22.4327745Z         tt.reduce.return %50 : f32
2026-02-21T09:25:22.4327948Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:25:22.4328178Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T09:25:22.4328401Z       %c4096_i32_3 = arith.constant 4096 : i32
2026-02-21T09:25:22.4328620Z       %c512_i32_4 = arith.constant 512 : i32
2026-02-21T09:25:22.4328886Z       scf.for %arg3 = %c0_i32 to %c4096_i32_3 step %c512_i32_4  : i32 {
2026-02-21T09:25:22.4329212Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:25:22.4329645Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:25:22.4329882Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T09:25:22.4330176Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:25:22.4330482Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T09:25:22.4330765Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:25:22.4331084Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4331365Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4331676Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T09:25:22.4331930Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4332299Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4332632Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4332962Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4333269Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4333541Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4333792Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T09:25:22.4334190Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4334628Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4334932Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4335182Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T09:25:22.4335443Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:25:22.4335728Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4336030Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4336310Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4336531Z         %c1_i32_5 = arith.constant 1 : i32
2026-02-21T09:25:22.4336739Z         %73 = arith.muli %c128_i32, %c1_i32_5 : i32
2026-02-21T09:25:22.4336943Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T09:25:22.4337194Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:25:22.4337458Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T09:25:22.4337672Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T09:25:22.4337941Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:25:22.4338214Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T09:25:22.4338490Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:25:22.4338796Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4339076Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4339325Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T09:25:22.4339568Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4339866Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4340181Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4340515Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4340816Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4341154Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4341407Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T09:25:22.4341860Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4342308Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4342606Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4342857Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T09:25:22.4343108Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:25:22.4343390Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4348491Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4348821Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4349052Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:25:22.4349255Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:25:22.4349468Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T09:25:22.4351352Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:25:22.4351680Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T09:25:22.4351909Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T09:25:22.4352179Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:25:22.4352492Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T09:25:22.4354419Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:25:22.4354771Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4355090Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4355374Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T09:25:22.4355659Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4355984Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4356358Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4356723Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4357068Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4357383Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4357662Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T09:25:22.4358102Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4358586Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4358928Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4359210Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T09:25:22.4359486Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:25:22.4359815Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4360140Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4360448Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4360682Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:25:22.4360906Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:25:22.4361132Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T09:25:22.4361402Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:25:22.4361803Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T09:25:22.4362037Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T09:25:22.4362312Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:25:22.4362597Z         %129 = arith.muli %128, %cst : tensor<16x1xi32>
2026-02-21T09:25:22.4362881Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:25:22.4363208Z         %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4363496Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4363766Z         %133 = arith.addi %131, %132 : tensor<16x128xi32>
2026-02-21T09:25:22.4364024Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4364396Z         %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4364734Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4365067Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4365382Z         %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4365661Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4365928Z         %140 = arith.subf %138, %139 : tensor<16x128xf32>
2026-02-21T09:25:22.4366326Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4366775Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4367085Z         %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4367344Z         %144 = arith.divf %141, %143 : tensor<16x128xf32>
2026-02-21T09:25:22.4367606Z         %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:25:22.4367896Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4368205Z         %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4368509Z         tt.store %147, %145 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4368732Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:25:22.4368993Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:25:22.4369267Z       %28 = tt.splat %c4096_i32_3 : i32 -> tensor<128xi32>
2026-02-21T09:25:22.4369496Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T09:25:22.4369764Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:25:22.4370037Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T09:25:22.4370321Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:25:22.4370626Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4370913Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:25:22.4371165Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T09:25:22.4371422Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4371764Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4372079Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4372413Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4372708Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:25:22.4372987Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4373303Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T09:25:22.4373691Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:25:22.4374134Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:25:22.4374429Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:25:22.4374699Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T09:25:22.4374962Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:25:22.4375272Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4375594Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:25:22.4375883Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:25:22.4376309Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T09:25:22.4376636Z     tt.return
2026-02-21T09:25:22.4376791Z   }
2026-02-21T09:25:22.4376929Z }
2026-02-21T09:25:22.4377014Z 
2026-02-21T09:25:22.4377071Z {-#
2026-02-21T09:25:22.4377226Z   external_resources: {
2026-02-21T09:25:22.4377408Z     mlir_reproducer: {
2026-02-21T09:25:22.4382545Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:25:22.4388079Z       disable_threading: false,
2026-02-21T09:25:22.4388271Z       verify_each: true
2026-02-21T09:25:22.4388425Z     }
2026-02-21T09:25:22.4388558Z   }
2026-02-21T09:25:22.4388678Z #-}
2026-02-21T09:25:22.4389186Z /tmp/torchinductor_root/w2/cw2csammbsly2xzje4frurab3f4fx7byjvsr2fjttwpoqvu6choy.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:25:22.4390568Z /tmp/torchinductor_root/w2/cw2csammbsly2xzje4frurab3f4fx7byjvsr2fjttwpoqvu6choy.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:25:22.4391717Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:25:22.4392900Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:25:22.4394016Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:25:22.4394291Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:25:22.8462606Z module {
2026-02-21T09:25:22.8465085Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:25:22.8465571Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:25:22.8465880Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T09:25:22.8470860Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:25:22.8477400Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:25:22.8479522Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T09:25:22.8479864Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T09:25:22.8485449Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T09:25:22.8487758Z     %cst_2 = arith.constant dense<4224> : tensor<8x1xi32>
2026-02-21T09:25:22.8488114Z     %cst_3 = arith.constant dense<4224> : tensor<1024xi32>
2026-02-21T09:25:22.8488394Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:25:22.8494806Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:25:22.8496996Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:25:22.8497285Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:25:22.8502943Z     %c4224_i32 = arith.constant 4224 : i32
2026-02-21T09:25:22.8505086Z     %c4224_i64 = arith.constant 4224 : i64
2026-02-21T09:25:22.8505379Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:25:22.8510861Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4224_i32], [%c4224_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T09:25:22.8512574Z     %1 = tt.get_program_id x : i32
2026-02-21T09:25:22.8512832Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T09:25:22.8513078Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:25:22.8513313Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:25:22.8513569Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:25:22.8513768Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:25:22.8513954Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T09:25:22.8514147Z       %c3072_i32_6 = arith.constant 3072 : i32
2026-02-21T09:25:22.8514381Z       %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8514659Z       %7 = tt.splat %c0_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8514885Z       %8 = arith.addi %7, %6 : tensor<1024xi32>
2026-02-21T09:25:22.8515101Z       %9 = arith.cmpi slt, %8, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8515371Z       %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8515630Z       %11 = arith.muli %10, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8515893Z       %12 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8516179Z       %13 = tt.broadcast %11 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8516445Z       %14 = tt.broadcast %12 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8516674Z       %15 = arith.addi %13, %14 : tensor<8x1024xi32>
2026-02-21T09:25:22.8516919Z       %16 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8517194Z       %17 = tt.addptr %16, %15 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8517508Z       %18 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8518075Z       %19 = tt.broadcast %18 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8518327Z       %20 = tt.load %17, %19, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8518597Z       %21 = arith.select %19, %20, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:25:22.8518933Z       %22 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8519160Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8519359Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8519547Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:25:22.8519746Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8519928Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8520153Z       %24 = arith.truncf %23 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:25:22.8520386Z       %25 = arith.extf %24 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:25:22.8520693Z       %26 = arith.cmpf ogt, %cst_5, %25 : tensor<8xf32>
2026-02-21T09:25:22.8520928Z       %27 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32>
2026-02-21T09:25:22.8521137Z       %28 = arith.ori %26, %27 : tensor<8xi1>
2026-02-21T09:25:22.8521372Z       %29 = arith.select %28, %cst_5, %25 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:25:22.8521668Z       %30 = arith.subf %cst_5, %29 : tensor<8xf32>
2026-02-21T09:25:22.8522031Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8522385Z       %32 = arith.mulf %cst_4, %31 : tensor<8xf32>
2026-02-21T09:25:22.8522644Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8522936Z       %34 = arith.extf %20 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8523196Z       %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8523447Z       %36 = arith.subf %34, %35 : tensor<8x1024xf32>
2026-02-21T09:25:22.8523815Z       %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8524239Z       %38 = arith.select %19, %37, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:25:22.8524503Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8524697Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8524893Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:25:22.8525083Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8525272Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8525464Z       %40 = arith.addf %32, %39 : tensor<8xf32>
2026-02-21T09:25:22.8525660Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T09:25:22.8525843Z       %41 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:25:22.8526036Z       %42 = arith.addi %c0_i32, %41 : i32
2026-02-21T09:25:22.8526276Z       %43 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8526524Z       %44 = tt.splat %42 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8526730Z       %45 = arith.addi %44, %43 : tensor<1024xi32>
2026-02-21T09:25:22.8526939Z       %46 = arith.cmpi slt, %45, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8527199Z       %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8527450Z       %48 = arith.muli %47, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8527717Z       %49 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8528015Z       %50 = tt.broadcast %48 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8528274Z       %51 = tt.broadcast %49 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8528512Z       %52 = arith.addi %50, %51 : tensor<8x1024xi32>
2026-02-21T09:25:22.8528747Z       %53 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8529026Z       %54 = tt.addptr %53, %52 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8529387Z       %55 = tt.expand_dims %46 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8529670Z       %56 = tt.broadcast %55 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8529922Z       %57 = tt.load %54, %56, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8530180Z       %58 = arith.select %56, %57, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:25:22.8530455Z       %59 = arith.extf %58 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8530678Z       %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8530871Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8531059Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:25:22.8531248Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8531437Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8531744Z       %61 = arith.truncf %60 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:25:22.8531987Z       %62 = arith.extf %61 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:25:22.8532200Z       %63 = arith.cmpf ogt, %29, %62 : tensor<8xf32>
2026-02-21T09:25:22.8532408Z       %64 = arith.cmpf une, %29, %29 : tensor<8xf32>
2026-02-21T09:25:22.8532602Z       %65 = arith.ori %63, %64 : tensor<8xi1>
2026-02-21T09:25:22.8532827Z       %66 = arith.select %65, %29, %62 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:25:22.8533059Z       %67 = arith.subf %29, %66 : tensor<8xf32>
2026-02-21T09:25:22.8534781Z       %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8535140Z       %69 = arith.mulf %40, %68 : tensor<8xf32>
2026-02-21T09:25:22.8535391Z       %70 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8535689Z       %71 = arith.extf %57 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8535963Z       %72 = tt.broadcast %70 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8536199Z       %73 = arith.subf %71, %72 : tensor<8x1024xf32>
2026-02-21T09:25:22.8536581Z       %74 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8537005Z       %75 = arith.select %56, %74, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:25:22.8537272Z       %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8537467Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8537656Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:25:22.8537854Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8538042Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8538990Z       %77 = arith.addf %69, %76 : tensor<8xf32>
2026-02-21T09:25:22.8539186Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T09:25:22.8539381Z       %78 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:25:22.8539577Z       %79 = arith.addi %c0_i32, %78 : i32
2026-02-21T09:25:22.8539829Z       %80 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8540088Z       %81 = tt.splat %79 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8540297Z       %82 = arith.addi %81, %80 : tensor<1024xi32>
2026-02-21T09:25:22.8540524Z       %83 = arith.cmpi slt, %82, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8540786Z       %84 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8541053Z       %85 = arith.muli %84, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8541315Z       %86 = tt.expand_dims %82 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8541662Z       %87 = tt.broadcast %85 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8544853Z       %88 = tt.broadcast %86 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8545079Z       %89 = arith.addi %87, %88 : tensor<8x1024xi32>
2026-02-21T09:25:22.8545317Z       %90 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8545649Z       %91 = tt.addptr %90, %89 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8548289Z       %92 = tt.expand_dims %83 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8548570Z       %93 = tt.broadcast %92 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8548822Z       %94 = tt.load %91, %93, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8549085Z       %95 = arith.select %93, %94, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:25:22.8549353Z       %96 = arith.extf %95 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8549584Z       %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8549770Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8549957Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:25:22.8550141Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8550403Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8550628Z       %98 = arith.truncf %97 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:25:22.8550855Z       %99 = arith.extf %98 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:25:22.8551079Z       %100 = arith.cmpf ogt, %66, %99 : tensor<8xf32>
2026-02-21T09:25:22.8551287Z       %101 = arith.cmpf une, %66, %66 : tensor<8xf32>
2026-02-21T09:25:22.8551495Z       %102 = arith.ori %100, %101 : tensor<8xi1>
2026-02-21T09:25:22.8551750Z       %103 = arith.select %102, %66, %99 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:25:22.8551995Z       %104 = arith.subf %66, %103 : tensor<8xf32>
2026-02-21T09:25:22.8552364Z       %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8552719Z       %106 = arith.mulf %77, %105 : tensor<8xf32>
2026-02-21T09:25:22.8552976Z       %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8553273Z       %108 = arith.extf %94 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8553545Z       %109 = tt.broadcast %107 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8553790Z       %110 = arith.subf %108, %109 : tensor<8x1024xf32>
2026-02-21T09:25:22.8554170Z       %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8554591Z       %112 = arith.select %93, %111, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:25:22.8554847Z       %113 = "tt.reduce"(%112) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8555045Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:25:22.8555224Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:25:22.8555415Z         tt.reduce.return %192 : f32
2026-02-21T09:25:22.8555595Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8555801Z       %114 = arith.addf %106, %113 : tensor<8xf32>
2026-02-21T09:25:22.8556185Z       %115:2 = scf.for %arg3 = %c3072_i32 to %c4224_i32 step %c1024_i32 iter_args(%arg4 = %103, %arg5 = %114) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:25:22.8556604Z         %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8556874Z         %193 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8557086Z         %194 = arith.addi %193, %192 : tensor<1024xi32>
2026-02-21T09:25:22.8557319Z         %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8557588Z         %196 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8557864Z         %197 = arith.muli %196, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8558138Z         %198 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8558439Z         %199 = tt.broadcast %197 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8558721Z         %200 = tt.broadcast %198 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8559020Z         %201 = arith.addi %199, %200 : tensor<8x1024xi32>
2026-02-21T09:25:22.8559270Z         %202 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8559560Z         %203 = tt.addptr %202, %201 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8559872Z         %204 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8560177Z         %205 = tt.broadcast %204 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8560431Z         %206 = tt.load %203, %205, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8560703Z         %207 = arith.select %205, %206, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:25:22.8561009Z         %208 = arith.extf %207 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8561273Z         %209 = "tt.reduce"(%208) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8561585Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.8561782Z           %227 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:25:22.8561990Z           tt.reduce.return %227 : f32
2026-02-21T09:25:22.8562180Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8562433Z         %210 = arith.truncf %209 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:25:22.8562699Z         %211 = arith.extf %210 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:25:22.8562963Z         %212 = arith.cmpf ogt, %arg4, %211 : tensor<8xf32>
2026-02-21T09:25:22.8563220Z         %213 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:25:22.8563453Z         %214 = arith.ori %212, %213 : tensor<8xi1>
2026-02-21T09:25:22.8563719Z         %215 = arith.select %214, %arg4, %211 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:25:22.8563985Z         %216 = arith.subf %arg4, %215 : tensor<8xf32>
2026-02-21T09:25:22.8564387Z         %217 = tt.extern_elementwise %216 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8564780Z         %218 = arith.mulf %arg5, %217 : tensor<8xf32>
2026-02-21T09:25:22.8565064Z         %219 = tt.expand_dims %215 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8565386Z         %220 = arith.extf %206 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8565674Z         %221 = tt.broadcast %219 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8565944Z         %222 = arith.subf %220, %221 : tensor<8x1024xf32>
2026-02-21T09:25:22.8566344Z         %223 = tt.extern_elementwise %222 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8566807Z         %224 = arith.select %205, %223, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:25:22.8567093Z         %225 = "tt.reduce"(%224) <{axis = 1 : i32}> ({
2026-02-21T09:25:22.8567302Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:25:22.8567514Z           %227 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:25:22.8567722Z           tt.reduce.return %227 : f32
2026-02-21T09:25:22.8567933Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:25:22.8568150Z         %226 = arith.addf %218, %225 : tensor<8xf32>
2026-02-21T09:25:22.8568394Z         scf.yield %215, %226 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:25:22.8568636Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:25:22.8568852Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T09:25:22.8569079Z       %c3072_i32_8 = arith.constant 3072 : i32
2026-02-21T09:25:22.8569342Z       %116 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8569634Z       %117 = tt.splat %c0_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8569859Z       %118 = arith.addi %117, %116 : tensor<1024xi32>
2026-02-21T09:25:22.8570104Z       %119 = arith.cmpi slt, %118, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8570454Z       %120 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:25:22.8570874Z       %121 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8571196Z       %122 = arith.extf %120 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8571480Z       %123 = tt.broadcast %121 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8571802Z       %124 = arith.subf %122, %123 : tensor<8x1024xf32>
2026-02-21T09:25:22.8572202Z       %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8572652Z       %126 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8572961Z       %127 = tt.broadcast %126 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8573213Z       %128 = arith.divf %125, %127 : tensor<8x1024xf32>
2026-02-21T09:25:22.8573518Z       %129 = arith.truncf %128 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:25:22.8573826Z       %130 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8574117Z       %131 = arith.muli %130, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8574382Z       %132 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8574672Z       %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8574950Z       %134 = tt.broadcast %132 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8575206Z       %135 = arith.addi %133, %134 : tensor<8x1024xi32>
2026-02-21T09:25:22.8575467Z       %136 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8575776Z       %137 = tt.addptr %136, %135 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8576111Z       %138 = tt.expand_dims %119 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8576445Z       %139 = tt.broadcast %138 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8576719Z       tt.store %137, %129, %139 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8576958Z       %c1_i32_9 = arith.constant 1 : i32
2026-02-21T09:25:22.8577163Z       %140 = arith.muli %c1024_i32, %c1_i32_9 : i32
2026-02-21T09:25:22.8577360Z       %141 = arith.addi %c0_i32, %140 : i32
2026-02-21T09:25:22.8577595Z       %142 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8577855Z       %143 = tt.splat %141 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8578067Z       %144 = arith.addi %143, %142 : tensor<1024xi32>
2026-02-21T09:25:22.8578284Z       %145 = arith.cmpi slt, %144, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8578594Z       %146 = tt.descriptor_load %0[%2, %141] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:25:22.8578938Z       %147 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8579236Z       %148 = arith.extf %146 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8579502Z       %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8579745Z       %150 = arith.subf %148, %149 : tensor<8x1024xf32>
2026-02-21T09:25:22.8580127Z       %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8580551Z       %152 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8580846Z       %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8581079Z       %154 = arith.divf %151, %153 : tensor<8x1024xf32>
2026-02-21T09:25:22.8581324Z       %155 = arith.truncf %154 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:25:22.8581648Z       %156 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8581902Z       %157 = arith.muli %156, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8582174Z       %158 = tt.expand_dims %144 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8582510Z       %159 = tt.broadcast %157 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8582777Z       %160 = tt.broadcast %158 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8583015Z       %161 = arith.addi %159, %160 : tensor<8x1024xi32>
2026-02-21T09:25:22.8583258Z       %162 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8583547Z       %163 = tt.addptr %162, %161 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8583848Z       %164 = tt.expand_dims %145 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8584144Z       %165 = tt.broadcast %164 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8584391Z       tt.store %163, %155, %165 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8584612Z       %c2_i32_10 = arith.constant 2 : i32
2026-02-21T09:25:22.8584856Z       %166 = arith.muli %c1024_i32, %c2_i32_10 : i32
2026-02-21T09:25:22.8585051Z       %167 = arith.addi %c0_i32, %166 : i32
2026-02-21T09:25:22.8585291Z       %168 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8585541Z       %169 = tt.splat %167 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8585754Z       %170 = arith.addi %169, %168 : tensor<1024xi32>
2026-02-21T09:25:22.8585971Z       %171 = arith.cmpi slt, %170, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8586278Z       %172 = tt.descriptor_load %0[%2, %167] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:25:22.8586630Z       %173 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8586920Z       %174 = arith.extf %172 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8587189Z       %175 = tt.broadcast %173 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8587421Z       %176 = arith.subf %174, %175 : tensor<8x1024xf32>
2026-02-21T09:25:22.8587804Z       %177 = tt.extern_elementwise %176 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8588226Z       %178 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8588518Z       %179 = tt.broadcast %178 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8588759Z       %180 = arith.divf %177, %179 : tensor<8x1024xf32>
2026-02-21T09:25:22.8588997Z       %181 = arith.truncf %180 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:25:22.8589286Z       %182 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8589537Z       %183 = arith.muli %182, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8589802Z       %184 = tt.expand_dims %170 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8590096Z       %185 = tt.broadcast %183 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8590358Z       %186 = tt.broadcast %184 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8590608Z       %187 = arith.addi %185, %186 : tensor<8x1024xi32>
2026-02-21T09:25:22.8590845Z       %188 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8591133Z       %189 = tt.addptr %188, %187 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8591449Z       %190 = tt.expand_dims %171 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8591797Z       %191 = tt.broadcast %190 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8592063Z       tt.store %189, %181, %191 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8592328Z       scf.for %arg3 = %c3072_i32_7 to %c4224_i32 step %c1024_i32  : i32 {
2026-02-21T09:25:22.8592629Z         %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:25:22.8592900Z         %193 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:25:22.8593129Z         %194 = arith.addi %193, %192 : tensor<1024xi32>
2026-02-21T09:25:22.8593406Z         %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32>
2026-02-21T09:25:22.8593735Z         %196 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:25:22.8594106Z         %197 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8594409Z         %198 = arith.extf %196 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:25:22.8594690Z         %199 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8594938Z         %200 = arith.subf %198, %199 : tensor<8x1024xf32>
2026-02-21T09:25:22.8595333Z         %201 = tt.extern_elementwise %200 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:25:22.8595779Z         %202 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:25:22.8596146Z         %203 = tt.broadcast %202 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:25:22.8596409Z         %204 = arith.divf %201, %203 : tensor<8x1024xf32>
2026-02-21T09:25:22.8596661Z         %205 = arith.truncf %204 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:25:22.8596972Z         %206 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:25:22.8597251Z         %207 = arith.muli %206, %cst_2 : tensor<8x1xi32>
2026-02-21T09:25:22.8597533Z         %208 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:25:22.8597851Z         %209 = tt.broadcast %207 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8598134Z         %210 = tt.broadcast %208 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:25:22.8598406Z         %211 = arith.addi %209, %210 : tensor<8x1024xi32>
2026-02-21T09:25:22.8598664Z         %212 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8598984Z         %213 = tt.addptr %212, %211 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:25:22.8599321Z         %214 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:25:22.8599628Z         %215 = tt.broadcast %214 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:25:22.8599903Z         tt.store %213, %205, %215 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:25:22.8600139Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:25:22.8600531Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:25:22.8600880Z     tt.return
2026-02-21T09:25:22.8601017Z   }
2026-02-21T09:25:22.8601147Z }
2026-02-21T09:25:22.8601218Z 
2026-02-21T09:25:22.8601270Z {-#
2026-02-21T09:25:22.8601408Z   external_resources: {
2026-02-21T09:25:22.8601616Z     mlir_reproducer: {
2026-02-21T09:25:22.8605991Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:25:22.8610512Z       disable_threading: false,
2026-02-21T09:25:22.8610683Z       verify_each: true
2026-02-21T09:25:22.8610837Z     }
2026-02-21T09:25:22.8610957Z   }
2026-02-21T09:25:22.8611080Z #-}
2026-02-21T09:25:22.8611603Z /tmp/torchinductor_root/cu/ccudrra547xcwkchef5tpeaz2v4byqm5ndbgc52wz75djktnqouq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:25:22.8612834Z /tmp/torchinductor_root/cu/ccudrra547xcwkchef5tpeaz2v4byqm5ndbgc52wz75djktnqouq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:25:22.8613821Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:25:22.8614904Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:25:22.8615858Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:25:22.8616120Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:25:27.2348468Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.7 configs/s
2026-02-21T09:25:27.2358336Z [40s] Adaptive compile timeout: 30s (90% percentile=5.5s, bounds=[30.0s, 30s])
2026-02-21T09:25:28.5194394Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 779.9 configs/s
2026-02-21T09:25:28.6053511Z [42s] Initial random population of 100, 5 starting points: 
2026-02-21T09:25:28.6054850Z error=14
2026-02-21T09:25:28.6055014Z timeout=1
2026-02-21T09:25:28.6055140Z ok=85
2026-02-21T09:25:28.6055274Z min=0.0398
2026-02-21T09:25:28.6055400Z mid=0.3544
2026-02-21T09:25:28.6055539Z max=98.1514
2026-02-21T09:25:28.6055685Z best={'block_sizes': [4, 512],
2026-02-21T09:25:28.6055917Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:25:28.6056159Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T09:25:28.6056376Z  'maxnreg': 64,
2026-02-21T09:25:28.6056834Z  'num_sm_multiplier': 8,
2026-02-21T09:25:28.6057010Z  'num_stages': 8,
2026-02-21T09:25:28.6057160Z  'num_warps': 8,
2026-02-21T09:25:28.6057312Z  'pid_type': 'persistent_blocked',
2026-02-21T09:25:28.6057502Z  'range_flattens': [False, None],
2026-02-21T09:25:28.6057680Z  'range_multi_buffers': [False, None],
2026-02-21T09:25:28.6057865Z  'range_num_stages': [4, 3],
2026-02-21T09:25:28.6058027Z  'range_unroll_factors': [2, 1],
2026-02-21T09:25:28.6058211Z  'range_warp_specializes': [False, True]}
2026-02-21T09:25:28.6068681Z [42s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:25:30.2294779Z [43s] Generation 1 starting: 92 neighbors, 5 active search path(s)
2026-02-21T09:25:41.0489524Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 3.9 configs/s
2026-02-21T09:25:46.7327387Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 16.8 configs/s
2026-02-21T09:25:49.7685374Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 333.6         
2026-02-21T09:25:49.7685949Z                                                                   configs/s     
2026-02-21T09:25:49.9702082Z [63s] Generation 1 complete: 
2026-02-21T09:25:49.9706072Z ok=98
2026-02-21T09:25:49.9710432Z min=0.0307
2026-02-21T09:25:49.9714972Z mid=0.0460
2026-02-21T09:25:49.9716484Z max=0.2191
2026-02-21T09:25:49.9716661Z best={'block_sizes': [2, 8192],
2026-02-21T09:25:49.9716920Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:25:49.9717180Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:25:49.9717380Z  'num_stages': 1,
2026-02-21T09:25:49.9717521Z  'num_warps': 1,
2026-02-21T09:25:49.9717668Z  'pid_type': 'flat',
2026-02-21T09:25:49.9717825Z  'range_flattens': [None, True],
2026-02-21T09:25:49.9718013Z  'range_multi_buffers': [None, True],
2026-02-21T09:25:49.9718193Z  'range_num_stages': [0, 4],
2026-02-21T09:25:49.9718365Z  'range_unroll_factors': [0, 0],
2026-02-21T09:25:49.9718572Z  'range_warp_specializes': [None, True]}
2026-02-21T09:25:49.9718870Z [63s] Fitting surrogate: 198 points, 198 targets
2026-02-21T09:25:51.1321352Z [64s] Generation 2 starting: 82 neighbors, 5 active search path(s)
2026-02-21T09:25:59.7321130Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 12.5 configs/s
2026-02-21T09:26:05.3403065Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 15.4 configs/s
2026-02-21T09:26:06.7677683Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 706.8         
2026-02-21T09:26:06.7678027Z                                                                   configs/s     
2026-02-21T09:26:06.8755644Z [80s] Generation 2 complete: 
2026-02-21T09:26:06.8760494Z ok=88
2026-02-21T09:26:06.8765039Z min=0.0205
2026-02-21T09:26:06.8766522Z mid=0.0348
2026-02-21T09:26:06.8766691Z max=0.1106
2026-02-21T09:26:06.8766834Z best={'block_sizes': [1, 8192],
2026-02-21T09:26:06.8767109Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:26:06.8767440Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:26:06.8767652Z  'num_stages': 7,
2026-02-21T09:26:06.8767801Z  'num_warps': 4,
2026-02-21T09:26:06.8767940Z  'pid_type': 'flat',
2026-02-21T09:26:06.8768099Z  'range_flattens': [None, True],
2026-02-21T09:26:06.8768271Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:06.8768456Z  'range_num_stages': [0, 3],
2026-02-21T09:26:06.8768616Z  'range_unroll_factors': [0, 3],
2026-02-21T09:26:06.8768797Z  'range_warp_specializes': [None, False]}
2026-02-21T09:26:06.8769014Z [80s] Fitting surrogate: 286 points, 286 targets
2026-02-21T09:26:07.8681831Z [81s] Generation 3 starting: 67 neighbors, 5 active search path(s)
2026-02-21T09:26:22.6211420Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 1.6 configs/s
2026-02-21T09:26:26.9243014Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.7 configs/s
2026-02-21T09:26:28.9599201Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 497.8         
2026-02-21T09:26:28.9600388Z                                                                   configs/s     
2026-02-21T09:26:29.1198266Z [102s] Generation 3 complete: 
2026-02-21T09:26:29.1199635Z ok=72
2026-02-21T09:26:29.1199805Z min=0.0204
2026-02-21T09:26:29.1199952Z mid=0.0327
2026-02-21T09:26:29.1200090Z max=0.2712
2026-02-21T09:26:29.1200239Z best={'block_sizes': [1, 8192],
2026-02-21T09:26:29.1200529Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:26:29.1200828Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:26:29.1201034Z  'num_stages': 7,
2026-02-21T09:26:29.1201179Z  'num_warps': 4,
2026-02-21T09:26:29.1201331Z  'pid_type': 'flat',
2026-02-21T09:26:29.1201492Z  'range_flattens': [None, True],
2026-02-21T09:26:29.1201913Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:29.1202119Z  'range_num_stages': [0, 3],
2026-02-21T09:26:29.1202292Z  'range_unroll_factors': [0, 3],
2026-02-21T09:26:29.1202828Z  'range_warp_specializes': [None, False]}
2026-02-21T09:26:29.1240876Z [102s] Fitting surrogate: 358 points, 358 targets
2026-02-21T09:26:30.0554517Z [103s] Generation 4 starting: 56 neighbors, 5 active search path(s)
2026-02-21T09:26:39.0748264Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 1.7 configs/s
2026-02-21T09:26:42.5765503Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 16.8 configs/s
2026-02-21T09:26:44.0580239Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 684.0         
2026-02-21T09:26:44.0580577Z                                                                   configs/s     
2026-02-21T09:26:44.1728270Z [117s] Generation 4 complete: 
2026-02-21T09:26:44.1730233Z ok=61
2026-02-21T09:26:44.1730403Z min=0.0204
2026-02-21T09:26:44.1730534Z mid=0.0328
2026-02-21T09:26:44.1730663Z max=0.1576
2026-02-21T09:26:44.1730809Z best={'block_sizes': [1, 8192],
2026-02-21T09:26:44.1731071Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:26:44.1731396Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:26:44.1731675Z  'num_stages': 7,
2026-02-21T09:26:44.1731823Z  'num_warps': 4,
2026-02-21T09:26:44.1731967Z  'pid_type': 'flat',
2026-02-21T09:26:44.1732133Z  'range_flattens': [None, True],
2026-02-21T09:26:44.1732309Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:44.1732497Z  'range_num_stages': [0, 3],
2026-02-21T09:26:44.1732669Z  'range_unroll_factors': [0, 3],
2026-02-21T09:26:44.1732850Z  'range_warp_specializes': [None, False]}
2026-02-21T09:26:44.1749664Z [117s] Fitting surrogate: 419 points, 419 targets
2026-02-21T09:26:44.9382571Z [118s] Generation 5 starting: 18 neighbors, 1 active search path(s)
2026-02-21T09:26:49.7827219Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 3.0 configs/s
2026-02-21T09:26:50.9355287Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 19/19 17.1 configs/s
2026-02-21T09:26:50.9359812Z [124s] Generation 5 complete: 
2026-02-21T09:26:50.9361911Z ok=20
2026-02-21T09:26:50.9362149Z min=0.0204
2026-02-21T09:26:50.9364344Z mid=0.0409
2026-02-21T09:26:50.9369498Z max=0.0552
2026-02-21T09:26:50.9371395Z best={'block_sizes': [1, 8192],
2026-02-21T09:26:50.9371788Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:26:50.9372096Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:26:50.9372309Z  'num_stages': 7,
2026-02-21T09:26:50.9372467Z  'num_warps': 4,
2026-02-21T09:26:50.9372626Z  'pid_type': 'flat',
2026-02-21T09:26:50.9372793Z  'range_flattens': [None, True],
2026-02-21T09:26:50.9372987Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:50.9373186Z  'range_num_stages': [0, 3],
2026-02-21T09:26:50.9373358Z  'range_unroll_factors': [0, 3],
2026-02-21T09:26:50.9373551Z  'range_warp_specializes': [None, False]}
2026-02-21T09:26:50.9373778Z [124s] Fitting surrogate: 439 points, 439 targets
2026-02-21T09:26:51.3155925Z [124s] Generation 6 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:26:55.0395782Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.7 configs/s
2026-02-21T09:26:56.2345304Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.4 configs/s
2026-02-21T09:26:56.2351819Z [129s] Generation 6 complete: 
2026-02-21T09:26:56.2354886Z ok=21
2026-02-21T09:26:56.2355137Z min=0.0204
2026-02-21T09:26:56.2355281Z mid=0.0451
2026-02-21T09:26:56.2355432Z max=0.1044
2026-02-21T09:26:56.2359508Z best={'block_sizes': [1, 8192],
2026-02-21T09:26:56.2364892Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:26:56.2366276Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:26:56.2366501Z  'num_stages': 7,
2026-02-21T09:26:56.2366655Z  'num_warps': 4,
2026-02-21T09:26:56.2366799Z  'pid_type': 'flat',
2026-02-21T09:26:56.2366964Z  'range_flattens': [None, True],
2026-02-21T09:26:56.2367148Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:56.2367343Z  'range_num_stages': [0, 3],
2026-02-21T09:26:56.2367894Z  'range_unroll_factors': [0, 3],
2026-02-21T09:26:56.2368110Z  'range_warp_specializes': [None, False]}
2026-02-21T09:26:56.2368341Z [129s] Fitting surrogate: 460 points, 460 targets
2026-02-21T09:26:56.4002873Z [130s] Autotuning complete in 130.1s after searching 448 configs.
2026-02-21T09:26:56.4005212Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:26:56.4006188Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:26:56.4007025Z 
2026-02-21T09:26:56.4007275Z [130s] Code of selected kernel: /tmp/torchinductor_root/bl/cbllivasagu25kv3xae4hgfckwjsmqt3hem5fxlwoeb2l3uruoyh.py
2026-02-21T09:26:56.4262493Z from __future__ import annotations
2026-02-21T09:26:56.4264269Z 
2026-02-21T09:26:56.4264432Z import torch
2026-02-21T09:26:56.4264597Z import triton
2026-02-21T09:26:56.4264760Z import triton.language as tl
2026-02-21T09:26:56.4264971Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:26:56.4265245Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:26:56.4265546Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:26:56.4265722Z 
2026-02-21T09:26:56.4265793Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:26:56.4265977Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:26:56.4266092Z 
2026-02-21T09:26:56.4266148Z @triton.jit
2026-02-21T09:26:56.4266296Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:26:56.4266543Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:26:56.4266796Z     pid_0 = tl.program_id(0)
2026-02-21T09:26:56.4266958Z     offset_0 = pid_0
2026-02-21T09:26:56.4267129Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:26:56.4267427Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:26:56.4267715Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:26:56.4267983Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:26:56.4268229Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:26:56.4268484Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:26:56.4268751Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:26:56.4269005Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:26:56.4269237Z     # src[softmax.py:82-89]: ...
2026-02-21T09:26:56.4269595Z     for offset_2 in tl.range(0, 4224, _BLOCK_SIZE_1, loop_unroll_factor=3, warp_specialize=False, num_stages=1, flatten=True):
2026-02-21T09:26:56.4269993Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:26:56.4270239Z         mask_1 = indices_2 < 4224
2026-02-21T09:26:56.4270745Z         mi_copy = mi
2026-02-21T09:26:56.4270894Z         di_copy = di
2026-02-21T09:26:56.4271037Z         mi_copy_0 = mi_copy
2026-02-21T09:26:56.4271197Z         di_copy_0 = di_copy
2026-02-21T09:26:56.4271378Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:26:56.4272018Z         values = tl.load(x + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:26:56.4272418Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:26:56.4272817Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:26:56.4273213Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:26:56.4273470Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:26:56.4273825Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:26:56.4274048Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:26:56.4274308Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:26:56.4274550Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:26:56.4274719Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:26:56.4274891Z         v_4 = di_copy_0 * v_3
2026-02-21T09:26:56.4275077Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:26:56.4275283Z         subscript = v_1[:, None]
2026-02-21T09:26:56.4275453Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:26:56.4275636Z         v_6 = v_5 - subscript
2026-02-21T09:26:56.4275849Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:26:56.4276106Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:26:56.4276325Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:26:56.4276507Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:26:56.4276832Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:26:56.4277183Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:26:56.4277384Z         di = v_4 + sum_1
2026-02-21T09:26:56.4277554Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:26:56.4277727Z         mi = v_1
2026-02-21T09:26:56.4277959Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:26:56.4278223Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:26:56.4278518Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:26:56.4278942Z     for offset_2 in tl.range(0, 4224, _BLOCK_SIZE_1, loop_unroll_factor=3, warp_specialize=False, num_stages=1, flatten=True):
2026-02-21T09:26:56.4279337Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:26:56.4279572Z         mask_2 = indices_2 < 4224
2026-02-21T09:26:56.4279740Z         mi_copy_1 = mi
2026-02-21T09:26:56.4279896Z         di_copy_1 = di
2026-02-21T09:26:56.4280041Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:26:56.4280209Z         di_copy_1_0 = di_copy_1
2026-02-21T09:26:56.4280389Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:26:56.4280758Z         values_1 = tl.load(x + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:26:56.4281187Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:26:56.4281455Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:26:56.4281726Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:26:56.4281906Z         v_10 = v_9 - subscript_1
2026-02-21T09:26:56.4282083Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:26:56.4282259Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:26:56.4282445Z         v_12 = v_11 / subscript_2
2026-02-21T09:26:56.4282616Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:26:56.4282974Z         tl.store(out + (indices_0[:, None] * 4224 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:26:56.4283182Z 
2026-02-21T09:26:56.4283315Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:26:56.4283541Z     """
2026-02-21T09:26:56.4283746Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:26:56.4284047Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:26:56.4284268Z     Args:
2026-02-21T09:26:56.4284427Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:26:56.4284625Z     Returns:
2026-02-21T09:26:56.4284807Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:26:56.4285010Z     """
2026-02-21T09:26:56.4285154Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:26:56.4285327Z     m, n = x.size()
2026-02-21T09:26:56.4285560Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:26:56.4285769Z     out = torch.empty_like(x)
2026-02-21T09:26:56.4286001Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:26:56.4286315Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:26:56.4286628Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:26:56.4286873Z     # src[softmax.py:79-92]: ...
2026-02-21T09:26:56.4287124Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=7)
2026-02-21T09:26:56.4287402Z     # src[softmax.py:93]: return out
2026-02-21T09:26:56.4287573Z     return out
2026-02-21T09:26:56.9256576Z WARNING:tritonbench.utils.triton_op:Completed input ID 31:
2026-02-21T09:26:56.9258544Z (M, N)
2026-02-21T09:26:56.9258707Z ------------
2026-02-21T09:26:56.9258859Z (4096, 4224)
2026-02-21T09:26:56.9265458Z 
2026-02-21T09:26:56.9266103Z  35%|███▌      | 7/20 [18:02<35:07, 162.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 36:
2026-02-21T09:26:56.9267822Z (M, N)
2026-02-21T09:26:56.9268028Z ------------
2026-02-21T09:26:56.9273964Z (4096, 4864)
2026-02-21T09:26:56.9278636Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:26:58.1964418Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:26:59.6430358Z INFO:tritonbench.utils.triton_op:Took 2.48ms to get benchmark function for torch_compile_softmax
2026-02-21T09:27:00.9660899Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:27:00.9664519Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:27:00.9668836Z               'dtype': 'torch.float16',
2026-02-21T09:27:00.9673269Z               'shape': (4096, 4864),
2026-02-21T09:27:00.9677110Z               'stride': (4864, 1)},),
2026-02-21T09:27:00.9681158Z   'kwargs': {}}
2026-02-21T09:27:00.9681721Z INFO:tritonbench.utils.triton_op:Took 2.32ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:27:01.1463148Z [0s] Autotune random seed: 2138408546
2026-02-21T09:27:01.1721833Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:27:35.8884921Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:27:35.8902999Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:27:38.2047113Z module {
2026-02-21T09:27:38.2051130Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:27:38.2051924Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:27:38.2057238Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T09:27:38.2061128Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:27:38.2065003Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:27:38.2067008Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T09:27:38.2067262Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T09:27:38.2067541Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T09:27:38.2067797Z     %cst_2 = arith.constant dense<4864> : tensor<8x1xi32>
2026-02-21T09:27:38.2068027Z     %cst_3 = arith.constant dense<4864> : tensor<1024xi32>
2026-02-21T09:27:38.2068276Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:27:38.2068528Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:27:38.2068745Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:27:38.2068934Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:27:38.2069403Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T09:27:38.2069673Z     %c4864_i64 = arith.constant 4864 : i64
2026-02-21T09:27:38.2069852Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:27:38.2070175Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4864_i32], [%c4864_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T09:27:38.2070509Z     %1 = tt.get_program_id x : i32
2026-02-21T09:27:38.2070721Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T09:27:38.2070948Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:27:38.2071173Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:27:38.2071427Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:27:38.2071687Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:27:38.2071879Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T09:27:38.2072079Z       %c3072_i32_6 = arith.constant 3072 : i32
2026-02-21T09:27:38.2072324Z       %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2072599Z       %7 = tt.splat %c0_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2072807Z       %8 = arith.addi %7, %6 : tensor<1024xi32>
2026-02-21T09:27:38.2073025Z       %9 = arith.cmpi slt, %8, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2073282Z       %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2073547Z       %11 = arith.muli %10, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2073809Z       %12 = tt.expand_dims %8 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2074095Z       %13 = tt.broadcast %11 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2074368Z       %14 = tt.broadcast %12 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2074605Z       %15 = arith.addi %13, %14 : tensor<8x1024xi32>
2026-02-21T09:27:38.2074851Z       %16 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2075127Z       %17 = tt.addptr %16, %15 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2075425Z       %18 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2075717Z       %19 = tt.broadcast %18 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2075964Z       %20 = tt.load %17, %19, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2076227Z       %21 = arith.select %19, %20, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:27:38.2076499Z       %22 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2076731Z       %23 = "tt.reduce"(%22) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2076920Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2077110Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:27:38.2077304Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2077485Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2077706Z       %24 = arith.truncf %23 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:27:38.2078097Z       %25 = arith.extf %24 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:27:38.2078326Z       %26 = arith.cmpf ogt, %cst_5, %25 : tensor<8xf32>
2026-02-21T09:27:38.2078547Z       %27 = arith.cmpf une, %cst_5, %cst_5 : tensor<8xf32>
2026-02-21T09:27:38.2078762Z       %28 = arith.ori %26, %27 : tensor<8xi1>
2026-02-21T09:27:38.2078990Z       %29 = arith.select %28, %cst_5, %25 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:27:38.2079222Z       %30 = arith.subf %cst_5, %29 : tensor<8xf32>
2026-02-21T09:27:38.2079586Z       %31 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2079936Z       %32 = arith.mulf %cst_4, %31 : tensor<8xf32>
2026-02-21T09:27:38.2080185Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2080463Z       %34 = arith.extf %20 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2080784Z       %35 = tt.broadcast %33 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2081026Z       %36 = arith.subf %34, %35 : tensor<8x1024xf32>
2026-02-21T09:27:38.2081389Z       %37 = tt.extern_elementwise %36 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2081840Z       %38 = arith.select %19, %37, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:27:38.2082079Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2082274Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2082459Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:27:38.2082644Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2082836Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2083034Z       %40 = arith.addf %32, %39 : tensor<8xf32>
2026-02-21T09:27:38.2083234Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T09:27:38.2083424Z       %41 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:27:38.2083629Z       %42 = arith.addi %c0_i32, %41 : i32
2026-02-21T09:27:38.2083861Z       %43 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2084121Z       %44 = tt.splat %42 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2084323Z       %45 = arith.addi %44, %43 : tensor<1024xi32>
2026-02-21T09:27:38.2084531Z       %46 = arith.cmpi slt, %45, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2084790Z       %47 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2085042Z       %48 = arith.muli %47, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2085303Z       %49 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2085590Z       %50 = tt.broadcast %48 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2085858Z       %51 = tt.broadcast %49 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2086098Z       %52 = arith.addi %50, %51 : tensor<8x1024xi32>
2026-02-21T09:27:38.2086336Z       %53 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2086615Z       %54 = tt.addptr %53, %52 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2086907Z       %55 = tt.expand_dims %46 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2087198Z       %56 = tt.broadcast %55 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2087452Z       %57 = tt.load %54, %56, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2087707Z       %58 = arith.select %56, %57, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:27:38.2088012Z       %59 = arith.extf %58 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2088235Z       %60 = "tt.reduce"(%59) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2088428Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2088607Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:27:38.2088802Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2089045Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2089270Z       %61 = arith.truncf %60 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:27:38.2089508Z       %62 = arith.extf %61 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:27:38.2089724Z       %63 = arith.cmpf ogt, %29, %62 : tensor<8xf32>
2026-02-21T09:27:38.2089932Z       %64 = arith.cmpf une, %29, %29 : tensor<8xf32>
2026-02-21T09:27:38.2090127Z       %65 = arith.ori %63, %64 : tensor<8xi1>
2026-02-21T09:27:38.2090345Z       %66 = arith.select %65, %29, %62 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:27:38.2090568Z       %67 = arith.subf %29, %66 : tensor<8xf32>
2026-02-21T09:27:38.2090919Z       %68 = tt.extern_elementwise %67 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2091277Z       %69 = arith.mulf %40, %68 : tensor<8xf32>
2026-02-21T09:27:38.2091514Z       %70 = tt.expand_dims %66 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2091895Z       %71 = arith.extf %57 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2092151Z       %72 = tt.broadcast %70 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2092383Z       %73 = arith.subf %71, %72 : tensor<8x1024xf32>
2026-02-21T09:27:38.2092751Z       %74 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2093186Z       %75 = arith.select %56, %74, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:27:38.2093435Z       %76 = "tt.reduce"(%75) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2093628Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2093803Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:27:38.2093998Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2094182Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2094385Z       %77 = arith.addf %69, %76 : tensor<8xf32>
2026-02-21T09:27:38.2094576Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T09:27:38.2094766Z       %78 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:27:38.2094960Z       %79 = arith.addi %c0_i32, %78 : i32
2026-02-21T09:27:38.2095188Z       %80 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2095440Z       %81 = tt.splat %79 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2095635Z       %82 = arith.addi %81, %80 : tensor<1024xi32>
2026-02-21T09:27:38.2095850Z       %83 = arith.cmpi slt, %82, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2096103Z       %84 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2096363Z       %85 = arith.muli %84, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2096621Z       %86 = tt.expand_dims %82 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2096909Z       %87 = tt.broadcast %85 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2097182Z       %88 = tt.broadcast %86 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2097426Z       %89 = arith.addi %87, %88 : tensor<8x1024xi32>
2026-02-21T09:27:38.2097675Z       %90 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2097960Z       %91 = tt.addptr %90, %89 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2098274Z       %92 = tt.expand_dims %83 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2098575Z       %93 = tt.broadcast %92 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2098832Z       %94 = tt.load %91, %93, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2099107Z       %95 = arith.select %93, %94, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:27:38.2099385Z       %96 = arith.extf %95 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2099623Z       %97 = "tt.reduce"(%96) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2099816Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2100008Z         %192 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:27:38.2100260Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2100445Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2100677Z       %98 = arith.truncf %97 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:27:38.2100919Z       %99 = arith.extf %98 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:27:38.2101155Z       %100 = arith.cmpf ogt, %66, %99 : tensor<8xf32>
2026-02-21T09:27:38.2101375Z       %101 = arith.cmpf une, %66, %66 : tensor<8xf32>
2026-02-21T09:27:38.2101662Z       %102 = arith.ori %100, %101 : tensor<8xi1>
2026-02-21T09:27:38.2101910Z       %103 = arith.select %102, %66, %99 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:27:38.2102158Z       %104 = arith.subf %66, %103 : tensor<8xf32>
2026-02-21T09:27:38.2102544Z       %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2102984Z       %106 = arith.mulf %77, %105 : tensor<8xf32>
2026-02-21T09:27:38.2103255Z       %107 = tt.expand_dims %103 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2103557Z       %108 = arith.extf %94 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2103840Z       %109 = tt.broadcast %107 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2104101Z       %110 = arith.subf %108, %109 : tensor<8x1024xf32>
2026-02-21T09:27:38.2104486Z       %111 = tt.extern_elementwise %110 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2104928Z       %112 = arith.select %93, %111, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:27:38.2105221Z       %113 = "tt.reduce"(%112) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2105420Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:27:38.2105609Z         %192 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:27:38.2105793Z         tt.reduce.return %192 : f32
2026-02-21T09:27:38.2105980Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2106174Z       %114 = arith.addf %106, %113 : tensor<8xf32>
2026-02-21T09:27:38.2106549Z       %115:2 = scf.for %arg3 = %c3072_i32 to %c4864_i32 step %c1024_i32 iter_args(%arg4 = %103, %arg5 = %114) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:27:38.2106960Z         %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2107227Z         %193 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2107442Z         %194 = arith.addi %193, %192 : tensor<1024xi32>
2026-02-21T09:27:38.2107659Z         %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2107931Z         %196 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2108194Z         %197 = arith.muli %196, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2108462Z         %198 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2108759Z         %199 = tt.broadcast %197 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2109037Z         %200 = tt.broadcast %198 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2109293Z         %201 = arith.addi %199, %200 : tensor<8x1024xi32>
2026-02-21T09:27:38.2109532Z         %202 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2109822Z         %203 = tt.addptr %202, %201 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2110128Z         %204 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2110433Z         %205 = tt.broadcast %204 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2110701Z         %206 = tt.load %203, %205, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2110973Z         %207 = arith.select %205, %206, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:27:38.2111266Z         %208 = arith.extf %207 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2111621Z         %209 = "tt.reduce"(%208) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2111826Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:27:38.2112013Z           %227 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:27:38.2112216Z           tt.reduce.return %227 : f32
2026-02-21T09:27:38.2112412Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2112634Z         %210 = arith.truncf %209 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:27:38.2112885Z         %211 = arith.extf %210 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:27:38.2113118Z         %212 = arith.cmpf ogt, %arg4, %211 : tensor<8xf32>
2026-02-21T09:27:38.2113348Z         %213 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:27:38.2113556Z         %214 = arith.ori %212, %213 : tensor<8xi1>
2026-02-21T09:27:38.2113795Z         %215 = arith.select %214, %arg4, %211 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:27:38.2114037Z         %216 = arith.subf %arg4, %215 : tensor<8xf32>
2026-02-21T09:27:38.2114463Z         %217 = tt.extern_elementwise %216 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2114831Z         %218 = arith.mulf %arg5, %217 : tensor<8xf32>
2026-02-21T09:27:38.2115080Z         %219 = tt.expand_dims %215 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2115371Z         %220 = arith.extf %206 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2115635Z         %221 = tt.broadcast %219 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2115883Z         %222 = arith.subf %220, %221 : tensor<8x1024xf32>
2026-02-21T09:27:38.2116263Z         %223 = tt.extern_elementwise %222 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2116686Z         %224 = arith.select %205, %223, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:27:38.2116956Z         %225 = "tt.reduce"(%224) <{axis = 1 : i32}> ({
2026-02-21T09:27:38.2117148Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:27:38.2117336Z           %227 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:27:38.2117523Z           tt.reduce.return %227 : f32
2026-02-21T09:27:38.2117713Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:27:38.2117923Z         %226 = arith.addf %218, %225 : tensor<8xf32>
2026-02-21T09:27:38.2118138Z         scf.yield %215, %226 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:27:38.2118360Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:27:38.2118556Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T09:27:38.2118753Z       %c3072_i32_8 = arith.constant 3072 : i32
2026-02-21T09:27:38.2118989Z       %116 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2119260Z       %117 = tt.splat %c0_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2119475Z       %118 = arith.addi %117, %116 : tensor<1024xi32>
2026-02-21T09:27:38.2119696Z       %119 = arith.cmpi slt, %118, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2120017Z       %120 = tt.descriptor_load %0[%2, %c0_i32] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:27:38.2120363Z       %121 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2120658Z       %122 = arith.extf %120 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2120919Z       %123 = tt.broadcast %121 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2121165Z       %124 = arith.subf %122, %123 : tensor<8x1024xf32>
2026-02-21T09:27:38.2121576Z       %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2121990Z       %126 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2122285Z       %127 = tt.broadcast %126 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2122523Z       %128 = arith.divf %125, %127 : tensor<8x1024xf32>
2026-02-21T09:27:38.2122816Z       %129 = arith.truncf %128 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:27:38.2123106Z       %130 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2123362Z       %131 = arith.muli %130, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2123629Z       %132 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2123922Z       %133 = tt.broadcast %131 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2124193Z       %134 = tt.broadcast %132 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2124430Z       %135 = arith.addi %133, %134 : tensor<8x1024xi32>
2026-02-21T09:27:38.2124671Z       %136 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2124962Z       %137 = tt.addptr %136, %135 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2125310Z       %138 = tt.expand_dims %119 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2125611Z       %139 = tt.broadcast %138 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2125862Z       tt.store %137, %129, %139 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2126082Z       %c1_i32_9 = arith.constant 1 : i32
2026-02-21T09:27:38.2126278Z       %140 = arith.muli %c1024_i32, %c1_i32_9 : i32
2026-02-21T09:27:38.2126467Z       %141 = arith.addi %c0_i32, %140 : i32
2026-02-21T09:27:38.2126704Z       %142 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2126956Z       %143 = tt.splat %141 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2127164Z       %144 = arith.addi %143, %142 : tensor<1024xi32>
2026-02-21T09:27:38.2127375Z       %145 = arith.cmpi slt, %144, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2127685Z       %146 = tt.descriptor_load %0[%2, %141] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:27:38.2128028Z       %147 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2128313Z       %148 = arith.extf %146 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2128574Z       %149 = tt.broadcast %147 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2128809Z       %150 = arith.subf %148, %149 : tensor<8x1024xf32>
2026-02-21T09:27:38.2129185Z       %151 = tt.extern_elementwise %150 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2129595Z       %152 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2129891Z       %153 = tt.broadcast %152 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2130135Z       %154 = arith.divf %151, %153 : tensor<8x1024xf32>
2026-02-21T09:27:38.2130367Z       %155 = arith.truncf %154 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:27:38.2130652Z       %156 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2130908Z       %157 = arith.muli %156, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2131176Z       %158 = tt.expand_dims %144 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2131473Z       %159 = tt.broadcast %157 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2131768Z       %160 = tt.broadcast %158 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2132017Z       %161 = arith.addi %159, %160 : tensor<8x1024xi32>
2026-02-21T09:27:38.2132257Z       %162 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2132547Z       %163 = tt.addptr %162, %161 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2132852Z       %164 = tt.expand_dims %145 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2133157Z       %165 = tt.broadcast %164 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2133418Z       tt.store %163, %155, %165 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2133685Z       %c2_i32_10 = arith.constant 2 : i32
2026-02-21T09:27:38.2133893Z       %166 = arith.muli %c1024_i32, %c2_i32_10 : i32
2026-02-21T09:27:38.2134091Z       %167 = arith.addi %c0_i32, %166 : i32
2026-02-21T09:27:38.2134329Z       %168 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2134578Z       %169 = tt.splat %167 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2134789Z       %170 = arith.addi %169, %168 : tensor<1024xi32>
2026-02-21T09:27:38.2135013Z       %171 = arith.cmpi slt, %170, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2135313Z       %172 = tt.descriptor_load %0[%2, %167] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:27:38.2135663Z       %173 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2135954Z       %174 = arith.extf %172 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2136265Z       %175 = tt.broadcast %173 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2136510Z       %176 = arith.subf %174, %175 : tensor<8x1024xf32>
2026-02-21T09:27:38.2136892Z       %177 = tt.extern_elementwise %176 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2137318Z       %178 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2137608Z       %179 = tt.broadcast %178 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2137855Z       %180 = arith.divf %177, %179 : tensor<8x1024xf32>
2026-02-21T09:27:38.2138093Z       %181 = arith.truncf %180 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:27:38.2138387Z       %182 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2138653Z       %183 = arith.muli %182, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2138915Z       %184 = tt.expand_dims %170 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2139221Z       %185 = tt.broadcast %183 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2139484Z       %186 = tt.broadcast %184 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2139730Z       %187 = arith.addi %185, %186 : tensor<8x1024xi32>
2026-02-21T09:27:38.2139967Z       %188 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2140261Z       %189 = tt.addptr %188, %187 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2140602Z       %190 = tt.expand_dims %171 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2140893Z       %191 = tt.broadcast %190 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2141159Z       tt.store %189, %181, %191 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2141422Z       scf.for %arg3 = %c3072_i32_7 to %c4864_i32 step %c1024_i32  : i32 {
2026-02-21T09:27:38.2141757Z         %192 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:27:38.2142043Z         %193 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:27:38.2142263Z         %194 = arith.addi %193, %192 : tensor<1024xi32>
2026-02-21T09:27:38.2142502Z         %195 = arith.cmpi slt, %194, %cst_3 : tensor<1024xi32>
2026-02-21T09:27:38.2142829Z         %196 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:27:38.2143207Z         %197 = tt.expand_dims %115#0 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2143513Z         %198 = arith.extf %196 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:27:38.2143797Z         %199 = tt.broadcast %197 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2144051Z         %200 = arith.subf %198, %199 : tensor<8x1024xf32>
2026-02-21T09:27:38.2144448Z         %201 = tt.extern_elementwise %200 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:27:38.2144899Z         %202 = tt.expand_dims %115#1 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:27:38.2145251Z         %203 = tt.broadcast %202 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:27:38.2145509Z         %204 = arith.divf %201, %203 : tensor<8x1024xf32>
2026-02-21T09:27:38.2145775Z         %205 = arith.truncf %204 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:27:38.2146075Z         %206 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:27:38.2146367Z         %207 = arith.muli %206, %cst_2 : tensor<8x1xi32>
2026-02-21T09:27:38.2146642Z         %208 = tt.expand_dims %194 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:27:38.2146958Z         %209 = tt.broadcast %207 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2147262Z         %210 = tt.broadcast %208 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:27:38.2147516Z         %211 = arith.addi %209, %210 : tensor<8x1024xi32>
2026-02-21T09:27:38.2147836Z         %212 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2148142Z         %213 = tt.addptr %212, %211 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:27:38.2148478Z         %214 = tt.expand_dims %195 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:27:38.2148790Z         %215 = tt.broadcast %214 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:27:38.2149069Z         tt.store %213, %205, %215 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:27:38.2149308Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:27:38.2149679Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:27:38.2150046Z     tt.return
2026-02-21T09:27:38.2150178Z   }
2026-02-21T09:27:38.2150311Z }
2026-02-21T09:27:38.2150383Z 
2026-02-21T09:27:38.2150435Z {-#
2026-02-21T09:27:38.2150587Z   external_resources: {
2026-02-21T09:27:38.2150750Z     mlir_reproducer: {
2026-02-21T09:27:38.2155136Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:27:38.2159540Z       disable_threading: false,
2026-02-21T09:27:38.2159715Z       verify_each: true
2026-02-21T09:27:38.2159857Z     }
2026-02-21T09:27:38.2159982Z   }
2026-02-21T09:27:38.2160094Z #-}
2026-02-21T09:27:38.2160521Z /tmp/torchinductor_root/2e/c2e47ra7xtiv7wxu5whsmkdqrvjbmzr2ti3rxjnzfqisyyhz3f4t.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:27:38.2161789Z /tmp/torchinductor_root/2e/c2e47ra7xtiv7wxu5whsmkdqrvjbmzr2ti3rxjnzfqisyyhz3f4t.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:27:38.2162763Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:27:38.2163857Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:27:38.2164816Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:27:38.2165067Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:27:42.7875901Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.4 configs/s
2026-02-21T09:27:42.7884702Z [41s] Adaptive compile timeout: 30s (90% percentile=6.3s, bounds=[30.0s, 30s])
2026-02-21T09:27:43.9119335Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 888.1 configs/s
2026-02-21T09:27:43.9911156Z [42s] Initial random population of 100, 5 starting points: 
2026-02-21T09:27:43.9914122Z error=13
2026-02-21T09:27:43.9918549Z timeout=1
2026-02-21T09:27:43.9922538Z ok=86
2026-02-21T09:27:43.9922732Z min=0.0411
2026-02-21T09:27:43.9922874Z mid=0.3808
2026-02-21T09:27:43.9922999Z max=112.9472
2026-02-21T09:27:43.9923159Z best={'block_sizes': [1, 8192],
2026-02-21T09:27:43.9923461Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:27:43.9923776Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:27:43.9923975Z  'num_sm_multiplier': 2,
2026-02-21T09:27:43.9924141Z  'num_stages': 7,
2026-02-21T09:27:43.9924278Z  'num_warps': 32,
2026-02-21T09:27:43.9924439Z  'pid_type': 'persistent_blocked',
2026-02-21T09:27:43.9924627Z  'range_flattens': [True, True],
2026-02-21T09:27:43.9924815Z  'range_multi_buffers': [False, None],
2026-02-21T09:27:43.9925006Z  'range_num_stages': [4, 3],
2026-02-21T09:27:43.9925172Z  'range_unroll_factors': [2, 3],
2026-02-21T09:27:43.9925361Z  'range_warp_specializes': [False, False]}
2026-02-21T09:27:43.9925573Z [42s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:27:45.1477926Z [43s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T09:28:17.7578148Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.8 configs/s
2026-02-21T09:28:23.0744331Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 16.9 configs/s
2026-02-21T09:28:23.5508856Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2077.3         
2026-02-21T09:28:23.5513677Z                                                                  configs/s      
2026-02-21T09:28:23.5975377Z [82s] Generation 1 complete: 
2026-02-21T09:28:23.5980294Z ok=91
2026-02-21T09:28:23.5984150Z min=0.0225
2026-02-21T09:28:23.5986044Z mid=0.0451
2026-02-21T09:28:23.5986214Z max=0.2663
2026-02-21T09:28:23.5986368Z best={'block_sizes': [1, 8192],
2026-02-21T09:28:23.5986617Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:28:23.5986888Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:28:23.5987085Z  'num_sm_multiplier': 64,
2026-02-21T09:28:23.5987251Z  'num_stages': 5,
2026-02-21T09:28:23.5987391Z  'num_warps': 1,
2026-02-21T09:28:23.5987550Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:28:23.5987739Z  'range_flattens': [None, False],
2026-02-21T09:28:23.5987941Z  'range_multi_buffers': [True, False],
2026-02-21T09:28:23.5988460Z  'range_num_stages': [2, 1],
2026-02-21T09:28:23.5988639Z  'range_unroll_factors': [0, 0],
2026-02-21T09:28:23.5988821Z  'range_warp_specializes': [True, None]}
2026-02-21T09:28:23.5994735Z [82s] Fitting surrogate: 191 points, 191 targets
2026-02-21T09:28:24.5784674Z [83s] Generation 2 starting: 70 neighbors, 5 active search path(s)
2026-02-21T09:28:38.0716142Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 1.1 configs/s
2026-02-21T09:28:42.3582778Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 17.2 configs/s
2026-02-21T09:28:43.8572174Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 675.6         
2026-02-21T09:28:43.8572723Z                                                                   configs/s     
2026-02-21T09:28:43.9758168Z [102s] Generation 2 complete: 
2026-02-21T09:28:43.9759678Z error=2
2026-02-21T09:28:43.9759859Z ok=74
2026-02-21T09:28:43.9760021Z min=0.0204
2026-02-21T09:28:43.9760606Z mid=0.0369
2026-02-21T09:28:43.9760869Z max=0.4075
2026-02-21T09:28:43.9761092Z best={'block_sizes': [1, 8192],
2026-02-21T09:28:43.9761454Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:28:43.9762029Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:28:43.9762242Z  'num_stages': 7,
2026-02-21T09:28:43.9762413Z  'num_warps': 4,
2026-02-21T09:28:43.9762589Z  'pid_type': 'flat',
2026-02-21T09:28:43.9762778Z  'range_flattens': [None, True],
2026-02-21T09:28:43.9762979Z  'range_multi_buffers': [None, None],
2026-02-21T09:28:43.9763190Z  'range_num_stages': [0, 3],
2026-02-21T09:28:43.9763355Z  'range_unroll_factors': [0, 3],
2026-02-21T09:28:43.9763576Z  'range_warp_specializes': [None, None]}
2026-02-21T09:28:43.9785989Z [102s] Fitting surrogate: 267 points, 267 targets
2026-02-21T09:28:44.8307384Z [103s] Generation 3 starting: 65 neighbors, 5 active search path(s)
2026-02-21T09:28:51.0540009Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 16.1 configs/s
2026-02-21T09:28:55.2888934Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.7 configs/s
2026-02-21T09:28:57.1039746Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 559.5         
2026-02-21T09:28:57.1043707Z                                                                   configs/s     
2026-02-21T09:28:57.2457578Z [116s] Generation 3 complete: 
2026-02-21T09:28:57.2459482Z ok=71
2026-02-21T09:28:57.2459682Z min=0.0204
2026-02-21T09:28:57.2464433Z mid=0.0328
2026-02-21T09:28:57.2467567Z max=0.1720
2026-02-21T09:28:57.2469735Z best={'block_sizes': [1, 8192],
2026-02-21T09:28:57.2470041Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:28:57.2470326Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:28:57.2470526Z  'num_stages': 7,
2026-02-21T09:28:57.2470677Z  'num_warps': 4,
2026-02-21T09:28:57.2470818Z  'pid_type': 'flat',
2026-02-21T09:28:57.2470980Z  'range_flattens': [None, True],
2026-02-21T09:28:57.2471192Z  'range_multi_buffers': [None, None],
2026-02-21T09:28:57.2471874Z  'range_num_stages': [0, 3],
2026-02-21T09:28:57.2472040Z  'range_unroll_factors': [0, 3],
2026-02-21T09:28:57.2472224Z  'range_warp_specializes': [None, None]}
2026-02-21T09:28:57.2476123Z [116s] Fitting surrogate: 338 points, 338 targets
2026-02-21T09:28:58.1510043Z [116s] Generation 4 starting: 64 neighbors, 5 active search path(s)
2026-02-21T09:29:03.7491099Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 14.3 configs/s
2026-02-21T09:29:08.2039470Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 15.2 configs/s
2026-02-21T09:29:10.2029844Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 508.2         
2026-02-21T09:29:10.2033808Z                                                                   configs/s     
2026-02-21T09:29:10.3697717Z [129s] Generation 4 complete: 
2026-02-21T09:29:10.3701481Z ok=70
2026-02-21T09:29:10.3704677Z min=0.0204
2026-02-21T09:29:10.3709028Z mid=0.0328
2026-02-21T09:29:10.3709975Z max=0.0778
2026-02-21T09:29:10.3710155Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:10.3710430Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:29:10.3710723Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:10.3710911Z  'num_stages': 6,
2026-02-21T09:29:10.3711060Z  'num_warps': 4,
2026-02-21T09:29:10.3711200Z  'pid_type': 'flat',
2026-02-21T09:29:10.3711361Z  'range_flattens': [None, True],
2026-02-21T09:29:10.3711606Z  'range_multi_buffers': [None, False],
2026-02-21T09:29:10.3711803Z  'range_num_stages': [0, 3],
2026-02-21T09:29:10.3711969Z  'range_unroll_factors': [0, 3],
2026-02-21T09:29:10.3712153Z  'range_warp_specializes': [None, None]}
2026-02-21T09:29:10.3714931Z [129s] Fitting surrogate: 408 points, 408 targets
2026-02-21T09:29:11.0910811Z [129s] Generation 5 starting: 52 neighbors, 4 active search path(s)
2026-02-21T09:29:15.5499299Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 9.9 configs/s
2026-02-21T09:29:18.7875590Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.9 configs/s
2026-02-21T09:29:20.7837124Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 508.7         
2026-02-21T09:29:20.7838801Z                                                                   configs/s     
2026-02-21T09:29:20.9543174Z [139s] Generation 5 complete: 
2026-02-21T09:29:20.9544844Z ok=57
2026-02-21T09:29:20.9545060Z min=0.0204
2026-02-21T09:29:20.9545231Z mid=0.0287
2026-02-21T09:29:20.9545400Z max=0.0614
2026-02-21T09:29:20.9545603Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:20.9545854Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:29:20.9546139Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:20.9546356Z  'num_stages': 2,
2026-02-21T09:29:20.9546517Z  'num_warps': 2,
2026-02-21T09:29:20.9546681Z  'pid_type': 'flat',
2026-02-21T09:29:20.9546864Z  'range_flattens': [None, True],
2026-02-21T09:29:20.9547118Z  'range_multi_buffers': [None, None],
2026-02-21T09:29:20.9547363Z  'range_num_stages': [0, 4],
2026-02-21T09:29:20.9547593Z  'range_unroll_factors': [0, 0],
2026-02-21T09:29:20.9547809Z  'range_warp_specializes': [None, True]}
2026-02-21T09:29:20.9559649Z [139s] Fitting surrogate: 465 points, 465 targets
2026-02-21T09:29:21.6131181Z [140s] Generation 6 starting: 31 neighbors, 3 active search path(s)
2026-02-21T09:29:25.4767305Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 9.2 configs/s
2026-02-21T09:29:27.4108362Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.9 configs/s
2026-02-21T09:29:28.7608825Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 749.7         
2026-02-21T09:29:28.7612950Z                                                                   configs/s     
2026-02-21T09:29:28.8713606Z [147s] Generation 6 complete: 
2026-02-21T09:29:28.8717909Z ok=34
2026-02-21T09:29:28.8722294Z min=0.0204
2026-02-21T09:29:28.8723823Z mid=0.0226
2026-02-21T09:29:28.8724442Z max=0.0655
2026-02-21T09:29:28.8724597Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:28.8729424Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:29:28.8730849Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:28.8731086Z  'num_stages': 2,
2026-02-21T09:29:28.8731233Z  'num_warps': 2,
2026-02-21T09:29:28.8731384Z  'pid_type': 'flat',
2026-02-21T09:29:28.8731608Z  'range_flattens': [None, True],
2026-02-21T09:29:28.8731808Z  'range_multi_buffers': [None, False],
2026-02-21T09:29:28.8731994Z  'range_num_stages': [0, 4],
2026-02-21T09:29:28.8732178Z  'range_unroll_factors': [0, 0],
2026-02-21T09:29:28.8732373Z  'range_warp_specializes': [None, True]}
2026-02-21T09:29:28.8732593Z [147s] Fitting surrogate: 499 points, 499 targets
2026-02-21T09:29:29.4555779Z [148s] Generation 7 starting: 38 neighbors, 3 active search path(s)
2026-02-21T09:29:34.1447017Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 5.3 configs/s
2026-02-21T09:29:36.8658374Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 14.5 configs/s
2026-02-21T09:29:38.8028044Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 524.9         
2026-02-21T09:29:38.8032227Z                                                                   configs/s     
2026-02-21T09:29:38.9656844Z [157s] Generation 7 complete: 
2026-02-21T09:29:38.9661185Z ok=41
2026-02-21T09:29:38.9665587Z min=0.0204
2026-02-21T09:29:38.9670784Z mid=0.0225
2026-02-21T09:29:38.9672848Z max=0.0942
2026-02-21T09:29:38.9673037Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:38.9673273Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:29:38.9673528Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:38.9673737Z  'num_stages': 1,
2026-02-21T09:29:38.9673887Z  'num_warps': 4,
2026-02-21T09:29:38.9674031Z  'pid_type': 'flat',
2026-02-21T09:29:38.9674199Z  'range_flattens': [None, True],
2026-02-21T09:29:38.9674380Z  'range_multi_buffers': [None, True],
2026-02-21T09:29:38.9674614Z  'range_num_stages': [0, 4],
2026-02-21T09:29:38.9674785Z  'range_unroll_factors': [0, 0],
2026-02-21T09:29:38.9674966Z  'range_warp_specializes': [None, True]}
2026-02-21T09:29:38.9675188Z [157s] Fitting surrogate: 540 points, 540 targets
2026-02-21T09:29:39.7560307Z [158s] Generation 8 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:29:43.9290952Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 7.8 configs/s
2026-02-21T09:29:46.4726925Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.8 configs/s
2026-02-21T09:29:48.4426973Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 514.9         
2026-02-21T09:29:48.4431006Z                                                                   configs/s     
2026-02-21T09:29:48.6045244Z [167s] Generation 8 complete: 
2026-02-21T09:29:48.6046614Z ok=44
2026-02-21T09:29:48.6046811Z min=0.0204
2026-02-21T09:29:48.6046987Z mid=0.0267
2026-02-21T09:29:48.6047156Z max=0.0513
2026-02-21T09:29:48.6047386Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:48.6047631Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:29:48.6047892Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:48.6048104Z  'num_stages': 1,
2026-02-21T09:29:48.6048263Z  'num_warps': 4,
2026-02-21T09:29:48.6048423Z  'pid_type': 'flat',
2026-02-21T09:29:48.6048584Z  'range_flattens': [None, True],
2026-02-21T09:29:48.6048769Z  'range_multi_buffers': [None, True],
2026-02-21T09:29:48.6048953Z  'range_num_stages': [0, 4],
2026-02-21T09:29:48.6049126Z  'range_unroll_factors': [0, 0],
2026-02-21T09:29:48.6049325Z  'range_warp_specializes': [None, True]}
2026-02-21T09:29:48.6065202Z [167s] Fitting surrogate: 584 points, 584 targets
2026-02-21T09:29:49.1933260Z [168s] Generation 9 starting: 28 neighbors, 2 active search path(s)
2026-02-21T09:29:53.0090671Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 4.0 configs/s
2026-02-21T09:29:54.7862175Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 16.7 configs/s
2026-02-21T09:29:56.1434076Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 745.2         
2026-02-21T09:29:56.1438604Z                                                                   configs/s     
2026-02-21T09:29:56.2552235Z [175s] Generation 9 complete: 
2026-02-21T09:29:56.2557299Z ok=31
2026-02-21T09:29:56.2561991Z min=0.0204
2026-02-21T09:29:56.2564978Z mid=0.0225
2026-02-21T09:29:56.2567747Z max=0.4096
2026-02-21T09:29:56.2567951Z best={'block_sizes': [1, 8192],
2026-02-21T09:29:56.2568212Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:29:56.2568503Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:29:56.2568710Z  'num_stages': 8,
2026-02-21T09:29:56.2568861Z  'num_warps': 4,
2026-02-21T09:29:56.2569003Z  'pid_type': 'flat',
2026-02-21T09:29:56.2569167Z  'range_flattens': [None, False],
2026-02-21T09:29:56.2569349Z  'range_multi_buffers': [None, False],
2026-02-21T09:29:56.2569910Z  'range_num_stages': [0, 3],
2026-02-21T09:29:56.2570108Z  'range_unroll_factors': [0, 2],
2026-02-21T09:29:56.2570299Z  'range_warp_specializes': [None, False]}
2026-02-21T09:29:56.7195334Z [175s] Fitting surrogate: 615 points, 615 targets
2026-02-21T09:29:56.7195768Z [175s] Generation 10 starting: 17 neighbors, 1 active search path(s)
2026-02-21T09:30:07.5100810Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 1.0 configs/s
2026-02-21T09:30:08.6093471Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.0 configs/s
2026-02-21T09:30:09.2688571Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1514.1        
2026-02-21T09:30:09.2692823Z                                                                   configs/s     
2026-02-21T09:30:09.3292017Z [188s] Generation 10 complete: 
2026-02-21T09:30:09.3293791Z ok=19
2026-02-21T09:30:09.3294021Z min=0.0204
2026-02-21T09:30:09.3294192Z mid=0.0287
2026-02-21T09:30:09.3294362Z max=0.3605
2026-02-21T09:30:09.3294581Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:09.3294888Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:09.3295170Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:09.3295377Z  'num_stages': 8,
2026-02-21T09:30:09.3295529Z  'num_warps': 4,
2026-02-21T09:30:09.3295682Z  'pid_type': 'flat',
2026-02-21T09:30:09.3295850Z  'range_flattens': [None, False],
2026-02-21T09:30:09.3296057Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:09.3296267Z  'range_num_stages': [0, 3],
2026-02-21T09:30:09.3296461Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:09.3296674Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:09.3314328Z [188s] Fitting surrogate: 634 points, 634 targets
2026-02-21T09:30:09.7470791Z [188s] Generation 11 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:30:13.8836438Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.2 configs/s
2026-02-21T09:30:14.9144330Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s
2026-02-21T09:30:15.5684755Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1528.2        
2026-02-21T09:30:15.5686064Z                                                                   configs/s     
2026-02-21T09:30:15.6286103Z [194s] Generation 11 complete: 
2026-02-21T09:30:15.6290499Z ok=18
2026-02-21T09:30:15.6294376Z min=0.0205
2026-02-21T09:30:15.6296374Z mid=0.0287
2026-02-21T09:30:15.6296533Z max=0.3411
2026-02-21T09:30:15.6296684Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:15.6296936Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:15.6297211Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:15.6297407Z  'num_stages': 8,
2026-02-21T09:30:15.6297548Z  'num_warps': 4,
2026-02-21T09:30:15.6297693Z  'pid_type': 'flat',
2026-02-21T09:30:15.6297848Z  'range_flattens': [None, False],
2026-02-21T09:30:15.6298040Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:15.6298583Z  'range_num_stages': [0, 3],
2026-02-21T09:30:15.6298783Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:15.6298962Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:15.6305746Z [194s] Fitting surrogate: 652 points, 652 targets
2026-02-21T09:30:16.4919398Z [195s] Generation 12 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:30:30.1789096Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.6 configs/s
2026-02-21T09:30:31.2280154Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.9 configs/s
2026-02-21T09:30:31.8769466Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1542.3        
2026-02-21T09:30:31.8770609Z                                                                   configs/s     
2026-02-21T09:30:31.9361528Z [210s] Generation 12 complete: 
2026-02-21T09:30:31.9363156Z ok=18
2026-02-21T09:30:31.9363319Z min=0.0205
2026-02-21T09:30:31.9363459Z mid=0.0287
2026-02-21T09:30:31.9363587Z max=0.2090
2026-02-21T09:30:31.9363756Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:31.9364031Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:31.9364293Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:31.9364486Z  'num_stages': 8,
2026-02-21T09:30:31.9364623Z  'num_warps': 4,
2026-02-21T09:30:31.9364769Z  'pid_type': 'flat',
2026-02-21T09:30:31.9364923Z  'range_flattens': [None, False],
2026-02-21T09:30:31.9365111Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:31.9365291Z  'range_num_stages': [0, 3],
2026-02-21T09:30:31.9365465Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:31.9365647Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:31.9384438Z [210s] Fitting surrogate: 670 points, 670 targets
2026-02-21T09:30:32.4060312Z [211s] Generation 13 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:30:34.3361532Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.3 configs/s
2026-02-21T09:30:35.3737185Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.1 configs/s
2026-02-21T09:30:36.0927090Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1393.0        
2026-02-21T09:30:36.0928291Z                                                                   configs/s     
2026-02-21T09:30:36.1606319Z [214s] Generation 13 complete: 
2026-02-21T09:30:36.1606651Z ok=18
2026-02-21T09:30:36.1611327Z min=0.0204
2026-02-21T09:30:36.1614959Z mid=0.0267
2026-02-21T09:30:36.1618739Z max=0.1106
2026-02-21T09:30:36.1621791Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:36.1625693Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:36.1626883Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:36.1627083Z  'num_stages': 8,
2026-02-21T09:30:36.1627235Z  'num_warps': 4,
2026-02-21T09:30:36.1627376Z  'pid_type': 'flat',
2026-02-21T09:30:36.1627554Z  'range_flattens': [None, False],
2026-02-21T09:30:36.1627744Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:36.1627965Z  'range_num_stages': [0, 3],
2026-02-21T09:30:36.1628510Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:36.1628691Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:36.1635177Z [214s] Fitting surrogate: 688 points, 688 targets
2026-02-21T09:30:36.5941139Z [215s] Generation 14 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:30:42.0092898Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 1.4 configs/s
2026-02-21T09:30:43.0049163Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.8 configs/s
2026-02-21T09:30:43.5767548Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1745.5        
2026-02-21T09:30:43.5771880Z                                                                   configs/s     
2026-02-21T09:30:43.6320103Z [222s] Generation 14 complete: 
2026-02-21T09:30:43.6324460Z ok=17
2026-02-21T09:30:43.6325793Z min=0.0204
2026-02-21T09:30:43.6325955Z mid=0.0266
2026-02-21T09:30:43.6326093Z max=0.4096
2026-02-21T09:30:43.6326599Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:43.6326899Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:43.6327180Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:43.6327376Z  'num_stages': 8,
2026-02-21T09:30:43.6327525Z  'num_warps': 4,
2026-02-21T09:30:43.6327672Z  'pid_type': 'flat',
2026-02-21T09:30:43.6327840Z  'range_flattens': [None, False],
2026-02-21T09:30:43.6328023Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:43.6328216Z  'range_num_stages': [0, 3],
2026-02-21T09:30:43.6328386Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:43.6328576Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:43.6348709Z [222s] Fitting surrogate: 705 points, 705 targets
2026-02-21T09:30:44.0582026Z [222s] Generation 15 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:30:47.9787153Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 2.3 configs/s
2026-02-21T09:30:48.9068739Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.0 configs/s
2026-02-21T09:30:49.8141407Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1110.3        
2026-02-21T09:30:49.8143059Z                                                                   configs/s     
2026-02-21T09:30:49.8902901Z [228s] Generation 15 complete: 
2026-02-21T09:30:49.8907287Z ok=17
2026-02-21T09:30:49.8911135Z min=0.0204
2026-02-21T09:30:49.8914382Z mid=0.0225
2026-02-21T09:30:49.8918262Z max=0.3405
2026-02-21T09:30:49.8918517Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:49.8918808Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:49.8922749Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:49.8927712Z  'num_stages': 8,
2026-02-21T09:30:49.8929032Z  'num_warps': 4,
2026-02-21T09:30:49.8929224Z  'pid_type': 'flat',
2026-02-21T09:30:49.8929413Z  'range_flattens': [None, False],
2026-02-21T09:30:49.8929608Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:49.8929833Z  'range_num_stages': [0, 3],
2026-02-21T09:30:49.8930018Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:49.8930206Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:49.8934386Z [228s] Fitting surrogate: 722 points, 722 targets
2026-02-21T09:30:50.3299130Z [229s] Generation 16 starting: 14 neighbors, 1 active search path(s)
2026-02-21T09:30:53.9480674Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2.4 configs/s
2026-02-21T09:30:54.8069620Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.2 configs/s
2026-02-21T09:30:55.7027593Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1122.9        
2026-02-21T09:30:55.7031296Z                                                                   configs/s     
2026-02-21T09:30:55.7875237Z [234s] Generation 16 complete: 
2026-02-21T09:30:55.7877365Z ok=16
2026-02-21T09:30:55.7877563Z min=0.0205
2026-02-21T09:30:55.7877702Z mid=0.0225
2026-02-21T09:30:55.7877839Z max=0.0473
2026-02-21T09:30:55.7878013Z best={'block_sizes': [1, 8192],
2026-02-21T09:30:55.7878644Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:30:55.7878919Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:30:55.7879123Z  'num_stages': 8,
2026-02-21T09:30:55.7879268Z  'num_warps': 4,
2026-02-21T09:30:55.7879423Z  'pid_type': 'flat',
2026-02-21T09:30:55.7879585Z  'range_flattens': [None, False],
2026-02-21T09:30:55.7879777Z  'range_multi_buffers': [None, False],
2026-02-21T09:30:55.7879974Z  'range_num_stages': [0, 3],
2026-02-21T09:30:55.7880143Z  'range_unroll_factors': [0, 2],
2026-02-21T09:30:55.7880349Z  'range_warp_specializes': [None, False]}
2026-02-21T09:30:55.7907987Z [234s] Fitting surrogate: 738 points, 738 targets
2026-02-21T09:30:56.2800204Z [235s] Generation 17 starting: 18 neighbors, 1 active search path(s)
2026-02-21T09:30:58.2651841Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 12.4 configs/s
2026-02-21T09:30:59.4104577Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.3 configs/s
2026-02-21T09:31:00.1287469Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1394.8        
2026-02-21T09:31:00.1291874Z                                                                   configs/s     
2026-02-21T09:31:00.1912557Z [239s] Generation 17 complete: 
2026-02-21T09:31:00.1917447Z ok=20
2026-02-21T09:31:00.1918993Z min=0.0205
2026-02-21T09:31:00.1919197Z mid=0.0286
2026-02-21T09:31:00.1924006Z max=0.3415
2026-02-21T09:31:00.1925566Z best={'block_sizes': [1, 8192],
2026-02-21T09:31:00.1925904Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:31:00.1931116Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:31:00.1933240Z  'num_stages': 8,
2026-02-21T09:31:00.1936876Z  'num_warps': 4,
2026-02-21T09:31:00.1940262Z  'pid_type': 'flat',
2026-02-21T09:31:00.1944158Z  'range_flattens': [None, False],
2026-02-21T09:31:00.1945625Z  'range_multi_buffers': [None, False],
2026-02-21T09:31:00.1945874Z  'range_num_stages': [0, 3],
2026-02-21T09:31:00.1946085Z  'range_unroll_factors': [0, 2],
2026-02-21T09:31:00.1946280Z  'range_warp_specializes': [None, False]}
2026-02-21T09:31:00.1946597Z [239s] Fitting surrogate: 758 points, 758 targets
2026-02-21T09:31:00.6081030Z [239s] Generation 18 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:31:04.7795658Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 4.5 configs/s
2026-02-21T09:31:05.6973136Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.2 configs/s
2026-02-21T09:31:06.5244601Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1215.4        
2026-02-21T09:31:06.5245011Z                                                                   configs/s     
2026-02-21T09:31:06.5989092Z [245s] Generation 18 complete: 
2026-02-21T09:31:06.5993992Z ok=17
2026-02-21T09:31:06.5995419Z min=0.0204
2026-02-21T09:31:06.5995587Z mid=0.0225
2026-02-21T09:31:06.5995712Z max=0.0574
2026-02-21T09:31:06.5995864Z best={'block_sizes': [1, 8192],
2026-02-21T09:31:06.5996166Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:31:06.5996427Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:31:06.5996617Z  'num_stages': 8,
2026-02-21T09:31:06.5996754Z  'num_warps': 4,
2026-02-21T09:31:06.5996898Z  'pid_type': 'flat',
2026-02-21T09:31:06.5997053Z  'range_flattens': [None, False],
2026-02-21T09:31:06.5997235Z  'range_multi_buffers': [None, False],
2026-02-21T09:31:06.5997413Z  'range_num_stages': [0, 3],
2026-02-21T09:31:06.5997584Z  'range_unroll_factors': [0, 2],
2026-02-21T09:31:06.5997769Z  'range_warp_specializes': [None, False]}
2026-02-21T09:31:06.6019296Z [245s] Fitting surrogate: 775 points, 775 targets
2026-02-21T09:31:07.0283722Z [245s] Generation 19 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:31:08.7748318Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 11.3 configs/s
2026-02-21T09:31:09.6905488Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.2 configs/s
2026-02-21T09:31:11.0994483Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 992.4         
2026-02-21T09:31:11.0997953Z                                                                   configs/s     
2026-02-21T09:31:11.1848185Z [250s] Generation 19 complete: 
2026-02-21T09:31:11.1852761Z ok=17
2026-02-21T09:31:11.1854323Z min=0.0205
2026-02-21T09:31:11.1854525Z mid=0.0225
2026-02-21T09:31:11.1854651Z max=0.0327
2026-02-21T09:31:11.1854807Z best={'block_sizes': [1, 8192],
2026-02-21T09:31:11.1855056Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:31:11.1855326Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:31:11.1855520Z  'num_stages': 8,
2026-02-21T09:31:11.1855660Z  'num_warps': 4,
2026-02-21T09:31:11.1855806Z  'pid_type': 'flat',
2026-02-21T09:31:11.1855962Z  'range_flattens': [None, False],
2026-02-21T09:31:11.1856148Z  'range_multi_buffers': [None, False],
2026-02-21T09:31:11.1856327Z  'range_num_stages': [0, 3],
2026-02-21T09:31:11.1856875Z  'range_unroll_factors': [0, 2],
2026-02-21T09:31:11.1857090Z  'range_warp_specializes': [None, False]}
2026-02-21T09:31:11.1864372Z [250s] Fitting surrogate: 792 points, 792 targets
2026-02-21T09:31:11.5942326Z [250s] Generation 20 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:31:13.2869569Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 16.0 configs/s
2026-02-21T09:31:14.0287821Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 17.2 configs/s
2026-02-21T09:31:14.8678049Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1197.3        
2026-02-21T09:31:14.8679857Z                                                                   configs/s     
2026-02-21T09:31:14.9413525Z [253s] Generation 20 complete: 
2026-02-21T09:31:14.9415598Z ok=14
2026-02-21T09:31:14.9415763Z min=0.0205
2026-02-21T09:31:14.9415902Z mid=0.0225
2026-02-21T09:31:14.9416030Z max=0.0368
2026-02-21T09:31:14.9416165Z best={'block_sizes': [1, 8192],
2026-02-21T09:31:14.9416471Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:31:14.9416730Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:31:14.9416924Z  'num_stages': 8,
2026-02-21T09:31:14.9417064Z  'num_warps': 4,
2026-02-21T09:31:14.9417207Z  'pid_type': 'flat',
2026-02-21T09:31:14.9417361Z  'range_flattens': [None, False],
2026-02-21T09:31:14.9417548Z  'range_multi_buffers': [None, False],
2026-02-21T09:31:14.9417728Z  'range_num_stages': [0, 3],
2026-02-21T09:31:14.9417898Z  'range_unroll_factors': [0, 2],
2026-02-21T09:31:14.9418079Z  'range_warp_specializes': [None, False]}
2026-02-21T09:31:14.9443222Z [253s] Fitting surrogate: 806 points, 806 targets
2026-02-21T09:31:15.2299622Z [254s] Autotuning complete in 254.1s after searching 769 configs.
2026-02-21T09:31:15.2301887Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:31:15.2303020Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:31:15.2304235Z 
2026-02-21T09:31:15.2304503Z [254s] Code of selected kernel: /tmp/torchinductor_root/dc/cdcv4axqksvqcx43ypyvb56ys6vejja4lzjbcj5l3og76zsgyolr.py
2026-02-21T09:31:15.2526901Z from __future__ import annotations
2026-02-21T09:31:15.2531328Z 
2026-02-21T09:31:15.2535674Z import torch
2026-02-21T09:31:15.2537662Z import triton
2026-02-21T09:31:15.2537857Z import triton.language as tl
2026-02-21T09:31:15.2538091Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:31:15.2538360Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:31:15.2538665Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:31:15.2538841Z 
2026-02-21T09:31:15.2539222Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:31:15.2539444Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:31:15.2539567Z 
2026-02-21T09:31:15.2539642Z @triton.jit
2026-02-21T09:31:15.2539799Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:31:15.2540065Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:31:15.2540318Z     pid_0 = tl.program_id(0)
2026-02-21T09:31:15.2540489Z     offset_0 = pid_0
2026-02-21T09:31:15.2540664Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:31:15.2540956Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:31:15.2541246Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:31:15.2541523Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:31:15.2541830Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:31:15.2542079Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:31:15.2542363Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:31:15.2542608Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:31:15.2542839Z     # src[softmax.py:82-89]: ...
2026-02-21T09:31:15.2543250Z     for offset_2 in tl.range(0, 4864, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T09:31:15.2543726Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:31:15.2543969Z         mask_1 = indices_2 < 4864
2026-02-21T09:31:15.2544131Z         mi_copy = mi
2026-02-21T09:31:15.2544277Z         di_copy = di
2026-02-21T09:31:15.2544417Z         mi_copy_0 = mi_copy
2026-02-21T09:31:15.2544579Z         di_copy_0 = di_copy
2026-02-21T09:31:15.2544757Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:31:15.2545133Z         values = tl.load(x + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:31:15.2545523Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:31:15.2545921Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:31:15.2546313Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:31:15.2546565Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:31:15.2546806Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:31:15.2547020Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:31:15.2547276Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:31:15.2547540Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:31:15.2547708Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:31:15.2547880Z         v_4 = di_copy_0 * v_3
2026-02-21T09:31:15.2548062Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:31:15.2548370Z         subscript = v_1[:, None]
2026-02-21T09:31:15.2548549Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:31:15.2548728Z         v_6 = v_5 - subscript
2026-02-21T09:31:15.2548946Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:31:15.2549204Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:31:15.2549421Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:31:15.2549602Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:31:15.2549928Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:31:15.2550298Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:31:15.2550500Z         di = v_4 + sum_1
2026-02-21T09:31:15.2550676Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:31:15.2550852Z         mi = v_1
2026-02-21T09:31:15.2551131Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:31:15.2551399Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:31:15.2551802Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:31:15.2552327Z     for offset_2 in tl.range(0, 4864, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T09:31:15.2552789Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:31:15.2553034Z         mask_2 = indices_2 < 4864
2026-02-21T09:31:15.2553198Z         mi_copy_1 = mi
2026-02-21T09:31:15.2553350Z         di_copy_1 = di
2026-02-21T09:31:15.2553497Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:31:15.2553668Z         di_copy_1_0 = di_copy_1
2026-02-21T09:31:15.2553854Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:31:15.2554233Z         values_1 = tl.load(x + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:31:15.2554671Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:31:15.2554947Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:31:15.2555142Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:31:15.2555322Z         v_10 = v_9 - subscript_1
2026-02-21T09:31:15.2555499Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:31:15.2555672Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:31:15.2555857Z         v_12 = v_11 / subscript_2
2026-02-21T09:31:15.2556034Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:31:15.2556297Z         tl.store(out + (indices_0[:, None] * 4864 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:31:15.2556515Z 
2026-02-21T09:31:15.2556642Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:31:15.2556868Z     """
2026-02-21T09:31:15.2557077Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:31:15.2557378Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:31:15.2557597Z     Args:
2026-02-21T09:31:15.2557758Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:31:15.2557947Z     Returns:
2026-02-21T09:31:15.2558128Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:31:15.2558331Z     """
2026-02-21T09:31:15.2558470Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:31:15.2558640Z     m, n = x.size()
2026-02-21T09:31:15.2558811Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:31:15.2559011Z     out = torch.empty_like(x)
2026-02-21T09:31:15.2559240Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:31:15.2559554Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:31:15.2559856Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:31:15.2560096Z     # src[softmax.py:79-92]: ...
2026-02-21T09:31:15.2560404Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=8)
2026-02-21T09:31:15.2560677Z     # src[softmax.py:93]: return out
2026-02-21T09:31:15.2560842Z     return out
2026-02-21T09:31:16.0218367Z WARNING:tritonbench.utils.triton_op:Completed input ID 36:
2026-02-21T09:31:16.0218691Z (M, N)
2026-02-21T09:31:16.0223034Z ------------
2026-02-21T09:31:16.0231916Z (4096, 4864)
2026-02-21T09:31:16.0235714Z 
2026-02-21T09:31:16.0237948Z  40%|████      | 8/20 [22:21<38:36, 193.01s/it]WARNING:tritonbench.utils.triton_op:Running input ID 41:
2026-02-21T09:31:16.0238310Z (M, N)
2026-02-21T09:31:16.0241905Z ------------
2026-02-21T09:31:16.0246572Z (4096, 5504)
2026-02-21T09:31:16.0251330Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:31:17.2439114Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:31:18.7616000Z INFO:tritonbench.utils.triton_op:Took 2.31ms to get benchmark function for torch_compile_softmax
2026-02-21T09:31:20.0959611Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:31:20.0963494Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:31:20.0967900Z               'dtype': 'torch.float16',
2026-02-21T09:31:20.0971974Z               'shape': (4096, 5504),
2026-02-21T09:31:20.0975823Z               'stride': (5504, 1)},),
2026-02-21T09:31:20.0976121Z   'kwargs': {}}
2026-02-21T09:31:20.0986505Z INFO:tritonbench.utils.triton_op:Took 2.75ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:31:20.2701717Z [0s] Autotune random seed: 2138408546
2026-02-21T09:31:20.2947892Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:31:55.3686024Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:31:55.4096601Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:32:02.6039770Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.8 configs/s
2026-02-21T09:32:02.6050619Z [42s] Adaptive compile timeout: 30s (90% percentile=7.2s, bounds=[30.0s, 30s])
2026-02-21T09:32:03.4351127Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1192.4 configs/s
2026-02-21T09:32:03.4968845Z [43s] Initial random population of 100, 5 starting points: 
2026-02-21T09:32:03.4974024Z error=12
2026-02-21T09:32:03.4978448Z timeout=1
2026-02-21T09:32:03.4983201Z ok=87
2026-02-21T09:32:03.4987255Z min=0.0430
2026-02-21T09:32:03.4991950Z mid=0.4362
2026-02-21T09:32:03.4993986Z max=128.0870
2026-02-21T09:32:03.4994224Z best={'block_sizes': [1, 8192],
2026-02-21T09:32:03.4999738Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:32:03.5001936Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:32:03.5002172Z  'num_sm_multiplier': 2,
2026-02-21T09:32:03.5002341Z  'num_stages': 7,
2026-02-21T09:32:03.5002481Z  'num_warps': 32,
2026-02-21T09:32:03.5002652Z  'pid_type': 'persistent_blocked',
2026-02-21T09:32:03.5002841Z  'range_flattens': [True, True],
2026-02-21T09:32:03.5003029Z  'range_multi_buffers': [False, None],
2026-02-21T09:32:03.5003219Z  'range_num_stages': [4, 3],
2026-02-21T09:32:03.5003383Z  'range_unroll_factors': [2, 3],
2026-02-21T09:32:03.5003573Z  'range_warp_specializes': [False, False]}
2026-02-21T09:32:03.5003788Z [43s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:32:04.7300787Z [44s] Generation 1 starting: 92 neighbors, 5 active search path(s)
2026-02-21T09:32:42.9542244Z [82s] Timeout after 30s compiling Config(block_sizes=[2, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=32, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[False, None])
2026-02-21T09:32:42.9555681Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 0.5 configs/s
2026-02-21T09:32:48.4066699Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 17.8 configs/s
2026-02-21T09:32:48.9079767Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1973.5         
2026-02-21T09:32:48.9084151Z                                                                  configs/s      
2026-02-21T09:32:48.9561510Z [88s] Generation 1 complete: 
2026-02-21T09:32:48.9566830Z error=4
2026-02-21T09:32:48.9568549Z timeout=1
2026-02-21T09:32:48.9568743Z ok=93
2026-02-21T09:32:48.9572784Z min=0.0246
2026-02-21T09:32:48.9577396Z mid=0.0512
2026-02-21T09:32:48.9579903Z max=0.3666
2026-02-21T09:32:48.9580122Z best={'block_sizes': [1, 8192],
2026-02-21T09:32:48.9580472Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:32:48.9580753Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:32:48.9580952Z  'num_stages': 7,
2026-02-21T09:32:48.9584469Z  'num_warps': 8,
2026-02-21T09:32:48.9585802Z  'pid_type': 'flat',
2026-02-21T09:32:48.9586071Z  'range_flattens': [None, True],
2026-02-21T09:32:48.9586275Z  'range_multi_buffers': [None, None],
2026-02-21T09:32:48.9590419Z  'range_num_stages': [0, 3],
2026-02-21T09:32:48.9594413Z  'range_unroll_factors': [0, 3],
2026-02-21T09:32:48.9598841Z  'range_warp_specializes': [None, False]}
2026-02-21T09:32:48.9602656Z [88s] Fitting surrogate: 198 points, 198 targets
2026-02-21T09:32:50.1290092Z [89s] Generation 2 starting: 78 neighbors, 5 active search path(s)
2026-02-21T09:33:18.3520177Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.9 configs/s
2026-02-21T09:33:23.7941436Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 16.4 configs/s
2026-02-21T09:33:25.5174905Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 587.2         
2026-02-21T09:33:25.5176204Z                                                                   configs/s     
2026-02-21T09:33:25.6429952Z [125s] Generation 2 complete: 
2026-02-21T09:33:25.6434919Z ok=84
2026-02-21T09:33:25.6439318Z min=0.0245
2026-02-21T09:33:25.6440763Z mid=0.0420
2026-02-21T09:33:25.6440917Z max=1.0669
2026-02-21T09:33:25.6441064Z best={'block_sizes': [1, 8192],
2026-02-21T09:33:25.6441309Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:33:25.6441649Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:33:25.6441847Z  'num_stages': 7,
2026-02-21T09:33:25.6442000Z  'num_warps': 4,
2026-02-21T09:33:25.6442150Z  'pid_type': 'flat',
2026-02-21T09:33:25.6442307Z  'range_flattens': [None, True],
2026-02-21T09:33:25.6442523Z  'range_multi_buffers': [None, None],
2026-02-21T09:33:25.6442722Z  'range_num_stages': [0, 3],
2026-02-21T09:33:25.6442892Z  'range_unroll_factors': [0, 3],
2026-02-21T09:33:25.6443070Z  'range_warp_specializes': [None, False]}
2026-02-21T09:33:25.6447857Z [125s] Fitting surrogate: 282 points, 282 targets
2026-02-21T09:33:26.5646264Z [126s] Generation 3 starting: 65 neighbors, 5 active search path(s)
2026-02-21T09:33:59.3393779Z [159s] Timeout after 30s compiling Config(block_sizes=[4, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=8, num_stages=8, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[3, 4], range_unroll_factors=[3, 1], range_warp_specializes=[False, None])
2026-02-21T09:33:59.3411282Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.4 configs/s
2026-02-21T09:34:03.4509986Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.0 configs/s
2026-02-21T09:34:05.5382834Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 485.7         
2026-02-21T09:34:05.5384077Z                                                                   configs/s     
2026-02-21T09:34:05.7028111Z [165s] Generation 3 complete: 
2026-02-21T09:34:05.7032540Z timeout=1
2026-02-21T09:34:05.7033952Z ok=69
2026-02-21T09:34:05.7034127Z min=0.0226
2026-02-21T09:34:05.7034255Z mid=0.0369
2026-02-21T09:34:05.7034387Z max=0.2785
2026-02-21T09:34:05.7034529Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:05.7034760Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:05.7035007Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:05.7035193Z  'num_stages': 1,
2026-02-21T09:34:05.7035341Z  'num_warps': 4,
2026-02-21T09:34:05.7035481Z  'pid_type': 'flat',
2026-02-21T09:34:05.7035643Z  'range_flattens': [None, True],
2026-02-21T09:34:05.7035816Z  'range_multi_buffers': [None, True],
2026-02-21T09:34:05.7035999Z  'range_num_stages': [0, 4],
2026-02-21T09:34:05.7036508Z  'range_unroll_factors': [0, 4],
2026-02-21T09:34:05.7036708Z  'range_warp_specializes': [None, None]}
2026-02-21T09:34:05.7044895Z [165s] Fitting surrogate: 352 points, 352 targets
2026-02-21T09:34:06.4904663Z [166s] Generation 4 starting: 50 neighbors, 4 active search path(s)
2026-02-21T09:34:13.7454526Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 3.7 configs/s
2026-02-21T09:34:16.8714810Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.9 configs/s
2026-02-21T09:34:18.9486500Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.6         
2026-02-21T09:34:18.9491999Z                                                                   configs/s     
2026-02-21T09:34:19.1199298Z [178s] Generation 4 complete: 
2026-02-21T09:34:19.1204541Z ok=54
2026-02-21T09:34:19.1206987Z min=0.0225
2026-02-21T09:34:19.1207206Z mid=0.0327
2026-02-21T09:34:19.1207347Z max=0.1516
2026-02-21T09:34:19.1207501Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:19.1207820Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:19.1213424Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:19.1215702Z  'num_stages': 1,
2026-02-21T09:34:19.1215910Z  'num_warps': 4,
2026-02-21T09:34:19.1216078Z  'pid_type': 'flat',
2026-02-21T09:34:19.1216272Z  'range_flattens': [None, True],
2026-02-21T09:34:19.1216472Z  'range_multi_buffers': [None, True],
2026-02-21T09:34:19.1216682Z  'range_num_stages': [0, 4],
2026-02-21T09:34:19.1216857Z  'range_unroll_factors': [0, 4],
2026-02-21T09:34:19.1217059Z  'range_warp_specializes': [None, None]}
2026-02-21T09:34:19.1217371Z [178s] Fitting surrogate: 406 points, 406 targets
2026-02-21T09:34:19.7963297Z [179s] Generation 5 starting: 22 neighbors, 3 active search path(s)
2026-02-21T09:34:23.8351099Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 6.0 configs/s
2026-02-21T09:34:25.2324425Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.0 configs/s
2026-02-21T09:34:26.6658443Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 704.8         
2026-02-21T09:34:26.6659730Z                                                                   configs/s     
2026-02-21T09:34:26.7836372Z [186s] Generation 5 complete: 
2026-02-21T09:34:26.7840267Z ok=25
2026-02-21T09:34:26.7845392Z min=0.0225
2026-02-21T09:34:26.7850619Z mid=0.0266
2026-02-21T09:34:26.7855112Z max=0.0451
2026-02-21T09:34:26.7858928Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:26.7860395Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:26.7860679Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:26.7860879Z  'num_stages': 6,
2026-02-21T09:34:26.7861035Z  'num_warps': 4,
2026-02-21T09:34:26.7861178Z  'pid_type': 'flat',
2026-02-21T09:34:26.7861349Z  'range_flattens': [None, True],
2026-02-21T09:34:26.7861531Z  'range_multi_buffers': [None, False],
2026-02-21T09:34:26.7861799Z  'range_num_stages': [0, 3],
2026-02-21T09:34:26.7861994Z  'range_unroll_factors': [0, 2],
2026-02-21T09:34:26.7862551Z  'range_warp_specializes': [None, False]}
2026-02-21T09:34:26.7862784Z [186s] Fitting surrogate: 431 points, 431 targets
2026-02-21T09:34:27.2443784Z [186s] Generation 6 starting: 26 neighbors, 2 active search path(s)
2026-02-21T09:34:29.7967700Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 27.3 configs/s
2026-02-21T09:34:31.3678451Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.0 configs/s
2026-02-21T09:34:33.1223756Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 577.0         
2026-02-21T09:34:33.1224446Z                                                                   configs/s     
2026-02-21T09:34:33.2706903Z [192s] Generation 6 complete: 
2026-02-21T09:34:33.2712715Z ok=28
2026-02-21T09:34:33.2713265Z min=0.0225
2026-02-21T09:34:33.2713456Z mid=0.0246
2026-02-21T09:34:33.2713611Z max=0.0511
2026-02-21T09:34:33.2713765Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:33.2714405Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:33.2714718Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:33.2714928Z  'num_stages': 6,
2026-02-21T09:34:33.2715091Z  'num_warps': 4,
2026-02-21T09:34:33.2715239Z  'pid_type': 'flat',
2026-02-21T09:34:33.2715411Z  'range_flattens': [None, True],
2026-02-21T09:34:33.2715602Z  'range_multi_buffers': [None, False],
2026-02-21T09:34:33.2715806Z  'range_num_stages': [0, 2],
2026-02-21T09:34:33.2715980Z  'range_unroll_factors': [0, 2],
2026-02-21T09:34:33.2716177Z  'range_warp_specializes': [None, False]}
2026-02-21T09:34:33.2723850Z [192s] Fitting surrogate: 459 points, 459 targets
2026-02-21T09:34:33.7238276Z [193s] Generation 7 starting: 21 neighbors, 2 active search path(s)
2026-02-21T09:34:35.9360708Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 12.4 configs/s
2026-02-21T09:34:37.1948743Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.3 configs/s
2026-02-21T09:34:38.7149118Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 665.7         
2026-02-21T09:34:38.7153439Z                                                                   configs/s     
2026-02-21T09:34:38.8373681Z [198s] Generation 7 complete: 
2026-02-21T09:34:38.8379081Z ok=23
2026-02-21T09:34:38.8386038Z min=0.0225
2026-02-21T09:34:38.8387518Z mid=0.0245
2026-02-21T09:34:38.8387676Z max=0.0326
2026-02-21T09:34:38.8387829Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:38.8388057Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:38.8388303Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:38.8388497Z  'num_stages': 3,
2026-02-21T09:34:38.8388637Z  'num_warps': 4,
2026-02-21T09:34:38.8388785Z  'pid_type': 'flat',
2026-02-21T09:34:38.8388940Z  'range_flattens': [None, True],
2026-02-21T09:34:38.8389124Z  'range_multi_buffers': [None, None],
2026-02-21T09:34:38.8389305Z  'range_num_stages': [0, 0],
2026-02-21T09:34:38.8389474Z  'range_unroll_factors': [0, 0],
2026-02-21T09:34:38.8389676Z  'range_warp_specializes': [None, True]}
2026-02-21T09:34:38.8390353Z [198s] Fitting surrogate: 482 points, 482 targets
2026-02-21T09:34:39.2881855Z [198s] Generation 8 starting: 18 neighbors, 2 active search path(s)
2026-02-21T09:34:41.3939149Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 10.5 configs/s
2026-02-21T09:34:42.4842337Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 17.2 configs/s
2026-02-21T09:34:43.7472351Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 799.1         
2026-02-21T09:34:43.7476289Z                                                                   configs/s     
2026-02-21T09:34:43.8561805Z [203s] Generation 8 complete: 
2026-02-21T09:34:43.8565055Z ok=20
2026-02-21T09:34:43.8568943Z min=0.0226
2026-02-21T09:34:43.8573035Z mid=0.0227
2026-02-21T09:34:43.8577747Z max=0.0389
2026-02-21T09:34:43.8579372Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:43.8580097Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:43.8580388Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:43.8585690Z  'num_stages': 3,
2026-02-21T09:34:43.8585939Z  'num_warps': 2,
2026-02-21T09:34:43.8586133Z  'pid_type': 'flat',
2026-02-21T09:34:43.8586361Z  'range_flattens': [None, True],
2026-02-21T09:34:43.8586582Z  'range_multi_buffers': [None, False],
2026-02-21T09:34:43.8586781Z  'range_num_stages': [0, 0],
2026-02-21T09:34:43.8586956Z  'range_unroll_factors': [0, 0],
2026-02-21T09:34:43.8587151Z  'range_warp_specializes': [None, True]}
2026-02-21T09:34:43.8587362Z [203s] Fitting surrogate: 502 points, 502 targets
2026-02-21T09:34:44.2294477Z [203s] Generation 9 starting: 9 neighbors, 1 active search path(s)
2026-02-21T09:34:45.4660912Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 15.3 configs/s
2026-02-21T09:34:46.0085997Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.2 configs/s
2026-02-21T09:34:46.6495086Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1560.2         
2026-02-21T09:34:46.6497712Z                                                                  configs/s      
2026-02-21T09:34:46.7098499Z [206s] Generation 9 complete: 
2026-02-21T09:34:46.7101877Z ok=10
2026-02-21T09:34:46.7106122Z min=0.0225
2026-02-21T09:34:46.7111021Z mid=0.0226
2026-02-21T09:34:46.7115428Z max=0.0226
2026-02-21T09:34:46.7118711Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:46.7123288Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:46.7126722Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:46.7126976Z  'num_stages': 3,
2026-02-21T09:34:46.7127130Z  'num_warps': 1,
2026-02-21T09:34:46.7127275Z  'pid_type': 'flat',
2026-02-21T09:34:46.7127440Z  'range_flattens': [None, True],
2026-02-21T09:34:46.7127618Z  'range_multi_buffers': [None, False],
2026-02-21T09:34:46.7127810Z  'range_num_stages': [0, 0],
2026-02-21T09:34:46.7127984Z  'range_unroll_factors': [0, 0],
2026-02-21T09:34:46.7128179Z  'range_warp_specializes': [None, True]}
2026-02-21T09:34:46.7128405Z [206s] Fitting surrogate: 512 points, 512 targets
2026-02-21T09:34:47.0588155Z [206s] Generation 10 starting: 5 neighbors, 1 active search path(s)
2026-02-21T09:34:47.9636499Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 11.0 configs/s
2026-02-21T09:34:48.3264448Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 6/6 19.2 configs/s
2026-02-21T09:34:48.7688407Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2239.6        
2026-02-21T09:34:48.7689522Z                                                                   configs/s     
2026-02-21T09:34:48.8134253Z [208s] Generation 10 complete: 
2026-02-21T09:34:48.8137340Z ok=7
2026-02-21T09:34:48.8141943Z min=0.0225
2026-02-21T09:34:48.8143940Z mid=0.0225
2026-02-21T09:34:48.8144112Z max=0.0225
2026-02-21T09:34:48.8144261Z best={'block_sizes': [1, 8192],
2026-02-21T09:34:48.8149103Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:34:48.8151216Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:34:48.8151915Z  'num_stages': 3,
2026-02-21T09:34:48.8156129Z  'num_warps': 1,
2026-02-21T09:34:48.8160072Z  'pid_type': 'flat',
2026-02-21T09:34:48.8160360Z  'range_flattens': [None, True],
2026-02-21T09:34:48.8160593Z  'range_multi_buffers': [None, False],
2026-02-21T09:34:48.8164639Z  'range_num_stages': [0, 0],
2026-02-21T09:34:48.8167911Z  'range_unroll_factors': [0, 0],
2026-02-21T09:34:48.8172226Z  'range_warp_specializes': [None, True]}
2026-02-21T09:34:48.8177147Z [208s] Fitting surrogate: 519 points, 519 targets
2026-02-21T09:34:49.0893106Z [208s] Autotuning complete in 208.8s after searching 498 configs.
2026-02-21T09:34:49.0893530Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:34:49.0894779Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:34:49.0895621Z 
2026-02-21T09:34:49.0895886Z [208s] Code of selected kernel: /tmp/torchinductor_root/ul/culrzb4qt45ddt4lkeeka2wtyrtxpfvhgey7flljwev7nwmsfbdj.py
2026-02-21T09:34:49.1115204Z from __future__ import annotations
2026-02-21T09:34:49.1116850Z 
2026-02-21T09:34:49.1117006Z import torch
2026-02-21T09:34:49.1117179Z import triton
2026-02-21T09:34:49.1117346Z import triton.language as tl
2026-02-21T09:34:49.1117560Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:34:49.1117844Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:34:49.1118141Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:34:49.1118337Z 
2026-02-21T09:34:49.1118410Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:34:49.1118602Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:34:49.1118723Z 
2026-02-21T09:34:49.1118802Z @triton.jit
2026-02-21T09:34:49.1118961Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:34:49.1119221Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:34:49.1119483Z     pid_0 = tl.program_id(0)
2026-02-21T09:34:49.1119653Z     offset_0 = pid_0
2026-02-21T09:34:49.1119839Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:34:49.1120133Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:34:49.1120435Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:34:49.1120719Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:34:49.1120979Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:34:49.1121255Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:34:49.1121631Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:34:49.1121921Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:34:49.1122183Z     # src[softmax.py:82-89]: ...
2026-02-21T09:34:49.1122543Z     for offset_2 in tl.range(0, 5504, _BLOCK_SIZE_1, warp_specialize=True, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:34:49.1122975Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:34:49.1123225Z         mask_1 = indices_2 < 5504
2026-02-21T09:34:49.1123417Z         mi_copy = mi
2026-02-21T09:34:49.1123567Z         di_copy = di
2026-02-21T09:34:49.1123726Z         mi_copy_0 = mi_copy
2026-02-21T09:34:49.1123892Z         di_copy_0 = di_copy
2026-02-21T09:34:49.1124088Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:34:49.1124477Z         values = tl.load(x + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:34:49.1124874Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:34:49.1125306Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:34:49.1125950Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:34:49.1126207Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:34:49.1126453Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:34:49.1126662Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:34:49.1126925Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:34:49.1127162Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:34:49.1127339Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:34:49.1127513Z         v_4 = di_copy_0 * v_3
2026-02-21T09:34:49.1127698Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:34:49.1127902Z         subscript = v_1[:, None]
2026-02-21T09:34:49.1128074Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:34:49.1128320Z         v_6 = v_5 - subscript
2026-02-21T09:34:49.1128531Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:34:49.1128802Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:34:49.1129013Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:34:49.1129201Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:34:49.1129525Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:34:49.1129882Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:34:49.1130087Z         di = v_4 + sum_1
2026-02-21T09:34:49.1130245Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:34:49.1130424Z         mi = v_1
2026-02-21T09:34:49.1130621Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:34:49.1130896Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:34:49.1131195Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:34:49.1131661Z     for offset_2 in tl.range(0, 5504, _BLOCK_SIZE_1, warp_specialize=True, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:34:49.1132054Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:34:49.1132285Z         mask_2 = indices_2 < 5504
2026-02-21T09:34:49.1132455Z         mi_copy_1 = mi
2026-02-21T09:34:49.1132600Z         di_copy_1 = di
2026-02-21T09:34:49.1132750Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:34:49.1132922Z         di_copy_1_0 = di_copy_1
2026-02-21T09:34:49.1133104Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:34:49.1133479Z         values_1 = tl.load(x + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:34:49.1133907Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:34:49.1134197Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:34:49.1134390Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:34:49.1134608Z         v_10 = v_9 - subscript_1
2026-02-21T09:34:49.1134784Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:34:49.1134957Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:34:49.1135143Z         v_12 = v_11 / subscript_2
2026-02-21T09:34:49.1135311Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:34:49.1135582Z         tl.store(out + (indices_0[:, None] * 5504 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:34:49.1135794Z 
2026-02-21T09:34:49.1135921Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:34:49.1136162Z     """
2026-02-21T09:34:49.1136370Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:34:49.1136666Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:34:49.1136887Z     Args:
2026-02-21T09:34:49.1137047Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:34:49.1137374Z     Returns:
2026-02-21T09:34:49.1137548Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:34:49.1137762Z     """
2026-02-21T09:34:49.1137898Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:34:49.1138077Z     m, n = x.size()
2026-02-21T09:34:49.1138248Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:34:49.1138447Z     out = torch.empty_like(x)
2026-02-21T09:34:49.1138678Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:34:49.1138990Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:34:49.1139303Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:34:49.1139536Z     # src[softmax.py:79-92]: ...
2026-02-21T09:34:49.1139790Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=3)
2026-02-21T09:34:49.1140124Z     # src[softmax.py:93]: return out
2026-02-21T09:34:49.1140295Z     return out
2026-02-21T09:34:50.2567440Z WARNING:tritonbench.utils.triton_op:Completed input ID 41:
2026-02-21T09:34:50.2571211Z (M, N)
2026-02-21T09:34:50.2575118Z ------------
2026-02-21T09:34:50.2577027Z (4096, 5504)
2026-02-21T09:34:50.2577162Z 
2026-02-21T09:34:50.2577626Z  45%|████▌     | 9/20 [25:55<36:36, 199.65s/it]WARNING:tritonbench.utils.triton_op:Running input ID 46:
2026-02-21T09:34:50.2582171Z (M, N)
2026-02-21T09:34:50.2584097Z ------------
2026-02-21T09:34:50.2584273Z (4096, 6144)
2026-02-21T09:34:50.2584606Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:34:51.4635234Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:34:52.9796018Z INFO:tritonbench.utils.triton_op:Took 2.16ms to get benchmark function for torch_compile_softmax
2026-02-21T09:34:54.2088347Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:34:54.2092638Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:34:54.2097017Z               'dtype': 'torch.float16',
2026-02-21T09:34:54.2101403Z               'shape': (4096, 6144),
2026-02-21T09:34:54.2105605Z               'stride': (6144, 1)},),
2026-02-21T09:34:54.2110698Z   'kwargs': {}}
2026-02-21T09:34:54.2112426Z INFO:tritonbench.utils.triton_op:Took 2.46ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:34:54.3839239Z [0s] Autotune random seed: 2138408546
2026-02-21T09:34:54.4087714Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:35:27.9158603Z [33s] Timeout after 30s compiling Config(block_sizes=[128, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None])
2026-02-21T09:35:30.3274906Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:35:32.4106365Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 2048], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=6, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, None])
2026-02-21T09:35:32.4126949Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T09:35:39.7371496Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.6 configs/s
2026-02-21T09:35:39.7384161Z [45s] Adaptive compile timeout: 30s (90% percentile=7.9s, bounds=[30.0s, 30s])
2026-02-21T09:35:40.4668087Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1355.7 configs/s
2026-02-21T09:35:40.5262829Z [46s] Initial random population of 100, 5 starting points: 
2026-02-21T09:35:40.5267129Z error=10
2026-02-21T09:35:40.5268623Z timeout=3
2026-02-21T09:35:40.5268964Z ok=87
2026-02-21T09:35:40.5274437Z min=0.0430
2026-02-21T09:35:40.5276415Z mid=0.4813
2026-02-21T09:35:40.5276685Z max=141.7728
2026-02-21T09:35:40.5276915Z best={'block_sizes': [1, 8192],
2026-02-21T09:35:40.5277267Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:35:40.5277607Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:35:40.5277977Z  'num_sm_multiplier': 2,
2026-02-21T09:35:40.5282685Z  'num_stages': 7,
2026-02-21T09:35:40.5285097Z  'num_warps': 32,
2026-02-21T09:35:40.5290775Z  'pid_type': 'persistent_blocked',
2026-02-21T09:35:40.5294880Z  'range_flattens': [True, True],
2026-02-21T09:35:40.5299360Z  'range_multi_buffers': [False, None],
2026-02-21T09:35:40.5300834Z  'range_num_stages': [4, 3],
2026-02-21T09:35:40.5301179Z  'range_unroll_factors': [2, 3],
2026-02-21T09:35:40.5305606Z  'range_warp_specializes': [False, False]}
2026-02-21T09:35:40.5310062Z [46s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:35:41.6772300Z [47s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T09:35:52.1949809Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 3.4 configs/s
2026-02-21T09:35:57.5680201Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 16.9 configs/s
2026-02-21T09:36:01.9364355Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 231.7         
2026-02-21T09:36:01.9368355Z                                                                   configs/s     
2026-02-21T09:36:02.2228226Z [67s] Generation 1 complete: 
2026-02-21T09:36:02.2229611Z ok=91
2026-02-21T09:36:02.2229872Z min=0.0328
2026-02-21T09:36:02.2230074Z mid=0.0471
2026-02-21T09:36:02.2230235Z max=0.2561
2026-02-21T09:36:02.2230441Z best={'block_sizes': [1, 2048],
2026-02-21T09:36:02.2230732Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:36:02.2231061Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:36:02.2231284Z  'num_stages': 5,
2026-02-21T09:36:02.2231498Z  'num_warps': 2,
2026-02-21T09:36:02.2231944Z  'pid_type': 'flat',
2026-02-21T09:36:02.2232185Z  'range_flattens': [None, False],
2026-02-21T09:36:02.2232424Z  'range_multi_buffers': [None, False],
2026-02-21T09:36:02.2232705Z  'range_num_stages': [0, 1],
2026-02-21T09:36:02.2232958Z  'range_unroll_factors': [0, 0],
2026-02-21T09:36:02.2233177Z  'range_warp_specializes': [None, False]}
2026-02-21T09:36:02.2246190Z [67s] Fitting surrogate: 191 points, 191 targets
2026-02-21T09:36:03.4733803Z [69s] Generation 2 starting: 75 neighbors, 5 active search path(s)
2026-02-21T09:36:27.1045441Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.8 configs/s
2026-02-21T09:36:32.4405724Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.9 configs/s
2026-02-21T09:36:34.0789594Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 617.5         
2026-02-21T09:36:34.0794217Z                                                                   configs/s     
2026-02-21T09:36:34.2110422Z [99s] Generation 2 complete: 
2026-02-21T09:36:34.2114906Z error=2
2026-02-21T09:36:34.2116594Z ok=79
2026-02-21T09:36:34.2116854Z min=0.0246
2026-02-21T09:36:34.2117049Z mid=0.0389
2026-02-21T09:36:34.2117264Z max=0.3337
2026-02-21T09:36:34.2117491Z best={'block_sizes': [1, 8192],
2026-02-21T09:36:34.2117805Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:36:34.2118171Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:36:34.2118514Z  'num_stages': 1,
2026-02-21T09:36:34.2122420Z  'num_warps': 2,
2026-02-21T09:36:34.2124110Z  'pid_type': 'flat',
2026-02-21T09:36:34.2124852Z  'range_flattens': [None, True],
2026-02-21T09:36:34.2125149Z  'range_multi_buffers': [None, True],
2026-02-21T09:36:34.2125443Z  'range_num_stages': [0, 4],
2026-02-21T09:36:34.2125682Z  'range_unroll_factors': [0, 0],
2026-02-21T09:36:34.2125979Z  'range_warp_specializes': [None, True]}
2026-02-21T09:36:34.2126355Z [99s] Fitting surrogate: 272 points, 272 targets
2026-02-21T09:36:35.2439751Z [100s] Generation 3 starting: 68 neighbors, 5 active search path(s)
2026-02-21T09:36:49.5302953Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.5 configs/s
2026-02-21T09:36:53.7757689Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.7 configs/s
2026-02-21T09:36:57.0586826Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 309.6         
2026-02-21T09:36:57.0592026Z                                                                   configs/s     
2026-02-21T09:36:57.3111252Z [122s] Generation 3 complete: 
2026-02-21T09:36:57.3112386Z ok=74
2026-02-21T09:36:57.3116972Z min=0.0246
2026-02-21T09:36:57.3118563Z mid=0.0348
2026-02-21T09:36:57.3118793Z max=0.8202
2026-02-21T09:36:57.3119029Z best={'block_sizes': [1, 8192],
2026-02-21T09:36:57.3119345Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:36:57.3119705Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:36:57.3119957Z  'num_stages': 1,
2026-02-21T09:36:57.3120183Z  'num_warps': 2,
2026-02-21T09:36:57.3120378Z  'pid_type': 'flat',
2026-02-21T09:36:57.3120625Z  'range_flattens': [None, True],
2026-02-21T09:36:57.3120892Z  'range_multi_buffers': [None, True],
2026-02-21T09:36:57.3121139Z  'range_num_stages': [0, 4],
2026-02-21T09:36:57.3121395Z  'range_unroll_factors': [0, 0],
2026-02-21T09:36:57.3121726Z  'range_warp_specializes': [None, True]}
2026-02-21T09:36:57.3129663Z [122s] Fitting surrogate: 346 points, 346 targets
2026-02-21T09:36:58.2905991Z [123s] Generation 4 starting: 58 neighbors, 5 active search path(s)
2026-02-21T09:37:24.8774984Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.4 configs/s
2026-02-21T09:37:28.4225667Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.8 configs/s
2026-02-21T09:37:32.0190194Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 345.9         
2026-02-21T09:37:32.0191373Z                                                                   configs/s     
2026-02-21T09:37:32.2404809Z [157s] Generation 4 complete: 
2026-02-21T09:37:32.2409419Z error=1
2026-02-21T09:37:32.2413584Z ok=63
2026-02-21T09:37:32.2415218Z min=0.0246
2026-02-21T09:37:32.2415493Z mid=0.0328
2026-02-21T09:37:32.2415703Z max=0.3267
2026-02-21T09:37:32.2415935Z best={'block_sizes': [1, 8192],
2026-02-21T09:37:32.2416265Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:37:32.2416665Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:37:32.2416935Z  'num_stages': 7,
2026-02-21T09:37:32.2417226Z  'num_warps': 2,
2026-02-21T09:37:32.2417906Z  'pid_type': 'flat',
2026-02-21T09:37:32.2418170Z  'range_flattens': [None, None],
2026-02-21T09:37:32.2418457Z  'range_multi_buffers': [None, False],
2026-02-21T09:37:32.2418739Z  'range_num_stages': [0, 4],
2026-02-21T09:37:32.2419253Z  'range_unroll_factors': [0, 0],
2026-02-21T09:37:32.2419510Z  'range_warp_specializes': [None, True]}
2026-02-21T09:37:32.2419885Z [157s] Fitting surrogate: 410 points, 410 targets
2026-02-21T09:37:32.8735759Z [158s] Generation 5 starting: 34 neighbors, 3 active search path(s)
2026-02-21T09:37:38.4721882Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 5.7 configs/s
2026-02-21T09:37:40.6590302Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.8 configs/s
2026-02-21T09:37:42.3645156Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 593.8         
2026-02-21T09:37:42.3646320Z                                                                   configs/s     
2026-02-21T09:37:42.4949329Z [168s] Generation 5 complete: 
2026-02-21T09:37:42.4950345Z ok=38
2026-02-21T09:37:42.4950534Z min=0.0246
2026-02-21T09:37:42.4950753Z mid=0.0328
2026-02-21T09:37:42.4950929Z max=0.0799
2026-02-21T09:37:42.4951151Z best={'block_sizes': [1, 8192],
2026-02-21T09:37:42.4951472Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:37:42.4952128Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:37:42.4952388Z  'num_stages': 7,
2026-02-21T09:37:42.4952625Z  'num_warps': 2,
2026-02-21T09:37:42.4952822Z  'pid_type': 'flat',
2026-02-21T09:37:42.4953064Z  'range_flattens': [None, None],
2026-02-21T09:37:42.4953328Z  'range_multi_buffers': [None, False],
2026-02-21T09:37:42.4953574Z  'range_num_stages': [0, 4],
2026-02-21T09:37:42.4953819Z  'range_unroll_factors': [0, 0],
2026-02-21T09:37:42.4954052Z  'range_warp_specializes': [None, True]}
2026-02-21T09:37:42.4965425Z [168s] Fitting surrogate: 448 points, 448 targets
2026-02-21T09:37:43.0046393Z [168s] Generation 6 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:37:46.8439267Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 4.6 configs/s
2026-02-21T09:37:48.2293482Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.1 configs/s
2026-02-21T09:37:49.7497817Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 666.4         
2026-02-21T09:37:49.7499385Z                                                                   configs/s     
2026-02-21T09:37:49.8803524Z [175s] Generation 6 complete: 
2026-02-21T09:37:49.8807899Z ok=26
2026-02-21T09:37:49.8811358Z min=0.0246
2026-02-21T09:37:49.8813724Z mid=0.0266
2026-02-21T09:37:49.8813980Z max=0.0429
2026-02-21T09:37:49.8818704Z best={'block_sizes': [1, 8192],
2026-02-21T09:37:49.8820206Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:37:49.8820599Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:37:49.8820887Z  'num_stages': 7,
2026-02-21T09:37:49.8821132Z  'num_warps': 2,
2026-02-21T09:37:49.8821377Z  'pid_type': 'flat',
2026-02-21T09:37:49.8821666Z  'range_flattens': [None, None],
2026-02-21T09:37:49.8821938Z  'range_multi_buffers': [None, False],
2026-02-21T09:37:49.8822183Z  'range_num_stages': [0, 4],
2026-02-21T09:37:49.8822434Z  'range_unroll_factors': [0, 0],
2026-02-21T09:37:49.8822671Z  'range_warp_specializes': [None, True]}
2026-02-21T09:37:49.8822976Z [175s] Fitting surrogate: 474 points, 474 targets
2026-02-21T09:37:50.1749810Z [175s] Generation 7 starting: 9 neighbors, 1 active search path(s)
2026-02-21T09:37:51.8365551Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 11.7 configs/s
2026-02-21T09:37:52.3719553Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.5 configs/s
2026-02-21T09:37:53.1120717Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1355.9         
2026-02-21T09:37:53.1125507Z                                                                  configs/s      
2026-02-21T09:37:53.1825907Z [178s] Generation 7 complete: 
2026-02-21T09:37:53.1827215Z ok=11
2026-02-21T09:37:53.1827491Z min=0.0246
2026-02-21T09:37:53.1827746Z mid=0.0255
2026-02-21T09:37:53.1827974Z max=0.0287
2026-02-21T09:37:53.1828254Z best={'block_sizes': [1, 8192],
2026-02-21T09:37:53.1828614Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:37:53.1829074Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:37:53.1829362Z  'num_stages': 7,
2026-02-21T09:37:53.1829638Z  'num_warps': 2,
2026-02-21T09:37:53.1829882Z  'pid_type': 'flat',
2026-02-21T09:37:53.1830163Z  'range_flattens': [None, None],
2026-02-21T09:37:53.1830477Z  'range_multi_buffers': [None, False],
2026-02-21T09:37:53.1830770Z  'range_num_stages': [0, 3],
2026-02-21T09:37:53.1831073Z  'range_unroll_factors': [0, 1],
2026-02-21T09:37:53.1831355Z  'range_warp_specializes': [None, True]}
2026-02-21T09:37:53.1840353Z [178s] Fitting surrogate: 485 points, 485 targets
2026-02-21T09:37:53.5017425Z [179s] Generation 8 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:37:55.1352032Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 14.1 configs/s
2026-02-21T09:37:55.7433556Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 17.8 configs/s
2026-02-21T09:37:56.4072947Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1506.2         
2026-02-21T09:37:56.4074836Z                                                                  configs/s      
2026-02-21T09:37:56.4713596Z [182s] Generation 8 complete: 
2026-02-21T09:37:56.4715018Z ok=11
2026-02-21T09:37:56.4715317Z min=0.0246
2026-02-21T09:37:56.4720856Z mid=0.0247
2026-02-21T09:37:56.4725558Z max=0.0369
2026-02-21T09:37:56.4727206Z best={'block_sizes': [1, 8192],
2026-02-21T09:37:56.4727610Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:37:56.4732990Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:37:56.4737858Z  'num_stages': 7,
2026-02-21T09:37:56.4740017Z  'num_warps': 4,
2026-02-21T09:37:56.4740383Z  'pid_type': 'flat',
2026-02-21T09:37:56.4740659Z  'range_flattens': [None, None],
2026-02-21T09:37:56.4744753Z  'range_multi_buffers': [None, False],
2026-02-21T09:37:56.4748810Z  'range_num_stages': [0, 3],
2026-02-21T09:37:56.4752923Z  'range_unroll_factors': [0, 1],
2026-02-21T09:37:56.4757019Z  'range_warp_specializes': [None, True]}
2026-02-21T09:37:56.4761416Z [182s] Fitting surrogate: 496 points, 496 targets
2026-02-21T09:37:56.8179646Z [182s] Generation 9 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:37:58.5249300Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 84.5 configs/s
2026-02-21T09:37:59.2318817Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.2 configs/s
2026-02-21T09:38:00.1049478Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1150.0         
2026-02-21T09:38:00.1050520Z                                                                  configs/s      
2026-02-21T09:38:00.1837447Z [185s] Generation 9 complete: 
2026-02-21T09:38:00.1839113Z ok=13
2026-02-21T09:38:00.1839332Z min=0.0246
2026-02-21T09:38:00.1839540Z mid=0.0246
2026-02-21T09:38:00.1839712Z max=0.0327
2026-02-21T09:38:00.1839923Z best={'block_sizes': [1, 8192],
2026-02-21T09:38:00.1840180Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:38:00.1840474Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:38:00.1840731Z  'num_stages': 7,
2026-02-21T09:38:00.1840912Z  'num_warps': 2,
2026-02-21T09:38:00.1841118Z  'pid_type': 'flat',
2026-02-21T09:38:00.1841322Z  'range_flattens': [None, None],
2026-02-21T09:38:00.1841895Z  'range_multi_buffers': [None, False],
2026-02-21T09:38:00.1842130Z  'range_num_stages': [0, 3],
2026-02-21T09:38:00.1842373Z  'range_unroll_factors': [0, 1],
2026-02-21T09:38:00.1842594Z  'range_warp_specializes': [None, True]}
2026-02-21T09:38:00.1870003Z [185s] Fitting surrogate: 509 points, 509 targets
2026-02-21T09:38:00.5908799Z [186s] Generation 10 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:38:02.3770417Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 7.4 configs/s
2026-02-21T09:38:03.0261384Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.3 configs/s
2026-02-21T09:38:03.8289812Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1248.7        
2026-02-21T09:38:03.8290980Z                                                                   configs/s     
2026-02-21T09:38:03.9038155Z [189s] Generation 10 complete: 
2026-02-21T09:38:03.9039837Z ok=12
2026-02-21T09:38:03.9040051Z min=0.0246
2026-02-21T09:38:03.9040254Z mid=0.0246
2026-02-21T09:38:03.9040420Z max=0.0287
2026-02-21T09:38:03.9040628Z best={'block_sizes': [1, 8192],
2026-02-21T09:38:03.9040879Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:38:03.9041173Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:38:03.9041402Z  'num_stages': 7,
2026-02-21T09:38:03.9041694Z  'num_warps': 2,
2026-02-21T09:38:03.9041909Z  'pid_type': 'flat',
2026-02-21T09:38:03.9042514Z  'range_flattens': [None, None],
2026-02-21T09:38:03.9042797Z  'range_multi_buffers': [None, False],
2026-02-21T09:38:03.9043046Z  'range_num_stages': [0, 3],
2026-02-21T09:38:03.9043281Z  'range_unroll_factors': [0, 1],
2026-02-21T09:38:03.9043504Z  'range_warp_specializes': [None, True]}
2026-02-21T09:38:03.9054399Z [189s] Fitting surrogate: 521 points, 521 targets
2026-02-21T09:38:04.2977308Z [189s] Generation 11 starting: 9 neighbors, 1 active search path(s)
2026-02-21T09:38:05.5441422Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 10.2 configs/s
2026-02-21T09:38:06.0697829Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 9/9 18.9 configs/s
2026-02-21T09:38:06.7996180Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1372.9        
2026-02-21T09:38:06.7996995Z                                                                   configs/s     
2026-02-21T09:38:06.8686475Z [192s] Generation 11 complete: 
2026-02-21T09:38:06.8688038Z ok=11
2026-02-21T09:38:06.8688293Z min=0.0246
2026-02-21T09:38:06.8688517Z mid=0.0246
2026-02-21T09:38:06.8688683Z max=0.0247
2026-02-21T09:38:06.8688894Z best={'block_sizes': [1, 8192],
2026-02-21T09:38:06.8689246Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:38:06.8693512Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:38:06.8695270Z  'num_stages': 7,
2026-02-21T09:38:06.8695548Z  'num_warps': 2,
2026-02-21T09:38:06.8699746Z  'pid_type': 'flat',
2026-02-21T09:38:06.8704334Z  'range_flattens': [None, None],
2026-02-21T09:38:06.8706204Z  'range_multi_buffers': [None, False],
2026-02-21T09:38:06.8706517Z  'range_num_stages': [0, 4],
2026-02-21T09:38:06.8706740Z  'range_unroll_factors': [0, 0],
2026-02-21T09:38:06.8707004Z  'range_warp_specializes': [None, True]}
2026-02-21T09:38:06.8714680Z [192s] Fitting surrogate: 532 points, 532 targets
2026-02-21T09:38:07.1599032Z [192s] Autotuning complete in 192.8s after searching 505 configs.
2026-02-21T09:38:07.1599805Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:38:07.1600809Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:38:07.1601788Z 
2026-02-21T09:38:07.1602087Z [192s] Code of selected kernel: /tmp/torchinductor_root/lc/clcx5yb5v47k23m7k6353vegzyi3ciijakdyml5lbrqctuz55van.py
2026-02-21T09:38:07.1850783Z from __future__ import annotations
2026-02-21T09:38:07.1852174Z 
2026-02-21T09:38:07.1852410Z import torch
2026-02-21T09:38:07.1852658Z import triton
2026-02-21T09:38:07.1852861Z import triton.language as tl
2026-02-21T09:38:07.1853153Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:38:07.1853456Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:38:07.1854262Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:38:07.1854460Z 
2026-02-21T09:38:07.1854581Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:38:07.1854799Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:38:07.1854963Z 
2026-02-21T09:38:07.1855042Z @triton.jit
2026-02-21T09:38:07.1855229Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:38:07.1855550Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:38:07.1855838Z     pid_0 = tl.program_id(0)
2026-02-21T09:38:07.1856069Z     offset_0 = pid_0
2026-02-21T09:38:07.1856314Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:38:07.1856642Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:38:07.1856993Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:38:07.1857295Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:38:07.1857779Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:38:07.1858114Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:38:07.1858424Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:38:07.1858749Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:38:07.1859018Z     # src[softmax.py:82-89]: ...
2026-02-21T09:38:07.1859426Z     for offset_2 in tl.range(0, 6144, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, disallow_acc_multi_buffer=True):
2026-02-21T09:38:07.1859880Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:38:07.1860155Z         mask_1 = indices_2 < 6144
2026-02-21T09:38:07.1860387Z         mi_copy = mi
2026-02-21T09:38:07.1860568Z         di_copy = di
2026-02-21T09:38:07.1860778Z         mi_copy_0 = mi_copy
2026-02-21T09:38:07.1860974Z         di_copy_0 = di_copy
2026-02-21T09:38:07.1861223Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:38:07.1861717Z         values = tl.load(x + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:38:07.1862180Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:38:07.1862659Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:38:07.1863092Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:38:07.1863420Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:38:07.1863701Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:38:07.1863992Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:38:07.1864315Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:38:07.1864597Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:38:07.1864837Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:38:07.1865042Z         v_4 = di_copy_0 * v_3
2026-02-21T09:38:07.1865298Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:38:07.1865538Z         subscript = v_1[:, None]
2026-02-21T09:38:07.1865783Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:38:07.1866001Z         v_6 = v_5 - subscript
2026-02-21T09:38:07.1866281Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:38:07.1866610Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:38:07.1866864Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:38:07.1867121Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:38:07.1867482Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:38:07.1867903Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:38:07.1868139Z         di = v_4 + sum_1
2026-02-21T09:38:07.1868362Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:38:07.1868605Z         mi = v_1
2026-02-21T09:38:07.1868934Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:38:07.1869271Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:38:07.1869607Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:38:07.1870097Z     for offset_2 in tl.range(0, 6144, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, disallow_acc_multi_buffer=True):
2026-02-21T09:38:07.1870526Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:38:07.1870828Z         mask_2 = indices_2 < 6144
2026-02-21T09:38:07.1871059Z         mi_copy_1 = mi
2026-02-21T09:38:07.1871246Z         di_copy_1 = di
2026-02-21T09:38:07.1871464Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:38:07.1871701Z         di_copy_1_0 = di_copy_1
2026-02-21T09:38:07.1871952Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:38:07.1872425Z         values_1 = tl.load(x + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:38:07.1872932Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:38:07.1873276Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:38:07.1873498Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:38:07.1873756Z         v_10 = v_9 - subscript_1
2026-02-21T09:38:07.1873967Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:38:07.1874213Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:38:07.1874436Z         v_12 = v_11 / subscript_2
2026-02-21T09:38:07.1874674Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:38:07.1874980Z         tl.store(out + (indices_0[:, None] * 6144 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:38:07.1875239Z 
2026-02-21T09:38:07.1875387Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:38:07.1875685Z     """
2026-02-21T09:38:07.1875935Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:38:07.1876302Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:38:07.1876562Z     Args:
2026-02-21T09:38:07.1876790Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:38:07.1877022Z     Returns:
2026-02-21T09:38:07.1877268Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:38:07.1877544Z     """
2026-02-21T09:38:07.1877723Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:38:07.1877969Z     m, n = x.size()
2026-02-21T09:38:07.1878175Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:38:07.1878444Z     out = torch.empty_like(x)
2026-02-21T09:38:07.1878708Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:38:07.1879085Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:38:07.1879466Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:38:07.1879749Z     # src[softmax.py:79-92]: ...
2026-02-21T09:38:07.1880068Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=7)
2026-02-21T09:38:07.1880377Z     # src[softmax.py:93]: return out
2026-02-21T09:38:07.1880614Z     return out
2026-02-21T09:38:08.0912864Z WARNING:tritonbench.utils.triton_op:Completed input ID 46:
2026-02-21T09:38:08.0916959Z (M, N)
2026-02-21T09:38:08.0920905Z ------------
2026-02-21T09:38:08.0923119Z (4096, 6144)
2026-02-21T09:38:08.0923345Z 
2026-02-21T09:38:08.0927984Z  50%|█████     | 10/20 [29:13<33:10, 199.09s/it]WARNING:tritonbench.utils.triton_op:Running input ID 51:
2026-02-21T09:38:08.0932216Z (M, N)
2026-02-21T09:38:08.0936078Z ------------
2026-02-21T09:38:08.0939933Z (4096, 6784)
2026-02-21T09:38:08.0944009Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:38:09.3003319Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:38:10.7909365Z INFO:tritonbench.utils.triton_op:Took 2.35ms to get benchmark function for torch_compile_softmax
2026-02-21T09:38:12.0944607Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:38:12.0946348Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:38:12.0946715Z               'dtype': 'torch.float16',
2026-02-21T09:38:12.0952281Z               'shape': (4096, 6784),
2026-02-21T09:38:12.0956621Z               'stride': (6784, 1)},),
2026-02-21T09:38:12.0958170Z   'kwargs': {}}
2026-02-21T09:38:12.0968572Z INFO:tritonbench.utils.triton_op:Took 2.74ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:38:12.2744377Z [0s] Autotune random seed: 2138408546
2026-02-21T09:38:12.3002456Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:38:49.3072244Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:38:51.3370744Z module {
2026-02-21T09:38:51.3374068Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:38:51.3375943Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:38:51.3376224Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:38:51.3376489Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:38:51.3376713Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:38:51.3377079Z     %cst = arith.constant dense<6784> : tensor<16x1xi32>
2026-02-21T09:38:51.3378720Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T09:38:51.3379098Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T09:38:51.3379368Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:38:51.3379630Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:38:51.3379861Z     %c6784_i32 = arith.constant 6784 : i32
2026-02-21T09:38:51.3380134Z     %c6784_i64 = arith.constant 6784 : i64
2026-02-21T09:38:51.3380392Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:38:51.3380756Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c6784_i32], [%c6784_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T09:38:51.3381152Z     %1 = tt.get_program_id x : i32
2026-02-21T09:38:51.3381373Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T09:38:51.3381689Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T09:38:51.3381931Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:38:51.3382207Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T09:38:51.3382510Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T09:38:51.3382801Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T09:38:51.3383062Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T09:38:51.3383289Z       %c6656_i32 = arith.constant 6656 : i32
2026-02-21T09:38:51.3383537Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T09:38:51.3383954Z       %8:2 = scf.for %arg3 = %c0_i32 to %c6656_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T09:38:51.3384491Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:38:51.3384882Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3385159Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3385423Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3385689Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.3385920Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3386174Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3386438Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:38:51.3386743Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:38:51.3387048Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T09:38:51.3387633Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T09:38:51.3387927Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T09:38:51.3388205Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:38:51.3388526Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T09:38:51.3388937Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3389393Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T09:38:51.3389721Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3390135Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3390417Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T09:38:51.3390926Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3391371Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3391637Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3391893Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.3392128Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3392388Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3392629Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T09:38:51.3392899Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:38:51.3393132Z         %68 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:38:51.3393394Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T09:38:51.3393739Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:38:51.3394102Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3394405Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3394638Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3394892Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.3395126Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3395385Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3395678Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:38:51.3395961Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:38:51.3396260Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T09:38:51.3396511Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T09:38:51.3396779Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T09:38:51.3397061Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:38:51.3397357Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T09:38:51.3397776Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3398197Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T09:38:51.3398528Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3398868Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3399184Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T09:38:51.3399601Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3400041Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3400311Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3400552Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.3400820Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3401055Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3401400Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T09:38:51.3401670Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:38:51.3401952Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:38:51.3402218Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T09:38:51.3402550Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:38:51.3402951Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3403235Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3403508Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3403740Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.3404009Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3404268Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3404544Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:38:51.3404926Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:38:51.3405204Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T09:38:51.3405496Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T09:38:51.3405749Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T09:38:51.3406058Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:38:51.3406377Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T09:38:51.3406793Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3407225Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T09:38:51.3407518Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3407885Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3408176Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T09:38:51.3408617Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3409063Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3409295Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3409540Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.3409763Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3410012Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3410255Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T09:38:51.3410513Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:38:51.3410760Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:38:51.3410988Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T09:38:51.3411334Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:38:51.3411725Z         %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3412032Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3412261Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3412507Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.3412760Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3412980Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3413275Z         %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:38:51.3413565Z         %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:38:51.3413867Z         %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32>
2026-02-21T09:38:51.3414124Z         %116 = arith.cmpf une, %98, %98 : tensor<16xf32>
2026-02-21T09:38:51.3414393Z         %117 = arith.ori %115, %116 : tensor<16xi1>
2026-02-21T09:38:51.3414703Z         %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:38:51.3415046Z         %119 = arith.subf %98, %118 : tensor<16xf32>
2026-02-21T09:38:51.3415469Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3415867Z         %121 = arith.mulf %107, %120 : tensor<16xf32>
2026-02-21T09:38:51.3416190Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3416526Z         %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3416836Z         %124 = arith.subf %111, %123 : tensor<16x128xf32>
2026-02-21T09:38:51.3417281Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3417699Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3417950Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.3418222Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.3418477Z           tt.reduce.return %128 : f32
2026-02-21T09:38:51.3418730Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3418973Z         %127 = arith.addf %121, %126 : tensor<16xf32>
2026-02-21T09:38:51.3419264Z         scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32>
2026-02-21T09:38:51.3419524Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:38:51.3419888Z       %9 = tt.descriptor_load %0[%4, %c6656_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:38:51.3420256Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3420555Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3420811Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:38:51.3421032Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:38:51.3421289Z         tt.reduce.return %50 : f32
2026-02-21T09:38:51.3421511Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3421835Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:38:51.3422113Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:38:51.3422402Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T09:38:51.3422656Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T09:38:51.3422935Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T09:38:51.3423231Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:38:51.3423500Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T09:38:51.3423916Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3424303Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T09:38:51.3424614Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3424964Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3425240Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T09:38:51.3425660Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3426059Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.3426312Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:38:51.3426533Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:38:51.3426783Z         tt.reduce.return %50 : f32
2026-02-21T09:38:51.3427031Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:38:51.3427262Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T09:38:51.3427522Z       %c6656_i32_2 = arith.constant 6656 : i32
2026-02-21T09:38:51.3427750Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T09:38:51.3428045Z       scf.for %arg3 = %c0_i32 to %c6656_i32_2 step %c512_i32_3  : i32 {
2026-02-21T09:38:51.3428430Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:38:51.3428756Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:38:51.3429002Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T09:38:51.3429321Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:38:51.3429653Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T09:38:51.3429958Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:38:51.3430322Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3430625Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3430931Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T09:38:51.3431240Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3431678Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3432112Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3432464Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3432814Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3433117Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3433425Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T09:38:51.3433865Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3434320Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3434672Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3434949Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T09:38:51.3435255Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:38:51.3435595Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3435917Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3436244Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3436492Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:38:51.3436748Z         %73 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:38:51.3436980Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T09:38:51.3437283Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:38:51.3437598Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T09:38:51.3437836Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T09:38:51.3438153Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:38:51.3438458Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T09:38:51.3438778Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:38:51.3439109Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3439435Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3439735Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T09:38:51.3440007Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3440344Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3440689Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3441078Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3441413Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3441849Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3442155Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T09:38:51.3442578Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3443081Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3443418Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3443729Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T09:38:51.3444042Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:38:51.3444386Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3444746Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3445113Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3445401Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:38:51.3445640Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:38:51.3445906Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T09:38:51.3446258Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:38:51.3446566Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T09:38:51.3446849Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T09:38:51.3447151Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:38:51.3447496Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T09:38:51.3447819Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:38:51.3448212Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3448575Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3448865Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T09:38:51.3449175Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3449497Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3449873Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3450260Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3450589Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3450926Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3451205Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T09:38:51.3451686Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3452145Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3452506Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3452817Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T09:38:51.3453101Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:38:51.3453445Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3453774Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3454109Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3454383Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:38:51.3454612Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:38:51.3454880Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T09:38:51.3455216Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:38:51.3455537Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T09:38:51.3455786Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T09:38:51.3456105Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:38:51.3456437Z         %129 = arith.muli %128, %cst : tensor<16x1xi32>
2026-02-21T09:38:51.3456737Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:38:51.3457099Z         %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3457405Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3457714Z         %133 = arith.addi %131, %132 : tensor<16x128xi32>
2026-02-21T09:38:51.3457992Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3458383Z         %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3458758Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3459114Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3459470Z         %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3459777Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3460088Z         %140 = arith.subf %138, %139 : tensor<16x128xf32>
2026-02-21T09:38:51.3460530Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3460989Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3461345Z         %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3461666Z         %144 = arith.divf %141, %143 : tensor<16x128xf32>
2026-02-21T09:38:51.3461975Z         %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:38:51.3462281Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3462642Z         %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3462966Z         tt.store %147, %145 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3463213Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:38:51.3463524Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:38:51.3463824Z       %28 = tt.splat %c6656_i32_2 : i32 -> tensor<128xi32>
2026-02-21T09:38:51.3464108Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T09:38:51.3464394Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:38:51.3464717Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T09:38:51.3465033Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:38:51.3465358Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3465686Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:38:51.3465957Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T09:38:51.3466253Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3466590Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3466928Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3467298Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3467622Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:38:51.3467947Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3468278Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T09:38:51.3468706Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:38:51.3469213Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:38:51.3469534Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:38:51.3469831Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T09:38:51.3470107Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:38:51.3470442Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3470783Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:38:51.3471076Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:38:51.3471494Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T09:38:51.3471857Z     tt.return
2026-02-21T09:38:51.3472058Z   }
2026-02-21T09:38:51.3472224Z }
2026-02-21T09:38:51.3472345Z 
2026-02-21T09:38:51.3472416Z {-#
2026-02-21T09:38:51.3472611Z   external_resources: {
2026-02-21T09:38:51.3472809Z     mlir_reproducer: {
2026-02-21T09:38:51.3477217Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:38:51.3481821Z       disable_threading: false,
2026-02-21T09:38:51.3482059Z       verify_each: true
2026-02-21T09:38:51.3482245Z     }
2026-02-21T09:38:51.3482435Z   }
2026-02-21T09:38:51.3482594Z #-}
2026-02-21T09:38:51.3483085Z /tmp/torchinductor_root/7s/c7s4dthdwq5dlgog2pjjyhvd6ij663jx2xp7majmr2ptyyxqymyr.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:38:51.3484346Z /tmp/torchinductor_root/7s/c7s4dthdwq5dlgog2pjjyhvd6ij663jx2xp7majmr2ptyyxqymyr.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:38:51.3485359Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:38:51.3486467Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:38:51.3487566Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:38:51.3487860Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:38:51.7437238Z module {
2026-02-21T09:38:51.7438905Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:38:51.7439460Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:38:51.7439782Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T09:38:51.7440322Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:38:51.7440611Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:38:51.7440839Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T09:38:51.7441134Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T09:38:51.7441447Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T09:38:51.7443469Z     %cst_2 = arith.constant dense<6784> : tensor<8x1xi32>
2026-02-21T09:38:51.7443775Z     %cst_3 = arith.constant dense<6784> : tensor<1024xi32>
2026-02-21T09:38:51.7444066Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:38:51.7444378Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:38:51.7444631Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:38:51.7444886Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:38:51.7445109Z     %c6784_i32 = arith.constant 6784 : i32
2026-02-21T09:38:51.7445364Z     %c6784_i64 = arith.constant 6784 : i64
2026-02-21T09:38:51.7445627Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:38:51.7445982Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c6784_i32], [%c6784_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T09:38:51.7446394Z     %1 = tt.get_program_id x : i32
2026-02-21T09:38:51.7446644Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T09:38:51.7446943Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:38:51.7447230Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:38:51.7447558Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:38:51.7447834Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:38:51.7448076Z       %c6144_i32 = arith.constant 6144 : i32
2026-02-21T09:38:51.7448366Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T09:38:51.7448865Z       %6:2 = scf.for %arg3 = %c0_i32 to %c6144_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:38:51.7449355Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7449711Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7450010Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:38:51.7450288Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7450639Z         %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7450948Z         %71 = arith.muli %70, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7451292Z         %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7451712Z         %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7452063Z         %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7452387Z         %75 = arith.addi %73, %74 : tensor<8x1024xi32>
2026-02-21T09:38:51.7452685Z         %76 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7453169Z         %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7453531Z         %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7453914Z         %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7454243Z         %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7454567Z         %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:38:51.7454955Z         %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7455227Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7455493Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7455724Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.7455987Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7456314Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7456587Z         %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:38:51.7456888Z         %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:38:51.7457154Z         %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32>
2026-02-21T09:38:51.7457438Z         %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:38:51.7457690Z         %88 = arith.ori %86, %87 : tensor<8xi1>
2026-02-21T09:38:51.7457999Z         %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:38:51.7458284Z         %90 = arith.subf %arg4, %89 : tensor<8xf32>
2026-02-21T09:38:51.7458720Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7459139Z         %92 = arith.mulf %arg5, %91 : tensor<8xf32>
2026-02-21T09:38:51.7459422Z         %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7459776Z         %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7460086Z         %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7460390Z         %96 = arith.subf %94, %95 : tensor<8x1024xf32>
2026-02-21T09:38:51.7460816Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7461261Z         %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:38:51.7461608Z         %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7461847Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7462096Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.7462327Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7462585Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7462853Z         %100 = arith.addf %92, %99 : tensor<8xf32>
2026-02-21T09:38:51.7463091Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:38:51.7463355Z         %101 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:38:51.7463591Z         %102 = arith.addi %arg3, %101 : i32
2026-02-21T09:38:51.7463903Z         %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7464202Z         %104 = tt.splat %102 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7464482Z         %105 = arith.addi %104, %103 : tensor<1024xi32>
2026-02-21T09:38:51.7464773Z         %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7465076Z         %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7465409Z         %108 = arith.muli %107, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7465713Z         %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7466079Z         %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7466454Z         %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7466768Z         %112 = arith.addi %110, %111 : tensor<8x1024xi32>
2026-02-21T09:38:51.7467078Z         %113 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7467411Z         %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7467794Z         %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7468139Z         %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7468469Z         %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7468814Z         %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:38:51.7469147Z         %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7469521Z         %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7469762Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7470022Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.7470260Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7470517Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7470788Z         %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:38:51.7471101Z         %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:38:51.7482416Z         %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32>
2026-02-21T09:38:51.7482753Z         %124 = arith.cmpf une, %89, %89 : tensor<8xf32>
2026-02-21T09:38:51.7483011Z         %125 = arith.ori %123, %124 : tensor<8xi1>
2026-02-21T09:38:51.7483317Z         %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:38:51.7483667Z         %127 = arith.subf %89, %126 : tensor<8xf32>
2026-02-21T09:38:51.7484094Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7484532Z         %129 = arith.mulf %100, %128 : tensor<8xf32>
2026-02-21T09:38:51.7484833Z         %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7485176Z         %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7485488Z         %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7485800Z         %133 = arith.subf %131, %132 : tensor<8x1024xf32>
2026-02-21T09:38:51.7486253Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7486721Z         %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:38:51.7487052Z         %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7487290Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7487550Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.7487808Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7488034Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7488303Z         %137 = arith.addf %129, %136 : tensor<8xf32>
2026-02-21T09:38:51.7488536Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:38:51.7488790Z         %138 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:38:51.7489019Z         %139 = arith.addi %arg3, %138 : i32
2026-02-21T09:38:51.7489321Z         %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7489619Z         %141 = tt.splat %139 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7489891Z         %142 = arith.addi %141, %140 : tensor<1024xi32>
2026-02-21T09:38:51.7490176Z         %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7490480Z         %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7490826Z         %145 = arith.muli %144, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7491283Z         %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7491703Z         %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7492054Z         %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7492352Z         %149 = arith.addi %147, %148 : tensor<8x1024xi32>
2026-02-21T09:38:51.7492678Z         %150 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7493027Z         %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7493421Z         %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7493777Z         %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7494122Z         %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7494545Z         %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:38:51.7494885Z         %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7495197Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7495437Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7495706Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:38:51.7495955Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7496224Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7496533Z         %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:38:51.7496833Z         %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:38:51.7497137Z         %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32>
2026-02-21T09:38:51.7497407Z         %161 = arith.cmpf une, %126, %126 : tensor<8xf32>
2026-02-21T09:38:51.7497688Z         %162 = arith.ori %160, %161 : tensor<8xi1>
2026-02-21T09:38:51.7497975Z         %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:38:51.7498295Z         %164 = arith.subf %126, %163 : tensor<8xf32>
2026-02-21T09:38:51.7498738Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7499162Z         %166 = arith.mulf %137, %165 : tensor<8xf32>
2026-02-21T09:38:51.7499496Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7499818Z         %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7500148Z         %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7500452Z         %170 = arith.subf %168, %169 : tensor<8x1024xf32>
2026-02-21T09:38:51.7500864Z         %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7501324Z         %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:38:51.7501647Z         %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7501908Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:38:51.7502133Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:38:51.7502381Z           tt.reduce.return %175 : f32
2026-02-21T09:38:51.7502624Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7502863Z         %174 = arith.addf %166, %173 : tensor<8xf32>
2026-02-21T09:38:51.7503146Z         scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:38:51.7503386Z       } {tt.flatten}
2026-02-21T09:38:51.7503654Z       %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7503953Z       %8 = tt.splat %c6144_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7504224Z       %9 = arith.addi %8, %7 : tensor<1024xi32>
2026-02-21T09:38:51.7504474Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7504859Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7505168Z       %12 = arith.muli %11, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7505456Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7505808Z       %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7506107Z       %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7506403Z       %16 = arith.addi %14, %15 : tensor<8x1024xi32>
2026-02-21T09:38:51.7506699Z       %17 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7507012Z       %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7507366Z       %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7507738Z       %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7508060Z       %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7508355Z       %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:38:51.7508690Z       %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7508977Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7509203Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:38:51.7509445Z         %66 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:38:51.7509673Z         tt.reduce.return %66 : f32
2026-02-21T09:38:51.7509920Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7510173Z       %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:38:51.7510459Z       %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:38:51.7510746Z       %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32>
2026-02-21T09:38:51.7511001Z       %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T09:38:51.7511265Z       %29 = arith.ori %27, %28 : tensor<8xi1>
2026-02-21T09:38:51.7511525Z       %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:38:51.7511853Z       %31 = arith.subf %6#0, %30 : tensor<8xf32>
2026-02-21T09:38:51.7512335Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7512759Z       %33 = arith.mulf %6#1, %32 : tensor<8xf32>
2026-02-21T09:38:51.7513065Z       %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7513380Z       %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7513703Z       %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7513967Z       %37 = arith.subf %35, %36 : tensor<8x1024xf32>
2026-02-21T09:38:51.7514393Z       %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7514864Z       %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:38:51.7515151Z       %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({
2026-02-21T09:38:51.7515403Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:38:51.7515620Z         %66 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:38:51.7515867Z         tt.reduce.return %66 : f32
2026-02-21T09:38:51.7516089Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:38:51.7516345Z       %41 = arith.addf %33, %40 : tensor<8xf32>
2026-02-21T09:38:51.7516577Z       %c6144_i32_6 = arith.constant 6144 : i32
2026-02-21T09:38:51.7516820Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T09:38:51.7517114Z       scf.for %arg3 = %c0_i32 to %c6144_i32_6 step %c3072_i32_7  : i32 {
2026-02-21T09:38:51.7517436Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7517775Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7518089Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:38:51.7518359Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7518703Z         %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:38:51.7519101Z         %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7519445Z         %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7519741Z         %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7520034Z         %74 = arith.subf %72, %73 : tensor<8x1024xf32>
2026-02-21T09:38:51.7520437Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7520938Z         %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7521284Z         %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7521579Z         %78 = arith.divf %75, %77 : tensor<8x1024xf32>
2026-02-21T09:38:51.7521900Z         %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:38:51.7522236Z         %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7522525Z         %81 = arith.muli %80, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7522845Z         %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7523172Z         %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7523494Z         %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7523763Z         %85 = arith.addi %83, %84 : tensor<8x1024xi32>
2026-02-21T09:38:51.7524064Z         %86 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7524409Z         %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7524739Z         %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7525092Z         %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7525376Z         tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7525647Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:38:51.7525871Z         %90 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:38:51.7526102Z         %91 = arith.addi %arg3, %90 : i32
2026-02-21T09:38:51.7526378Z         %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7526698Z         %93 = tt.splat %91 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7526940Z         %94 = arith.addi %93, %92 : tensor<1024xi32>
2026-02-21T09:38:51.7527222Z         %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7527563Z         %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:38:51.7527972Z         %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7528320Z         %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7528616Z         %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7528924Z         %100 = arith.subf %98, %99 : tensor<8x1024xf32>
2026-02-21T09:38:51.7529341Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7529835Z         %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7530199Z         %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7530486Z         %104 = arith.divf %101, %103 : tensor<8x1024xf32>
2026-02-21T09:38:51.7530804Z         %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:38:51.7531183Z         %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7531517Z         %107 = arith.muli %106, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7531869Z         %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7532235Z         %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7532575Z         %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7532868Z         %111 = arith.addi %109, %110 : tensor<8x1024xi32>
2026-02-21T09:38:51.7533179Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7533509Z         %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7533965Z         %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7534325Z         %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7534674Z         tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7534976Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:38:51.7535223Z         %116 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:38:51.7535501Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T09:38:51.7535789Z         %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7536136Z         %119 = tt.splat %117 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7536396Z         %120 = arith.addi %119, %118 : tensor<1024xi32>
2026-02-21T09:38:51.7536698Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7537090Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:38:51.7537490Z         %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7537866Z         %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7538193Z         %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7538518Z         %126 = arith.subf %124, %125 : tensor<8x1024xf32>
2026-02-21T09:38:51.7538984Z         %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7539460Z         %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7539837Z         %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7540104Z         %130 = arith.divf %127, %129 : tensor<8x1024xf32>
2026-02-21T09:38:51.7540424Z         %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:38:51.7540766Z         %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7541120Z         %133 = arith.muli %132, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7541469Z         %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7541838Z         %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7542189Z         %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7542473Z         %137 = arith.addi %135, %136 : tensor<8x1024xi32>
2026-02-21T09:38:51.7542780Z         %138 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7543132Z         %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7543479Z         %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7543839Z         %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7544143Z         tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7544470Z       } {tt.flatten}
2026-02-21T09:38:51.7544720Z       %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:38:51.7545029Z       %43 = tt.splat %c6144_i32_6 : i32 -> tensor<1024xi32>
2026-02-21T09:38:51.7545254Z       %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T09:38:51.7545476Z       %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32>
2026-02-21T09:38:51.7545813Z       %46 = tt.descriptor_load %0[%2, %c6144_i32_6] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:38:51.7546162Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7546451Z       %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:38:51.7546714Z       %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7546946Z       %50 = arith.subf %48, %49 : tensor<8x1024xf32>
2026-02-21T09:38:51.7547367Z       %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:38:51.7547775Z       %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:38:51.7548060Z       %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:38:51.7548296Z       %54 = arith.divf %51, %53 : tensor<8x1024xf32>
2026-02-21T09:38:51.7548532Z       %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:38:51.7548816Z       %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:38:51.7549065Z       %57 = arith.muli %56, %cst_2 : tensor<8x1xi32>
2026-02-21T09:38:51.7549326Z       %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:38:51.7549609Z       %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7549876Z       %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:38:51.7550118Z       %61 = arith.addi %59, %60 : tensor<8x1024xi32>
2026-02-21T09:38:51.7550349Z       %62 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7550630Z       %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:38:51.7550920Z       %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:38:51.7551208Z       %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:38:51.7551454Z       tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:38:51.7551869Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:38:51.7552210Z     tt.return
2026-02-21T09:38:51.7552332Z   }
2026-02-21T09:38:51.7552452Z }
2026-02-21T09:38:51.7552520Z 
2026-02-21T09:38:51.7552569Z {-#
2026-02-21T09:38:51.7552699Z   external_resources: {
2026-02-21T09:38:51.7552854Z     mlir_reproducer: {
2026-02-21T09:38:51.7557142Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:38:51.7561569Z       disable_threading: false,
2026-02-21T09:38:51.7561737Z       verify_each: true
2026-02-21T09:38:51.7561875Z     }
2026-02-21T09:38:51.7561991Z   }
2026-02-21T09:38:51.7562100Z #-}
2026-02-21T09:38:51.7562571Z /tmp/torchinductor_root/au/cauvte7suhxwgudkpzlyh47qnfkfxs2lm5bmex7bmeotoztnmarb.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:38:51.7563735Z /tmp/torchinductor_root/au/cauvte7suhxwgudkpzlyh47qnfkfxs2lm5bmex7bmeotoztnmarb.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:38:51.7564711Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:38:51.7565750Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:38:51.7566686Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:38:51.7566938Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:38:57.4821020Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.1 configs/s
2026-02-21T09:38:57.4833618Z [45s] Adaptive compile timeout: 30s (90% percentile=10.5s, bounds=[30.0s, 30s])
2026-02-21T09:38:58.1523000Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1470.2 configs/s
2026-02-21T09:38:58.2057645Z [45s] Initial random population of 100, 5 starting points: 
2026-02-21T09:38:58.2059508Z error=14
2026-02-21T09:38:58.2070792Z ok=86
2026-02-21T09:38:58.2074711Z min=0.0450
2026-02-21T09:38:58.2076902Z mid=0.5674
2026-02-21T09:38:58.2077086Z max=156.2122
2026-02-21T09:38:58.2077240Z best={'block_sizes': [1, 8192],
2026-02-21T09:38:58.2077551Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:38:58.2078168Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:38:58.2078366Z  'num_sm_multiplier': 2,
2026-02-21T09:38:58.2078532Z  'num_stages': 7,
2026-02-21T09:38:58.2078669Z  'num_warps': 32,
2026-02-21T09:38:58.2078828Z  'pid_type': 'persistent_blocked',
2026-02-21T09:38:58.2079012Z  'range_flattens': [True, True],
2026-02-21T09:38:58.2079195Z  'range_multi_buffers': [False, None],
2026-02-21T09:38:58.2079376Z  'range_num_stages': [4, 3],
2026-02-21T09:38:58.2079546Z  'range_unroll_factors': [2, 3],
2026-02-21T09:38:58.2079725Z  'range_warp_specializes': [False, False]}
2026-02-21T09:38:59.3475588Z [45s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:38:59.3476016Z [47s] Generation 1 starting: 78 neighbors, 5 active search path(s)
2026-02-21T09:39:08.9813920Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 9.1 configs/s
2026-02-21T09:39:13.9214445Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.0 configs/s
2026-02-21T09:39:15.9897017Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.8         
2026-02-21T09:39:15.9897791Z                                                                   configs/s     
2026-02-21T09:39:16.1384118Z [63s] Generation 1 complete: 
2026-02-21T09:39:16.1388711Z ok=84
2026-02-21T09:39:16.1392946Z min=0.0307
2026-02-21T09:39:16.1397722Z mid=0.0533
2026-02-21T09:39:16.1401997Z max=0.3523
2026-02-21T09:39:16.1407568Z best={'block_sizes': [1, 8192],
2026-02-21T09:39:16.1409321Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:39:16.1409679Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:39:16.1409881Z  'num_stages': 7,
2026-02-21T09:39:16.1410041Z  'num_warps': 8,
2026-02-21T09:39:16.1410195Z  'pid_type': 'flat',
2026-02-21T09:39:16.1410359Z  'range_flattens': [None, True],
2026-02-21T09:39:16.1410551Z  'range_multi_buffers': [None, None],
2026-02-21T09:39:16.1410743Z  'range_num_stages': [0, 3],
2026-02-21T09:39:16.1410954Z  'range_unroll_factors': [0, 3],
2026-02-21T09:39:16.1411164Z  'range_warp_specializes': [None, False]}
2026-02-21T09:39:16.1411476Z [63s] Fitting surrogate: 184 points, 184 targets
2026-02-21T09:39:17.2129306Z [64s] Generation 2 starting: 72 neighbors, 5 active search path(s)
2026-02-21T09:39:27.8572167Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 6.9 configs/s
2026-02-21T09:39:32.3902748Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.9 configs/s
2026-02-21T09:39:34.6995557Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 437.9         
2026-02-21T09:39:34.6999317Z                                                                   configs/s     
2026-02-21T09:39:34.8660285Z [82s] Generation 2 complete: 
2026-02-21T09:39:34.8660608Z ok=78
2026-02-21T09:39:34.8664461Z min=0.0266
2026-02-21T09:39:34.8666490Z mid=0.0410
2026-02-21T09:39:34.8666659Z max=0.4158
2026-02-21T09:39:34.8666812Z best={'block_sizes': [1, 8192],
2026-02-21T09:39:34.8667094Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:39:34.8667386Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:39:34.8667574Z  'num_stages': 7,
2026-02-21T09:39:34.8667721Z  'num_warps': 4,
2026-02-21T09:39:34.8667859Z  'pid_type': 'flat',
2026-02-21T09:39:34.8668019Z  'range_flattens': [None, True],
2026-02-21T09:39:34.8668192Z  'range_multi_buffers': [None, None],
2026-02-21T09:39:34.8668379Z  'range_num_stages': [0, 3],
2026-02-21T09:39:34.8668540Z  'range_unroll_factors': [0, 3],
2026-02-21T09:39:34.8668729Z  'range_warp_specializes': [None, False]}
2026-02-21T09:39:34.8676126Z [82s] Fitting surrogate: 262 points, 262 targets
2026-02-21T09:39:35.8042851Z [83s] Generation 3 starting: 59 neighbors, 5 active search path(s)
2026-02-21T09:39:48.9159505Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 1.5 configs/s
2026-02-21T09:39:52.6877518Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s
2026-02-21T09:39:55.8447944Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 321.3         
2026-02-21T09:39:55.8448744Z                                                                   configs/s     
2026-02-21T09:39:56.0796534Z [103s] Generation 3 complete: 
2026-02-21T09:39:56.0798332Z ok=65
2026-02-21T09:39:56.0798509Z min=0.0266
2026-02-21T09:39:56.0798646Z mid=0.0348
2026-02-21T09:39:56.0798781Z max=0.5059
2026-02-21T09:39:56.0798924Z best={'block_sizes': [1, 8192],
2026-02-21T09:39:56.0799190Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:39:56.0799477Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:39:56.0799674Z  'num_stages': 7,
2026-02-21T09:39:56.0799829Z  'num_warps': 4,
2026-02-21T09:39:56.0799976Z  'pid_type': 'flat',
2026-02-21T09:39:56.0800145Z  'range_flattens': [None, True],
2026-02-21T09:39:56.0800329Z  'range_multi_buffers': [None, None],
2026-02-21T09:39:56.0800524Z  'range_num_stages': [0, 3],
2026-02-21T09:39:56.0801088Z  'range_unroll_factors': [0, 3],
2026-02-21T09:39:56.0801318Z  'range_warp_specializes': [None, False]}
2026-02-21T09:39:56.0814242Z [103s] Fitting surrogate: 327 points, 327 targets
2026-02-21T09:39:56.8741277Z [104s] Generation 4 starting: 49 neighbors, 4 active search path(s)
2026-02-21T09:40:04.8681493Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 3.0 configs/s
2026-02-21T09:40:07.9249215Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 16.9 configs/s
2026-02-21T09:40:10.2020606Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 445.8         
2026-02-21T09:40:10.2024163Z                                                                   configs/s     
2026-02-21T09:40:10.3711163Z [118s] Generation 4 complete: 
2026-02-21T09:40:10.3717349Z ok=54
2026-02-21T09:40:10.3718807Z min=0.0266
2026-02-21T09:40:10.3719003Z mid=0.0307
2026-02-21T09:40:10.3719137Z max=0.1577
2026-02-21T09:40:10.3719299Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:10.3719597Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:10.3719890Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:10.3720100Z  'num_stages': 6,
2026-02-21T09:40:10.3720245Z  'num_warps': 1,
2026-02-21T09:40:10.3720399Z  'pid_type': 'flat',
2026-02-21T09:40:10.3720560Z  'range_flattens': [None, True],
2026-02-21T09:40:10.3720751Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:10.3720941Z  'range_num_stages': [0, 4],
2026-02-21T09:40:10.3721120Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:10.3721309Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:10.3729579Z [118s] Fitting surrogate: 381 points, 381 targets
2026-02-21T09:40:10.9659047Z [118s] Generation 5 starting: 29 neighbors, 3 active search path(s)
2026-02-21T09:40:18.9878932Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.8 configs/s
2026-02-21T09:40:20.8060037Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.5 configs/s
2026-02-21T09:40:21.9470873Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 882.1         
2026-02-21T09:40:21.9475421Z                                                                   configs/s     
2026-02-21T09:40:22.0452472Z [129s] Generation 5 complete: 
2026-02-21T09:40:22.0456720Z error=1
2026-02-21T09:40:22.0460838Z ok=32
2026-02-21T09:40:22.0466141Z min=0.0266
2026-02-21T09:40:22.0470672Z mid=0.0410
2026-02-21T09:40:22.0472448Z max=1.0916
2026-02-21T09:40:22.0472675Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:22.0472969Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:22.0473270Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:22.0473491Z  'num_stages': 6,
2026-02-21T09:40:22.0473665Z  'num_warps': 1,
2026-02-21T09:40:22.0473836Z  'pid_type': 'flat',
2026-02-21T09:40:22.0474384Z  'range_flattens': [None, True],
2026-02-21T09:40:22.0474629Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:22.0474828Z  'range_num_stages': [0, 4],
2026-02-21T09:40:22.0475027Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:22.0475625Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:22.0475851Z [129s] Fitting surrogate: 414 points, 414 targets
2026-02-21T09:40:22.5134463Z [130s] Generation 6 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:40:25.9283331Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 17.3 configs/s
2026-02-21T09:40:27.3089441Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 17.2 configs/s
2026-02-21T09:40:29.6325800Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 591.0         
2026-02-21T09:40:29.6329426Z                                                                   configs/s     
2026-02-21T09:40:29.7653991Z [137s] Generation 6 complete: 
2026-02-21T09:40:29.7658960Z ok=26
2026-02-21T09:40:29.7662948Z min=0.0266
2026-02-21T09:40:29.7667482Z mid=0.0286
2026-02-21T09:40:29.7670801Z max=0.0492
2026-02-21T09:40:29.7675286Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:29.7680222Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:29.7683624Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:29.7684832Z  'num_stages': 6,
2026-02-21T09:40:29.7685069Z  'num_warps': 1,
2026-02-21T09:40:29.7685260Z  'pid_type': 'flat',
2026-02-21T09:40:29.7685446Z  'range_flattens': [None, True],
2026-02-21T09:40:29.7685651Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:29.7690779Z  'range_num_stages': [0, 4],
2026-02-21T09:40:29.7695268Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:29.7699706Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:29.7704502Z [137s] Fitting surrogate: 440 points, 440 targets
2026-02-21T09:40:30.1938480Z [137s] Generation 7 starting: 21 neighbors, 2 active search path(s)
2026-02-21T09:40:32.9157256Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 16.3 configs/s
2026-02-21T09:40:34.1653446Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.4 configs/s
2026-02-21T09:40:35.5609349Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 723.4         
2026-02-21T09:40:35.5613061Z                                                                   configs/s     
2026-02-21T09:40:35.6774765Z [143s] Generation 7 complete: 
2026-02-21T09:40:35.6778816Z ok=24
2026-02-21T09:40:35.6780252Z min=0.0266
2026-02-21T09:40:35.6780407Z mid=0.0286
2026-02-21T09:40:35.6780539Z max=0.0471
2026-02-21T09:40:35.6780682Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:35.6780940Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:35.6781207Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:35.6781409Z  'num_stages': 6,
2026-02-21T09:40:35.6781836Z  'num_warps': 1,
2026-02-21T09:40:35.6782002Z  'pid_type': 'flat',
2026-02-21T09:40:35.6782169Z  'range_flattens': [None, True],
2026-02-21T09:40:35.6782362Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:35.6782563Z  'range_num_stages': [0, 4],
2026-02-21T09:40:35.6782775Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:35.6782985Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:35.6790650Z [143s] Fitting surrogate: 464 points, 464 targets
2026-02-21T09:40:36.1115368Z [143s] Generation 8 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:40:39.3597344Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 6.6 configs/s
2026-02-21T09:40:40.5599561Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.3 configs/s
2026-02-21T09:40:41.9825503Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 708.8         
2026-02-21T09:40:41.9825932Z                                                                   configs/s     
2026-02-21T09:40:42.0937657Z [149s] Generation 8 complete: 
2026-02-21T09:40:42.0941464Z ok=23
2026-02-21T09:40:42.0942858Z min=0.0266
2026-02-21T09:40:42.0943048Z mid=0.0286
2026-02-21T09:40:42.0943195Z max=0.0492
2026-02-21T09:40:42.0943372Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:42.0943713Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:42.0944354Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:42.0944553Z  'num_stages': 6,
2026-02-21T09:40:42.0944694Z  'num_warps': 1,
2026-02-21T09:40:42.0944845Z  'pid_type': 'flat',
2026-02-21T09:40:42.0945002Z  'range_flattens': [None, True],
2026-02-21T09:40:42.0945185Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:42.0945367Z  'range_num_stages': [0, 4],
2026-02-21T09:40:42.0945544Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:42.0945725Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:42.0953449Z [149s] Fitting surrogate: 487 points, 487 targets
2026-02-21T09:40:42.5203542Z [150s] Generation 9 starting: 22 neighbors, 2 active search path(s)
2026-02-21T09:40:45.8698899Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 6.7 configs/s
2026-02-21T09:40:47.1902879Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.2 configs/s
2026-02-21T09:40:48.6791085Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 678.3         
2026-02-21T09:40:48.6791483Z                                                                   configs/s     
2026-02-21T09:40:48.7949991Z [156s] Generation 9 complete: 
2026-02-21T09:40:48.7954322Z ok=25
2026-02-21T09:40:48.7958885Z min=0.0266
2026-02-21T09:40:48.7960252Z mid=0.0286
2026-02-21T09:40:48.7960418Z max=0.0471
2026-02-21T09:40:48.7960559Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:48.7960815Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:48.7961083Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:48.7961274Z  'num_stages': 6,
2026-02-21T09:40:48.7961423Z  'num_warps': 1,
2026-02-21T09:40:48.7961650Z  'pid_type': 'flat',
2026-02-21T09:40:48.7961820Z  'range_flattens': [None, True],
2026-02-21T09:40:48.7961999Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:48.7962193Z  'range_num_stages': [0, 4],
2026-02-21T09:40:48.7962379Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:48.7962574Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:48.7969481Z [156s] Fitting surrogate: 512 points, 512 targets
2026-02-21T09:40:49.3460385Z [157s] Generation 10 starting: 24 neighbors, 2 active search path(s)
2026-02-21T09:40:53.0436477Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 6.4 configs/s
2026-02-21T09:40:54.5709854Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.9 configs/s
2026-02-21T09:40:56.0492917Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 682.4         
2026-02-21T09:40:56.0493678Z                                                                   configs/s     
2026-02-21T09:40:56.1676384Z [163s] Generation 10 complete: 
2026-02-21T09:40:56.1680078Z ok=27
2026-02-21T09:40:56.1684072Z min=0.0266
2026-02-21T09:40:56.1688579Z mid=0.0286
2026-02-21T09:40:56.1689955Z max=0.0553
2026-02-21T09:40:56.1690149Z best={'block_sizes': [1, 8192],
2026-02-21T09:40:56.1690450Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:40:56.1691111Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:40:56.1691322Z  'num_stages': 6,
2026-02-21T09:40:56.1691467Z  'num_warps': 1,
2026-02-21T09:40:56.1691888Z  'pid_type': 'flat',
2026-02-21T09:40:56.1692049Z  'range_flattens': [None, True],
2026-02-21T09:40:56.1692240Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:56.1692427Z  'range_num_stages': [0, 4],
2026-02-21T09:40:56.1692607Z  'range_unroll_factors': [0, 0],
2026-02-21T09:40:56.1692786Z  'range_warp_specializes': [None, True]}
2026-02-21T09:40:56.1693072Z [163s] Fitting surrogate: 539 points, 539 targets
2026-02-21T09:40:56.6609603Z [164s] Generation 11 starting: 16 neighbors, 2 active search path(s)
2026-02-21T09:40:58.6317539Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 14.6 configs/s
2026-02-21T09:40:59.6381940Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.7 configs/s
2026-02-21T09:41:00.7858426Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 876.1         
2026-02-21T09:41:00.7858903Z                                                                   configs/s     
2026-02-21T09:41:00.8727517Z [168s] Generation 11 complete: 
2026-02-21T09:41:00.8732022Z ok=19
2026-02-21T09:41:00.8733548Z min=0.0266
2026-02-21T09:41:00.8733709Z mid=0.0307
2026-02-21T09:41:00.8733845Z max=0.0492
2026-02-21T09:41:00.8733985Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:00.8734248Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:00.8734528Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:00.8734722Z  'num_stages': 6,
2026-02-21T09:41:00.8734894Z  'num_warps': 1,
2026-02-21T09:41:00.8735033Z  'pid_type': 'flat',
2026-02-21T09:41:00.8735195Z  'range_flattens': [None, True],
2026-02-21T09:41:00.8735370Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:00.8735558Z  'range_num_stages': [0, 4],
2026-02-21T09:41:00.8735722Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:00.8735933Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:00.8744061Z [168s] Fitting surrogate: 558 points, 558 targets
2026-02-21T09:41:01.3726255Z [169s] Generation 12 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:41:06.0257684Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 3.3 configs/s
2026-02-21T09:41:07.4678872Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 24/24 17.2 configs/s
2026-02-21T09:41:09.2848843Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 770.5         
2026-02-21T09:41:09.2849205Z                                                                   configs/s     
2026-02-21T09:41:09.3866725Z [177s] Generation 12 complete: 
2026-02-21T09:41:09.3872320Z ok=26
2026-02-21T09:41:09.3876284Z min=0.0266
2026-02-21T09:41:09.3880162Z mid=0.0326
2026-02-21T09:41:09.3882113Z max=0.0696
2026-02-21T09:41:09.3882291Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:09.3882585Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:09.3882865Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:09.3883059Z  'num_stages': 6,
2026-02-21T09:41:09.3883197Z  'num_warps': 1,
2026-02-21T09:41:09.3883342Z  'pid_type': 'flat',
2026-02-21T09:41:09.3883495Z  'range_flattens': [None, True],
2026-02-21T09:41:09.3883677Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:09.3883864Z  'range_num_stages': [0, 4],
2026-02-21T09:41:09.3884027Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:09.3884207Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:09.3884415Z [177s] Fitting surrogate: 584 points, 584 targets
2026-02-21T09:41:09.9859588Z [177s] Generation 13 starting: 24 neighbors, 2 active search path(s)
2026-02-21T09:41:13.5285966Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 3.7 configs/s
2026-02-21T09:41:15.0231325Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 17.2 configs/s
2026-02-21T09:41:16.4997873Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 683.5         
2026-02-21T09:41:16.5002002Z                                                                   configs/s     
2026-02-21T09:41:16.6131903Z [184s] Generation 13 complete: 
2026-02-21T09:41:16.6133661Z ok=27
2026-02-21T09:41:16.6133829Z min=0.0266
2026-02-21T09:41:16.6133974Z mid=0.0286
2026-02-21T09:41:16.6134098Z max=0.0777
2026-02-21T09:41:16.6134253Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:16.6134506Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:16.6134772Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:16.6134961Z  'num_stages': 6,
2026-02-21T09:41:16.6135107Z  'num_warps': 1,
2026-02-21T09:41:16.6135244Z  'pid_type': 'flat',
2026-02-21T09:41:16.6135406Z  'range_flattens': [None, True],
2026-02-21T09:41:16.6135588Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:16.6135767Z  'range_num_stages': [0, 4],
2026-02-21T09:41:16.6135935Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:16.6136423Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:16.6151907Z [184s] Fitting surrogate: 611 points, 611 targets
2026-02-21T09:41:17.1315944Z [184s] Generation 14 starting: 21 neighbors, 2 active search path(s)
2026-02-21T09:41:19.8036820Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 7.6 configs/s
2026-02-21T09:41:21.0349226Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.7 configs/s
2026-02-21T09:41:22.5929678Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 647.7         
2026-02-21T09:41:22.5933821Z                                                                   configs/s     
2026-02-21T09:41:22.7070447Z [190s] Generation 14 complete: 
2026-02-21T09:41:22.7073747Z ok=24
2026-02-21T09:41:22.7076358Z min=0.0266
2026-02-21T09:41:22.7076527Z mid=0.0286
2026-02-21T09:41:22.7076653Z max=0.0471
2026-02-21T09:41:22.7076798Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:22.7077080Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:22.7077373Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:22.7077562Z  'num_stages': 6,
2026-02-21T09:41:22.7077709Z  'num_warps': 1,
2026-02-21T09:41:22.7077847Z  'pid_type': 'flat',
2026-02-21T09:41:22.7078007Z  'range_flattens': [None, True],
2026-02-21T09:41:22.7078181Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:22.7078368Z  'range_num_stages': [0, 4],
2026-02-21T09:41:22.7078539Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:22.7078712Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:22.7087069Z [190s] Fitting surrogate: 635 points, 635 targets
2026-02-21T09:41:23.2565767Z [190s] Generation 15 starting: 22 neighbors, 2 active search path(s)
2026-02-21T09:41:25.8892241Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 16.1 configs/s
2026-02-21T09:41:27.2487866Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 17.5 configs/s
2026-02-21T09:41:28.6134341Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 738.3         
2026-02-21T09:41:28.6138383Z                                                                   configs/s     
2026-02-21T09:41:28.7196746Z [196s] Generation 15 complete: 
2026-02-21T09:41:28.7200543Z ok=25
2026-02-21T09:41:28.7204880Z min=0.0266
2026-02-21T09:41:28.7209248Z mid=0.0307
2026-02-21T09:41:28.7211255Z max=0.0491
2026-02-21T09:41:28.7211477Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:28.7216543Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:28.7218412Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:28.7218636Z  'num_stages': 6,
2026-02-21T09:41:28.7218788Z  'num_warps': 1,
2026-02-21T09:41:28.7218930Z  'pid_type': 'flat',
2026-02-21T09:41:28.7219095Z  'range_flattens': [None, True],
2026-02-21T09:41:28.7219272Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:28.7219465Z  'range_num_stages': [0, 4],
2026-02-21T09:41:28.7219636Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:28.7219828Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:28.7220390Z [196s] Fitting surrogate: 660 points, 660 targets
2026-02-21T09:41:29.2377436Z [196s] Generation 16 starting: 18 neighbors, 2 active search path(s)
2026-02-21T09:41:32.1622345Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 7.6 configs/s
2026-02-21T09:41:33.3377415Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 17.7 configs/s
2026-02-21T09:41:34.4991207Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 865.2         
2026-02-21T09:41:34.4995687Z                                                                   configs/s     
2026-02-21T09:41:34.5904521Z [202s] Generation 16 complete: 
2026-02-21T09:41:34.5909006Z ok=21
2026-02-21T09:41:34.5914074Z min=0.0266
2026-02-21T09:41:34.5918548Z mid=0.0328
2026-02-21T09:41:34.5920064Z max=0.0471
2026-02-21T09:41:34.5920310Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:34.5920942Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:34.5925766Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:34.5929532Z  'num_stages': 6,
2026-02-21T09:41:34.5931046Z  'num_warps': 1,
2026-02-21T09:41:34.5931244Z  'pid_type': 'flat',
2026-02-21T09:41:34.5931417Z  'range_flattens': [None, True],
2026-02-21T09:41:34.5931692Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:34.5931885Z  'range_num_stages': [0, 4],
2026-02-21T09:41:34.5932064Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:34.5932250Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:34.5932584Z [202s] Fitting surrogate: 681 points, 681 targets
2026-02-21T09:41:35.0833163Z [202s] Generation 17 starting: 20 neighbors, 2 active search path(s)
2026-02-21T09:41:40.7907692Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.2 configs/s
2026-02-21T09:41:42.0493347Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 17.3 configs/s
2026-02-21T09:41:43.1713630Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 896.3         
2026-02-21T09:41:43.1717314Z                                                                   configs/s     
2026-02-21T09:41:43.2583981Z [210s] Generation 17 complete: 
2026-02-21T09:41:43.2585822Z ok=23
2026-02-21T09:41:43.2586020Z min=0.0266
2026-02-21T09:41:43.2586185Z mid=0.0286
2026-02-21T09:41:43.2586341Z max=0.4762
2026-02-21T09:41:43.2586521Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:43.2586805Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:43.2587115Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:43.2587326Z  'num_stages': 6,
2026-02-21T09:41:43.2587481Z  'num_warps': 1,
2026-02-21T09:41:43.2587662Z  'pid_type': 'flat',
2026-02-21T09:41:43.2587819Z  'range_flattens': [None, True],
2026-02-21T09:41:43.2588004Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:43.2588187Z  'range_num_stages': [0, 4],
2026-02-21T09:41:43.2588359Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:43.2588553Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:43.2605001Z [210s] Fitting surrogate: 704 points, 704 targets
2026-02-21T09:41:43.7900560Z [211s] Generation 18 starting: 22 neighbors, 2 active search path(s)
2026-02-21T09:41:46.2909178Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 12.1 configs/s
2026-02-21T09:41:47.6046711Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.3 configs/s
2026-02-21T09:41:49.2927510Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 598.7         
2026-02-21T09:41:49.2928917Z                                                                   configs/s     
2026-02-21T09:41:49.4258354Z [217s] Generation 18 complete: 
2026-02-21T09:41:49.4262704Z ok=25
2026-02-21T09:41:49.4265933Z min=0.0266
2026-02-21T09:41:49.4270998Z mid=0.0286
2026-02-21T09:41:49.4275362Z max=0.0451
2026-02-21T09:41:49.4276872Z best={'block_sizes': [1, 8192],
2026-02-21T09:41:49.4277160Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:41:49.4277793Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:41:49.4278012Z  'num_stages': 6,
2026-02-21T09:41:49.4278156Z  'num_warps': 1,
2026-02-21T09:41:49.4278305Z  'pid_type': 'flat',
2026-02-21T09:41:49.4278462Z  'range_flattens': [None, True],
2026-02-21T09:41:49.4278650Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:49.4278835Z  'range_num_stages': [0, 4],
2026-02-21T09:41:49.4279011Z  'range_unroll_factors': [0, 0],
2026-02-21T09:41:49.4279213Z  'range_warp_specializes': [None, True]}
2026-02-21T09:41:49.4284610Z [217s] Fitting surrogate: 729 points, 729 targets
2026-02-21T09:41:49.7051149Z [217s] Autotuning complete in 217.4s after searching 688 configs.
2026-02-21T09:41:49.7051675Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:41:49.7052714Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:41:49.7053602Z 
2026-02-21T09:41:49.7053867Z [217s] Code of selected kernel: /tmp/torchinductor_root/fp/cfpjpq5pjqbo6r2tocqrc7cpd4xs2lpovq445vunpnr4iei6c2ry.py
2026-02-21T09:41:49.7282717Z from __future__ import annotations
2026-02-21T09:41:49.7286552Z 
2026-02-21T09:41:49.7289959Z import torch
2026-02-21T09:41:49.7291419Z import triton
2026-02-21T09:41:49.7291720Z import triton.language as tl
2026-02-21T09:41:49.7291947Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:41:49.7292258Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:41:49.7292567Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:41:49.7292776Z 
2026-02-21T09:41:49.7292849Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:41:49.7293030Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:41:49.7293179Z 
2026-02-21T09:41:49.7293239Z @triton.jit
2026-02-21T09:41:49.7293392Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:41:49.7293643Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:41:49.7293905Z     pid_0 = tl.program_id(0)
2026-02-21T09:41:49.7294069Z     offset_0 = pid_0
2026-02-21T09:41:49.7294250Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:41:49.7294527Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:41:49.7294838Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:41:49.7295130Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:41:49.7295399Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:41:49.7295664Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:41:49.7295968Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:41:49.7296244Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:41:49.7296829Z     # src[softmax.py:82-89]: ...
2026-02-21T09:41:49.7297151Z     for offset_2 in tl.range(0, 6784, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True):
2026-02-21T09:41:49.7297554Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:41:49.7297798Z         mask_1 = indices_2 < 6784
2026-02-21T09:41:49.7297975Z         mi_copy = mi
2026-02-21T09:41:49.7298122Z         di_copy = di
2026-02-21T09:41:49.7298264Z         mi_copy_0 = mi_copy
2026-02-21T09:41:49.7298433Z         di_copy_0 = di_copy
2026-02-21T09:41:49.7298611Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:41:49.7298983Z         values = tl.load(x + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:41:49.7299367Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:41:49.7299872Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:41:49.7300278Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:41:49.7300537Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:41:49.7300780Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:41:49.7300986Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:41:49.7301248Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:41:49.7301483Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:41:49.7301703Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:41:49.7301869Z         v_4 = di_copy_0 * v_3
2026-02-21T09:41:49.7302067Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:41:49.7302273Z         subscript = v_1[:, None]
2026-02-21T09:41:49.7302449Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:41:49.7302650Z         v_6 = v_5 - subscript
2026-02-21T09:41:49.7302866Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:41:49.7303140Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:41:49.7303357Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:41:49.7303555Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:41:49.7303878Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:41:49.7304229Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:41:49.7304435Z         di = v_4 + sum_1
2026-02-21T09:41:49.7304592Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:41:49.7304768Z         mi = v_1
2026-02-21T09:41:49.7304962Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:41:49.7305231Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:41:49.7305527Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:41:49.7305908Z     for offset_2 in tl.range(0, 6784, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True):
2026-02-21T09:41:49.7306255Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:41:49.7306480Z         mask_2 = indices_2 < 6784
2026-02-21T09:41:49.7306651Z         mi_copy_1 = mi
2026-02-21T09:41:49.7306796Z         di_copy_1 = di
2026-02-21T09:41:49.7306949Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:41:49.7307106Z         di_copy_1_0 = di_copy_1
2026-02-21T09:41:49.7307296Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:41:49.7307655Z         values_1 = tl.load(x + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:41:49.7308081Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:41:49.7308361Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:41:49.7308634Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:41:49.7308821Z         v_10 = v_9 - subscript_1
2026-02-21T09:41:49.7308988Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:41:49.7309168Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:41:49.7309353Z         v_12 = v_11 / subscript_2
2026-02-21T09:41:49.7309522Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:41:49.7309795Z         tl.store(out + (indices_0[:, None] * 6784 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:41:49.7310008Z 
2026-02-21T09:41:49.7310137Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:41:49.7310374Z     """
2026-02-21T09:41:49.7310576Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:41:49.7310885Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:41:49.7311110Z     Args:
2026-02-21T09:41:49.7311328Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:41:49.7311563Z     Returns:
2026-02-21T09:41:49.7311740Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:41:49.7311951Z     """
2026-02-21T09:41:49.7312086Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:41:49.7312269Z     m, n = x.size()
2026-02-21T09:41:49.7312434Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:41:49.7312639Z     out = torch.empty_like(x)
2026-02-21T09:41:49.7312870Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:41:49.7313183Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:41:49.7313504Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:41:49.7313745Z     # src[softmax.py:79-92]: ...
2026-02-21T09:41:49.7314012Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6)
2026-02-21T09:41:49.7314284Z     # src[softmax.py:93]: return out
2026-02-21T09:41:49.7314476Z     return out
2026-02-21T09:41:50.6833239Z WARNING:tritonbench.utils.triton_op:Completed input ID 51:
2026-02-21T09:41:50.6835056Z (M, N)
2026-02-21T09:41:50.6835223Z ------------
2026-02-21T09:41:50.6835374Z (4096, 6784)
2026-02-21T09:41:50.6835453Z 
2026-02-21T09:41:50.6845245Z  55%|█████▌    | 11/20 [32:55<30:56, 206.28s/it]WARNING:tritonbench.utils.triton_op:Running input ID 56:
2026-02-21T09:41:50.6846803Z (M, N)
2026-02-21T09:41:50.6847017Z ------------
2026-02-21T09:41:50.6847172Z (4096, 7424)
2026-02-21T09:41:50.6852214Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:41:51.8862556Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:41:53.3695168Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for torch_compile_softmax
2026-02-21T09:41:54.7130296Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:41:54.7134592Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:41:54.7137832Z               'dtype': 'torch.float16',
2026-02-21T09:41:54.7142214Z               'shape': (4096, 7424),
2026-02-21T09:41:54.7143554Z               'stride': (7424, 1)},),
2026-02-21T09:41:54.7143775Z   'kwargs': {}}
2026-02-21T09:41:54.7156213Z INFO:tritonbench.utils.triton_op:Took 2.94ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:41:54.8891223Z [0s] Autotune random seed: 2138408546
2026-02-21T09:41:54.9142655Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:42:32.1148487Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:42:40.4993824Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.8 configs/s
2026-02-21T09:42:40.5004592Z [45s] Adaptive compile timeout: 30s (90% percentile=10.5s, bounds=[30.0s, 30s])
2026-02-21T09:42:41.1703655Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1466.1 configs/s
2026-02-21T09:42:41.2271369Z [46s] Initial random population of 100, 5 starting points: 
2026-02-21T09:42:41.2275834Z error=12
2026-02-21T09:42:41.2277346Z ok=88
2026-02-21T09:42:41.2277525Z min=0.0451
2026-02-21T09:42:41.2277655Z mid=0.6451
2026-02-21T09:42:41.2277786Z max=172.3709
2026-02-21T09:42:41.2277935Z best={'block_sizes': [1, 8192],
2026-02-21T09:42:41.2278200Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:42:41.2278490Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:42:41.2278702Z  'num_sm_multiplier': 2,
2026-02-21T09:42:41.2278863Z  'num_stages': 7,
2026-02-21T09:42:41.2279001Z  'num_warps': 32,
2026-02-21T09:42:41.2279157Z  'pid_type': 'persistent_blocked',
2026-02-21T09:42:41.2279339Z  'range_flattens': [True, True],
2026-02-21T09:42:41.2279517Z  'range_multi_buffers': [False, None],
2026-02-21T09:42:41.2279696Z  'range_num_stages': [4, 3],
2026-02-21T09:42:41.2279863Z  'range_unroll_factors': [2, 3],
2026-02-21T09:42:41.2280340Z  'range_warp_specializes': [False, False]}
2026-02-21T09:42:41.2291075Z [46s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:42:42.3191231Z [47s] Generation 1 starting: 76 neighbors, 5 active search path(s)
2026-02-21T09:43:02.1211244Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.1 configs/s
2026-02-21T09:43:06.8472129Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.1 configs/s
2026-02-21T09:43:08.2875575Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 697.3         
2026-02-21T09:43:08.2877340Z                                                                   configs/s     
2026-02-21T09:43:08.3888009Z [73s] Generation 1 complete: 
2026-02-21T09:43:08.3893200Z error=1
2026-02-21T09:43:08.3898432Z ok=81
2026-02-21T09:43:08.3902621Z min=0.0328
2026-02-21T09:43:08.3907134Z mid=0.0553
2026-02-21T09:43:08.3909505Z max=0.3358
2026-02-21T09:43:08.3909722Z best={'block_sizes': [1, 8192],
2026-02-21T09:43:08.3910106Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:43:08.3910469Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:43:08.3910700Z  'num_sm_multiplier': 4,
2026-02-21T09:43:08.3910875Z  'num_stages': 7,
2026-02-21T09:43:08.3911035Z  'num_warps': 8,
2026-02-21T09:43:08.3911377Z  'pid_type': 'persistent_blocked',
2026-02-21T09:43:08.3911721Z  'range_flattens': [False, True],
2026-02-21T09:43:08.3912036Z  'range_multi_buffers': [False, None],
2026-02-21T09:43:08.3912301Z  'range_num_stages': [4, 3],
2026-02-21T09:43:08.3912601Z  'range_unroll_factors': [2, 3],
2026-02-21T09:43:08.3912913Z  'range_warp_specializes': [False, False]}
2026-02-21T09:43:08.3913330Z [73s] Fitting surrogate: 182 points, 182 targets
2026-02-21T09:43:09.4914279Z [74s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T09:43:18.8657849Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 10.4 configs/s
2026-02-21T09:43:23.2805853Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s
2026-02-21T09:43:28.5395609Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 215.9         
2026-02-21T09:43:28.5396966Z                                                                   configs/s     
2026-02-21T09:43:28.8427634Z [93s] Generation 2 complete: 
2026-02-21T09:43:28.8432102Z ok=77
2026-02-21T09:43:28.8437260Z min=0.0307
2026-02-21T09:43:28.8441247Z mid=0.0431
2026-02-21T09:43:28.8443339Z max=0.4157
2026-02-21T09:43:28.8443650Z best={'block_sizes': [1, 8192],
2026-02-21T09:43:28.8448398Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:43:28.8449958Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:43:28.8450280Z  'num_stages': 7,
2026-02-21T09:43:28.8455256Z  'num_warps': 4,
2026-02-21T09:43:28.8459783Z  'pid_type': 'flat',
2026-02-21T09:43:28.8463477Z  'range_flattens': [None, True],
2026-02-21T09:43:28.8465496Z  'range_multi_buffers': [None, None],
2026-02-21T09:43:28.8465842Z  'range_num_stages': [0, 3],
2026-02-21T09:43:28.8466554Z  'range_unroll_factors': [0, 3],
2026-02-21T09:43:28.8470518Z  'range_warp_specializes': [None, False]}
2026-02-21T09:43:28.8474461Z [93s] Fitting surrogate: 259 points, 259 targets
2026-02-21T09:43:29.8156081Z [94s] Generation 3 starting: 64 neighbors, 5 active search path(s)
2026-02-21T09:43:47.5672143Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 1.2 configs/s
2026-02-21T09:43:51.5429604Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.8 configs/s
2026-02-21T09:43:55.1715095Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 279.2         
2026-02-21T09:43:55.1715965Z                                                                   configs/s     
2026-02-21T09:43:55.4186099Z [120s] Generation 3 complete: 
2026-02-21T09:43:55.4189771Z ok=69
2026-02-21T09:43:55.4194331Z min=0.0307
2026-02-21T09:43:55.4199415Z mid=0.0410
2026-02-21T09:43:55.4203352Z max=0.6092
2026-02-21T09:43:55.4208146Z best={'block_sizes': [1, 8192],
2026-02-21T09:43:55.4208872Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:43:55.4209173Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:43:55.4209373Z  'num_stages': 7,
2026-02-21T09:43:55.4209537Z  'num_warps': 4,
2026-02-21T09:43:55.4209694Z  'pid_type': 'flat',
2026-02-21T09:43:55.4209866Z  'range_flattens': [None, True],
2026-02-21T09:43:55.4210051Z  'range_multi_buffers': [None, None],
2026-02-21T09:43:55.4210248Z  'range_num_stages': [0, 3],
2026-02-21T09:43:55.4210419Z  'range_unroll_factors': [0, 3],
2026-02-21T09:43:55.4210613Z  'range_warp_specializes': [None, False]}
2026-02-21T09:43:55.4210831Z [120s] Fitting surrogate: 328 points, 328 targets
2026-02-21T09:43:56.2887391Z [121s] Generation 4 starting: 62 neighbors, 5 active search path(s)
2026-02-21T09:44:05.3317637Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 5.1 configs/s
2026-02-21T09:44:09.2168226Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 16.9 configs/s
2026-02-21T09:44:12.9197138Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 273.9         
2026-02-21T09:44:12.9201202Z                                                                   configs/s     
2026-02-21T09:44:13.1792314Z [138s] Generation 4 complete: 
2026-02-21T09:44:13.1794078Z ok=68
2026-02-21T09:44:13.1794285Z min=0.0307
2026-02-21T09:44:13.1798404Z mid=0.0328
2026-02-21T09:44:13.1802885Z max=0.1454
2026-02-21T09:44:13.1806106Z best={'block_sizes': [1, 8192],
2026-02-21T09:44:13.1810639Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:44:13.1812047Z  'load_eviction_policies': ['', ''],
2026-02-21T09:44:13.1812268Z  'num_stages': 1,
2026-02-21T09:44:13.1812422Z  'num_warps': 4,
2026-02-21T09:44:13.1812568Z  'pid_type': 'flat',
2026-02-21T09:44:13.1812737Z  'range_flattens': [None, True],
2026-02-21T09:44:13.1812920Z  'range_multi_buffers': [None, None],
2026-02-21T09:44:13.1813149Z  'range_num_stages': [0, 3],
2026-02-21T09:44:13.1813329Z  'range_unroll_factors': [0, 1],
2026-02-21T09:44:13.1813517Z  'range_warp_specializes': [None, False]}
2026-02-21T09:44:13.1813743Z [138s] Fitting surrogate: 396 points, 396 targets
2026-02-21T09:44:13.9510041Z [139s] Generation 5 starting: 51 neighbors, 4 active search path(s)
2026-02-21T09:44:22.5035197Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 2.8 configs/s
2026-02-21T09:44:26.2028005Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 14.2 configs/s
2026-02-21T09:44:29.7077085Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.2         
2026-02-21T09:44:29.7078438Z                                                                   configs/s     
2026-02-21T09:44:29.9520156Z [155s] Generation 5 complete: 
2026-02-21T09:44:29.9525450Z ok=55
2026-02-21T09:44:29.9526829Z min=0.0307
2026-02-21T09:44:29.9526995Z mid=0.0328
2026-02-21T09:44:29.9527127Z max=0.1474
2026-02-21T09:44:29.9527303Z best={'block_sizes': [1, 8192],
2026-02-21T09:44:29.9527911Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T09:44:29.9528149Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:44:29.9528340Z  'num_stages': 5,
2026-02-21T09:44:29.9528480Z  'num_warps': 4,
2026-02-21T09:44:29.9528626Z  'pid_type': 'flat',
2026-02-21T09:44:29.9528784Z  'range_flattens': [None, False],
2026-02-21T09:44:29.9528968Z  'range_multi_buffers': [None, None],
2026-02-21T09:44:29.9529154Z  'range_num_stages': [0, 2],
2026-02-21T09:44:29.9529318Z  'range_unroll_factors': [0, 1],
2026-02-21T09:44:29.9529502Z  'range_warp_specializes': [None, False]}
2026-02-21T09:44:29.9536743Z [155s] Fitting surrogate: 451 points, 451 targets
2026-02-21T09:44:30.7260040Z [155s] Generation 6 starting: 52 neighbors, 4 active search path(s)
2026-02-21T09:44:40.1558436Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 1.0 configs/s
2026-02-21T09:44:43.3287861Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 16.9 configs/s
2026-02-21T09:44:46.7268240Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 298.4         
2026-02-21T09:44:46.7269551Z                                                                   configs/s     
2026-02-21T09:44:46.9578933Z [172s] Generation 6 complete: 
2026-02-21T09:44:46.9583602Z ok=56
2026-02-21T09:44:46.9585011Z min=0.0307
2026-02-21T09:44:46.9585180Z mid=0.0327
2026-02-21T09:44:46.9585304Z max=0.6860
2026-02-21T09:44:46.9585459Z best={'block_sizes': [1, 8192],
2026-02-21T09:44:46.9585714Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:44:46.9585983Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:44:46.9586168Z  'num_stages': 5,
2026-02-21T09:44:46.9586311Z  'num_warps': 4,
2026-02-21T09:44:46.9586450Z  'pid_type': 'flat',
2026-02-21T09:44:46.9586614Z  'range_flattens': [None, False],
2026-02-21T09:44:46.9586807Z  'range_multi_buffers': [None, None],
2026-02-21T09:44:46.9586991Z  'range_num_stages': [0, 2],
2026-02-21T09:44:46.9587197Z  'range_unroll_factors': [0, 1],
2026-02-21T09:44:46.9587398Z  'range_warp_specializes': [None, False]}
2026-02-21T09:44:46.9597770Z [172s] Fitting surrogate: 507 points, 507 targets
2026-02-21T09:44:47.6952763Z [172s] Generation 7 starting: 35 neighbors, 3 active search path(s)
2026-02-21T09:44:53.3498715Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 5.2 configs/s
2026-02-21T09:44:55.4962466Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 17.1 configs/s
2026-02-21T09:44:57.8564599Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 428.3         
2026-02-21T09:44:57.8566183Z                                                                   configs/s     
2026-02-21T09:44:58.0225651Z [183s] Generation 7 complete: 
2026-02-21T09:44:58.0225898Z ok=38
2026-02-21T09:44:58.0230575Z min=0.0307
2026-02-21T09:44:58.0235142Z mid=0.0328
2026-02-21T09:44:58.0240186Z max=0.6861
2026-02-21T09:44:58.0241869Z best={'block_sizes': [1, 8192],
2026-02-21T09:44:58.0242181Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:44:58.0242804Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:44:58.0242994Z  'num_stages': 5,
2026-02-21T09:44:58.0243144Z  'num_warps': 4,
2026-02-21T09:44:58.0243287Z  'pid_type': 'flat',
2026-02-21T09:44:58.0243451Z  'range_flattens': [None, False],
2026-02-21T09:44:58.0243632Z  'range_multi_buffers': [None, False],
2026-02-21T09:44:58.0243822Z  'range_num_stages': [0, 3],
2026-02-21T09:44:58.0243987Z  'range_unroll_factors': [0, 1],
2026-02-21T09:44:58.0244173Z  'range_warp_specializes': [None, False]}
2026-02-21T09:44:58.0244398Z [183s] Fitting surrogate: 545 points, 545 targets
2026-02-21T09:44:58.4634572Z [183s] Generation 8 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:45:01.9451036Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.2 configs/s
2026-02-21T09:45:02.7334226Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.5 configs/s
2026-02-21T09:45:03.6063031Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1146.8         
2026-02-21T09:45:03.6063875Z                                                                  configs/s      
2026-02-21T09:45:03.6801198Z [188s] Generation 8 complete: 
2026-02-21T09:45:03.6802708Z ok=14
2026-02-21T09:45:03.6802920Z min=0.0307
2026-02-21T09:45:03.6803096Z mid=0.0307
2026-02-21T09:45:03.6803329Z max=0.0656
2026-02-21T09:45:03.6807723Z best={'block_sizes': [1, 8192],
2026-02-21T09:45:03.6811206Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:45:03.6815135Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:45:03.6818996Z  'num_stages': 6,
2026-02-21T09:45:03.6823483Z  'num_warps': 4,
2026-02-21T09:45:03.6824842Z  'pid_type': 'flat',
2026-02-21T09:45:03.6825129Z  'range_flattens': [None, True],
2026-02-21T09:45:03.6825400Z  'range_multi_buffers': [None, False],
2026-02-21T09:45:03.6825635Z  'range_num_stages': [0, 3],
2026-02-21T09:45:03.6825892Z  'range_unroll_factors': [0, 1],
2026-02-21T09:45:03.6826138Z  'range_warp_specializes': [None, False]}
2026-02-21T09:45:03.6826532Z [188s] Fitting surrogate: 559 points, 559 targets
2026-02-21T09:45:04.0930037Z [189s] Generation 9 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:45:07.5811015Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 4.8 configs/s
2026-02-21T09:45:08.3675224Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 17.5 configs/s
2026-02-21T09:45:09.2723969Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1105.6         
2026-02-21T09:45:09.2725532Z                                                                  configs/s      
2026-02-21T09:45:09.3432853Z [194s] Generation 9 complete: 
2026-02-21T09:45:09.3434719Z ok=14
2026-02-21T09:45:09.3434958Z min=0.0307
2026-02-21T09:45:09.3435205Z mid=0.0307
2026-02-21T09:45:09.3440078Z max=0.0635
2026-02-21T09:45:09.3441530Z best={'block_sizes': [1, 8192],
2026-02-21T09:45:09.3442124Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:45:09.3442446Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:45:09.3442708Z  'num_stages': 6,
2026-02-21T09:45:09.3442891Z  'num_warps': 4,
2026-02-21T09:45:09.3443103Z  'pid_type': 'flat',
2026-02-21T09:45:09.3443302Z  'range_flattens': [None, True],
2026-02-21T09:45:09.3443552Z  'range_multi_buffers': [None, False],
2026-02-21T09:45:09.3443779Z  'range_num_stages': [0, 3],
2026-02-21T09:45:09.3444018Z  'range_unroll_factors': [0, 1],
2026-02-21T09:45:09.3444264Z  'range_warp_specializes': [None, False]}
2026-02-21T09:45:09.3463491Z [194s] Fitting surrogate: 573 points, 573 targets
2026-02-21T09:45:09.6355063Z [194s] Autotuning complete in 194.7s after searching 552 configs.
2026-02-21T09:45:09.6357201Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:45:09.6358310Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:45:09.6359509Z 
2026-02-21T09:45:09.6359823Z [194s] Code of selected kernel: /tmp/torchinductor_root/xa/cxazsujyi2oj7h2x3luypkwqyn3moastc6gnnus24dsfv6e5nbnw.py
2026-02-21T09:45:09.6580873Z from __future__ import annotations
2026-02-21T09:45:09.6581206Z 
2026-02-21T09:45:09.6581416Z import torch
2026-02-21T09:45:09.6581692Z import triton
2026-02-21T09:45:09.6581977Z import triton.language as tl
2026-02-21T09:45:09.6587056Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:45:09.6587529Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:45:09.6587937Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:45:09.6588214Z 
2026-02-21T09:45:09.6588379Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:45:09.6588938Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:45:09.6589146Z 
2026-02-21T09:45:09.6589323Z @triton.jit
2026-02-21T09:45:09.6589603Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:45:09.6589998Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:45:09.6590358Z     pid_0 = tl.program_id(0)
2026-02-21T09:45:09.6590655Z     offset_0 = pid_0
2026-02-21T09:45:09.6590895Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:45:09.6591311Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:45:09.6591792Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:45:09.6592124Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:45:09.6592461Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:45:09.6592772Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:45:09.6593101Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:45:09.6593438Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:45:09.6593718Z     # src[softmax.py:82-89]: ...
2026-02-21T09:45:09.6594223Z     for offset_2 in tl.range(0, 7424, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:45:09.6594760Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:45:09.6595077Z         mask_1 = indices_2 < 7424
2026-02-21T09:45:09.6595321Z         mi_copy = mi
2026-02-21T09:45:09.6595520Z         di_copy = di
2026-02-21T09:45:09.6595727Z         mi_copy_0 = mi_copy
2026-02-21T09:45:09.6595912Z         di_copy_0 = di_copy
2026-02-21T09:45:09.6596139Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:45:09.6596542Z         values = tl.load(x + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:45:09.6596994Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:45:09.6597468Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:45:09.6597894Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:45:09.6598216Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:45:09.6598490Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:45:09.6598767Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:45:09.6599064Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:45:09.6599367Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:45:09.6599592Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:45:09.6599769Z         v_4 = di_copy_0 * v_3
2026-02-21T09:45:09.6600021Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:45:09.6600262Z         subscript = v_1[:, None]
2026-02-21T09:45:09.6600601Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:45:09.6600825Z         v_6 = v_5 - subscript
2026-02-21T09:45:09.6601110Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:45:09.6601429Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:45:09.6601740Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:45:09.6601988Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:45:09.6602347Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:45:09.6602773Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:45:09.6603018Z         di = v_4 + sum_1
2026-02-21T09:45:09.6603250Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:45:09.6603460Z         mi = v_1
2026-02-21T09:45:09.6603726Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:45:09.6604132Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:45:09.6604468Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:45:09.6605035Z     for offset_2 in tl.range(0, 7424, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T09:45:09.6605535Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:45:09.6605837Z         mask_2 = indices_2 < 7424
2026-02-21T09:45:09.6606074Z         mi_copy_1 = mi
2026-02-21T09:45:09.6606264Z         di_copy_1 = di
2026-02-21T09:45:09.6606484Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:45:09.6606688Z         di_copy_1_0 = di_copy_1
2026-02-21T09:45:09.6606948Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:45:09.6607351Z         values_1 = tl.load(x + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:45:09.6607853Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:45:09.6608172Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:45:09.6608421Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:45:09.6608660Z         v_10 = v_9 - subscript_1
2026-02-21T09:45:09.6608869Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:45:09.6609116Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:45:09.6609336Z         v_12 = v_11 / subscript_2
2026-02-21T09:45:09.6609567Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:45:09.6609883Z         tl.store(out + (indices_0[:, None] * 7424 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:45:09.6610144Z 
2026-02-21T09:45:09.6610289Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:45:09.6610590Z     """
2026-02-21T09:45:09.6610828Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:45:09.6611203Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:45:09.6611461Z     Args:
2026-02-21T09:45:09.6611731Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:45:09.6611959Z     Returns:
2026-02-21T09:45:09.6612201Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:45:09.6612479Z     """
2026-02-21T09:45:09.6612664Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:45:09.6612920Z     m, n = x.size()
2026-02-21T09:45:09.6613149Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:45:09.6613419Z     out = torch.empty_like(x)
2026-02-21T09:45:09.6613684Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:45:09.6614054Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:45:09.6614429Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:45:09.6614699Z     # src[softmax.py:79-92]: ...
2026-02-21T09:45:09.6615076Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6)
2026-02-21T09:45:09.6615417Z     # src[softmax.py:93]: return out
2026-02-21T09:45:09.6615672Z     return out
2026-02-21T09:45:10.1774765Z WARNING:tritonbench.utils.triton_op:Completed input ID 56:
2026-02-21T09:45:10.1779071Z (M, N)
2026-02-21T09:45:10.1782995Z ------------
2026-02-21T09:45:10.1787582Z (4096, 7424)
2026-02-21T09:45:10.1789295Z 
2026-02-21T09:45:10.1789890Z  60%|██████    | 12/20 [36:15<27:13, 204.22s/it]WARNING:tritonbench.utils.triton_op:Running input ID 61:
2026-02-21T09:45:10.1790239Z (M, N)
2026-02-21T09:45:10.1790439Z ------------
2026-02-21T09:45:10.1790618Z (4096, 8064)
2026-02-21T09:45:10.1790995Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:45:11.3456929Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:45:12.8374712Z INFO:tritonbench.utils.triton_op:Took 2.29ms to get benchmark function for torch_compile_softmax
2026-02-21T09:45:14.1619411Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:45:14.1623440Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:45:14.1624907Z               'dtype': 'torch.float16',
2026-02-21T09:45:14.1625221Z               'shape': (4096, 8064),
2026-02-21T09:45:14.1625457Z               'stride': (8064, 1)},),
2026-02-21T09:45:14.1625712Z   'kwargs': {}}
2026-02-21T09:45:14.1640845Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:45:14.3378789Z [0s] Autotune random seed: 2138408546
2026-02-21T09:45:14.3630531Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:45:50.8322637Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:45:59.4848212Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.5 configs/s
2026-02-21T09:45:59.4856932Z [45s] Adaptive compile timeout: 30s (90% percentile=11.2s, bounds=[30.0s, 30s])
2026-02-21T09:46:00.0609302Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1695.9 configs/s
2026-02-21T09:46:00.1126530Z [45s] Initial random population of 100, 5 starting points: 
2026-02-21T09:46:00.1130790Z error=12
2026-02-21T09:46:00.1132384Z ok=88
2026-02-21T09:46:00.1132692Z min=0.0451
2026-02-21T09:46:00.1137462Z mid=0.6431
2026-02-21T09:46:00.1138799Z max=187.0090
2026-02-21T09:46:00.1139072Z best={'block_sizes': [1, 8192],
2026-02-21T09:46:00.1139494Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:46:00.1144679Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:46:00.1146068Z  'num_sm_multiplier': 2,
2026-02-21T09:46:00.1146352Z  'num_stages': 7,
2026-02-21T09:46:00.1146543Z  'num_warps': 32,
2026-02-21T09:46:00.1146796Z  'pid_type': 'persistent_blocked',
2026-02-21T09:46:00.1147029Z  'range_flattens': [True, True],
2026-02-21T09:46:00.1147308Z  'range_multi_buffers': [False, None],
2026-02-21T09:46:00.1147556Z  'range_num_stages': [4, 3],
2026-02-21T09:46:00.1147793Z  'range_unroll_factors': [2, 3],
2026-02-21T09:46:00.1148040Z  'range_warp_specializes': [False, False]}
2026-02-21T09:46:00.1148384Z [45s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:46:01.1359140Z [46s] Generation 1 starting: 76 neighbors, 5 active search path(s)
2026-02-21T09:46:21.2927857Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 1.4 configs/s
2026-02-21T09:46:26.0370386Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.8 configs/s
2026-02-21T09:46:29.0537718Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 334.6         
2026-02-21T09:46:29.0539450Z                                                                   configs/s     
2026-02-21T09:46:29.2292580Z [74s] Generation 1 complete: 
2026-02-21T09:46:29.2296944Z ok=82
2026-02-21T09:46:29.2301303Z min=0.0328
2026-02-21T09:46:29.2306530Z mid=0.0553
2026-02-21T09:46:29.2310822Z max=0.4742
2026-02-21T09:46:29.2314123Z best={'block_sizes': [1, 8192],
2026-02-21T09:46:29.2316142Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:46:29.2316583Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:46:29.2316884Z  'num_sm_multiplier': 4,
2026-02-21T09:46:29.2320787Z  'num_stages': 7,
2026-02-21T09:46:29.2324535Z  'num_warps': 8,
2026-02-21T09:46:29.2329591Z  'pid_type': 'persistent_blocked',
2026-02-21T09:46:29.2332888Z  'range_flattens': [False, True],
2026-02-21T09:46:29.2336681Z  'range_multi_buffers': [False, None],
2026-02-21T09:46:29.2340970Z  'range_num_stages': [4, 3],
2026-02-21T09:46:29.2345449Z  'range_unroll_factors': [2, 3],
2026-02-21T09:46:29.2350398Z  'range_warp_specializes': [False, False]}
2026-02-21T09:46:29.2354720Z [74s] Fitting surrogate: 182 points, 182 targets
2026-02-21T09:46:30.2292022Z [75s] Generation 2 starting: 78 neighbors, 5 active search path(s)
2026-02-21T09:46:42.5202419Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 3.2 configs/s
2026-02-21T09:46:47.4642207Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.9 configs/s
2026-02-21T09:46:52.6621297Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 217.2         
2026-02-21T09:46:52.6623958Z                                                                   configs/s     
2026-02-21T09:46:52.9411195Z [98s] Generation 2 complete: 
2026-02-21T09:46:52.9415365Z ok=84
2026-02-21T09:46:52.9419783Z min=0.0328
2026-02-21T09:46:52.9424830Z mid=0.0451
2026-02-21T09:46:52.9426356Z max=0.2847
2026-02-21T09:46:52.9426580Z best={'block_sizes': [1, 8192],
2026-02-21T09:46:52.9426909Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:46:52.9427227Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:46:52.9427491Z  'num_stages': 1,
2026-02-21T09:46:52.9427707Z  'num_warps': 8,
2026-02-21T09:46:52.9427892Z  'pid_type': 'flat',
2026-02-21T09:46:52.9428160Z  'range_flattens': [None, True],
2026-02-21T09:46:52.9428406Z  'range_multi_buffers': [None, False],
2026-02-21T09:46:52.9428660Z  'range_num_stages': [0, 4],
2026-02-21T09:46:52.9428868Z  'range_unroll_factors': [0, 0],
2026-02-21T09:46:52.9429118Z  'range_warp_specializes': [None, True]}
2026-02-21T09:46:52.9429369Z [98s] Fitting surrogate: 266 points, 266 targets
2026-02-21T09:46:53.9876345Z [99s] Generation 3 starting: 74 neighbors, 5 active search path(s)
2026-02-21T09:47:06.3188125Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 2.0 configs/s
2026-02-21T09:47:10.7616048Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s
2026-02-21T09:47:15.9740401Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 194.5         
2026-02-21T09:47:15.9741016Z                                                                   configs/s     
2026-02-21T09:47:16.2992423Z [121s] Generation 3 complete: 
2026-02-21T09:47:16.2996554Z ok=79
2026-02-21T09:47:16.3000047Z min=0.0307
2026-02-21T09:47:16.3001673Z mid=0.0369
2026-02-21T09:47:16.3001974Z max=0.4690
2026-02-21T09:47:16.3002198Z best={'block_sizes': [1, 8192],
2026-02-21T09:47:16.3002526Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:47:16.3007140Z  'load_eviction_policies': ['', ''],
2026-02-21T09:47:16.3009127Z  'num_stages': 1,
2026-02-21T09:47:16.3009393Z  'num_warps': 4,
2026-02-21T09:47:16.3009587Z  'pid_type': 'flat',
2026-02-21T09:47:16.3009821Z  'range_flattens': [None, True],
2026-02-21T09:47:16.3010015Z  'range_multi_buffers': [None, None],
2026-02-21T09:47:16.3010269Z  'range_num_stages': [0, 4],
2026-02-21T09:47:16.3010464Z  'range_unroll_factors': [0, 1],
2026-02-21T09:47:16.3010712Z  'range_warp_specializes': [None, False]}
2026-02-21T09:47:16.3011073Z [121s] Fitting surrogate: 345 points, 345 targets
2026-02-21T09:47:17.2920750Z [122s] Generation 4 starting: 73 neighbors, 5 active search path(s)
2026-02-21T09:47:26.7697037Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 19.8 configs/s
2026-02-21T09:47:31.2046305Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.8 configs/s
2026-02-21T09:47:37.1531398Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 188.9         
2026-02-21T09:47:37.1535590Z                                                                   configs/s     
2026-02-21T09:47:37.5020541Z [143s] Generation 4 complete: 
2026-02-21T09:47:37.5024652Z ok=78
2026-02-21T09:47:37.5026093Z min=0.0307
2026-02-21T09:47:37.5026336Z mid=0.0328
2026-02-21T09:47:37.5026500Z max=0.1352
2026-02-21T09:47:37.5026681Z best={'block_sizes': [1, 8192],
2026-02-21T09:47:37.5026942Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:47:37.5027249Z  'load_eviction_policies': ['', ''],
2026-02-21T09:47:37.5027494Z  'num_stages': 1,
2026-02-21T09:47:37.5027676Z  'num_warps': 4,
2026-02-21T09:47:37.5027886Z  'pid_type': 'flat',
2026-02-21T09:47:37.5028089Z  'range_flattens': [None, True],
2026-02-21T09:47:37.5028763Z  'range_multi_buffers': [None, None],
2026-02-21T09:47:37.5029027Z  'range_num_stages': [0, 4],
2026-02-21T09:47:37.5029264Z  'range_unroll_factors': [0, 1],
2026-02-21T09:47:37.5029489Z  'range_warp_specializes': [None, False]}
2026-02-21T09:47:37.5038644Z [143s] Fitting surrogate: 423 points, 423 targets
2026-02-21T09:47:38.1474305Z [143s] Generation 5 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:47:43.9479220Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 4.9 configs/s
2026-02-21T09:47:46.4118104Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.9 configs/s
2026-02-21T09:47:49.5068546Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.0         
2026-02-21T09:47:49.5072655Z                                                                   configs/s     
2026-02-21T09:47:49.7137354Z [155s] Generation 5 complete: 
2026-02-21T09:47:49.7142239Z ok=45
2026-02-21T09:47:49.7147109Z min=0.0307
2026-02-21T09:47:49.7148375Z mid=0.0328
2026-02-21T09:47:49.7148628Z max=0.0615
2026-02-21T09:47:49.7148817Z best={'block_sizes': [1, 8192],
2026-02-21T09:47:49.7149130Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:47:49.7149438Z  'load_eviction_policies': ['', ''],
2026-02-21T09:47:49.7149659Z  'num_stages': 1,
2026-02-21T09:47:49.7149872Z  'num_warps': 4,
2026-02-21T09:47:49.7150059Z  'pid_type': 'flat',
2026-02-21T09:47:49.7150255Z  'range_flattens': [None, True],
2026-02-21T09:47:49.7150477Z  'range_multi_buffers': [None, None],
2026-02-21T09:47:49.7150735Z  'range_num_stages': [0, 4],
2026-02-21T09:47:49.7150947Z  'range_unroll_factors': [0, 1],
2026-02-21T09:47:49.7151204Z  'range_warp_specializes': [None, False]}
2026-02-21T09:47:49.7157243Z [155s] Fitting surrogate: 468 points, 468 targets
2026-02-21T09:47:50.1138136Z [155s] Generation 6 starting: 17 neighbors, 2 active search path(s)
2026-02-21T09:47:52.4098451Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 16.8 configs/s
2026-02-21T09:47:53.4093172Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.8 configs/s
2026-02-21T09:47:54.9233788Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 665.1         
2026-02-21T09:47:54.9235215Z                                                                   configs/s     
2026-02-21T09:47:55.0373559Z [160s] Generation 6 complete: 
2026-02-21T09:47:55.0377980Z ok=20
2026-02-21T09:47:55.0382989Z min=0.0307
2026-02-21T09:47:55.0387389Z mid=0.0328
2026-02-21T09:47:55.0391310Z max=0.0460
2026-02-21T09:47:55.0395726Z best={'block_sizes': [1, 8192],
2026-02-21T09:47:55.0400178Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:47:55.0403957Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:47:55.0404342Z  'num_stages': 2,
2026-02-21T09:47:55.0404588Z  'num_warps': 4,
2026-02-21T09:47:55.0408585Z  'pid_type': 'flat',
2026-02-21T09:47:55.0412901Z  'range_flattens': [None, True],
2026-02-21T09:47:55.0417634Z  'range_multi_buffers': [None, None],
2026-02-21T09:47:55.0421776Z  'range_num_stages': [0, 4],
2026-02-21T09:47:55.0426306Z  'range_unroll_factors': [0, 0],
2026-02-21T09:47:55.0428495Z  'range_warp_specializes': [None, True]}
2026-02-21T09:47:55.0428884Z [160s] Fitting surrogate: 488 points, 488 targets
2026-02-21T09:47:55.2055290Z [160s] Autotuning complete in 160.8s after searching 468 configs.
2026-02-21T09:47:55.2059114Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:47:55.2060409Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T09:47:55.2061366Z 
2026-02-21T09:47:55.2061929Z [160s] Code of selected kernel: /tmp/torchinductor_root/qj/cqjadztvwdqy7k5dxsiy766h4h5udvc56cnlx2p24djkf67drx4o.py
2026-02-21T09:47:55.2289794Z from __future__ import annotations
2026-02-21T09:47:55.2293437Z 
2026-02-21T09:47:55.2294910Z import torch
2026-02-21T09:47:55.2295162Z import triton
2026-02-21T09:47:55.2295390Z import triton.language as tl
2026-02-21T09:47:55.2295648Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:47:55.2295990Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:47:55.2296323Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:47:55.2296552Z 
2026-02-21T09:47:55.2296643Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:47:55.2296863Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:47:55.2297025Z 
2026-02-21T09:47:55.2297104Z @triton.jit
2026-02-21T09:47:55.2297319Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:47:55.2297611Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:47:55.2297918Z     pid_0 = tl.program_id(0)
2026-02-21T09:47:55.2298219Z     offset_0 = pid_0
2026-02-21T09:47:55.2298498Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:47:55.2304798Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:47:55.2309272Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:47:55.2311244Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:47:55.2311664Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:47:55.2312004Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:47:55.2312318Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:47:55.2312639Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:47:55.2312932Z     # src[softmax.py:82-89]: ...
2026-02-21T09:47:55.2313267Z     for offset_2 in tl.range(0, 8064, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True):
2026-02-21T09:47:55.2313700Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:47:55.2314277Z         mask_1 = indices_2 < 8064
2026-02-21T09:47:55.2314517Z         mi_copy = mi
2026-02-21T09:47:55.2314702Z         di_copy = di
2026-02-21T09:47:55.2314914Z         mi_copy_0 = mi_copy
2026-02-21T09:47:55.2315110Z         di_copy_0 = di_copy
2026-02-21T09:47:55.2315363Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:47:55.2315804Z         values = tl.load(x + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:47:55.2316232Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:47:55.2316706Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:47:55.2317133Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:47:55.2317527Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:47:55.2317825Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:47:55.2318076Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:47:55.2318391Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:47:55.2318673Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:47:55.2318916Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:47:55.2319126Z         v_4 = di_copy_0 * v_3
2026-02-21T09:47:55.2319387Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:47:55.2319626Z         subscript = v_1[:, None]
2026-02-21T09:47:55.2319866Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:47:55.2320107Z         v_6 = v_5 - subscript
2026-02-21T09:47:55.2320356Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:47:55.2320679Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:47:55.2320930Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:47:55.2321204Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:47:55.2321638Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:47:55.2322034Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:47:55.2322306Z         di = v_4 + sum_1
2026-02-21T09:47:55.2322509Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:47:55.2322757Z         mi = v_1
2026-02-21T09:47:55.2323001Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:47:55.2323340Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:47:55.2323701Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:47:55.2324127Z     for offset_2 in tl.range(0, 8064, _BLOCK_SIZE_1, warp_specialize=True, num_stages=4, flatten=True):
2026-02-21T09:47:55.2324543Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:47:55.2324819Z         mask_2 = indices_2 < 8064
2026-02-21T09:47:55.2325056Z         mi_copy_1 = mi
2026-02-21T09:47:55.2325243Z         di_copy_1 = di
2026-02-21T09:47:55.2325464Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:47:55.2325669Z         di_copy_1_0 = di_copy_1
2026-02-21T09:47:55.2325922Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:47:55.2326355Z         values_1 = tl.load(x + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T09:47:55.2326827Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:47:55.2327168Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:47:55.2327400Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:47:55.2327646Z         v_10 = v_9 - subscript_1
2026-02-21T09:47:55.2327851Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:47:55.2328092Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:47:55.2328336Z         v_12 = v_11 / subscript_2
2026-02-21T09:47:55.2328631Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:47:55.2328974Z         tl.store(out + (indices_0[:, None] * 8064 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:47:55.2329209Z 
2026-02-21T09:47:55.2329359Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:47:55.2329661Z     """
2026-02-21T09:47:55.2329906Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:47:55.2330280Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:47:55.2330569Z     Args:
2026-02-21T09:47:55.2330766Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:47:55.2331026Z     Returns:
2026-02-21T09:47:55.2331241Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:47:55.2331512Z     """
2026-02-21T09:47:55.2331727Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:47:55.2331967Z     m, n = x.size()
2026-02-21T09:47:55.2332227Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:47:55.2332504Z     out = torch.empty_like(x)
2026-02-21T09:47:55.2332799Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:47:55.2333170Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:47:55.2333552Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:47:55.2333837Z     # src[softmax.py:79-92]: ...
2026-02-21T09:47:55.2334173Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=2)
2026-02-21T09:47:55.2334503Z     # src[softmax.py:93]: return out
2026-02-21T09:47:55.2334749Z     return out
2026-02-21T09:47:56.3013641Z WARNING:tritonbench.utils.triton_op:Completed input ID 61:
2026-02-21T09:47:56.3018060Z (M, N)
2026-02-21T09:47:56.3019392Z ------------
2026-02-21T09:47:56.3019651Z (4096, 8064)
2026-02-21T09:47:56.3019823Z 
2026-02-21T09:47:56.3025633Z  65%|██████▌   | 13/20 [39:01<22:28, 192.68s/it]WARNING:tritonbench.utils.triton_op:Running input ID 66:
2026-02-21T09:47:56.3026423Z (M, N)
2026-02-21T09:47:56.3026618Z ------------
2026-02-21T09:47:56.3026876Z (4096, 8704)
2026-02-21T09:47:56.3027215Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for naive_softmax
2026-02-21T09:47:57.4606067Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:47:58.7412181Z INFO:tritonbench.utils.triton_op:Took 2.40ms to get benchmark function for torch_compile_softmax
2026-02-21T09:48:00.0113709Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:48:00.0117874Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:48:00.0122277Z               'dtype': 'torch.float16',
2026-02-21T09:48:00.0126570Z               'shape': (4096, 8704),
2026-02-21T09:48:00.0127990Z               'stride': (8704, 1)},),
2026-02-21T09:48:00.0128300Z   'kwargs': {}}
2026-02-21T09:48:00.0134955Z INFO:tritonbench.utils.triton_op:Took 2.33ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:48:00.1880129Z [0s] Autotune random seed: 2138408546
2026-02-21T09:48:00.2132384Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:48:35.5266458Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T09:48:37.9747102Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:48:46.7199986Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.3 configs/s
2026-02-21T09:48:46.7212577Z [46s] Adaptive compile timeout: 30s (90% percentile=12.3s, bounds=[30.0s, 30s])
2026-02-21T09:48:47.9035033Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 836.4 configs/s
2026-02-21T09:48:47.9779976Z [47s] Initial random population of 100, 5 starting points: 
2026-02-21T09:48:47.9783216Z error=11
2026-02-21T09:48:47.9787666Z timeout=1
2026-02-21T09:48:47.9791514Z ok=88
2026-02-21T09:48:47.9793255Z min=0.0635
2026-02-21T09:48:47.9793495Z mid=0.4302
2026-02-21T09:48:47.9796574Z max=203.7668
2026-02-21T09:48:47.9796804Z best={'block_sizes': [1, 1024],
2026-02-21T09:48:47.9797156Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:48:47.9797443Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:48:47.9798819Z  'num_stages': 6,
2026-02-21T09:48:47.9799029Z  'num_warps': 4,
2026-02-21T09:48:47.9799235Z  'pid_type': 'flat',
2026-02-21T09:48:47.9804148Z  'range_flattens': [None, None],
2026-02-21T09:48:47.9807099Z  'range_multi_buffers': [None, True],
2026-02-21T09:48:47.9810978Z  'range_num_stages': [0, 0],
2026-02-21T09:48:47.9815092Z  'range_unroll_factors': [0, 4],
2026-02-21T09:48:47.9819179Z  'range_warp_specializes': [None, False]}
2026-02-21T09:48:47.9823365Z [47s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:48:49.7116290Z [49s] Generation 1 starting: 84 neighbors, 5 active search path(s)
2026-02-21T09:49:02.3013196Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 11.3 configs/s
2026-02-21T09:49:07.3687946Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.1 configs/s
2026-02-21T09:49:12.8999140Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 182.6         
2026-02-21T09:49:12.9000032Z                                                                   configs/s     
2026-02-21T09:49:13.1771114Z [72s] Generation 1 complete: 
2026-02-21T09:49:13.1772571Z ok=89
2026-02-21T09:49:13.1772797Z min=0.0492
2026-02-21T09:49:13.1773011Z mid=0.0676
2026-02-21T09:49:13.1773199Z max=0.4833
2026-02-21T09:49:13.1773408Z best={'block_sizes': [2, 512],
2026-02-21T09:49:13.1773791Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:49:13.1774146Z  'load_eviction_policies': ['first', ''],
2026-02-21T09:49:13.1774369Z  'num_stages': 3,
2026-02-21T09:49:13.1774576Z  'num_warps': 1,
2026-02-21T09:49:13.1774763Z  'pid_type': 'flat',
2026-02-21T09:49:13.1774991Z  'range_flattens': [None, False],
2026-02-21T09:49:13.1775218Z  'range_multi_buffers': [None, False],
2026-02-21T09:49:13.1775472Z  'range_num_stages': [0, 3],
2026-02-21T09:49:13.1775707Z  'range_unroll_factors': [0, 1],
2026-02-21T09:49:13.1775926Z  'range_warp_specializes': [None, False]}
2026-02-21T09:49:13.1787946Z [72s] Fitting surrogate: 189 points, 189 targets
2026-02-21T09:49:14.4563931Z [74s] Generation 2 starting: 75 neighbors, 5 active search path(s)
2026-02-21T09:49:27.0144879Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 5.2 configs/s
2026-02-21T09:49:31.5472950Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 17.2 configs/s
2026-02-21T09:49:38.8004899Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 139.4         
2026-02-21T09:49:38.8005416Z                                                                   configs/s     
2026-02-21T09:49:39.1875538Z [98s] Generation 2 complete: 
2026-02-21T09:49:39.1880487Z ok=80
2026-02-21T09:49:39.1884183Z min=0.0451
2026-02-21T09:49:39.1887329Z mid=0.0572
2026-02-21T09:49:39.1891902Z max=0.1965
2026-02-21T09:49:39.1895181Z best={'block_sizes': [1, 512],
2026-02-21T09:49:39.1899538Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:49:39.1903280Z  'load_eviction_policies': ['first', ''],
2026-02-21T09:49:39.1903638Z  'num_stages': 2,
2026-02-21T09:49:39.1907171Z  'num_warps': 1,
2026-02-21T09:49:39.1909214Z  'pid_type': 'flat',
2026-02-21T09:49:39.1909474Z  'range_flattens': [None, False],
2026-02-21T09:49:39.1909796Z  'range_multi_buffers': [None, False],
2026-02-21T09:49:39.1910031Z  'range_num_stages': [0, 3],
2026-02-21T09:49:39.1912959Z  'range_unroll_factors': [0, 1],
2026-02-21T09:49:39.1915468Z  'range_warp_specializes': [None, False]}
2026-02-21T09:49:39.1915761Z [98s] Fitting surrogate: 269 points, 269 targets
2026-02-21T09:49:40.1649386Z [99s] Generation 3 starting: 68 neighbors, 5 active search path(s)
2026-02-21T09:50:14.0756934Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 0.2 configs/s
2026-02-21T09:50:15.7744356Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T09:50:15.7745294Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:50:15.7751272Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:50:15.7751817Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x2048xf16>
2026-02-21T09:50:15.7752159Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:50:15.7752468Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:50:15.7753080Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:50:15.7753463Z     %cst_0 = arith.constant dense<8704> : tensor<8x1xi32>
2026-02-21T09:50:15.7753885Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x2048xf32>
2026-02-21T09:50:15.7754243Z     %cst_2 = arith.constant dense<0xFC00> : tensor<8x2048xf16>
2026-02-21T09:50:15.7754608Z     %cst_3 = arith.constant dense<8704> : tensor<2048xi32>
2026-02-21T09:50:15.7754944Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:50:15.7755314Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:50:15.7755636Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:50:15.7755944Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:50:15.7756227Z     %c8704_i32 = arith.constant 8704 : i32
2026-02-21T09:50:15.7756544Z     %c8704_i64 = arith.constant 8704 : i64
2026-02-21T09:50:15.7756858Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:50:15.7757287Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8704_i32], [%c8704_i64, %c1_i64] : <f16>, <tensor<8x2048xf16>>
2026-02-21T09:50:15.7757740Z     %1 = tt.get_program_id x : i32
2026-02-21T09:50:15.7757996Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T09:50:15.7758282Z     %3 = arith.minsi %2, %c512_i32 : i32
2026-02-21T09:50:15.7758555Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:50:15.7758868Z       %4 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:50:15.7759215Z       %5 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:50:15.7759563Z       %6 = tt.splat %4 : i32 -> tensor<8xi32>
2026-02-21T09:50:15.7759875Z       %7 = arith.addi %6, %5 : tensor<8xi32>
2026-02-21T09:50:15.7760150Z       %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:50:15.7760470Z       %c4096_i32_6 = arith.constant 4096 : i32
2026-02-21T09:50:15.7760989Z       %8:2 = scf.for %arg3 = %c0_i32 to %c8192_i32 step %c4096_i32_6 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:50:15.7761657Z         %62 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7762078Z         %63 = tt.splat %arg3 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7762409Z         %64 = arith.addi %63, %62 : tensor<2048xi32>
2026-02-21T09:50:15.7762782Z         %65 = arith.cmpi slt, %64, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7763194Z         %66 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<8x2048xf16>> -> tensor<8x2048xf16>
2026-02-21T09:50:15.7763680Z         %67 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7764130Z         %68 = tt.broadcast %67 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7764542Z         %69 = arith.select %68, %66, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16>
2026-02-21T09:50:15.7764967Z         %70 = arith.extf %69 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7765302Z         %71 = "tt.reduce"(%70) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7765624Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:50:15.7766058Z           %118 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:50:15.7766377Z           tt.reduce.return %118 : f32
2026-02-21T09:50:15.7766649Z         }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7766993Z         %72 = arith.truncf %71 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:50:15.7767350Z         %73 = arith.extf %72 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:50:15.7767665Z         %74 = arith.cmpf ogt, %arg4, %73 : tensor<8xf32>
2026-02-21T09:50:15.7768082Z         %75 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:50:15.7768420Z         %76 = arith.ori %74, %75 : tensor<8xi1>
2026-02-21T09:50:15.7768756Z         %77 = arith.select %76, %arg4, %73 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:50:15.7769123Z         %78 = arith.subf %arg4, %77 : tensor<8xf32>
2026-02-21T09:50:15.7769735Z         %79 = tt.extern_elementwise %78 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7770230Z         %80 = arith.mulf %arg5, %79 : tensor<8xf32>
2026-02-21T09:50:15.7770621Z         %81 = tt.expand_dims %77 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7771011Z         %82 = arith.extf %66 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7771400Z         %83 = tt.broadcast %81 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7771831Z         %84 = arith.subf %82, %83 : tensor<8x2048xf32>
2026-02-21T09:50:15.7772341Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7772931Z         %86 = arith.select %68, %85, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32>
2026-02-21T09:50:15.7773300Z         %87 = "tt.reduce"(%86) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7773608Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:50:15.7773875Z           %118 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:50:15.7774183Z           tt.reduce.return %118 : f32
2026-02-21T09:50:15.7774496Z         }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7774780Z         %88 = arith.addf %80, %87 : tensor<8xf32>
2026-02-21T09:50:15.7775090Z         %c1_i32_9 = arith.constant 1 : i32
2026-02-21T09:50:15.7775367Z         %89 = arith.muli %c2048_i32, %c1_i32_9 : i32
2026-02-21T09:50:15.7775671Z         %90 = arith.addi %arg3, %89 : i32
2026-02-21T09:50:15.7776002Z         %91 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7776382Z         %92 = tt.splat %90 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7776663Z         %93 = arith.addi %92, %91 : tensor<2048xi32>
2026-02-21T09:50:15.7777018Z         %94 = arith.cmpi slt, %93, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7777461Z         %95 = tt.descriptor_load %0[%4, %90] : !tt.tensordesc<tensor<8x2048xf16>> -> tensor<8x2048xf16>
2026-02-21T09:50:15.7777920Z         %96 = tt.expand_dims %94 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7778338Z         %97 = tt.broadcast %96 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7778703Z         %98 = arith.select %97, %95, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16>
2026-02-21T09:50:15.7779101Z         %99 = arith.extf %98 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7779449Z         %100 = "tt.reduce"(%99) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7779724Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:50:15.7780017Z           %118 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:50:15.7780285Z           tt.reduce.return %118 : f32
2026-02-21T09:50:15.7780573Z         }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7780887Z         %101 = arith.truncf %100 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:50:15.7781245Z         %102 = arith.extf %101 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:50:15.7781625Z         %103 = arith.cmpf ogt, %77, %102 : tensor<8xf32>
2026-02-21T09:50:15.7781926Z         %104 = arith.cmpf une, %77, %77 : tensor<8xf32>
2026-02-21T09:50:15.7782323Z         %105 = arith.ori %103, %104 : tensor<8xi1>
2026-02-21T09:50:15.7782640Z         %106 = arith.select %105, %77, %102 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:50:15.7782995Z         %107 = arith.subf %77, %106 : tensor<8xf32>
2026-02-21T09:50:15.7783459Z         %108 = tt.extern_elementwise %107 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7783959Z         %109 = arith.mulf %88, %108 : tensor<8xf32>
2026-02-21T09:50:15.7784288Z         %110 = tt.expand_dims %106 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7784675Z         %111 = arith.extf %95 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7785055Z         %112 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7785386Z         %113 = arith.subf %111, %112 : tensor<8x2048xf32>
2026-02-21T09:50:15.7785960Z         %114 = tt.extern_elementwise %113 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7786520Z         %115 = arith.select %97, %114, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32>
2026-02-21T09:50:15.7786869Z         %116 = "tt.reduce"(%115) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7787167Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:50:15.7787424Z           %118 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:50:15.7787721Z           tt.reduce.return %118 : f32
2026-02-21T09:50:15.7787983Z         }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7788295Z         %117 = arith.addf %109, %116 : tensor<8xf32>
2026-02-21T09:50:15.7788594Z         scf.yield %106, %117 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:50:15.7788920Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:50:15.7789264Z       %9 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7789620Z       %10 = tt.splat %c8192_i32 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7789951Z       %11 = arith.addi %10, %9 : tensor<2048xi32>
2026-02-21T09:50:15.7790249Z       %12 = arith.cmpi slt, %11, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7790690Z       %13 = tt.descriptor_load %0[%4, %c8192_i32] : !tt.tensordesc<tensor<8x2048xf16>> -> tensor<8x2048xf16>
2026-02-21T09:50:15.7791146Z       %14 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7791593Z       %15 = tt.broadcast %14 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7791993Z       %16 = arith.select %15, %13, %cst_2 : tensor<8x2048xi1>, tensor<8x2048xf16>
2026-02-21T09:50:15.7792364Z       %17 = arith.extf %16 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7792710Z       %18 = "tt.reduce"(%17) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7792977Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:50:15.7793270Z         %62 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:50:15.7793540Z         tt.reduce.return %62 : f32
2026-02-21T09:50:15.7793833Z       }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7794160Z       %19 = arith.truncf %18 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:50:15.7794483Z       %20 = arith.extf %19 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:50:15.7794813Z       %21 = arith.cmpf ogt, %8#0, %20 : tensor<8xf32>
2026-02-21T09:50:15.7795106Z       %22 = arith.cmpf une, %8#0, %8#0 : tensor<8xf32>
2026-02-21T09:50:15.7795410Z       %23 = arith.ori %21, %22 : tensor<8xi1>
2026-02-21T09:50:15.7795690Z       %24 = arith.select %23, %8#0, %20 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:50:15.7795999Z       %25 = arith.subf %8#0, %24 : tensor<8xf32>
2026-02-21T09:50:15.7796441Z       %26 = tt.extern_elementwise %25 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7796861Z       %27 = arith.mulf %8#1, %26 : tensor<8xf32>
2026-02-21T09:50:15.7797196Z       %28 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7797601Z       %29 = arith.extf %13 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7797913Z       %30 = tt.broadcast %28 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7798205Z       %31 = arith.subf %29, %30 : tensor<8x2048xf32>
2026-02-21T09:50:15.7798664Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7799171Z       %33 = arith.select %15, %32, %cst_1 : tensor<8x2048xi1>, tensor<8x2048xf32>
2026-02-21T09:50:15.7799479Z       %34 = "tt.reduce"(%33) <{axis = 1 : i32}> ({
2026-02-21T09:50:15.7799780Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:50:15.7800015Z         %62 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:50:15.7800287Z         tt.reduce.return %62 : f32
2026-02-21T09:50:15.7800521Z       }) : (tensor<8x2048xf32>) -> tensor<8xf32>
2026-02-21T09:50:15.7800856Z       %35 = arith.addf %27, %34 : tensor<8xf32>
2026-02-21T09:50:15.7801134Z       %c8192_i32_7 = arith.constant 8192 : i32
2026-02-21T09:50:15.7801385Z       %c4096_i32_8 = arith.constant 4096 : i32
2026-02-21T09:50:15.7801773Z       scf.for %arg3 = %c0_i32 to %c8192_i32_7 step %c4096_i32_8  : i32 {
2026-02-21T09:50:15.7802124Z         %62 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7802470Z         %63 = tt.splat %arg3 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7802737Z         %64 = arith.addi %63, %62 : tensor<2048xi32>
2026-02-21T09:50:15.7803039Z         %65 = arith.cmpi slt, %64, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7803393Z         %66 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:50:15.7803717Z         %67 = arith.muli %66, %cst_0 : tensor<8x1xi32>
2026-02-21T09:50:15.7804075Z         %68 = tt.expand_dims %64 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32>
2026-02-21T09:50:15.7804438Z         %69 = tt.broadcast %67 : tensor<8x1xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7804796Z         %70 = tt.broadcast %68 : tensor<1x2048xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7805121Z         %71 = arith.addi %69, %70 : tensor<8x2048xi32>
2026-02-21T09:50:15.7805421Z         %72 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7805795Z         %73 = tt.addptr %72, %71 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7806159Z         %74 = tt.expand_dims %65 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7806546Z         %75 = tt.broadcast %74 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7806917Z         %76 = tt.load %73, %75, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7807342Z         %77 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7807718Z         %78 = arith.extf %76 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7808044Z         %79 = tt.broadcast %77 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7808366Z         %80 = arith.subf %78, %79 : tensor<8x2048xf32>
2026-02-21T09:50:15.7808805Z         %81 = tt.extern_elementwise %80 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7809321Z         %82 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7809693Z         %83 = tt.broadcast %82 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7809982Z         %84 = arith.divf %81, %83 : tensor<8x2048xf32>
2026-02-21T09:50:15.7810307Z         %85 = arith.truncf %84 : tensor<8x2048xf32> to tensor<8x2048xf16>
2026-02-21T09:50:15.7810636Z         %86 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7811002Z         %87 = tt.addptr %86, %71 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7811324Z         tt.store %87, %85, %75 : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7811768Z         %c1_i32_9 = arith.constant 1 : i32
2026-02-21T09:50:15.7812053Z         %88 = arith.muli %c2048_i32, %c1_i32_9 : i32
2026-02-21T09:50:15.7812304Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T09:50:15.7812626Z         %90 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7812938Z         %91 = tt.splat %89 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7813225Z         %92 = arith.addi %91, %90 : tensor<2048xi32>
2026-02-21T09:50:15.7813500Z         %93 = arith.cmpi slt, %92, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7813850Z         %94 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:50:15.7814165Z         %95 = arith.muli %94, %cst_0 : tensor<8x1xi32>
2026-02-21T09:50:15.7814487Z         %96 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32>
2026-02-21T09:50:15.7814871Z         %97 = tt.broadcast %95 : tensor<8x1xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7815254Z         %98 = tt.broadcast %96 : tensor<1x2048xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7815579Z         %99 = arith.addi %97, %98 : tensor<8x2048xi32>
2026-02-21T09:50:15.7815877Z         %100 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7816256Z         %101 = tt.addptr %100, %99 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7816662Z         %102 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7817022Z         %103 = tt.broadcast %102 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7817437Z         %104 = tt.load %101, %103, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7817835Z         %105 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7818219Z         %106 = arith.extf %104 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7818577Z         %107 = tt.broadcast %105 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7818880Z         %108 = arith.subf %106, %107 : tensor<8x2048xf32>
2026-02-21T09:50:15.7819362Z         %109 = tt.extern_elementwise %108 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7819856Z         %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7820239Z         %111 = tt.broadcast %110 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7820567Z         %112 = arith.divf %109, %111 : tensor<8x2048xf32>
2026-02-21T09:50:15.7820871Z         %113 = arith.truncf %112 : tensor<8x2048xf32> to tensor<8x2048xf16>
2026-02-21T09:50:15.7821239Z         %114 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7821614Z         %115 = tt.addptr %114, %99 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7821980Z         tt.store %115, %113, %103 : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7822255Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:50:15.7822575Z       %36 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T09:50:15.7822929Z       %37 = tt.splat %c8192_i32_7 : i32 -> tensor<2048xi32>
2026-02-21T09:50:15.7823203Z       %38 = arith.addi %37, %36 : tensor<2048xi32>
2026-02-21T09:50:15.7823510Z       %39 = arith.cmpi slt, %38, %cst_3 : tensor<2048xi32>
2026-02-21T09:50:15.7823834Z       %40 = tt.expand_dims %7 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:50:15.7824181Z       %41 = arith.muli %40, %cst_0 : tensor<8x1xi32>
2026-02-21T09:50:15.7824498Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32>
2026-02-21T09:50:15.7824885Z       %43 = tt.broadcast %41 : tensor<8x1xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7825238Z       %44 = tt.broadcast %42 : tensor<1x2048xi32> -> tensor<8x2048xi32>
2026-02-21T09:50:15.7825532Z       %45 = arith.addi %43, %44 : tensor<8x2048xi32>
2026-02-21T09:50:15.7825923Z       %46 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7826261Z       %47 = tt.addptr %46, %45 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7826647Z       %48 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2048xi1> -> tensor<1x2048xi1>
2026-02-21T09:50:15.7827000Z       %49 = tt.broadcast %48 : tensor<1x2048xi1> -> tensor<8x2048xi1>
2026-02-21T09:50:15.7827393Z       %50 = tt.load %47, %49, %cst evictionPolicy = evict_first : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7827810Z       %51 = tt.expand_dims %24 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7828154Z       %52 = arith.extf %50 : tensor<8x2048xf16> to tensor<8x2048xf32>
2026-02-21T09:50:15.7828502Z       %53 = tt.broadcast %51 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7828788Z       %54 = arith.subf %52, %53 : tensor<8x2048xf32>
2026-02-21T09:50:15.7829306Z       %55 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x2048xf32>) -> tensor<8x2048xf32>
2026-02-21T09:50:15.7829818Z       %56 = tt.expand_dims %35 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:50:15.7830157Z       %57 = tt.broadcast %56 : tensor<8x1xf32> -> tensor<8x2048xf32>
2026-02-21T09:50:15.7830470Z       %58 = arith.divf %55, %57 : tensor<8x2048xf32>
2026-02-21T09:50:15.7830763Z       %59 = arith.truncf %58 : tensor<8x2048xf32> to tensor<8x2048xf16>
2026-02-21T09:50:15.7831117Z       %60 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7831450Z       %61 = tt.addptr %60, %45 : tensor<8x2048x!tt.ptr<f16>>, tensor<8x2048xi32>
2026-02-21T09:50:15.7831833Z       tt.store %61, %59, %49 : tensor<8x2048x!tt.ptr<f16>>
2026-02-21T09:50:15.7832143Z     } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:50:15.7832394Z     tt.return
2026-02-21T09:50:15.7832605Z   }
2026-02-21T09:50:15.7832775Z }
2026-02-21T09:50:15.7832900Z 
2026-02-21T09:50:15.7832979Z {-#
2026-02-21T09:50:15.7833162Z   external_resources: {
2026-02-21T09:50:15.7833415Z     mlir_reproducer: {
2026-02-21T09:50:15.7838225Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:50:15.7843525Z       disable_threading: false,
2026-02-21T09:50:15.7843801Z       verify_each: true
2026-02-21T09:50:15.7844073Z     }
2026-02-21T09:50:15.7844285Z   }
2026-02-21T09:50:15.7844460Z #-}
2026-02-21T09:50:15.7845027Z /tmp/torchinductor_root/54/c546bkyi7sl4ccjuehpnzcndjhoufwdajeu553u5laotkp7oj7xq.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:50:15.7846512Z /tmp/torchinductor_root/54/c546bkyi7sl4ccjuehpnzcndjhoufwdajeu553u5laotkp7oj7xq.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:50:15.7847706Z [135s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:50:15.7849079Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 2048], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=8, num_stages=8, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:50:15.7850189Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:50:15.7850517Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:50:18.1963608Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.2 configs/s
2026-02-21T09:50:23.2333237Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 200.9         
2026-02-21T09:50:23.2336778Z                                                                   configs/s     
2026-02-21T09:50:23.5080037Z [143s] Generation 3 complete: 
2026-02-21T09:50:23.5083958Z error=2
2026-02-21T09:50:23.5088177Z ok=71
2026-02-21T09:50:23.5092695Z min=0.0451
2026-02-21T09:50:23.5097714Z mid=0.0513
2026-02-21T09:50:23.5102167Z max=0.4997
2026-02-21T09:50:23.5103631Z best={'block_sizes': [1, 512],
2026-02-21T09:50:23.5104000Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T09:50:23.5104357Z  'load_eviction_policies': ['first', ''],
2026-02-21T09:50:23.5104587Z  'num_stages': 2,
2026-02-21T09:50:23.5104810Z  'num_warps': 1,
2026-02-21T09:50:23.5104996Z  'pid_type': 'flat',
2026-02-21T09:50:23.5105223Z  'range_flattens': [None, False],
2026-02-21T09:50:23.5105448Z  'range_multi_buffers': [None, False],
2026-02-21T09:50:23.5105702Z  'range_num_stages': [0, 3],
2026-02-21T09:50:23.5105940Z  'range_unroll_factors': [0, 1],
2026-02-21T09:50:23.5106159Z  'range_warp_specializes': [None, False]}
2026-02-21T09:50:23.5106454Z [143s] Fitting surrogate: 342 points, 342 targets
2026-02-21T09:50:24.3527177Z [144s] Generation 4 starting: 59 neighbors, 5 active search path(s)
2026-02-21T09:50:52.2447243Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.3 configs/s
2026-02-21T09:50:55.7913613Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.1 configs/s
2026-02-21T09:51:00.1320973Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 232.8         
2026-02-21T09:51:00.1322341Z                                                                   configs/s     
2026-02-21T09:51:00.3816581Z [180s] Generation 4 complete: 
2026-02-21T09:51:00.3820845Z ok=64
2026-02-21T09:51:00.3824778Z min=0.0451
2026-02-21T09:51:00.3826398Z mid=0.0511
2026-02-21T09:51:00.3826649Z max=0.9605
2026-02-21T09:51:00.3826845Z best={'block_sizes': [1, 512],
2026-02-21T09:51:00.3827177Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:51:00.3827490Z  'load_eviction_policies': ['first', ''],
2026-02-21T09:51:00.3827752Z  'num_stages': 2,
2026-02-21T09:51:00.3827941Z  'num_warps': 1,
2026-02-21T09:51:00.3828158Z  'pid_type': 'flat',
2026-02-21T09:51:00.3828392Z  'range_flattens': [None, False],
2026-02-21T09:51:00.3828611Z  'range_multi_buffers': [None, False],
2026-02-21T09:51:00.3828863Z  'range_num_stages': [0, 3],
2026-02-21T09:51:00.3829566Z  'range_unroll_factors': [0, 1],
2026-02-21T09:51:00.3833330Z  'range_warp_specializes': [None, False]}
2026-02-21T09:51:00.3835382Z [180s] Fitting surrogate: 406 points, 406 targets
2026-02-21T09:51:01.1403830Z [180s] Generation 5 starting: 50 neighbors, 4 active search path(s)
2026-02-21T09:51:08.5171345Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 8.3 configs/s
2026-02-21T09:51:11.5599629Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 17.3 configs/s
2026-02-21T09:51:15.9164878Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.0         
2026-02-21T09:51:15.9168355Z                                                                   configs/s     
2026-02-21T09:51:16.1313593Z [195s] Generation 5 complete: 
2026-02-21T09:51:16.1316684Z ok=55
2026-02-21T09:51:16.1319899Z min=0.0451
2026-02-21T09:51:16.1323946Z mid=0.0532
2026-02-21T09:51:16.1328602Z max=0.2764
2026-02-21T09:51:16.1330722Z best={'block_sizes': [1, 512],
2026-02-21T09:51:16.1331503Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:51:16.1336667Z  'load_eviction_policies': ['first', ''],
2026-02-21T09:51:16.1340991Z  'num_stages': 2,
2026-02-21T09:51:16.1341321Z  'num_warps': 1,
2026-02-21T09:51:16.1346136Z  'pid_type': 'flat',
2026-02-21T09:51:16.1350508Z  'range_flattens': [None, False],
2026-02-21T09:51:16.1350865Z  'range_multi_buffers': [None, False],
2026-02-21T09:51:16.1355527Z  'range_num_stages': [0, 3],
2026-02-21T09:51:16.1359927Z  'range_unroll_factors': [0, 1],
2026-02-21T09:51:16.1361702Z  'range_warp_specializes': [None, False]}
2026-02-21T09:51:16.8563014Z [195s] Fitting surrogate: 461 points, 461 targets
2026-02-21T09:51:16.8563443Z [196s] Generation 6 starting: 45 neighbors, 4 active search path(s)
2026-02-21T09:51:23.0759538Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 7.3 configs/s
2026-02-21T09:51:25.7705044Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 17.4 configs/s
2026-02-21T09:51:29.6622202Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 259.7         
2026-02-21T09:51:29.6622662Z                                                                   configs/s     
2026-02-21T09:51:29.8945506Z [209s] Generation 6 complete: 
2026-02-21T09:51:29.8949735Z ok=50
2026-02-21T09:51:29.8953665Z min=0.0451
2026-02-21T09:51:29.8957918Z mid=0.0471
2026-02-21T09:51:29.8959425Z max=0.0799
2026-02-21T09:51:29.8959688Z best={'block_sizes': [1, 512],
2026-02-21T09:51:29.8959997Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:51:29.8960428Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:51:29.8965485Z  'num_stages': 8,
2026-02-21T09:51:29.8967640Z  'num_warps': 1,
2026-02-21T09:51:29.8967923Z  'pid_type': 'flat',
2026-02-21T09:51:29.8972142Z  'range_flattens': [None, None],
2026-02-21T09:51:29.8973662Z  'range_multi_buffers': [None, None],
2026-02-21T09:51:29.8973936Z  'range_num_stages': [0, 3],
2026-02-21T09:51:29.8974174Z  'range_unroll_factors': [0, 1],
2026-02-21T09:51:29.8974802Z  'range_warp_specializes': [None, False]}
2026-02-21T09:51:29.8975103Z [209s] Fitting surrogate: 511 points, 511 targets
2026-02-21T09:51:30.2619258Z [210s] Generation 7 starting: 7 neighbors, 1 active search path(s)
2026-02-21T09:51:32.0076495Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7/7 3.8 configs/s
2026-02-21T09:51:32.4199140Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 7/7 19.4 configs/s
2026-02-21T09:51:33.1281076Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1395.7         
2026-02-21T09:51:33.1285176Z                                                                  configs/s      
2026-02-21T09:51:33.1858206Z [212s] Generation 7 complete: 
2026-02-21T09:51:33.1858514Z ok=8
2026-02-21T09:51:33.1862577Z min=0.0451
2026-02-21T09:51:33.1867054Z mid=0.0451
2026-02-21T09:51:33.1871327Z max=0.0635
2026-02-21T09:51:33.1875743Z best={'block_sizes': [1, 512],
2026-02-21T09:51:33.1878376Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T09:51:33.1878861Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:51:33.1882787Z  'num_stages': 8,
2026-02-21T09:51:33.1887191Z  'num_warps': 1,
2026-02-21T09:51:33.1891517Z  'pid_type': 'flat',
2026-02-21T09:51:33.1895398Z  'range_flattens': [None, None],
2026-02-21T09:51:33.1898683Z  'range_multi_buffers': [None, None],
2026-02-21T09:51:33.1902450Z  'range_num_stages': [0, 3],
2026-02-21T09:51:33.1907374Z  'range_unroll_factors': [0, 1],
2026-02-21T09:51:33.1911773Z  'range_warp_specializes': [None, False]}
2026-02-21T09:51:33.1913896Z [212s] Fitting surrogate: 519 points, 519 targets
2026-02-21T09:51:33.4434631Z [213s] Autotuning complete in 213.2s after searching 497 configs.
2026-02-21T09:51:33.4438521Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:51:33.4439649Z     @helion.kernel(config=helion.Config(block_sizes=[1, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:51:33.4440524Z 
2026-02-21T09:51:33.4440812Z [213s] Code of selected kernel: /tmp/torchinductor_root/kg/ckgyyc5wosnskbkyt7nxbsquxpemydjjp6uf3lfhfvqntz2bcqqp.py
2026-02-21T09:51:33.4637249Z from __future__ import annotations
2026-02-21T09:51:33.4637527Z 
2026-02-21T09:51:33.4637706Z import torch
2026-02-21T09:51:33.4637890Z import triton
2026-02-21T09:51:33.4638171Z import triton.language as tl
2026-02-21T09:51:33.4638419Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:51:33.4638779Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:51:33.4639119Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:51:33.4639350Z 
2026-02-21T09:51:33.4639454Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:51:33.4639708Z _BLOCK_SIZE_1 = tl.constexpr(512)
2026-02-21T09:51:33.4639846Z 
2026-02-21T09:51:33.4639921Z @triton.jit
2026-02-21T09:51:33.4640138Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:51:33.4640424Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:51:33.4640749Z     pid_0 = tl.program_id(0)
2026-02-21T09:51:33.4640951Z     offset_0 = pid_0
2026-02-21T09:51:33.4641190Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:51:33.4641702Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:51:33.4642040Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:51:33.4642382Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:51:33.4642673Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:51:33.4642996Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:51:33.4643315Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:51:33.4643927Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:51:33.4644229Z     # src[softmax.py:82-89]: ...
2026-02-21T09:51:33.4644603Z     for offset_2 in tl.range(0, 8704, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3):
2026-02-21T09:51:33.4645048Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:51:33.4645309Z         mi_copy = mi
2026-02-21T09:51:33.4645514Z         di_copy = di
2026-02-21T09:51:33.4645697Z         mi_copy_0 = mi_copy
2026-02-21T09:51:33.4645919Z         di_copy_0 = di_copy
2026-02-21T09:51:33.4646138Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:51:33.4646544Z         values = tl.load(x + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), None, eviction_policy='evict_last')
2026-02-21T09:51:33.4647018Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:51:33.4647382Z         local_amax = tl.cast(tl.max(values, 1), tl.float16)
2026-02-21T09:51:33.4647731Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:51:33.4648058Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:51:33.4648321Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:51:33.4648662Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:51:33.4648956Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:51:33.4649205Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:51:33.4649421Z         v_4 = di_copy_0 * v_3
2026-02-21T09:51:33.4649692Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:51:33.4649925Z         subscript = v_1[:, None]
2026-02-21T09:51:33.4650169Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:51:33.4650431Z         v_6 = v_5 - subscript
2026-02-21T09:51:33.4650694Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:51:33.4651044Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:51:33.4651312Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:51:33.4651616Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:51:33.4651852Z         sum_1 = tl.cast(tl.sum(v_7, 1), tl.float32)
2026-02-21T09:51:33.4652131Z         di = v_4 + sum_1
2026-02-21T09:51:33.4652342Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:51:33.4652598Z         mi = v_1
2026-02-21T09:51:33.4652879Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:51:33.4653208Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:51:33.4653583Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:51:33.4654049Z     for offset_2 in tl.range(0, 8704, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=False, num_stages=3):
2026-02-21T09:51:33.4654497Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:51:33.4654780Z         mi_copy_1 = mi
2026-02-21T09:51:33.4654998Z         di_copy_1 = di
2026-02-21T09:51:33.4655231Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:51:33.4655434Z         di_copy_1_0 = di_copy_1
2026-02-21T09:51:33.4655677Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:51:33.4656036Z         values_1 = tl.load(x + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), None, eviction_policy='evict_last')
2026-02-21T09:51:33.4656504Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:51:33.4656821Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:51:33.4657082Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:51:33.4657334Z         v_10 = v_9 - subscript_1
2026-02-21T09:51:33.4657549Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:51:33.4657793Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:51:33.4658018Z         v_12 = v_11 / subscript_2
2026-02-21T09:51:33.4658258Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:51:33.4658634Z         tl.store(out + (indices_0[:, None] * 8704 + indices_2[None, :] * 1), v_13, None)
2026-02-21T09:51:33.4658878Z 
2026-02-21T09:51:33.4659037Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:51:33.4659339Z     """
2026-02-21T09:51:33.4659587Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:51:33.4659967Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:51:33.4660230Z     Args:
2026-02-21T09:51:33.4660458Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:51:33.4660690Z     Returns:
2026-02-21T09:51:33.4660936Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:51:33.4661182Z     """
2026-02-21T09:51:33.4661383Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:51:33.4661671Z     m, n = x.size()
2026-02-21T09:51:33.4661877Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:51:33.4662212Z     out = torch.empty_like(x)
2026-02-21T09:51:33.4662483Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:51:33.4662868Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:51:33.4663203Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:51:33.4663506Z     # src[softmax.py:79-92]: ...
2026-02-21T09:51:33.4663830Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=8)
2026-02-21T09:51:33.4664140Z     # src[softmax.py:93]: return out
2026-02-21T09:51:33.4664379Z     return out
2026-02-21T09:51:34.5504665Z WARNING:tritonbench.utils.triton_op:Completed input ID 66:
2026-02-21T09:51:34.5506219Z (M, N)
2026-02-21T09:51:34.5506509Z ------------
2026-02-21T09:51:34.5506723Z (4096, 8704)
2026-02-21T09:51:34.5506918Z 
2026-02-21T09:51:34.5518132Z  70%|███████   | 14/20 [42:39<20:02, 200.40s/it]WARNING:tritonbench.utils.triton_op:Running input ID 71:
2026-02-21T09:51:34.5522790Z (M, N)
2026-02-21T09:51:34.5525950Z ------------
2026-02-21T09:51:34.5530959Z (4096, 9344)
2026-02-21T09:51:34.5532943Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:51:35.7681199Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:51:37.1276165Z INFO:tritonbench.utils.triton_op:Took 2.42ms to get benchmark function for torch_compile_softmax
2026-02-21T09:51:38.4721349Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:51:38.4725512Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:51:38.4730483Z               'dtype': 'torch.float16',
2026-02-21T09:51:38.4734314Z               'shape': (4096, 9344),
2026-02-21T09:51:38.4735888Z               'stride': (9344, 1)},),
2026-02-21T09:51:38.4736307Z   'kwargs': {}}
2026-02-21T09:51:38.4744200Z INFO:tritonbench.utils.triton_op:Took 2.45ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:51:38.6479554Z [0s] Autotune random seed: 2138408546
2026-02-21T09:51:38.6732335Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:52:14.3133198Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T09:52:17.2048686Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:52:19.2281335Z module {
2026-02-21T09:52:19.2282122Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:19.2282677Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:52:19.2283253Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:52:19.2283512Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:19.2283736Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:19.2284019Z     %cst = arith.constant dense<9344> : tensor<16x1xi32>
2026-02-21T09:52:19.2284319Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T09:52:19.2284651Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T09:52:19.2284940Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:52:19.2285168Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:52:19.2285427Z     %c9344_i32 = arith.constant 9344 : i32
2026-02-21T09:52:19.2285650Z     %c9344_i64 = arith.constant 9344 : i64
2026-02-21T09:52:19.2285898Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:52:19.2286372Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T09:52:19.2286918Z     %1 = tt.get_program_id x : i32
2026-02-21T09:52:19.2287178Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T09:52:19.2287396Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T09:52:19.2287667Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:52:19.2287911Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T09:52:19.2288218Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T09:52:19.2288507Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T09:52:19.2288778Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T09:52:19.2292086Z       %c9216_i32 = arith.constant 9216 : i32
2026-02-21T09:52:19.2292360Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T09:52:19.2292851Z       %8:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T09:52:19.2293390Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:52:19.2293819Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2294146Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2294394Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2294667Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.2294915Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2295209Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2295530Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:52:19.2295862Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:52:19.2296151Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T09:52:19.2296466Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T09:52:19.2296735Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T09:52:19.2297059Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:52:19.2297362Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T09:52:19.2297821Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2298273Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T09:52:19.2298584Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2298971Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2299264Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T09:52:19.2299724Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2300179Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2300429Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2300692Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.2301013Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2301275Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2301516Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T09:52:19.2301823Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:52:19.2302055Z         %68 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:52:19.2302317Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T09:52:19.2302661Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:52:19.2303023Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2303354Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2303583Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2303841Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.2304076Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2304405Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2304696Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:52:19.2304984Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:52:19.2305285Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T09:52:19.2305555Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T09:52:19.2305822Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T09:52:19.2306087Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:52:19.2306386Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T09:52:19.2306803Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2307191Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T09:52:19.2307517Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2307850Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2308162Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T09:52:19.2308596Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2308994Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2309253Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2309478Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.2309729Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2309956Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2310216Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T09:52:19.2310455Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:19.2310712Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:52:19.2310967Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T09:52:19.2311282Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:52:19.2311698Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2311965Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2312220Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2312447Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.2312705Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2312960Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2313217Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:52:19.2313525Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:52:19.2313787Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T09:52:19.2314062Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T09:52:19.2314396Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T09:52:19.2314691Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:52:19.2314994Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T09:52:19.2315400Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2315837Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T09:52:19.2316130Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2316497Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2316788Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T09:52:19.2317235Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2317754Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2317991Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2318245Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.2318471Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2318733Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2318975Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T09:52:19.2319244Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:52:19.2319501Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:52:19.2319735Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T09:52:19.2320088Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:52:19.2320455Z         %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2320763Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2321001Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2321253Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.2321509Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2321774Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2322071Z         %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:52:19.2322369Z         %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:52:19.2322675Z         %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32>
2026-02-21T09:52:19.2322940Z         %116 = arith.cmpf une, %98, %98 : tensor<16xf32>
2026-02-21T09:52:19.2323213Z         %117 = arith.ori %115, %116 : tensor<16xi1>
2026-02-21T09:52:19.2323517Z         %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:52:19.2323800Z         %119 = arith.subf %98, %118 : tensor<16xf32>
2026-02-21T09:52:19.2324229Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2324636Z         %121 = arith.mulf %107, %120 : tensor<16xf32>
2026-02-21T09:52:19.2324963Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2325339Z         %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2325630Z         %124 = arith.subf %111, %123 : tensor<16x128xf32>
2026-02-21T09:52:19.2326071Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2326479Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2326737Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.2326958Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.2327214Z           tt.reduce.return %128 : f32
2026-02-21T09:52:19.2327463Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2327705Z         %127 = arith.addf %121, %126 : tensor<16xf32>
2026-02-21T09:52:19.2328058Z         scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32>
2026-02-21T09:52:19.2328314Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:52:19.2328676Z       %9 = tt.descriptor_load %0[%4, %c9216_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T09:52:19.2329047Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2329348Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2329606Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:52:19.2329830Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:52:19.2330091Z         tt.reduce.return %50 : f32
2026-02-21T09:52:19.2330317Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2330609Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T09:52:19.2330890Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T09:52:19.2331281Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T09:52:19.2331572Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T09:52:19.2331850Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T09:52:19.2332152Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T09:52:19.2332426Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T09:52:19.2332846Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2333242Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T09:52:19.2333561Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2333940Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2334230Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T09:52:19.2334680Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2335112Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.2335377Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:52:19.2335603Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:52:19.2335869Z         tt.reduce.return %50 : f32
2026-02-21T09:52:19.2336128Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T09:52:19.2336373Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T09:52:19.2336648Z       %c9216_i32_2 = arith.constant 9216 : i32
2026-02-21T09:52:19.2336890Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T09:52:19.2337201Z       scf.for %arg3 = %c0_i32 to %c9216_i32_2 step %c512_i32_3  : i32 {
2026-02-21T09:52:19.2337538Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:52:19.2337884Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T09:52:19.2338142Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T09:52:19.2338475Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:52:19.2338825Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T09:52:19.2339139Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:52:19.2339521Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2339840Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2340158Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T09:52:19.2340516Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2340858Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2341250Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2341649Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2342085Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2342463Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2342778Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T09:52:19.2343203Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2343708Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2344051Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2344357Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T09:52:19.2344660Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:52:19.2344969Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2345370Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2345672Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2345950Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T09:52:19.2346180Z         %73 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T09:52:19.2346445Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T09:52:19.2346749Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:52:19.2347036Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T09:52:19.2347301Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T09:52:19.2347590Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:52:19.2347915Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T09:52:19.2348206Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:52:19.2348564Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2348917Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2349221Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T09:52:19.2349546Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2349867Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2350239Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2350583Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2350925Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2351258Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2351639Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T09:52:19.2352091Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2352593Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2352956Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2353257Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T09:52:19.2353532Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:52:19.2353868Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2354180Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2354471Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2354722Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:19.2354981Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T09:52:19.2355241Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T09:52:19.2355578Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:52:19.2355896Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T09:52:19.2356142Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T09:52:19.2356464Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:52:19.2356774Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T09:52:19.2357111Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:52:19.2357474Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2357782Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2358114Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T09:52:19.2358451Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2358805Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2359185Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2359540Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2359902Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2360210Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2360522Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T09:52:19.2360939Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2361425Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2361821Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2362106Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T09:52:19.2362421Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:52:19.2362738Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2363093Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2363428Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2363673Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T09:52:19.2363930Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T09:52:19.2364164Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T09:52:19.2364467Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:52:19.2364761Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T09:52:19.2365041Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T09:52:19.2365358Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:52:19.2365664Z         %129 = arith.muli %128, %cst : tensor<16x1xi32>
2026-02-21T09:52:19.2365991Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:52:19.2366316Z         %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2366649Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2366909Z         %133 = arith.addi %131, %132 : tensor<16x128xi32>
2026-02-21T09:52:19.2367213Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2367565Z         %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2367912Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2368283Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2368669Z         %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2369009Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2369317Z         %140 = arith.subf %138, %139 : tensor<16x128xf32>
2026-02-21T09:52:19.2369734Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2370185Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2370516Z         %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2370825Z         %144 = arith.divf %141, %143 : tensor<16x128xf32>
2026-02-21T09:52:19.2371112Z         %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:52:19.2371507Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2371900Z         %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2372201Z         tt.store %147, %145 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2372489Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:52:19.2372772Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T09:52:19.2373112Z       %28 = tt.splat %c9216_i32_2 : i32 -> tensor<128xi32>
2026-02-21T09:52:19.2373371Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T09:52:19.2373655Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T09:52:19.2373974Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T09:52:19.2374264Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T09:52:19.2374618Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2374924Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T09:52:19.2375230Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T09:52:19.2375532Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2375850Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2376221Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2376568Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2376923Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T09:52:19.2377225Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
﻿2026-02-21T09:52:19.2379383Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T09:52:19.2379798Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T09:52:19.2380326Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T09:52:19.2380696Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T09:52:19.2380985Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T09:52:19.2381306Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T09:52:19.2381670Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2382036Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T09:52:19.2382350Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T09:52:19.2382745Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T09:52:19.2383133Z     tt.return
2026-02-21T09:52:19.2383341Z   }
2026-02-21T09:52:19.2383501Z }
2026-02-21T09:52:19.2383626Z 
2026-02-21T09:52:19.2383701Z {-#
2026-02-21T09:52:19.2383909Z   external_resources: {
2026-02-21T09:52:19.2384187Z     mlir_reproducer: {
2026-02-21T09:52:19.2388931Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:52:19.2393429Z       disable_threading: false,
2026-02-21T09:52:19.2393664Z       verify_each: true
2026-02-21T09:52:19.2393848Z     }
2026-02-21T09:52:19.2394036Z   }
2026-02-21T09:52:19.2394195Z #-}
2026-02-21T09:52:19.2394678Z /tmp/torchinductor_root/nt/cntgwioi7k7tujhdfmow44l2zvmhvplvsustl766b4zlecuifsdz.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:19.2395919Z /tmp/torchinductor_root/nt/cntgwioi7k7tujhdfmow44l2zvmhvplvsustl766b4zlecuifsdz.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:19.2396958Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:19.2398087Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:52:19.2399174Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:19.2399473Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:19.6310771Z module {
2026-02-21T09:52:19.6311441Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:19.6312232Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:52:19.6312584Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T09:52:19.6312910Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:52:19.6313172Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:19.6313433Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T09:52:19.6313722Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T09:52:19.6314324Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T09:52:19.6314613Z     %cst_2 = arith.constant dense<9344> : tensor<8x1xi32>
2026-02-21T09:52:19.6314925Z     %cst_3 = arith.constant dense<9344> : tensor<1024xi32>
2026-02-21T09:52:19.6315212Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:52:19.6315529Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:52:19.6315814Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:52:19.6316037Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:52:19.6316298Z     %c9344_i32 = arith.constant 9344 : i32
2026-02-21T09:52:19.6316522Z     %c9344_i64 = arith.constant 9344 : i64
2026-02-21T09:52:19.6316773Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:52:19.6317134Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c9344_i32], [%c9344_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T09:52:19.6317642Z     %1 = tt.get_program_id x : i32
2026-02-21T09:52:19.6317930Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T09:52:19.6318191Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:52:19.6318495Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:52:19.6318784Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:52:19.6319046Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:52:19.6319276Z       %c9216_i32 = arith.constant 9216 : i32
2026-02-21T09:52:19.6319538Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T09:52:19.6319982Z       %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:52:19.6320451Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6320783Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6321036Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:52:19.6321322Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6321681Z         %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6322013Z         %71 = arith.muli %70, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6322344Z         %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6322683Z         %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6323020Z         %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6323304Z         %75 = arith.addi %73, %74 : tensor<8x1024xi32>
2026-02-21T09:52:19.6323610Z         %76 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6324027Z         %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6324454Z         %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6324794Z         %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6325118Z         %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6325450Z         %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:52:19.6325800Z         %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6326078Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6326343Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6326576Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.6326841Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6327096Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6327390Z         %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:52:19.6327669Z         %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:52:19.6327965Z         %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32>
2026-02-21T09:52:19.6328300Z         %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:52:19.6328550Z         %88 = arith.ori %86, %87 : tensor<8xi1>
2026-02-21T09:52:19.6328849Z         %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:52:19.6329125Z         %90 = arith.subf %arg4, %89 : tensor<8xf32>
2026-02-21T09:52:19.6329555Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6329984Z         %92 = arith.mulf %arg5, %91 : tensor<8xf32>
2026-02-21T09:52:19.6330274Z         %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6330632Z         %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6330934Z         %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6331303Z         %96 = arith.subf %94, %95 : tensor<8x1024xf32>
2026-02-21T09:52:19.6331745Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6332227Z         %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:52:19.6332553Z         %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6332790Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6333063Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.6333295Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6333557Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6333794Z         %100 = arith.addf %92, %99 : tensor<8xf32>
2026-02-21T09:52:19.6334057Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:19.6334319Z         %101 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:52:19.6334550Z         %102 = arith.addi %arg3, %101 : i32
2026-02-21T09:52:19.6334860Z         %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6335158Z         %104 = tt.splat %102 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6335439Z         %105 = arith.addi %104, %103 : tensor<1024xi32>
2026-02-21T09:52:19.6335701Z         %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6336033Z         %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6336371Z         %108 = arith.muli %107, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6336679Z         %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6337054Z         %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6337365Z         %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6337720Z         %112 = arith.addi %110, %111 : tensor<8x1024xi32>
2026-02-21T09:52:19.6338007Z         %113 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6338364Z         %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6338742Z         %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6339082Z         %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6339410Z         %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6339726Z         %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:52:19.6340086Z         %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6340391Z         %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6340624Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6340880Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.6341112Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6341366Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6341689Z         %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:52:19.6342004Z         %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:52:19.6342273Z         %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32>
2026-02-21T09:52:19.6342554Z         %124 = arith.cmpf une, %89, %89 : tensor<8xf32>
2026-02-21T09:52:19.6342823Z         %125 = arith.ori %123, %124 : tensor<8xi1>
2026-02-21T09:52:19.6343100Z         %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:52:19.6343411Z         %127 = arith.subf %89, %126 : tensor<8xf32>
2026-02-21T09:52:19.6343816Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6344260Z         %129 = arith.mulf %100, %128 : tensor<8xf32>
2026-02-21T09:52:19.6344653Z         %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6344992Z         %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6345332Z         %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6345617Z         %133 = arith.subf %131, %132 : tensor<8x1024xf32>
2026-02-21T09:52:19.6346068Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6346502Z         %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:52:19.6346840Z         %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6347104Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6347321Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.6347561Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6347792Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6348069Z         %137 = arith.addf %129, %136 : tensor<8xf32>
2026-02-21T09:52:19.6348307Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:19.6348571Z         %138 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:52:19.6348844Z         %139 = arith.addi %arg3, %138 : i32
2026-02-21T09:52:19.6349121Z         %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6349452Z         %141 = tt.splat %139 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6349703Z         %142 = arith.addi %141, %140 : tensor<1024xi32>
2026-02-21T09:52:19.6349991Z         %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6350293Z         %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6350625Z         %145 = arith.muli %144, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6350990Z         %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6351332Z         %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6351720Z         %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6352017Z         %149 = arith.addi %147, %148 : tensor<8x1024xi32>
2026-02-21T09:52:19.6352326Z         %150 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6352651Z         %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6353030Z         %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6353400Z         %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6353701Z         %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6354044Z         %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:52:19.6354374Z         %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6354679Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6354973Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6355214Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:52:19.6355486Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6355716Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6356027Z         %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:52:19.6356326Z         %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:52:19.6356636Z         %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32>
2026-02-21T09:52:19.6356903Z         %161 = arith.cmpf une, %126, %126 : tensor<8xf32>
2026-02-21T09:52:19.6357185Z         %162 = arith.ori %160, %161 : tensor<8xi1>
2026-02-21T09:52:19.6357500Z         %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:52:19.6357793Z         %164 = arith.subf %126, %163 : tensor<8xf32>
2026-02-21T09:52:19.6358324Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6358751Z         %166 = arith.mulf %137, %165 : tensor<8xf32>
2026-02-21T09:52:19.6359091Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6359463Z         %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6359788Z         %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6360117Z         %170 = arith.subf %168, %169 : tensor<8x1024xf32>
2026-02-21T09:52:19.6360551Z         %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6361067Z         %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:52:19.6361387Z         %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6361699Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:52:19.6361963Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:52:19.6362204Z           tt.reduce.return %175 : f32
2026-02-21T09:52:19.6362504Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6362758Z         %174 = arith.addf %166, %173 : tensor<8xf32>
2026-02-21T09:52:19.6363060Z         scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:52:19.6363317Z       } {tt.flatten}
2026-02-21T09:52:19.6363596Z       %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6363930Z       %8 = tt.splat %c9216_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6364190Z       %9 = arith.addi %8, %7 : tensor<1024xi32>
2026-02-21T09:52:19.6364483Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6364830Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6365166Z       %12 = arith.muli %11, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6365482Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6365839Z       %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6366173Z       %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6366450Z       %16 = arith.addi %14, %15 : tensor<8x1024xi32>
2026-02-21T09:52:19.6366754Z       %17 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6367133Z       %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6367469Z       %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6367824Z       %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6368140Z       %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6368446Z       %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:52:19.6368811Z       %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6369083Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6369359Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:52:19.6369586Z         %66 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:52:19.6369848Z         tt.reduce.return %66 : f32
2026-02-21T09:52:19.6370100Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6370367Z       %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:52:19.6370666Z       %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:52:19.6370927Z       %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32>
2026-02-21T09:52:19.6371204Z       %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T09:52:19.6371447Z       %29 = arith.ori %27, %28 : tensor<8xi1>
2026-02-21T09:52:19.6371779Z       %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:52:19.6372128Z       %31 = arith.subf %6#0, %30 : tensor<8xf32>
2026-02-21T09:52:19.6372519Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6372955Z       %33 = arith.mulf %6#1, %32 : tensor<8xf32>
2026-02-21T09:52:19.6373239Z       %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6373588Z       %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6373867Z       %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6374173Z       %37 = arith.subf %35, %36 : tensor<8x1024xf32>
2026-02-21T09:52:19.6374605Z       %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6375052Z       %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:52:19.6375373Z       %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({
2026-02-21T09:52:19.6375609Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:52:19.6375861Z         %66 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:52:19.6376090Z         tt.reduce.return %66 : f32
2026-02-21T09:52:19.6376348Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:52:19.6376617Z       %41 = arith.addf %33, %40 : tensor<8xf32>
2026-02-21T09:52:19.6376853Z       %c9216_i32_6 = arith.constant 9216 : i32
2026-02-21T09:52:19.6377123Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T09:52:19.6377395Z       scf.for %arg3 = %c0_i32 to %c9216_i32_6 step %c3072_i32_7  : i32 {
2026-02-21T09:52:19.6377754Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6378059Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6378369Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:52:19.6378664Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6379023Z         %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:52:19.6379441Z         %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6379768Z         %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6380098Z         %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6380372Z         %74 = arith.subf %72, %73 : tensor<8x1024xf32>
2026-02-21T09:52:19.6380808Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6381289Z         %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6381650Z         %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6381955Z         %78 = arith.divf %75, %77 : tensor<8x1024xf32>
2026-02-21T09:52:19.6382234Z         %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:52:19.6382613Z         %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6382934Z         %81 = arith.muli %80, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6383232Z         %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6383587Z         %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6383891Z         %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6384193Z         %85 = arith.addi %83, %84 : tensor<8x1024xi32>
2026-02-21T09:52:19.6384468Z         %86 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6384817Z         %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6385203Z         %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6385606Z         %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6385931Z         tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6386185Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:19.6386442Z         %90 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:52:19.6386710Z         %91 = arith.addi %arg3, %90 : i32
2026-02-21T09:52:19.6386991Z         %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6387316Z         %93 = tt.splat %91 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6387557Z         %94 = arith.addi %93, %92 : tensor<1024xi32>
2026-02-21T09:52:19.6387847Z         %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6388187Z         %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:52:19.6388593Z         %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6388951Z         %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6389252Z         %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6389559Z         %100 = arith.subf %98, %99 : tensor<8x1024xf32>
2026-02-21T09:52:19.6389978Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6390474Z         %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6390834Z         %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6391118Z         %104 = arith.divf %101, %103 : tensor<8x1024xf32>
2026-02-21T09:52:19.6391442Z         %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:52:19.6391846Z         %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6392178Z         %107 = arith.muli %106, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6392491Z         %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6392856Z         %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6393200Z         %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6393488Z         %111 = arith.addi %109, %110 : tensor<8x1024xi32>
2026-02-21T09:52:19.6393809Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6394138Z         %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6394520Z         %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6394856Z         %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6395185Z         tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6395476Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:19.6395740Z         %116 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:52:19.6396001Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T09:52:19.6396281Z         %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6396609Z         %119 = tt.splat %117 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6396861Z         %120 = arith.addi %119, %118 : tensor<1024xi32>
2026-02-21T09:52:19.6397170Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6397543Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:52:19.6397922Z         %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6398282Z         %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6398586Z         %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6398946Z         %126 = arith.subf %124, %125 : tensor<8x1024xf32>
2026-02-21T09:52:19.6399396Z         %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6399869Z         %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6400222Z         %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6400516Z         %130 = arith.divf %127, %129 : tensor<8x1024xf32>
2026-02-21T09:52:19.6400847Z         %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:52:19.6401222Z         %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6401569Z         %133 = arith.muli %132, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6401931Z         %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6402284Z         %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6402633Z         %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6402939Z         %137 = arith.addi %135, %136 : tensor<8x1024xi32>
2026-02-21T09:52:19.6403265Z         %138 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6403639Z         %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6404011Z         %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6404395Z         %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6404707Z         tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6405024Z       } {tt.flatten}
2026-02-21T09:52:19.6405284Z       %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:52:19.6405641Z       %43 = tt.splat %c9216_i32_6 : i32 -> tensor<1024xi32>
2026-02-21T09:52:19.6405940Z       %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T09:52:19.6406207Z       %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32>
2026-02-21T09:52:19.6406615Z       %46 = tt.descriptor_load %0[%2, %c9216_i32_6] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:52:19.6407024Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6407395Z       %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:52:19.6407744Z       %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6408031Z       %50 = arith.subf %48, %49 : tensor<8x1024xf32>
2026-02-21T09:52:19.6408490Z       %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:52:19.6408944Z       %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:52:19.6409295Z       %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:52:19.6409597Z       %54 = arith.divf %51, %53 : tensor<8x1024xf32>
2026-02-21T09:52:19.6409912Z       %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:52:19.6410264Z       %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:52:19.6410554Z       %57 = arith.muli %56, %cst_2 : tensor<8x1xi32>
2026-02-21T09:52:19.6410882Z       %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:52:19.6411208Z       %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6411570Z       %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:52:19.6411844Z       %61 = arith.addi %59, %60 : tensor<8x1024xi32>
2026-02-21T09:52:19.6412142Z       %62 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6412573Z       %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:52:19.6412910Z       %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:52:19.6413265Z       %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:52:19.6413552Z       tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:52:19.6413992Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:52:19.6414406Z     tt.return
2026-02-21T09:52:19.6414576Z   }
2026-02-21T09:52:19.6414774Z }
2026-02-21T09:52:19.6414866Z 
2026-02-21T09:52:19.6414937Z {-#
2026-02-21T09:52:19.6415137Z   external_resources: {
2026-02-21T09:52:19.6415337Z     mlir_reproducer: {
2026-02-21T09:52:19.6419692Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:52:19.6424314Z       disable_threading: false,
2026-02-21T09:52:19.6424521Z       verify_each: true
2026-02-21T09:52:19.6424737Z     }
2026-02-21T09:52:19.6424923Z   }
2026-02-21T09:52:19.6425078Z #-}
2026-02-21T09:52:19.6425568Z /tmp/torchinductor_root/js/cjswdicw4cghex4jyofrnzl3zuhjuy5ja2oac7iwkk7djpqzg7jc.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:19.6426831Z /tmp/torchinductor_root/js/cjswdicw4cghex4jyofrnzl3zuhjuy5ja2oac7iwkk7djpqzg7jc.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:19.6427891Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:19.6429014Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:52:19.6430034Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:19.6430341Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:26.2978724Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.9 configs/s
2026-02-21T09:52:26.2989139Z [47s] Adaptive compile timeout: 30s (90% percentile=13.3s, bounds=[30.0s, 30s])
2026-02-21T09:52:27.4009535Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 894.7 configs/s
2026-02-21T09:52:27.4748055Z [48s] Initial random population of 100, 5 starting points: 
2026-02-21T09:52:27.4752992Z error=13
2026-02-21T09:52:27.4757226Z timeout=1
2026-02-21T09:52:27.4761732Z ok=86
2026-02-21T09:52:27.4763183Z min=0.0656
2026-02-21T09:52:27.4763424Z mid=0.5489
2026-02-21T09:52:27.4763602Z max=218.3107
2026-02-21T09:52:27.4763824Z best={'block_sizes': [1, 1024],
2026-02-21T09:52:27.4764094Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:52:27.4764438Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:52:27.4764669Z  'num_stages': 6,
2026-02-21T09:52:27.4764878Z  'num_warps': 4,
2026-02-21T09:52:27.4765060Z  'pid_type': 'flat',
2026-02-21T09:52:27.4765327Z  'range_flattens': [None, None],
2026-02-21T09:52:27.4765543Z  'range_multi_buffers': [None, True],
2026-02-21T09:52:27.4765793Z  'range_num_stages': [0, 0],
2026-02-21T09:52:27.4766030Z  'range_unroll_factors': [0, 4],
2026-02-21T09:52:27.4766253Z  'range_warp_specializes': [None, False]}
2026-02-21T09:52:27.4767957Z [48s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:52:28.5095211Z [49s] Generation 1 starting: 75 neighbors, 5 active search path(s)
2026-02-21T09:52:51.3903431Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.5 configs/s
2026-02-21T09:52:56.0902569Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.7 configs/s
2026-02-21T09:53:00.2842133Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.2         
2026-02-21T09:53:00.2846985Z                                                                   configs/s     
2026-02-21T09:53:00.4932858Z [81s] Generation 1 complete: 
2026-02-21T09:53:00.4934508Z ok=81
2026-02-21T09:53:00.4934824Z min=0.0492
2026-02-21T09:53:00.4935084Z mid=0.0798
2026-02-21T09:53:00.4935299Z max=1.3835
2026-02-21T09:53:00.4935570Z best={'block_sizes': [1, 1024],
2026-02-21T09:53:00.4935963Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:53:00.4936410Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:53:00.4936749Z  'num_stages': 3,
2026-02-21T09:53:00.4936998Z  'num_warps': 1,
2026-02-21T09:53:00.4937273Z  'pid_type': 'flat',
2026-02-21T09:53:00.4937537Z  'range_flattens': [None, False],
2026-02-21T09:53:00.4937866Z  'range_multi_buffers': [None, True],
2026-02-21T09:53:00.4938171Z  'range_num_stages': [0, 2],
2026-02-21T09:53:00.4938470Z  'range_unroll_factors': [0, 3],
2026-02-21T09:53:00.4938691Z  'range_warp_specializes': [None, None]}
2026-02-21T09:53:00.4947570Z [81s] Fitting surrogate: 181 points, 181 targets
2026-02-21T09:53:02.0471298Z [83s] Generation 2 starting: 65 neighbors, 5 active search path(s)
2026-02-21T09:53:20.3982501Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.5 configs/s
2026-02-21T09:53:24.4776246Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.1 configs/s
2026-02-21T09:53:30.4813986Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 168.2         
2026-02-21T09:53:30.4815833Z                                                                   configs/s     
2026-02-21T09:53:30.8112255Z [112s] Generation 2 complete: 
2026-02-21T09:53:30.8115641Z ok=71
2026-02-21T09:53:30.8120157Z min=0.0511
2026-02-21T09:53:30.8127409Z mid=0.0614
2026-02-21T09:53:30.8132143Z max=0.1434
2026-02-21T09:53:30.8132688Z best={'block_sizes': [1, 1024],
2026-02-21T09:53:30.8133008Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:53:30.8133393Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:53:30.8133670Z  'num_stages': 3,
2026-02-21T09:53:30.8133862Z  'num_warps': 1,
2026-02-21T09:53:30.8134090Z  'pid_type': 'flat',
2026-02-21T09:53:30.8134661Z  'range_flattens': [None, False],
2026-02-21T09:53:30.8134993Z  'range_multi_buffers': [None, True],
2026-02-21T09:53:30.8135226Z  'range_num_stages': [0, 2],
2026-02-21T09:53:30.8139428Z  'range_unroll_factors': [0, 3],
2026-02-21T09:53:30.8139754Z  'range_warp_specializes': [None, None]}
2026-02-21T09:53:30.8140051Z [112s] Fitting surrogate: 252 points, 252 targets
2026-02-21T09:53:31.8072213Z [113s] Generation 3 starting: 61 neighbors, 5 active search path(s)
2026-02-21T09:53:43.7563118Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 1.4 configs/s
2026-02-21T09:53:47.3615601Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 17.1 configs/s
2026-02-21T09:53:53.3387923Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 169.2         
2026-02-21T09:53:53.3389086Z                                                                   configs/s     
2026-02-21T09:53:53.6509645Z [134s] Generation 3 complete: 
2026-02-21T09:53:53.6512628Z ok=67
2026-02-21T09:53:53.6515931Z min=0.0492
2026-02-21T09:53:53.6519057Z mid=0.0573
2026-02-21T09:53:53.6523660Z max=0.2036
2026-02-21T09:53:53.6528116Z best={'block_sizes': [1, 1024],
2026-02-21T09:53:53.6532030Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:53:53.6532430Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:53:53.6536360Z  'num_stages': 2,
2026-02-21T09:53:53.6540763Z  'num_warps': 1,
2026-02-21T09:53:53.6545625Z  'pid_type': 'flat',
2026-02-21T09:53:53.6549869Z  'range_flattens': [None, False],
2026-02-21T09:53:53.6554351Z  'range_multi_buffers': [None, True],
2026-02-21T09:53:53.6558710Z  'range_num_stages': [0, 2],
2026-02-21T09:53:53.6561812Z  'range_unroll_factors': [0, 3],
2026-02-21T09:53:53.6563861Z  'range_warp_specializes': [None, None]}
2026-02-21T09:53:53.6564607Z [134s] Fitting surrogate: 319 points, 319 targets
2026-02-21T09:53:54.3824253Z [135s] Generation 4 starting: 47 neighbors, 4 active search path(s)
2026-02-21T09:54:04.4298995Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 2.7 configs/s
2026-02-21T09:54:07.2780369Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.1 configs/s
2026-02-21T09:54:11.6049648Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.8         
2026-02-21T09:54:11.6050982Z                                                                   configs/s     
2026-02-21T09:54:11.8126641Z [153s] Generation 4 complete: 
2026-02-21T09:54:11.8131000Z ok=52
2026-02-21T09:54:11.8135370Z min=0.0492
2026-02-21T09:54:11.8139312Z mid=0.0553
2026-02-21T09:54:11.8143786Z max=0.2171
2026-02-21T09:54:11.8148132Z best={'block_sizes': [1, 1024],
2026-02-21T09:54:11.8152769Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:54:11.8156028Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T09:54:11.8156370Z  'num_stages': 6,
2026-02-21T09:54:11.8160432Z  'num_warps': 1,
2026-02-21T09:54:11.8162198Z  'pid_type': 'flat',
2026-02-21T09:54:11.8162489Z  'range_flattens': [None, None],
2026-02-21T09:54:11.8163025Z  'range_multi_buffers': [None, True],
2026-02-21T09:54:11.8163302Z  'range_num_stages': [0, 1],
2026-02-21T09:54:11.8163524Z  'range_unroll_factors': [0, 3],
2026-02-21T09:54:11.8163795Z  'range_warp_specializes': [None, False]}
2026-02-21T09:54:11.8164072Z [153s] Fitting surrogate: 371 points, 371 targets
2026-02-21T09:54:12.4133986Z [153s] Generation 5 starting: 34 neighbors, 3 active search path(s)
2026-02-21T09:54:18.6278894Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 4.8 configs/s
2026-02-21T09:54:20.7034889Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 17.2 configs/s
2026-02-21T09:54:23.8668086Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 318.6         
2026-02-21T09:54:23.8669215Z                                                                   configs/s     
2026-02-21T09:54:24.0436721Z [165s] Generation 5 complete: 
2026-02-21T09:54:24.0440958Z ok=38
2026-02-21T09:54:24.0442794Z min=0.0480
2026-02-21T09:54:24.0443052Z mid=0.0532
2026-02-21T09:54:24.0443266Z max=0.1272
2026-02-21T09:54:24.0443447Z best={'block_sizes': [1, 8192],
2026-02-21T09:54:24.0443750Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:54:24.0444029Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:54:24.0444293Z  'num_stages': 1,
2026-02-21T09:54:24.0444473Z  'num_warps': 4,
2026-02-21T09:54:24.0444680Z  'pid_type': 'flat',
2026-02-21T09:54:24.0444873Z  'range_flattens': [None, True],
2026-02-21T09:54:24.0445123Z  'range_multi_buffers': [None, None],
2026-02-21T09:54:24.0445376Z  'range_num_stages': [0, 3],
2026-02-21T09:54:24.0445583Z  'range_unroll_factors': [0, 2],
2026-02-21T09:54:24.0445833Z  'range_warp_specializes': [None, False]}
2026-02-21T09:54:24.0453518Z [165s] Fitting surrogate: 409 points, 409 targets
2026-02-21T09:54:24.4942612Z [165s] Generation 6 starting: 26 neighbors, 2 active search path(s)
2026-02-21T09:54:32.1565750Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 1.5 configs/s
2026-02-21T09:54:33.7572283Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.3 configs/s
2026-02-21T09:54:36.1271102Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 424.1         
2026-02-21T09:54:36.1271832Z                                                                   configs/s     
2026-02-21T09:54:36.2625102Z [177s] Generation 6 complete: 
2026-02-21T09:54:36.2630284Z ok=28
2026-02-21T09:54:36.2631890Z min=0.0472
2026-02-21T09:54:36.2632121Z mid=0.0532
2026-02-21T09:54:36.2632290Z max=0.2273
2026-02-21T09:54:36.2632500Z best={'block_sizes': [1, 8192],
2026-02-21T09:54:36.2632770Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:54:36.2633072Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:54:36.2633628Z  'num_stages': 1,
2026-02-21T09:54:36.2633847Z  'num_warps': 4,
2026-02-21T09:54:36.2634034Z  'pid_type': 'flat',
2026-02-21T09:54:36.2634269Z  'range_flattens': [None, True],
2026-02-21T09:54:36.2634504Z  'range_multi_buffers': [None, None],
2026-02-21T09:54:36.2634866Z  'range_num_stages': [0, 3],
2026-02-21T09:54:36.2635099Z  'range_unroll_factors': [0, 2],
2026-02-21T09:54:36.2635321Z  'range_warp_specializes': [None, False]}
2026-02-21T09:54:36.2640449Z [177s] Fitting surrogate: 437 points, 437 targets
2026-02-21T09:54:36.5508845Z [177s] Generation 7 starting: 9 neighbors, 1 active search path(s)
2026-02-21T09:54:39.9020148Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 2.4 configs/s
2026-02-21T09:54:40.4376117Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 18.4 configs/s
2026-02-21T09:54:41.1847572Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1319.1         
2026-02-21T09:54:41.1851794Z                                                                  configs/s      
2026-02-21T09:54:41.2435829Z [182s] Generation 7 complete: 
2026-02-21T09:54:41.2440107Z ok=10
2026-02-21T09:54:41.2441841Z min=0.0490
2026-02-21T09:54:41.2442140Z mid=0.0553
2026-02-21T09:54:41.2447092Z max=0.0758
2026-02-21T09:54:41.2448463Z best={'block_sizes': [1, 8192],
2026-02-21T09:54:41.2448862Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:54:41.2453879Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:54:41.2455395Z  'num_stages': 1,
2026-02-21T09:54:41.2455702Z  'num_warps': 4,
2026-02-21T09:54:41.2458684Z  'pid_type': 'flat',
2026-02-21T09:54:41.2458984Z  'range_flattens': [None, True],
2026-02-21T09:54:41.2459252Z  'range_multi_buffers': [None, None],
2026-02-21T09:54:41.2462956Z  'range_num_stages': [0, 3],
2026-02-21T09:54:41.2467276Z  'range_unroll_factors': [0, 2],
2026-02-21T09:54:41.2471446Z  'range_warp_specializes': [None, False]}
2026-02-21T09:54:41.2475911Z [182s] Fitting surrogate: 447 points, 447 targets
2026-02-21T09:54:41.4195133Z [182s] Autotuning complete in 182.7s after searching 426 configs.
2026-02-21T09:54:41.4199534Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:54:41.4201923Z     @helion.kernel(config=helion.Config(block_sizes=[1, 8192], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', ''], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:54:41.4202876Z 
2026-02-21T09:54:41.4203160Z [182s] Code of selected kernel: /tmp/torchinductor_root/7k/c7k7oelmsebgubewu5tkpv7n5yseqxxi252ufqh4z6tavu5nsnni.py
2026-02-21T09:54:41.4419596Z from __future__ import annotations
2026-02-21T09:54:41.4423153Z 
2026-02-21T09:54:41.4427815Z import torch
2026-02-21T09:54:41.4429781Z import triton
2026-02-21T09:54:41.4430054Z import triton.language as tl
2026-02-21T09:54:41.4430317Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:54:41.4430666Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:54:41.4431027Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:54:41.4431230Z 
2026-02-21T09:54:41.4431322Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:54:41.4431689Z _BLOCK_SIZE_1 = tl.constexpr(8192)
2026-02-21T09:54:41.4431830Z 
2026-02-21T09:54:41.4431910Z @triton.jit
2026-02-21T09:54:41.4432124Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:54:41.4432416Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:54:41.4432734Z     pid_0 = tl.program_id(0)
2026-02-21T09:54:41.4432936Z     offset_0 = pid_0
2026-02-21T09:54:41.4433181Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:54:41.4433546Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:54:41.4433876Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:54:41.4434416Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:54:41.4434703Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:54:41.4435029Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:54:41.4435422Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:54:41.4435736Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:54:41.4436036Z     # src[softmax.py:82-89]: ...
2026-02-21T09:54:41.4436428Z     for offset_2 in tl.range(0, 9344, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, flatten=True):
2026-02-21T09:54:41.4436900Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:54:41.4437175Z         mask_1 = indices_2 < 9344
2026-02-21T09:54:41.4437410Z         mi_copy = mi
2026-02-21T09:54:41.4437591Z         di_copy = di
2026-02-21T09:54:41.4437798Z         mi_copy_0 = mi_copy
2026-02-21T09:54:41.4438043Z         di_copy_0 = di_copy
2026-02-21T09:54:41.4438262Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:54:41.4438771Z         values = tl.load(x + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:54:41.4439235Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:54:41.4439685Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:54:41.4440146Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:54:41.4440447Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:54:41.4440749Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:54:41.4441022Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:54:41.4441325Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:54:41.4441671Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:54:41.4441888Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:54:41.4442124Z         v_4 = di_copy_0 * v_3
2026-02-21T09:54:41.4442354Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:54:41.4442632Z         subscript = v_1[:, None]
2026-02-21T09:54:41.4442845Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:54:41.4443091Z         v_6 = v_5 - subscript
2026-02-21T09:54:41.4443376Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:54:41.4443679Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:54:41.4443956Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:54:41.4444181Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:54:41.4444567Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:54:41.4444957Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:54:41.4445225Z         di = v_4 + sum_1
2026-02-21T09:54:41.4445451Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:54:41.4445663Z         mi = v_1
2026-02-21T09:54:41.4445929Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:54:41.4446238Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:54:41.4446594Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:54:41.4447063Z     for offset_2 in tl.range(0, 9344, _BLOCK_SIZE_1, loop_unroll_factor=2, warp_specialize=False, num_stages=1, flatten=True):
2026-02-21T09:54:41.4447523Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:54:41.4447821Z         mask_2 = indices_2 < 9344
2026-02-21T09:54:41.4448025Z         mi_copy_1 = mi
2026-02-21T09:54:41.4448239Z         di_copy_1 = di
2026-02-21T09:54:41.4448430Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:54:41.4448662Z         di_copy_1_0 = di_copy_1
2026-02-21T09:54:41.4448932Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:54:41.4449306Z         values_1 = tl.load(x + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), mask_2[None, :], other=0)
2026-02-21T09:54:41.4449759Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:54:41.4450102Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:54:41.4450355Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:54:41.4450588Z         v_10 = v_9 - subscript_1
2026-02-21T09:54:41.4450825Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:54:41.4451048Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:54:41.4451292Z         v_12 = v_11 / subscript_2
2026-02-21T09:54:41.4451566Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:54:41.4451908Z         tl.store(out + (indices_0[:, None] * 9344 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:54:41.4452143Z 
2026-02-21T09:54:41.4452322Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:54:41.4452598Z     """
2026-02-21T09:54:41.4452882Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:54:41.4453326Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:54:41.4453623Z     Args:
2026-02-21T09:54:41.4453828Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:54:41.4454097Z     Returns:
2026-02-21T09:54:41.4454321Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:54:41.4454608Z     """
2026-02-21T09:54:41.4454816Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:54:41.4455044Z     m, n = x.size()
2026-02-21T09:54:41.4455276Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:54:41.4455516Z     out = torch.empty_like(x)
2026-02-21T09:54:41.4455804Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:54:41.4456159Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:54:41.4456539Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:54:41.4456842Z     # src[softmax.py:79-92]: ...
2026-02-21T09:54:41.4457140Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=1)
2026-02-21T09:54:41.4457478Z     # src[softmax.py:93]: return out
2026-02-21T09:54:41.4457687Z     return out
2026-02-21T09:54:42.5312639Z WARNING:tritonbench.utils.triton_op:Completed input ID 71:
2026-02-21T09:54:42.5316845Z (M, N)
2026-02-21T09:54:42.5318288Z ------------
2026-02-21T09:54:42.5318563Z (4096, 9344)
2026-02-21T09:54:42.5318671Z 
2026-02-21T09:54:42.5325786Z  75%|███████▌  | 15/20 [45:47<16:23, 196.66s/it]WARNING:tritonbench.utils.triton_op:Running input ID 77:
2026-02-21T09:54:42.5327561Z (M, N)
2026-02-21T09:54:42.5327836Z -------------
2026-02-21T09:54:42.5331973Z (4096, 10112)
2026-02-21T09:54:42.5336443Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:54:43.7389299Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:54:45.1404267Z INFO:tritonbench.utils.triton_op:Took 2.25ms to get benchmark function for torch_compile_softmax
2026-02-21T09:54:46.4619831Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:54:46.4624367Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:54:46.4628175Z               'dtype': 'torch.float16',
2026-02-21T09:54:46.4629702Z               'shape': (4096, 10112),
2026-02-21T09:54:46.4630049Z               'stride': (10112, 1)},),
2026-02-21T09:54:46.4635015Z   'kwargs': {}}
2026-02-21T09:54:46.4639708Z INFO:tritonbench.utils.triton_op:Took 2.17ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:54:46.6356122Z [0s] Autotune random seed: 2138408546
2026-02-21T09:54:46.6600668Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:55:22.6519002Z [35s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T09:55:25.6869697Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T09:55:28.0705784Z module {
2026-02-21T09:55:28.0708124Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:28.0713051Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:55:28.0717597Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T09:55:28.0718868Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:55:28.0719223Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:28.0719481Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T09:55:28.0720149Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T09:55:28.0720526Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T09:55:28.0720830Z     %cst_2 = arith.constant dense<10112> : tensor<8x1xi32>
2026-02-21T09:55:28.0721157Z     %cst_3 = arith.constant dense<10112> : tensor<1024xi32>
2026-02-21T09:55:28.0721460Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T09:55:28.0721881Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T09:55:28.0723515Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:55:28.0723792Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:55:28.0724075Z     %c10112_i32 = arith.constant 10112 : i32
2026-02-21T09:55:28.0724351Z     %c10112_i64 = arith.constant 10112 : i64
2026-02-21T09:55:28.0724581Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T09:55:28.0724999Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c10112_i32], [%c10112_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T09:55:28.0725377Z     %1 = tt.get_program_id x : i32
2026-02-21T09:55:28.0725662Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T09:55:28.0725923Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T09:55:28.0726251Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T09:55:28.0726567Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T09:55:28.0726807Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T09:55:28.0727081Z       %c9216_i32 = arith.constant 9216 : i32
2026-02-21T09:55:28.0727317Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T09:55:28.0727789Z       %6:2 = scf.for %arg3 = %c0_i32 to %c9216_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T09:55:28.0728287Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0728652Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0728956Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:55:28.0729238Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0729591Z         %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0729884Z         %71 = arith.muli %70, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0730239Z         %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0730594Z         %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0730948Z         %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0731278Z         %75 = arith.addi %73, %74 : tensor<8x1024xi32>
2026-02-21T09:55:28.0731651Z         %76 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0732298Z         %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0732673Z         %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0733119Z         %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0733464Z         %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0733793Z         %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:55:28.0734158Z         %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0734464Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0734744Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0734991Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:55:28.0735270Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0735544Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0735827Z         %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:55:28.0736159Z         %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:55:28.0736515Z         %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32>
2026-02-21T09:55:28.0736814Z         %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T09:55:28.0737069Z         %88 = arith.ori %86, %87 : tensor<8xi1>
2026-02-21T09:55:28.0737371Z         %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:55:28.0737680Z         %90 = arith.subf %arg4, %89 : tensor<8xf32>
2026-02-21T09:55:28.0738159Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0738576Z         %92 = arith.mulf %arg5, %91 : tensor<8xf32>
2026-02-21T09:55:28.0738863Z         %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0739211Z         %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0739512Z         %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0739830Z         %96 = arith.subf %94, %95 : tensor<8x1024xf32>
2026-02-21T09:55:28.0740269Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0740720Z         %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:55:28.0741037Z         %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0741276Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0741527Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:55:28.0741833Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0742062Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0742328Z         %100 = arith.addf %92, %99 : tensor<8xf32>
2026-02-21T09:55:28.0742560Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:28.0742809Z         %101 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:55:28.0743055Z         %102 = arith.addi %arg3, %101 : i32
2026-02-21T09:55:28.0743363Z         %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0743667Z         %104 = tt.splat %102 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0743947Z         %105 = arith.addi %104, %103 : tensor<1024xi32>
2026-02-21T09:55:28.0744236Z         %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0744538Z         %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0744871Z         %108 = arith.muli %107, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0745176Z         %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0745542Z         %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0745877Z         %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0746209Z         %112 = arith.addi %110, %111 : tensor<8x1024xi32>
2026-02-21T09:55:28.0746524Z         %113 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0746893Z         %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0747275Z         %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0747618Z         %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0747953Z         %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0748299Z         %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:55:28.0748627Z         %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0748934Z         %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0749172Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0749433Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:55:28.0749673Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0749986Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0750281Z         %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:55:28.0750569Z         %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:55:28.0750874Z         %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32>
2026-02-21T09:55:28.0751131Z         %124 = arith.cmpf une, %89, %89 : tensor<8xf32>
2026-02-21T09:55:28.0751405Z         %125 = arith.ori %123, %124 : tensor<8xi1>
2026-02-21T09:55:28.0751700Z         %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:55:28.0752002Z         %127 = arith.subf %89, %126 : tensor<8xf32>
2026-02-21T09:55:28.0752431Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0752836Z         %129 = arith.mulf %100, %128 : tensor<8xf32>
2026-02-21T09:55:28.0753157Z         %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0753488Z         %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0753821Z         %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0754100Z         %133 = arith.subf %131, %132 : tensor<8x1024xf32>
2026-02-21T09:55:28.0754540Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0755031Z         %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:55:28.0755331Z         %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0755593Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0755812Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:55:28.0756068Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0756322Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0756568Z         %137 = arith.addf %129, %136 : tensor<8xf32>
2026-02-21T09:55:28.0756831Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:28.0757059Z         %138 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:55:28.0757320Z         %139 = arith.addi %arg3, %138 : i32
2026-02-21T09:55:28.0757593Z         %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0757919Z         %141 = tt.splat %139 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0758167Z         %142 = arith.addi %141, %140 : tensor<1024xi32>
2026-02-21T09:55:28.0758461Z         %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0758790Z         %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0759093Z         %145 = arith.muli %144, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0759457Z         %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0759799Z         %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0760172Z         %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0760495Z         %149 = arith.addi %147, %148 : tensor<8x1024xi32>
2026-02-21T09:55:28.0760783Z         %150 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0761136Z         %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0761486Z         %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0761878Z         %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0762177Z         %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0762520Z         %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:55:28.0762878Z         %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0763212Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0763478Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0763706Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T09:55:28.0763971Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0764197Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0764497Z         %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:55:28.0764813Z         %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:55:28.0765088Z         %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32>
2026-02-21T09:55:28.0765377Z         %161 = arith.cmpf une, %126, %126 : tensor<8xf32>
2026-02-21T09:55:28.0765625Z         %162 = arith.ori %160, %161 : tensor<8xi1>
2026-02-21T09:55:28.0765927Z         %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:55:28.0766210Z         %164 = arith.subf %126, %163 : tensor<8xf32>
2026-02-21T09:55:28.0766645Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0767075Z         %166 = arith.mulf %137, %165 : tensor<8xf32>
2026-02-21T09:55:28.0767365Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0767725Z         %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0768031Z         %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0768341Z         %170 = arith.subf %168, %169 : tensor<8x1024xf32>
2026-02-21T09:55:28.0768786Z         %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0769262Z         %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:55:28.0769604Z         %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0769848Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T09:55:28.0770114Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T09:55:28.0770351Z           tt.reduce.return %175 : f32
2026-02-21T09:55:28.0770616Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0770901Z         %174 = arith.addf %166, %173 : tensor<8xf32>
2026-02-21T09:55:28.0771175Z         scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32>
2026-02-21T09:55:28.0771461Z       } {tt.flatten}
2026-02-21T09:55:28.0771737Z       %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0772085Z       %8 = tt.splat %c9216_i32 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0773998Z       %9 = arith.addi %8, %7 : tensor<1024xi32>
2026-02-21T09:55:28.0774305Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0774688Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0775025Z       %12 = arith.muli %11, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0775405Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0775785Z       %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0776103Z       %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0776419Z       %16 = arith.addi %14, %15 : tensor<8x1024xi32>
2026-02-21T09:55:28.0776739Z       %17 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0777084Z       %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0777463Z       %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0777846Z       %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0778136Z       %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0778469Z       %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T09:55:28.0778850Z       %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0779115Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0779374Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:55:28.0779598Z         %66 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T09:55:28.0779855Z         tt.reduce.return %66 : f32
2026-02-21T09:55:28.0780080Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0780371Z       %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16>
2026-02-21T09:55:28.0780677Z       %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32>
2026-02-21T09:55:28.0780939Z       %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32>
2026-02-21T09:55:28.0781221Z       %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T09:55:28.0781459Z       %29 = arith.ori %27, %28 : tensor<8xi1>
2026-02-21T09:55:28.0781789Z       %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32>
2026-02-21T09:55:28.0782053Z       %31 = arith.subf %6#0, %30 : tensor<8xf32>
2026-02-21T09:55:28.0782477Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0782897Z       %33 = arith.mulf %6#1, %32 : tensor<8xf32>
2026-02-21T09:55:28.0783186Z       %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0783534Z       %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0783834Z       %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0784135Z       %37 = arith.subf %35, %36 : tensor<8x1024xf32>
2026-02-21T09:55:28.0784534Z       %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0785005Z       %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T09:55:28.0785325Z       %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({
2026-02-21T09:55:28.0785559Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T09:55:28.0785812Z         %66 = arith.addf %arg3, %arg4 : f32
2026-02-21T09:55:28.0786041Z         tt.reduce.return %66 : f32
2026-02-21T09:55:28.0786264Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T09:55:28.0786498Z       %41 = arith.addf %33, %40 : tensor<8xf32>
2026-02-21T09:55:28.0786761Z       %c9216_i32_6 = arith.constant 9216 : i32
2026-02-21T09:55:28.0787025Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T09:55:28.0787300Z       scf.for %arg3 = %c0_i32 to %c9216_i32_6 step %c3072_i32_7  : i32 {
2026-02-21T09:55:28.0787655Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0788031Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0788337Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T09:55:28.0788597Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0788976Z         %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:55:28.0789423Z         %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0789748Z         %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0790070Z         %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0790343Z         %74 = arith.subf %72, %73 : tensor<8x1024xf32>
2026-02-21T09:55:28.0790776Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0791253Z         %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0791617Z         %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0791918Z         %78 = arith.divf %75, %77 : tensor<8x1024xf32>
2026-02-21T09:55:28.0792221Z         %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:55:28.0792573Z         %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0792868Z         %81 = arith.muli %80, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0793195Z         %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0793545Z         %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0793839Z         %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0794138Z         %85 = arith.addi %83, %84 : tensor<8x1024xi32>
2026-02-21T09:55:28.0794423Z         %86 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0794771Z         %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0795138Z         %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0795468Z         %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0795785Z         tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0796040Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:28.0796299Z         %90 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T09:55:28.0796532Z         %91 = arith.addi %arg3, %90 : i32
2026-02-21T09:55:28.0796834Z         %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0797159Z         %93 = tt.splat %91 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0797401Z         %94 = arith.addi %93, %92 : tensor<1024xi32>
2026-02-21T09:55:28.0797685Z         %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0798021Z         %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:55:28.0798405Z         %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0798728Z         %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0799036Z         %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0799308Z         %100 = arith.subf %98, %99 : tensor<8x1024xf32>
2026-02-21T09:55:28.0799730Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0800222Z         %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0800548Z         %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0800907Z         %104 = arith.divf %101, %103 : tensor<8x1024xf32>
2026-02-21T09:55:28.0801197Z         %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:55:28.0801627Z         %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0801964Z         %107 = arith.muli %106, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0802294Z         %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0802659Z         %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0802968Z         %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0803287Z         %111 = arith.addi %109, %110 : tensor<8x1024xi32>
2026-02-21T09:55:28.0803599Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0803927Z         %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0804307Z         %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0804646Z         %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0804968Z         tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0805256Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:28.0805518Z         %116 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T09:55:28.0805779Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T09:55:28.0806055Z         %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0806383Z         %119 = tt.splat %117 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0806631Z         %120 = arith.addi %119, %118 : tensor<1024xi32>
2026-02-21T09:55:28.0806923Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0807272Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:55:28.0807682Z         %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0808018Z         %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0808327Z         %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0808628Z         %126 = arith.subf %124, %125 : tensor<8x1024xf32>
2026-02-21T09:55:28.0809036Z         %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0809523Z         %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0809877Z         %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0810156Z         %130 = arith.divf %127, %129 : tensor<8x1024xf32>
2026-02-21T09:55:28.0810474Z         %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:55:28.0810802Z         %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0811130Z         %133 = arith.muli %132, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0811444Z         %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0811839Z         %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0812173Z         %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0812461Z         %137 = arith.addi %135, %136 : tensor<8x1024xi32>
2026-02-21T09:55:28.0812766Z         %138 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0813107Z         %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0813489Z         %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0813887Z         %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0814237Z         tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0814597Z       } {tt.flatten}
2026-02-21T09:55:28.0814858Z       %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T09:55:28.0815206Z       %43 = tt.splat %c9216_i32_6 : i32 -> tensor<1024xi32>
2026-02-21T09:55:28.0815511Z       %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T09:55:28.0815816Z       %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32>
2026-02-21T09:55:28.0816221Z       %46 = tt.descriptor_load %0[%2, %c9216_i32_6] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T09:55:28.0816632Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0816998Z       %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T09:55:28.0817313Z       %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0817623Z       %50 = arith.subf %48, %49 : tensor<8x1024xf32>
2026-02-21T09:55:28.0818054Z       %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T09:55:28.0818554Z       %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T09:55:28.0818946Z       %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T09:55:28.0819232Z       %54 = arith.divf %51, %53 : tensor<8x1024xf32>
2026-02-21T09:55:28.0819536Z       %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T09:55:28.0819871Z       %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T09:55:28.0820206Z       %57 = arith.muli %56, %cst_2 : tensor<8x1xi32>
2026-02-21T09:55:28.0820545Z       %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T09:55:28.0820894Z       %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0821226Z       %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T09:55:28.0821500Z       %61 = arith.addi %59, %60 : tensor<8x1024xi32>
2026-02-21T09:55:28.0821824Z       %62 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0822139Z       %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T09:55:28.0822498Z       %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T09:55:28.0822850Z       %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T09:55:28.0823137Z       tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T09:55:28.0823577Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T09:55:28.0823952Z     tt.return
2026-02-21T09:55:28.0824146Z   }
2026-02-21T09:55:28.0824313Z }
2026-02-21T09:55:28.0824428Z 
2026-02-21T09:55:28.0824499Z {-#
2026-02-21T09:55:28.0824662Z   external_resources: {
2026-02-21T09:55:28.0824865Z     mlir_reproducer: {
2026-02-21T09:55:28.0829277Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T09:55:28.0833886Z       disable_threading: false,
2026-02-21T09:55:28.0834091Z       verify_each: true
2026-02-21T09:55:28.0834301Z     }
2026-02-21T09:55:28.0834464Z   }
2026-02-21T09:55:28.0834655Z #-}
2026-02-21T09:55:28.0835119Z /tmp/torchinductor_root/oz/cozizfrq6lg2d66agxaqvnd37afoobfh5mjafjqpkqpqj4x5xmon.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:28.0836401Z /tmp/torchinductor_root/oz/cozizfrq6lg2d66agxaqvnd37afoobfh5mjafjqpkqpqj4x5xmon.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:28.0837446Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:28.0838562Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T09:55:28.0839539Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:28.0839858Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:35.0485274Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.6 configs/s
2026-02-21T09:55:35.0495935Z [48s] Adaptive compile timeout: 30s (90% percentile=14.3s, bounds=[30.0s, 30s])
2026-02-21T09:55:36.1672028Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 881.4 configs/s
2026-02-21T09:55:36.2382145Z [49s] Initial random population of 100, 5 starting points: 
2026-02-21T09:55:36.2386392Z error=12
2026-02-21T09:55:36.2388047Z timeout=1
2026-02-21T09:55:36.2388260Z ok=87
2026-02-21T09:55:36.2392952Z min=0.0676
2026-02-21T09:55:36.2397361Z mid=0.4945
2026-02-21T09:55:36.2398840Z max=234.3701
2026-02-21T09:55:36.2399047Z best={'block_sizes': [1, 1024],
2026-02-21T09:55:36.2399298Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:55:36.2399827Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:55:36.2400019Z  'num_stages': 6,
2026-02-21T09:55:36.2400175Z  'num_warps': 4,
2026-02-21T09:55:36.2400321Z  'pid_type': 'flat',
2026-02-21T09:55:36.2400512Z  'range_flattens': [None, None],
2026-02-21T09:55:36.2400693Z  'range_multi_buffers': [None, True],
2026-02-21T09:55:36.2400888Z  'range_num_stages': [0, 0],
2026-02-21T09:55:36.2401059Z  'range_unroll_factors': [0, 4],
2026-02-21T09:55:36.2401252Z  'range_warp_specializes': [None, False]}
2026-02-21T09:55:36.2402844Z [49s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:55:37.2300693Z [50s] Generation 1 starting: 74 neighbors, 5 active search path(s)
2026-02-21T09:55:59.4151963Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.0 configs/s
2026-02-21T09:56:03.9999093Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 17.0 configs/s
2026-02-21T09:56:09.3928077Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 214.6         
2026-02-21T09:56:09.3928709Z                                                                   configs/s     
2026-02-21T09:56:09.6223955Z [82s] Generation 1 complete: 
2026-02-21T09:56:09.6228834Z ok=80
2026-02-21T09:56:09.6232158Z min=0.0553
2026-02-21T09:56:09.6233970Z mid=0.0819
2026-02-21T09:56:09.6234157Z max=1.2482
2026-02-21T09:56:09.6234348Z best={'block_sizes': [1, 1024],
2026-02-21T09:56:09.6234635Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T09:56:09.6234964Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T09:56:09.6235188Z  'num_stages': 3,
2026-02-21T09:56:09.6235348Z  'num_warps': 1,
2026-02-21T09:56:09.6235517Z  'pid_type': 'flat',
2026-02-21T09:56:09.6235687Z  'range_flattens': [None, False],
2026-02-21T09:56:09.6235919Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:09.6236129Z  'range_num_stages': [0, 2],
2026-02-21T09:56:09.6236346Z  'range_unroll_factors': [0, 3],
2026-02-21T09:56:09.6236556Z  'range_warp_specializes': [None, None]}
2026-02-21T09:56:09.6236987Z [82s] Fitting surrogate: 180 points, 180 targets
2026-02-21T09:56:10.6497922Z [83s] Generation 2 starting: 72 neighbors, 5 active search path(s)
2026-02-21T09:56:35.6919637Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.7 configs/s
2026-02-21T09:56:40.1101745Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s
2026-02-21T09:56:45.8972772Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 174.6         
2026-02-21T09:56:45.8973598Z                                                                   configs/s     
2026-02-21T09:56:46.1792290Z [119s] Generation 2 complete: 
2026-02-21T09:56:46.1796474Z ok=78
2026-02-21T09:56:46.1800913Z min=0.0531
2026-02-21T09:56:46.1802490Z mid=0.0635
2026-02-21T09:56:46.1802654Z max=6.4338
2026-02-21T09:56:46.1802811Z best={'block_sizes': [1, 1024],
2026-02-21T09:56:46.1803083Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T09:56:46.1803320Z  'load_eviction_policies': ['last', ''],
2026-02-21T09:56:46.1803526Z  'num_stages': 5,
2026-02-21T09:56:46.1803670Z  'num_warps': 1,
2026-02-21T09:56:46.1803822Z  'pid_type': 'flat',
2026-02-21T09:56:46.1803995Z  'range_flattens': [None, False],
2026-02-21T09:56:46.1804201Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:46.1804388Z  'range_num_stages': [0, 1],
2026-02-21T09:56:46.1804568Z  'range_unroll_factors': [0, 0],
2026-02-21T09:56:46.1804752Z  'range_warp_specializes': [None, False]}
2026-02-21T09:56:46.1807563Z [119s] Fitting surrogate: 258 points, 258 targets
2026-02-21T09:56:47.0162884Z [120s] Generation 3 starting: 58 neighbors, 5 active search path(s)
2026-02-21T09:57:09.3156623Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 0.4 configs/s
2026-02-21T09:57:12.8436928Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.2 configs/s
2026-02-21T09:57:17.7613985Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 205.3         
2026-02-21T09:57:17.7615021Z                                                                   configs/s     
2026-02-21T09:57:18.0278302Z [151s] Generation 3 complete: 
2026-02-21T09:57:18.0282533Z ok=63
2026-02-21T09:57:18.0286983Z min=0.0492
2026-02-21T09:57:18.0291306Z mid=0.0594
2026-02-21T09:57:18.0295874Z max=0.9093
2026-02-21T09:57:18.0297357Z best={'block_sizes': [1, 16384],
2026-02-21T09:57:18.0297603Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:57:18.0297841Z  'load_eviction_policies': ['', 'last'],
2026-02-21T09:57:18.0298028Z  'num_stages': 1,
2026-02-21T09:57:18.0298187Z  'num_warps': 4,
2026-02-21T09:57:18.0298344Z  'pid_type': 'flat',
2026-02-21T09:57:18.0298505Z  'range_flattens': [None, None],
2026-02-21T09:57:18.0298695Z  'range_multi_buffers': [None, False],
2026-02-21T09:57:18.0298880Z  'range_num_stages': [0, 4],
2026-02-21T09:57:18.0299053Z  'range_unroll_factors': [0, 1],
2026-02-21T09:57:18.0299517Z  'range_warp_specializes': [None, False]}
2026-02-21T09:57:18.7590756Z [151s] Fitting surrogate: 321 points, 321 targets
2026-02-21T09:57:18.7591133Z [152s] Generation 4 starting: 44 neighbors, 5 active search path(s)
2026-02-21T09:57:29.7989270Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 2.1 configs/s
2026-02-21T09:57:32.5339804Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 17.1 configs/s
2026-02-21T09:57:35.1140730Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 532.8         
2026-02-21T09:57:35.1144717Z                                                                   configs/s     
2026-02-21T09:57:35.2286136Z [168s] Generation 4 complete: 
2026-02-21T09:57:35.2288059Z ok=49
2026-02-21T09:57:35.2288224Z min=0.0389
2026-02-21T09:57:35.2288363Z mid=0.0635
2026-02-21T09:57:35.2288484Z max=0.9082
2026-02-21T09:57:35.2288635Z best={'block_sizes': [1, 16384],
2026-02-21T09:57:35.2288855Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:57:35.2289098Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:57:35.2289302Z  'num_stages': 1,
2026-02-21T09:57:35.2289441Z  'num_warps': 8,
2026-02-21T09:57:35.2289584Z  'pid_type': 'flat',
2026-02-21T09:57:35.2289738Z  'range_flattens': [None, None],
2026-02-21T09:57:35.2290209Z  'range_multi_buffers': [None, False],
2026-02-21T09:57:35.2290417Z  'range_num_stages': [0, 4],
2026-02-21T09:57:35.2290590Z  'range_unroll_factors': [0, 1],
2026-02-21T09:57:35.2290777Z  'range_warp_specializes': [None, False]}
2026-02-21T09:57:35.2301923Z [168s] Fitting surrogate: 370 points, 370 targets
2026-02-21T09:57:36.1650331Z [169s] Generation 5 starting: 61 neighbors, 5 active search path(s)
2026-02-21T09:57:47.3418738Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 8.0 configs/s
2026-02-21T09:57:51.0794447Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 17.1 configs/s
2026-02-21T09:57:54.0379201Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 341.0         
2026-02-21T09:57:54.0380748Z                                                                   configs/s     
2026-02-21T09:57:54.2118418Z [187s] Generation 5 complete: 
2026-02-21T09:57:54.2122834Z ok=66
2026-02-21T09:57:54.2126826Z min=0.0389
2026-02-21T09:57:54.2131135Z mid=0.0594
2026-02-21T09:57:54.2132566Z max=0.2827
2026-02-21T09:57:54.2132770Z best={'block_sizes': [1, 16384],
2026-02-21T09:57:54.2132986Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:57:54.2133212Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:57:54.2133402Z  'num_stages': 1,
2026-02-21T09:57:54.2133544Z  'num_warps': 8,
2026-02-21T09:57:54.2133694Z  'pid_type': 'flat',
2026-02-21T09:57:54.2133851Z  'range_flattens': [None, None],
2026-02-21T09:57:54.2134040Z  'range_multi_buffers': [None, False],
2026-02-21T09:57:54.2134226Z  'range_num_stages': [0, 4],
2026-02-21T09:57:54.2134394Z  'range_unroll_factors': [0, 0],
2026-02-21T09:57:54.2134571Z  'range_warp_specializes': [None, False]}
2026-02-21T09:57:54.2138355Z [187s] Fitting surrogate: 436 points, 436 targets
2026-02-21T09:57:54.9274996Z [188s] Generation 6 starting: 46 neighbors, 4 active search path(s)
2026-02-21T09:58:03.6176098Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 9.6 configs/s
2026-02-21T09:58:06.4739001Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.1 configs/s
2026-02-21T09:58:09.0613311Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 389.5         
2026-02-21T09:58:09.0616913Z                                                                   configs/s     
2026-02-21T09:58:09.2105203Z [202s] Generation 6 complete: 
2026-02-21T09:58:09.2110035Z ok=50
2026-02-21T09:58:09.2111430Z min=0.0389
2026-02-21T09:58:09.2111829Z mid=0.0553
2026-02-21T09:58:09.2111962Z max=0.9032
2026-02-21T09:58:09.2112101Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:09.2112321Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:09.2112537Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:09.2113063Z  'num_stages': 1,
2026-02-21T09:58:09.2113217Z  'num_warps': 8,
2026-02-21T09:58:09.2113377Z  'pid_type': 'flat',
2026-02-21T09:58:09.2113541Z  'range_flattens': [None, None],
2026-02-21T09:58:09.2113720Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:09.2113922Z  'range_num_stages': [0, 4],
2026-02-21T09:58:09.2114177Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:09.2114360Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:09.2123889Z [202s] Fitting surrogate: 486 points, 486 targets
2026-02-21T09:58:09.7220672Z [203s] Generation 7 starting: 30 neighbors, 2 active search path(s)
2026-02-21T09:58:17.1096158Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.8 configs/s
2026-02-21T09:58:18.9376851Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.4 configs/s
2026-02-21T09:58:20.3414640Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 711.8         
2026-02-21T09:58:20.3418673Z                                                                   configs/s     
2026-02-21T09:58:20.4353390Z [213s] Generation 7 complete: 
2026-02-21T09:58:20.4358334Z ok=33
2026-02-21T09:58:20.4362661Z min=0.0389
2026-02-21T09:58:20.4367044Z mid=0.0595
2026-02-21T09:58:20.4371629Z max=0.1659
2026-02-21T09:58:20.4376759Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:20.4380562Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:20.4384208Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:20.4386214Z  'num_stages': 1,
2026-02-21T09:58:20.4386391Z  'num_warps': 8,
2026-02-21T09:58:20.4386553Z  'pid_type': 'flat',
2026-02-21T09:58:20.4386800Z  'range_flattens': [None, None],
2026-02-21T09:58:20.4386999Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:20.4387206Z  'range_num_stages': [0, 4],
2026-02-21T09:58:20.4392106Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:20.4397141Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:20.4400889Z [213s] Fitting surrogate: 519 points, 519 targets
2026-02-21T09:58:20.8105058Z [214s] Generation 8 starting: 10 neighbors, 1 active search path(s)
2026-02-21T09:58:24.0013182Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 1.5 configs/s
2026-02-21T09:58:24.6464743Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s
2026-02-21T09:58:24.9993724Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2710.2         
2026-02-21T09:58:24.9995306Z                                                                  configs/s      
2026-02-21T09:58:25.0393725Z [218s] Generation 8 complete: 
2026-02-21T09:58:25.0398071Z ok=12
2026-02-21T09:58:25.0402494Z min=0.0389
2026-02-21T09:58:25.0406786Z mid=0.0655
2026-02-21T09:58:25.0408804Z max=0.0900
2026-02-21T09:58:25.0413987Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:25.0417450Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:25.0417744Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:25.0421969Z  'num_stages': 1,
2026-02-21T09:58:25.0422442Z  'num_warps': 8,
2026-02-21T09:58:25.0424801Z  'pid_type': 'flat',
2026-02-21T09:58:25.0425072Z  'range_flattens': [None, None],
2026-02-21T09:58:25.0425287Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:25.0430074Z  'range_num_stages': [0, 4],
2026-02-21T09:58:25.0433130Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:25.0436803Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:25.0440430Z [218s] Fitting surrogate: 531 points, 531 targets
2026-02-21T09:58:25.4128335Z [218s] Generation 9 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:58:27.8865932Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 7.6 configs/s
2026-02-21T09:58:28.6504517Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.1 configs/s
2026-02-21T09:58:29.0030140Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2712.8         
2026-02-21T09:58:29.0032016Z                                                                  configs/s      
2026-02-21T09:58:29.0427186Z [222s] Generation 9 complete: 
2026-02-21T09:58:29.0432100Z ok=14
2026-02-21T09:58:29.0436017Z min=0.0389
2026-02-21T09:58:29.0440389Z mid=0.0614
2026-02-21T09:58:29.0444316Z max=0.0840
2026-02-21T09:58:29.0448109Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:29.0449832Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:29.0450127Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:29.0450332Z  'num_stages': 1,
2026-02-21T09:58:29.0450480Z  'num_warps': 8,
2026-02-21T09:58:29.0450635Z  'pid_type': 'flat',
2026-02-21T09:58:29.0450795Z  'range_flattens': [None, None],
2026-02-21T09:58:29.0450999Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:29.0451189Z  'range_num_stages': [0, 4],
2026-02-21T09:58:29.0451351Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:29.0451624Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:29.0451947Z [222s] Fitting surrogate: 545 points, 545 targets
2026-02-21T09:58:29.3931497Z [222s] Generation 10 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:58:31.7487318Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 6.8 configs/s
2026-02-21T09:58:32.4468237Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.5 configs/s
2026-02-21T09:58:32.9962973Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1778.2        
2026-02-21T09:58:32.9967155Z                                                                   configs/s     
2026-02-21T09:58:33.0446535Z [226s] Generation 10 complete: 
2026-02-21T09:58:33.0450854Z ok=13
2026-02-21T09:58:33.0454868Z min=0.0389
2026-02-21T09:58:33.0459561Z mid=0.0594
2026-02-21T09:58:33.0463888Z max=0.0840
2026-02-21T09:58:33.0468514Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:33.0472909Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:33.0473210Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:33.0477910Z  'num_stages': 1,
2026-02-21T09:58:33.0482336Z  'num_warps': 8,
2026-02-21T09:58:33.0486739Z  'pid_type': 'flat',
2026-02-21T09:58:33.0490520Z  'range_flattens': [None, None],
2026-02-21T09:58:33.0495540Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:33.0499586Z  'range_num_stages': [0, 4],
2026-02-21T09:58:33.0503026Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:33.0505761Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:33.0506116Z [226s] Fitting surrogate: 558 points, 558 targets
2026-02-21T09:58:33.4495578Z [226s] Generation 11 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:58:36.4149384Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 5.0 configs/s
2026-02-21T09:58:37.1242535Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s
2026-02-21T09:58:37.3797248Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3676.0        
2026-02-21T09:58:37.3797613Z                                                                   configs/s     
2026-02-21T09:58:37.4143642Z [230s] Generation 11 complete: 
2026-02-21T09:58:37.4148270Z ok=13
2026-02-21T09:58:37.4149361Z min=0.0389
2026-02-21T09:58:37.4149545Z mid=0.0635
2026-02-21T09:58:37.4149687Z max=0.0922
2026-02-21T09:58:37.4149843Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:37.4150061Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:37.4150556Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:37.4150743Z  'num_stages': 1,
2026-02-21T09:58:37.4150893Z  'num_warps': 8,
2026-02-21T09:58:37.4151032Z  'pid_type': 'flat',
2026-02-21T09:58:37.4151193Z  'range_flattens': [None, None],
2026-02-21T09:58:37.4151369Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:37.4151633Z  'range_num_stages': [0, 4],
2026-02-21T09:58:37.4151800Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:37.4151986Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:37.4164971Z [230s] Fitting surrogate: 571 points, 571 targets
2026-02-21T09:58:37.8623752Z [231s] Generation 12 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:58:40.9609062Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 8.3 configs/s
2026-02-21T09:58:41.9596612Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.8 configs/s
2026-02-21T09:58:41.9601796Z [235s] Generation 12 complete: 
2026-02-21T09:58:41.9603833Z ok=18
2026-02-21T09:58:41.9604032Z min=0.0389
2026-02-21T09:58:41.9604170Z mid=0.0719
2026-02-21T09:58:41.9604290Z max=0.2233
2026-02-21T09:58:41.9604430Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:41.9604642Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:41.9604861Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:41.9605041Z  'num_stages': 1,
2026-02-21T09:58:41.9605186Z  'num_warps': 8,
2026-02-21T09:58:41.9605324Z  'pid_type': 'flat',
2026-02-21T09:58:41.9605485Z  'range_flattens': [None, None],
2026-02-21T09:58:41.9605668Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:41.9605849Z  'range_num_stages': [0, 4],
2026-02-21T09:58:41.9606026Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:41.9606202Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:41.9620557Z [235s] Fitting surrogate: 589 points, 589 targets
2026-02-21T09:58:42.3411379Z [235s] Generation 13 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:58:44.8638198Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 16.0 configs/s
2026-02-21T09:58:45.5725032Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.1 configs/s
2026-02-21T09:58:45.8317251Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3619.9        
2026-02-21T09:58:45.8318993Z                                                                   configs/s     
2026-02-21T09:58:45.8667170Z [239s] Generation 13 complete: 
2026-02-21T09:58:45.8668697Z ok=13
2026-02-21T09:58:45.8668876Z min=0.0389
2026-02-21T09:58:45.8669013Z mid=0.0636
2026-02-21T09:58:45.8669152Z max=0.0942
2026-02-21T09:58:45.8669297Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:45.8669536Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:45.8669753Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:45.8670250Z  'num_stages': 1,
2026-02-21T09:58:45.8670397Z  'num_warps': 8,
2026-02-21T09:58:45.8670636Z  'pid_type': 'flat',
2026-02-21T09:58:45.8670861Z  'range_flattens': [None, None],
2026-02-21T09:58:45.8671057Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:45.8671257Z  'range_num_stages': [0, 4],
2026-02-21T09:58:45.8671422Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:45.8671915Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:45.8686095Z [239s] Fitting surrogate: 602 points, 602 targets
2026-02-21T09:58:46.2596600Z [239s] Generation 14 starting: 11 neighbors, 1 active search path(s)
2026-02-21T09:58:48.6542029Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 11.7 configs/s
2026-02-21T09:58:49.3620850Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 18.2 configs/s
2026-02-21T09:58:49.6216460Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3630.7        
2026-02-21T09:58:49.6217766Z                                                                   configs/s     
2026-02-21T09:58:49.6566516Z [242s] Generation 14 complete: 
2026-02-21T09:58:49.6571795Z ok=13
2026-02-21T09:58:49.6575258Z min=0.0389
2026-02-21T09:58:49.6576705Z mid=0.0655
2026-02-21T09:58:49.6576871Z max=0.1352
2026-02-21T09:58:49.6577010Z best={'block_sizes': [1, 16384],
2026-02-21T09:58:49.6577232Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:58:49.6577450Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T09:58:49.6577641Z  'num_stages': 1,
2026-02-21T09:58:49.6577787Z  'num_warps': 8,
2026-02-21T09:58:49.6577923Z  'pid_type': 'flat',
2026-02-21T09:58:49.6578082Z  'range_flattens': [None, None],
2026-02-21T09:58:49.6578259Z  'range_multi_buffers': [None, False],
2026-02-21T09:58:49.6578449Z  'range_num_stages': [0, 4],
2026-02-21T09:58:49.6578614Z  'range_unroll_factors': [0, 0],
2026-02-21T09:58:49.6578807Z  'range_warp_specializes': [None, False]}
2026-02-21T09:58:49.6590311Z [242s] Fitting surrogate: 615 points, 615 targets
2026-02-21T09:58:49.9372819Z [243s] Autotuning complete in 243.3s after searching 587 configs.
2026-02-21T09:58:49.9373196Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:58:49.9378076Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T09:58:49.9378958Z 
2026-02-21T09:58:49.9379229Z [243s] Code of selected kernel: /tmp/torchinductor_root/vu/cvuzgumv3hnvyvaeve4ak7ugh5wvpszrpourxpqbu2qumrcgmzqb.py
2026-02-21T09:58:49.9595239Z from __future__ import annotations
2026-02-21T09:58:49.9595444Z 
2026-02-21T09:58:49.9600203Z import torch
2026-02-21T09:58:49.9603548Z import triton
2026-02-21T09:58:49.9607328Z import triton.language as tl
2026-02-21T09:58:49.9611261Z from torch._inductor.runtime import triton_helpers
2026-02-21T09:58:49.9615907Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T09:58:49.9620418Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:58:49.9621501Z 
2026-02-21T09:58:49.9621642Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T09:58:49.9621848Z _BLOCK_SIZE_1 = tl.constexpr(16384)
2026-02-21T09:58:49.9621970Z 
2026-02-21T09:58:49.9622039Z @triton.jit
2026-02-21T09:58:49.9627026Z def _helion_softmax_two_pass(x, out):
2026-02-21T09:58:49.9627364Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:58:49.9630068Z     pid_0 = tl.program_id(0)
2026-02-21T09:58:49.9630243Z     offset_0 = pid_0
2026-02-21T09:58:49.9630446Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T09:58:49.9634935Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:58:49.9638293Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T09:58:49.9642843Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:58:49.9646988Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T09:58:49.9650537Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:58:49.9654790Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T09:58:49.9656053Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T09:58:49.9656298Z     # src[softmax.py:82-89]: ...
2026-02-21T09:58:49.9656647Z     for offset_2 in tl.range(0, 10112, _BLOCK_SIZE_1, warp_specialize=False, num_stages=4, disallow_acc_multi_buffer=True):
2026-02-21T09:58:49.9657048Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:58:49.9657281Z         mask_1 = indices_2 < 10112
2026-02-21T09:58:49.9657455Z         mi_copy = mi
2026-02-21T09:58:49.9657782Z         di_copy = di
2026-02-21T09:58:49.9657944Z         mi_copy_0 = mi_copy
2026-02-21T09:58:49.9658103Z         di_copy_0 = di_copy
2026-02-21T09:58:49.9658293Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T09:58:49.9658675Z         values = tl.load(x + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:58:49.9659129Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T09:58:49.9659570Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T09:58:49.9659966Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T09:58:49.9660226Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T09:58:49.9660467Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T09:58:49.9660675Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T09:58:49.9660939Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:58:49.9661177Z         v_2 = mi_copy_0 - v_1
2026-02-21T09:58:49.9661358Z         v_3 = libdevice.exp(v_2)
2026-02-21T09:58:49.9661526Z         v_4 = di_copy_0 * v_3
2026-02-21T09:58:49.9661843Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T09:58:49.9662071Z         subscript = v_1[:, None]
2026-02-21T09:58:49.9662254Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T09:58:49.9662446Z         v_6 = v_5 - subscript
2026-02-21T09:58:49.9662666Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T09:58:49.9662952Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T09:58:49.9663183Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T09:58:49.9663388Z         v_7 = libdevice.exp(v_6)
2026-02-21T09:58:49.9663731Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T09:58:49.9664106Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T09:58:49.9664323Z         di = v_4 + sum_1
2026-02-21T09:58:49.9664492Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T09:58:49.9664679Z         mi = v_1
2026-02-21T09:58:49.9664886Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T09:58:49.9665175Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T09:58:49.9665482Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:58:49.9665940Z     for offset_2 in tl.range(0, 10112, _BLOCK_SIZE_1, warp_specialize=False, num_stages=4, disallow_acc_multi_buffer=True):
2026-02-21T09:58:49.9666356Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T09:58:49.9666599Z         mask_2 = indices_2 < 10112
2026-02-21T09:58:49.9666779Z         mi_copy_1 = mi
2026-02-21T09:58:49.9666930Z         di_copy_1 = di
2026-02-21T09:58:49.9667094Z         mi_copy_1_0 = mi_copy_1
2026-02-21T09:58:49.9667306Z         di_copy_1_0 = di_copy_1
2026-02-21T09:58:49.9667503Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T09:58:49.9667894Z         values_1 = tl.load(x + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_last')
2026-02-21T09:58:49.9668338Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T09:58:49.9668636Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T09:58:49.9668830Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T09:58:49.9669026Z         v_10 = v_9 - subscript_1
2026-02-21T09:58:49.9669199Z         v_11 = libdevice.exp(v_10)
2026-02-21T09:58:49.9669386Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T09:58:49.9669587Z         v_12 = v_11 / subscript_2
2026-02-21T09:58:49.9669755Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T09:58:49.9670065Z         tl.store(out + (indices_0[:, None] * 10112 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T09:58:49.9670284Z 
2026-02-21T09:58:49.9670413Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T09:58:49.9670651Z     """
2026-02-21T09:58:49.9670852Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T09:58:49.9671197Z     This version uses fewer passes but is less numerically stable.
2026-02-21T09:58:49.9671420Z     Args:
2026-02-21T09:58:49.9671624Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T09:58:49.9671824Z     Returns:
2026-02-21T09:58:49.9671998Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T09:58:49.9672208Z     """
2026-02-21T09:58:49.9672340Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T09:58:49.9672522Z     m, n = x.size()
2026-02-21T09:58:49.9672685Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T09:58:49.9672892Z     out = torch.empty_like(x)
2026-02-21T09:58:49.9673125Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T09:58:49.9673435Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T09:58:49.9673786Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T09:58:49.9674024Z     # src[softmax.py:79-92]: ...
2026-02-21T09:58:49.9674286Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=8, num_stages=1)
2026-02-21T09:58:49.9674553Z     # src[softmax.py:93]: return out
2026-02-21T09:58:49.9674724Z     return out
2026-02-21T09:58:50.7350628Z WARNING:tritonbench.utils.triton_op:Completed input ID 77:
2026-02-21T09:58:50.7354928Z (M, N)
2026-02-21T09:58:50.7359469Z -------------
2026-02-21T09:58:50.7363362Z (4096, 10112)
2026-02-21T09:58:50.7364731Z 
2026-02-21T09:58:50.7365331Z  80%|████████  | 16/20 [49:56<14:08, 212.17s/it]WARNING:tritonbench.utils.triton_op:Running input ID 82:
2026-02-21T09:58:50.7365713Z (M, N)
2026-02-21T09:58:50.7367426Z -------------
2026-02-21T09:58:50.7367613Z (4096, 10752)
2026-02-21T09:58:50.7367902Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T09:58:51.9227403Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T09:58:53.2734245Z INFO:tritonbench.utils.triton_op:Took 2.35ms to get benchmark function for torch_compile_softmax
2026-02-21T09:58:54.5246971Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:58:54.5248538Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:58:54.5248818Z               'dtype': 'torch.float16',
2026-02-21T09:58:54.5249019Z               'shape': (4096, 10752),
2026-02-21T09:58:54.5253710Z               'stride': (10752, 1)},),
2026-02-21T09:58:54.5258070Z   'kwargs': {}}
2026-02-21T09:58:54.5269163Z INFO:tritonbench.utils.triton_op:Took 2.50ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T09:58:55.3766789Z [0s] Autotune random seed: 2138408546
2026-02-21T09:58:55.4023982Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:59:32.0400207Z [36s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T09:59:34.6357160Z [39s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T09:59:36.8459290Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T09:59:45.7658778Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 11.1 configs/s
2026-02-21T09:59:45.7670604Z [50s] Adaptive compile timeout: 30s (90% percentile=15.9s, bounds=[30.0s, 30s])
2026-02-21T09:59:47.0472604Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 772.5 configs/s
2026-02-21T09:59:47.1272177Z [51s] Initial random population of 100, 5 starting points: 
2026-02-21T09:59:47.1276413Z error=11
2026-02-21T09:59:47.1277883Z timeout=2
2026-02-21T09:59:47.1278101Z ok=87
2026-02-21T09:59:47.1278247Z min=0.0717
2026-02-21T09:59:47.1282959Z mid=0.5366
2026-02-21T09:59:47.1287606Z max=249.8283
2026-02-21T09:59:47.1289595Z best={'block_sizes': [1, 1024],
2026-02-21T09:59:47.1289870Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T09:59:47.1290124Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T09:59:47.1290314Z  'num_stages': 6,
2026-02-21T09:59:47.1290479Z  'num_warps': 4,
2026-02-21T09:59:47.1290646Z  'pid_type': 'flat',
2026-02-21T09:59:47.1290813Z  'range_flattens': [None, None],
2026-02-21T09:59:47.1290996Z  'range_multi_buffers': [None, True],
2026-02-21T09:59:47.1291177Z  'range_num_stages': [0, 0],
2026-02-21T09:59:47.1291660Z  'range_unroll_factors': [0, 4],
2026-02-21T09:59:47.1291871Z  'range_warp_specializes': [None, False]}
2026-02-21T09:59:47.1292164Z [51s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:59:48.1878557Z [52s] Generation 1 starting: 74 neighbors, 5 active search path(s)
2026-02-21T10:00:03.8775537Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 4.8 configs/s
2026-02-21T10:00:08.3964167Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 17.0 configs/s
2026-02-21T10:00:13.8702571Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 184.1         
2026-02-21T10:00:13.8703893Z                                                                   configs/s     
2026-02-21T10:00:14.1189164Z [78s] Generation 1 complete: 
2026-02-21T10:00:14.1193390Z ok=79
2026-02-21T10:00:14.1197809Z min=0.0594
2026-02-21T10:00:14.1202293Z mid=0.0840
2026-02-21T10:00:14.1206155Z max=0.9549
2026-02-21T10:00:14.1206415Z best={'block_sizes': [1, 1024],
2026-02-21T10:00:14.1206679Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T10:00:14.1210636Z  'load_eviction_policies': ['first', ''],
2026-02-21T10:00:14.1213776Z  'num_stages': 1,
2026-02-21T10:00:14.1218807Z  'num_warps': 1,
2026-02-21T10:00:14.1223714Z  'pid_type': 'flat',
2026-02-21T10:00:14.1228142Z  'range_flattens': [None, True],
2026-02-21T10:00:14.1231383Z  'range_multi_buffers': [None, None],
2026-02-21T10:00:14.1235067Z  'range_num_stages': [0, 4],
2026-02-21T10:00:14.1239512Z  'range_unroll_factors': [0, 1],
2026-02-21T10:00:14.1242824Z  'range_warp_specializes': [None, False]}
2026-02-21T10:00:14.1247187Z [78s] Fitting surrogate: 179 points, 179 targets
2026-02-21T10:00:15.1225159Z [79s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T10:00:28.6057829Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 4.6 configs/s
2026-02-21T10:00:32.8727794Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.0 configs/s
2026-02-21T10:00:40.4511159Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 149.1         
2026-02-21T10:00:40.4512063Z                                                                   configs/s     
2026-02-21T10:00:40.7874767Z [105s] Generation 2 complete: 
2026-02-21T10:00:40.7878835Z ok=77
2026-02-21T10:00:40.7880461Z min=0.0573
2026-02-21T10:00:40.7880624Z mid=0.0656
2026-02-21T10:00:40.7880747Z max=0.5509
2026-02-21T10:00:40.7880895Z best={'block_sizes': [1, 16384],
2026-02-21T10:00:40.7881144Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:00:40.7881411Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T10:00:40.7881711Z  'num_sm_multiplier': 128,
2026-02-21T10:00:40.7881877Z  'num_stages': 5,
2026-02-21T10:00:40.7882329Z  'num_warps': 2,
2026-02-21T10:00:40.7882498Z  'pid_type': 'persistent_blocked',
2026-02-21T10:00:40.7882698Z  'range_flattens': [False, False],
2026-02-21T10:00:40.7882879Z  'range_multi_buffers': [True, False],
2026-02-21T10:00:40.7883070Z  'range_num_stages': [2, 1],
2026-02-21T10:00:40.7883326Z  'range_unroll_factors': [0, 0],
2026-02-21T10:00:40.7883526Z  'range_warp_specializes': [True, None]}
2026-02-21T10:00:40.7887916Z [105s] Fitting surrogate: 256 points, 256 targets
2026-02-21T10:00:41.7508140Z [106s] Generation 3 starting: 64 neighbors, 5 active search path(s)
2026-02-21T10:01:21.5611486Z [146s] Timeout after 30s compiling Config(block_sizes=[8, 4096], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, False])
2026-02-21T10:01:21.5621468Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.2 configs/s
2026-02-21T10:01:25.3387611Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.4 configs/s
2026-02-21T10:01:29.4390860Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 245.7         
2026-02-21T10:01:29.4394862Z                                                                   configs/s     
2026-02-21T10:01:29.6504446Z [154s] Generation 3 complete: 
2026-02-21T10:01:29.6506296Z timeout=1
2026-02-21T10:01:29.6506452Z ok=68
2026-02-21T10:01:29.6506587Z min=0.0451
2026-02-21T10:01:29.6506716Z mid=0.0635
2026-02-21T10:01:29.6506847Z max=0.5408
2026-02-21T10:01:29.6506987Z best={'block_sizes': [1, 16384],
2026-02-21T10:01:29.6507250Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:01:29.6507532Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:01:29.6507743Z  'num_sm_multiplier': 128,
2026-02-21T10:01:29.6507913Z  'num_stages': 5,
2026-02-21T10:01:29.6508055Z  'num_warps': 2,
2026-02-21T10:01:29.6508235Z  'pid_type': 'persistent_blocked',
2026-02-21T10:01:29.6508733Z  'range_flattens': [False, False],
2026-02-21T10:01:29.6508922Z  'range_multi_buffers': [True, False],
2026-02-21T10:01:29.6509109Z  'range_num_stages': [2, 1],
2026-02-21T10:01:29.6509295Z  'range_unroll_factors': [0, 0],
2026-02-21T10:01:29.6509486Z  'range_warp_specializes': [True, None]}
2026-02-21T10:01:29.6522019Z [154s] Fitting surrogate: 325 points, 325 targets
2026-02-21T10:01:30.5574020Z [155s] Generation 4 starting: 63 neighbors, 5 active search path(s)
2026-02-21T10:01:42.0677260Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 11.9 configs/s
2026-02-21T10:01:45.8841788Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.2 configs/s
2026-02-21T10:01:49.9513481Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 248.2         
2026-02-21T10:01:49.9513820Z                                                                   configs/s     
2026-02-21T10:01:50.1768195Z [174s] Generation 4 complete: 
2026-02-21T10:01:50.1771931Z ok=68
2026-02-21T10:01:50.1776826Z min=0.0390
2026-02-21T10:01:50.1781289Z mid=0.0594
2026-02-21T10:01:50.1783205Z max=0.1412
2026-02-21T10:01:50.1783396Z best={'block_sizes': [1, 16384],
2026-02-21T10:01:50.1783671Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:01:50.1784228Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:01:50.1784428Z  'num_sm_multiplier': 64,
2026-02-21T10:01:50.1784597Z  'num_stages': 5,
2026-02-21T10:01:50.1784734Z  'num_warps': 4,
2026-02-21T10:01:50.1784899Z  'pid_type': 'persistent_blocked',
2026-02-21T10:01:50.1785094Z  'range_flattens': [False, False],
2026-02-21T10:01:50.1785275Z  'range_multi_buffers': [True, False],
2026-02-21T10:01:50.1785467Z  'range_num_stages': [2, 1],
2026-02-21T10:01:50.1785633Z  'range_unroll_factors': [0, 0],
2026-02-21T10:01:50.1785817Z  'range_warp_specializes': [True, None]}
2026-02-21T10:01:50.1786115Z [174s] Fitting surrogate: 393 points, 393 targets
2026-02-21T10:01:51.0761006Z [175s] Generation 5 starting: 62 neighbors, 5 active search path(s)
2026-02-21T10:02:05.7450657Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.7 configs/s
2026-02-21T10:02:09.4074756Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.1 configs/s
2026-02-21T10:02:13.5181329Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 245.7         
2026-02-21T10:02:13.5182690Z                                                                   configs/s     
2026-02-21T10:02:13.7563240Z [198s] Generation 5 complete: 
2026-02-21T10:02:13.7567465Z ok=67
2026-02-21T10:02:13.7571394Z min=0.0409
2026-02-21T10:02:13.7579546Z mid=0.0533
2026-02-21T10:02:13.7583257Z max=0.4260
2026-02-21T10:02:13.7585598Z best={'block_sizes': [1, 16384],
2026-02-21T10:02:13.7587815Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:02:13.7588112Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:02:13.7588332Z  'num_sm_multiplier': 64,
2026-02-21T10:02:13.7588505Z  'num_stages': 5,
2026-02-21T10:02:13.7588657Z  'num_warps': 4,
2026-02-21T10:02:13.7588819Z  'pid_type': 'persistent_blocked',
2026-02-21T10:02:13.7589001Z  'range_flattens': [False, False],
2026-02-21T10:02:13.7589190Z  'range_multi_buffers': [True, False],
2026-02-21T10:02:13.7589380Z  'range_num_stages': [2, 1],
2026-02-21T10:02:13.7589554Z  'range_unroll_factors': [0, 0],
2026-02-21T10:02:13.7589739Z  'range_warp_specializes': [True, None]}
2026-02-21T10:02:13.7589958Z [198s] Fitting surrogate: 460 points, 460 targets
2026-02-21T10:02:14.4430804Z [199s] Generation 6 starting: 42 neighbors, 4 active search path(s)
2026-02-21T10:02:23.7630457Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 2.5 configs/s
2026-02-21T10:02:26.2556609Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.2 configs/s
2026-02-21T10:02:30.3796779Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.3         
2026-02-21T10:02:30.3801394Z                                                                   configs/s     
2026-02-21T10:02:30.5842757Z [215s] Generation 6 complete: 
2026-02-21T10:02:30.5846495Z ok=46
2026-02-21T10:02:30.5848017Z min=0.0389
2026-02-21T10:02:30.5848189Z mid=0.0470
2026-02-21T10:02:30.5848339Z max=0.7865
2026-02-21T10:02:30.5848504Z best={'block_sizes': [1, 16384],
2026-02-21T10:02:30.5848745Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:02:30.5848986Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:02:30.5849183Z  'num_stages': 6,
2026-02-21T10:02:30.5849325Z  'num_warps': 1,
2026-02-21T10:02:30.5849473Z  'pid_type': 'flat',
2026-02-21T10:02:30.5849632Z  'range_flattens': [None, False],
2026-02-21T10:02:30.5849821Z  'range_multi_buffers': [None, False],
2026-02-21T10:02:30.5850007Z  'range_num_stages': [0, 0],
2026-02-21T10:02:30.5850179Z  'range_unroll_factors': [0, 0],
2026-02-21T10:02:30.5850367Z  'range_warp_specializes': [None, True]}
2026-02-21T10:02:30.5859327Z [215s] Fitting surrogate: 506 points, 506 targets
2026-02-21T10:02:31.2069450Z [215s] Generation 7 starting: 32 neighbors, 3 active search path(s)
2026-02-21T10:02:43.3170181Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.3 configs/s
2026-02-21T10:02:45.2911120Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 17.1 configs/s
2026-02-21T10:02:46.9730319Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 597.8         
2026-02-21T10:02:46.9731443Z                                                                   configs/s     
2026-02-21T10:02:47.0911313Z [231s] Generation 7 complete: 
2026-02-21T10:02:47.0916260Z ok=35
2026-02-21T10:02:47.0920672Z min=0.0409
2026-02-21T10:02:47.0923847Z mid=0.0512
2026-02-21T10:02:47.0928250Z max=0.7873
2026-02-21T10:02:47.0932778Z best={'block_sizes': [1, 16384],
2026-02-21T10:02:47.0934283Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:02:47.0934581Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:02:47.0934774Z  'num_stages': 6,
2026-02-21T10:02:47.0934940Z  'num_warps': 1,
2026-02-21T10:02:47.0935081Z  'pid_type': 'flat',
2026-02-21T10:02:47.0935249Z  'range_flattens': [None, True],
2026-02-21T10:02:47.0935427Z  'range_multi_buffers': [None, False],
2026-02-21T10:02:47.0935869Z  'range_num_stages': [0, 0],
2026-02-21T10:02:47.0936051Z  'range_unroll_factors': [0, 0],
2026-02-21T10:02:47.0936243Z  'range_warp_specializes': [None, True]}
2026-02-21T10:02:47.0936472Z [231s] Fitting surrogate: 541 points, 541 targets
2026-02-21T10:02:47.5715596Z [232s] Generation 8 starting: 19 neighbors, 2 active search path(s)
2026-02-21T10:02:54.4129430Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.3 configs/s
2026-02-21T10:02:55.6060739Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.4 configs/s
2026-02-21T10:02:56.9346476Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 754.6         
2026-02-21T10:02:56.9346928Z                                                                   configs/s     
2026-02-21T10:02:57.0247580Z [241s] Generation 8 complete: 
2026-02-21T10:02:57.0252387Z ok=22
2026-02-21T10:02:57.0256692Z min=0.0409
2026-02-21T10:02:57.0261206Z mid=0.0410
2026-02-21T10:02:57.0266265Z max=0.6174
2026-02-21T10:02:57.0271180Z best={'block_sizes': [1, 16384],
2026-02-21T10:02:57.0272577Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:02:57.0272848Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:02:57.0273051Z  'num_stages': 6,
2026-02-21T10:02:57.0273197Z  'num_warps': 1,
2026-02-21T10:02:57.0273347Z  'pid_type': 'flat',
2026-02-21T10:02:57.0273514Z  'range_flattens': [None, True],
2026-02-21T10:02:57.0273689Z  'range_multi_buffers': [None, False],
2026-02-21T10:02:57.0273880Z  'range_num_stages': [0, 1],
2026-02-21T10:02:57.0274042Z  'range_unroll_factors': [0, 0],
2026-02-21T10:02:57.0274224Z  'range_warp_specializes': [None, True]}
2026-02-21T10:02:57.0274444Z [241s] Fitting surrogate: 563 points, 563 targets
2026-02-21T10:02:57.3069561Z [241s] Autotuning complete in 241.9s after searching 532 configs.
2026-02-21T10:02:57.3071262Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:02:57.3072318Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T10:02:57.3073265Z 
2026-02-21T10:02:57.3073521Z [241s] Code of selected kernel: /tmp/torchinductor_root/pr/cprifmp37aytfpvnokgwzudeoq5y7jzemovnyvoi4hezk62ibwks.py
2026-02-21T10:02:57.3297479Z from __future__ import annotations
2026-02-21T10:02:57.3299312Z 
2026-02-21T10:02:57.3299471Z import torch
2026-02-21T10:02:57.3299633Z import triton
2026-02-21T10:02:57.3299790Z import triton.language as tl
2026-02-21T10:02:57.3300248Z from torch._inductor.runtime import triton_helpers
2026-02-21T10:02:57.3300539Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T10:02:57.3300820Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T10:02:57.3300998Z 
2026-02-21T10:02:57.3301075Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T10:02:57.3301258Z _BLOCK_SIZE_1 = tl.constexpr(16384)
2026-02-21T10:02:57.3301375Z 
2026-02-21T10:02:57.3301432Z @triton.jit
2026-02-21T10:02:57.3301643Z def _helion_softmax_two_pass(x, out):
2026-02-21T10:02:57.3301895Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:02:57.3302149Z     pid_0 = tl.program_id(0)
2026-02-21T10:02:57.3302310Z     offset_0 = pid_0
2026-02-21T10:02:57.3302491Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T10:02:57.3302780Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:02:57.3303077Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T10:02:57.3303353Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:02:57.3303606Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T10:02:57.3303935Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:02:57.3304213Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T10:02:57.3304476Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T10:02:57.3304717Z     # src[softmax.py:82-89]: ...
2026-02-21T10:02:57.3305079Z     for offset_2 in tl.range(0, 10752, _BLOCK_SIZE_1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T10:02:57.3305502Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:02:57.3305733Z         mask_1 = indices_2 < 10752
2026-02-21T10:02:57.3305907Z         mi_copy = mi
2026-02-21T10:02:57.3306046Z         di_copy = di
2026-02-21T10:02:57.3306198Z         mi_copy_0 = mi_copy
2026-02-21T10:02:57.3306359Z         di_copy_0 = di_copy
2026-02-21T10:02:57.3306538Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T10:02:57.3306916Z         values = tl.load(x + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:02:57.3307301Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T10:02:57.3307709Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T10:02:57.3308101Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T10:02:57.3308363Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T10:02:57.3308624Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T10:02:57.3308828Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T10:02:57.3309089Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:02:57.3309325Z         v_2 = mi_copy_0 - v_1
2026-02-21T10:02:57.3309549Z         v_3 = libdevice.exp(v_2)
2026-02-21T10:02:57.3309720Z         v_4 = di_copy_0 * v_3
2026-02-21T10:02:57.3309907Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T10:02:57.3310150Z         subscript = v_1[:, None]
2026-02-21T10:02:57.3310317Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T10:02:57.3310496Z         v_6 = v_5 - subscript
2026-02-21T10:02:57.3310701Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:02:57.3310967Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T10:02:57.3311182Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T10:02:57.3311364Z         v_7 = libdevice.exp(v_6)
2026-02-21T10:02:57.3311721Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T10:02:57.3312123Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T10:02:57.3312345Z         di = v_4 + sum_1
2026-02-21T10:02:57.3312516Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T10:02:57.3312708Z         mi = v_1
2026-02-21T10:02:57.3312922Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:02:57.3313214Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T10:02:57.3313534Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:02:57.3314018Z     for offset_2 in tl.range(0, 10752, _BLOCK_SIZE_1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T10:02:57.3314459Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:02:57.3314708Z         mask_2 = indices_2 < 10752
2026-02-21T10:02:57.3314895Z         mi_copy_1 = mi
2026-02-21T10:02:57.3315055Z         di_copy_1 = di
2026-02-21T10:02:57.3315228Z         mi_copy_1_0 = mi_copy_1
2026-02-21T10:02:57.3315415Z         di_copy_1_0 = di_copy_1
2026-02-21T10:02:57.3315617Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T10:02:57.3316059Z         values_1 = tl.load(x + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:02:57.3316516Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:02:57.3316817Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T10:02:57.3317014Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T10:02:57.3317210Z         v_10 = v_9 - subscript_1
2026-02-21T10:02:57.3317392Z         v_11 = libdevice.exp(v_10)
2026-02-21T10:02:57.3317572Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T10:02:57.3317764Z         v_12 = v_11 / subscript_2
2026-02-21T10:02:57.3317941Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T10:02:57.3318232Z         tl.store(out + (indices_0[:, None] * 10752 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T10:02:57.3318458Z 
2026-02-21T10:02:57.3318590Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T10:02:57.3318839Z     """
2026-02-21T10:02:57.3319071Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T10:02:57.3319387Z     This version uses fewer passes but is less numerically stable.
2026-02-21T10:02:57.3319620Z     Args:
2026-02-21T10:02:57.3319783Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T10:02:57.3319990Z     Returns:
2026-02-21T10:02:57.3320172Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T10:02:57.3320395Z     """
2026-02-21T10:02:57.3320535Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T10:02:57.3320719Z     m, n = x.size()
2026-02-21T10:02:57.3320896Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T10:02:57.3321105Z     out = torch.empty_like(x)
2026-02-21T10:02:57.3321332Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:02:57.3321675Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:02:57.3322012Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:02:57.3322245Z     # src[softmax.py:79-92]: ...
2026-02-21T10:02:57.3322530Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6)
2026-02-21T10:02:57.3322805Z     # src[softmax.py:93]: return out
2026-02-21T10:02:57.3322971Z     return out
2026-02-21T10:02:58.2790354Z WARNING:tritonbench.utils.triton_op:Completed input ID 82:
2026-02-21T10:02:58.2794477Z (M, N)
2026-02-21T10:02:58.2798208Z -------------
2026-02-21T10:02:58.2799647Z (4096, 10752)
2026-02-21T10:02:58.2799773Z 
2026-02-21T10:02:58.2800364Z  85%|████████▌ | 17/20 [54:03<11:08, 222.81s/it]WARNING:tritonbench.utils.triton_op:Running input ID 87:
2026-02-21T10:02:58.2800691Z (M, N)
2026-02-21T10:02:58.2805983Z -------------
2026-02-21T10:02:58.2807646Z (4096, 11392)
2026-02-21T10:02:58.2808054Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T10:02:59.5120131Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T10:03:00.8957588Z INFO:tritonbench.utils.triton_op:Took 2.24ms to get benchmark function for torch_compile_softmax
2026-02-21T10:03:02.2235223Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:03:02.2239608Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:03:02.2243949Z               'dtype': 'torch.float16',
2026-02-21T10:03:02.2248783Z               'shape': (4096, 11392),
2026-02-21T10:03:02.2253163Z               'stride': (11392, 1)},),
2026-02-21T10:03:02.2256351Z   'kwargs': {}}
2026-02-21T10:03:02.2260423Z INFO:tritonbench.utils.triton_op:Took 2.49ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T10:03:02.4003460Z [0s] Autotune random seed: 2138408546
2026-02-21T10:03:02.4296242Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:03:39.5119503Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T10:03:43.9689502Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T10:03:46.0365023Z module {
2026-02-21T10:03:46.0367071Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:03:46.0367542Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T10:03:46.0367772Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:03:46.0367957Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:03:46.0368155Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T10:03:46.0368375Z     %cst = arith.constant dense<11392> : tensor<16x1xi32>
2026-02-21T10:03:46.0368639Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16xf32>
2026-02-21T10:03:46.0368913Z     %cst_1 = arith.constant dense<0xFF800000> : tensor<16xf32>
2026-02-21T10:03:46.0369129Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T10:03:46.0369319Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T10:03:46.0369508Z     %c11392_i32 = arith.constant 11392 : i32
2026-02-21T10:03:46.0369703Z     %c11392_i64 = arith.constant 11392 : i64
2026-02-21T10:03:46.0369883Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T10:03:46.0370218Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c11392_i32], [%c11392_i64, %c1_i64] : <f16>, <tensor<16x128xf16>>
2026-02-21T10:03:46.0370551Z     %1 = tt.get_program_id x : i32
2026-02-21T10:03:46.0370733Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T10:03:46.0370920Z     %3 = arith.minsi %2, %c256_i32 : i32
2026-02-21T10:03:46.0371413Z     scf.for %arg2 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T10:03:46.0372429Z       %4 = arith.muli %arg2, %c16_i32 : i32
2026-02-21T10:03:46.0372674Z       %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T10:03:46.0373066Z       %6 = tt.splat %4 : i32 -> tensor<16xi32>
2026-02-21T10:03:46.0373289Z       %7 = arith.addi %6, %5 : tensor<16xi32>
2026-02-21T10:03:46.0373496Z       %c11264_i32 = arith.constant 11264 : i32
2026-02-21T10:03:46.0373717Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T10:03:46.0374167Z       %8:2 = scf.for %arg3 = %c0_i32 to %c11264_i32 step %c512_i32 iter_args(%arg4 = %cst_1, %arg5 = %cst_0) -> (tensor<16xf32>, tensor<16xf32>)  : i32 {
2026-02-21T10:03:46.0374710Z         %50 = tt.descriptor_load %0[%4, %arg3] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T10:03:46.0375183Z         %51 = arith.extf %50 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0375479Z         %52 = "tt.reduce"(%51) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0375707Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0379252Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:03:46.0384050Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0387247Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0391153Z         %53 = arith.truncf %52 : tensor<16xf32> to tensor<16xf16>
2026-02-21T10:03:46.0395920Z         %54 = arith.extf %53 : tensor<16xf16> to tensor<16xf32>
2026-02-21T10:03:46.0396277Z         %55 = arith.cmpf ogt, %arg4, %54 : tensor<16xf32>
2026-02-21T10:03:46.0396522Z         %56 = arith.cmpf une, %arg4, %arg4 : tensor<16xf32>
2026-02-21T10:03:46.0396741Z         %57 = arith.ori %55, %56 : tensor<16xi1>
2026-02-21T10:03:46.0396988Z         %58 = arith.select %57, %arg4, %54 : tensor<16xi1>, tensor<16xf32>
2026-02-21T10:03:46.0397247Z         %59 = arith.subf %arg4, %58 : tensor<16xf32>
2026-02-21T10:03:46.0397623Z         %60 = tt.extern_elementwise %59 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0397992Z         %61 = arith.mulf %arg5, %60 : tensor<16xf32>
2026-02-21T10:03:46.0398251Z         %62 = tt.expand_dims %58 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0398553Z         %63 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0398791Z         %64 = arith.subf %51, %63 : tensor<16x128xf32>
2026-02-21T10:03:46.0399160Z         %65 = tt.extern_elementwise %64 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0399517Z         %66 = "tt.reduce"(%65) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0399718Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0399911Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:03:46.0400105Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0400299Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0400495Z         %67 = arith.addf %61, %66 : tensor<16xf32>
2026-02-21T10:03:46.0400695Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T10:03:46.0400886Z         %68 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T10:03:46.0401093Z         %69 = arith.addi %arg3, %68 : i32
2026-02-21T10:03:46.0401365Z         %70 = tt.descriptor_load %0[%4, %69] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T10:03:46.0401773Z         %71 = arith.extf %70 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0402015Z         %72 = "tt.reduce"(%71) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0402204Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0402396Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:03:46.0402587Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0402781Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0403004Z         %73 = arith.truncf %72 : tensor<16xf32> to tensor<16xf16>
2026-02-21T10:03:46.0403420Z         %74 = arith.extf %73 : tensor<16xf16> to tensor<16xf32>
2026-02-21T10:03:46.0403659Z         %75 = arith.cmpf ogt, %58, %74 : tensor<16xf32>
2026-02-21T10:03:46.0403874Z         %76 = arith.cmpf une, %58, %58 : tensor<16xf32>
2026-02-21T10:03:46.0404129Z         %77 = arith.ori %75, %76 : tensor<16xi1>
2026-02-21T10:03:46.0404358Z         %78 = arith.select %77, %58, %74 : tensor<16xi1>, tensor<16xf32>
2026-02-21T10:03:46.0404599Z         %79 = arith.subf %58, %78 : tensor<16xf32>
2026-02-21T10:03:46.0404948Z         %80 = tt.extern_elementwise %79 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0405317Z         %81 = arith.mulf %67, %80 : tensor<16xf32>
2026-02-21T10:03:46.0405580Z         %82 = tt.expand_dims %78 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0405914Z         %83 = tt.broadcast %82 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0406172Z         %84 = arith.subf %71, %83 : tensor<16x128xf32>
2026-02-21T10:03:46.0406584Z         %85 = tt.extern_elementwise %84 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0406979Z         %86 = "tt.reduce"(%85) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0407184Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0407374Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:03:46.0407575Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0407769Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0407981Z         %87 = arith.addf %81, %86 : tensor<16xf32>
2026-02-21T10:03:46.0408183Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T10:03:46.0408387Z         %88 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T10:03:46.0408587Z         %89 = arith.addi %arg3, %88 : i32
2026-02-21T10:03:46.0408886Z         %90 = tt.descriptor_load %0[%4, %89] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T10:03:46.0409232Z         %91 = arith.extf %90 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0409473Z         %92 = "tt.reduce"(%91) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0409677Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0409871Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:03:46.0410074Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0410266Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0410502Z         %93 = arith.truncf %92 : tensor<16xf32> to tensor<16xf16>
2026-02-21T10:03:46.0410763Z         %94 = arith.extf %93 : tensor<16xf16> to tensor<16xf32>
2026-02-21T10:03:46.0411001Z         %95 = arith.cmpf ogt, %78, %94 : tensor<16xf32>
2026-02-21T10:03:46.0411224Z         %96 = arith.cmpf une, %78, %78 : tensor<16xf32>
2026-02-21T10:03:46.0411432Z         %97 = arith.ori %95, %96 : tensor<16xi1>
2026-02-21T10:03:46.0411735Z         %98 = arith.select %97, %78, %94 : tensor<16xi1>, tensor<16xf32>
2026-02-21T10:03:46.0411975Z         %99 = arith.subf %78, %98 : tensor<16xf32>
2026-02-21T10:03:46.0412352Z         %100 = tt.extern_elementwise %99 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0412734Z         %101 = arith.mulf %87, %100 : tensor<16xf32>
2026-02-21T10:03:46.0412995Z         %102 = tt.expand_dims %98 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0413307Z         %103 = tt.broadcast %102 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0413562Z         %104 = arith.subf %91, %103 : tensor<16x128xf32>
2026-02-21T10:03:46.0413932Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0414312Z         %106 = "tt.reduce"(%105) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0414505Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0414689Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:03:46.0414903Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0415093Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0415295Z         %107 = arith.addf %101, %106 : tensor<16xf32>
2026-02-21T10:03:46.0415525Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T10:03:46.0415716Z         %108 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T10:03:46.0415904Z         %109 = arith.addi %arg3, %108 : i32
2026-02-21T10:03:46.0416191Z         %110 = tt.descriptor_load %0[%4, %109] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T10:03:46.0416516Z         %111 = arith.extf %110 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0416761Z         %112 = "tt.reduce"(%111) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0416948Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0417168Z           %128 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:03:46.0417359Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0417554Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0417785Z         %113 = arith.truncf %112 : tensor<16xf32> to tensor<16xf16>
2026-02-21T10:03:46.0418060Z         %114 = arith.extf %113 : tensor<16xf16> to tensor<16xf32>
2026-02-21T10:03:46.0418305Z         %115 = arith.cmpf ogt, %98, %114 : tensor<16xf32>
2026-02-21T10:03:46.0418528Z         %116 = arith.cmpf une, %98, %98 : tensor<16xf32>
2026-02-21T10:03:46.0418742Z         %117 = arith.ori %115, %116 : tensor<16xi1>
2026-02-21T10:03:46.0418976Z         %118 = arith.select %117, %98, %114 : tensor<16xi1>, tensor<16xf32>
2026-02-21T10:03:46.0419225Z         %119 = arith.subf %98, %118 : tensor<16xf32>
2026-02-21T10:03:46.0419591Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0419959Z         %121 = arith.mulf %107, %120 : tensor<16xf32>
2026-02-21T10:03:46.0420224Z         %122 = tt.expand_dims %118 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0420526Z         %123 = tt.broadcast %122 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0420786Z         %124 = arith.subf %111, %123 : tensor<16x128xf32>
2026-02-21T10:03:46.0421167Z         %125 = tt.extern_elementwise %124 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0421579Z         %126 = "tt.reduce"(%125) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0421778Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:03:46.0421953Z           %128 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:03:46.0422144Z           tt.reduce.return %128 : f32
2026-02-21T10:03:46.0422323Z         }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0422528Z         %127 = arith.addf %121, %126 : tensor<16xf32>
2026-02-21T10:03:46.0422755Z         scf.yield %118, %127 : tensor<16xf32>, tensor<16xf32>
2026-02-21T10:03:46.0422970Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T10:03:46.0423272Z       %9 = tt.descriptor_load %0[%4, %c11264_i32] : !tt.tensordesc<tensor<16x128xf16>> -> tensor<16x128xf16>
2026-02-21T10:03:46.0423602Z       %10 = arith.extf %9 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0423835Z       %11 = "tt.reduce"(%10) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0424023Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T10:03:46.0424209Z         %50 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T10:03:46.0424404Z         tt.reduce.return %50 : f32
2026-02-21T10:03:46.0424581Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0424802Z       %12 = arith.truncf %11 : tensor<16xf32> to tensor<16xf16>
2026-02-21T10:03:46.0425036Z       %13 = arith.extf %12 : tensor<16xf16> to tensor<16xf32>
2026-02-21T10:03:46.0425263Z       %14 = arith.cmpf ogt, %8#0, %13 : tensor<16xf32>
2026-02-21T10:03:46.0425473Z       %15 = arith.cmpf une, %8#0, %8#0 : tensor<16xf32>
2026-02-21T10:03:46.0425680Z       %16 = arith.ori %14, %15 : tensor<16xi1>
2026-02-21T10:03:46.0425940Z       %17 = arith.select %16, %8#0, %13 : tensor<16xi1>, tensor<16xf32>
2026-02-21T10:03:46.0426179Z       %18 = arith.subf %8#0, %17 : tensor<16xf32>
2026-02-21T10:03:46.0426580Z       %19 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0426925Z       %20 = arith.mulf %8#1, %19 : tensor<16xf32>
2026-02-21T10:03:46.0427174Z       %21 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0427457Z       %22 = tt.broadcast %21 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0427694Z       %23 = arith.subf %10, %22 : tensor<16x128xf32>
2026-02-21T10:03:46.0428056Z       %24 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0428436Z       %25 = "tt.reduce"(%24) <{axis = 1 : i32}> ({
2026-02-21T10:03:46.0428634Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T10:03:46.0428810Z         %50 = arith.addf %arg3, %arg4 : f32
2026-02-21T10:03:46.0428998Z         tt.reduce.return %50 : f32
2026-02-21T10:03:46.0429207Z       }) : (tensor<16x128xf32>) -> tensor<16xf32>
2026-02-21T10:03:46.0429411Z       %26 = arith.addf %20, %25 : tensor<16xf32>
2026-02-21T10:03:46.0429615Z       %c11264_i32_2 = arith.constant 11264 : i32
2026-02-21T10:03:46.0429811Z       %c512_i32_3 = arith.constant 512 : i32
2026-02-21T10:03:46.0430070Z       scf.for %arg3 = %c0_i32 to %c11264_i32_2 step %c512_i32_3  : i32 {
2026-02-21T10:03:46.0430352Z         %50 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T10:03:46.0430614Z         %51 = tt.splat %arg3 : i32 -> tensor<128xi32>
2026-02-21T10:03:46.0430816Z         %52 = arith.addi %51, %50 : tensor<128xi32>
2026-02-21T10:03:46.0431082Z         %53 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T10:03:46.0431362Z         %54 = arith.muli %53, %cst : tensor<16x1xi32>
2026-02-21T10:03:46.0431661Z         %55 = tt.expand_dims %52 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T10:03:46.0431960Z         %56 = tt.broadcast %54 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0432220Z         %57 = tt.broadcast %55 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0432457Z         %58 = arith.addi %56, %57 : tensor<16x128xi32>
2026-02-21T10:03:46.0432692Z         %59 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0432981Z         %60 = tt.addptr %59, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0433286Z         %61 = tt.load %60 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0433597Z         %62 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0433887Z         %63 = arith.extf %61 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0434142Z         %64 = tt.broadcast %62 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0434380Z         %65 = arith.subf %63, %64 : tensor<16x128xf32>
2026-02-21T10:03:46.0434751Z         %66 = tt.extern_elementwise %65 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0435150Z         %67 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0435434Z         %68 = tt.broadcast %67 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0435662Z         %69 = arith.divf %66, %68 : tensor<16x128xf32>
2026-02-21T10:03:46.0435898Z         %70 = arith.truncf %69 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T10:03:46.0436164Z         %71 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0436443Z         %72 = tt.addptr %71, %58 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0436698Z         tt.store %72, %70 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0436929Z         %c1_i32_4 = arith.constant 1 : i32
2026-02-21T10:03:46.0437124Z         %73 = arith.muli %c128_i32, %c1_i32_4 : i32
2026-02-21T10:03:46.0437314Z         %74 = arith.addi %arg3, %73 : i32
2026-02-21T10:03:46.0437576Z         %75 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T10:03:46.0437819Z         %76 = tt.splat %74 : i32 -> tensor<128xi32>
2026-02-21T10:03:46.0438019Z         %77 = arith.addi %76, %75 : tensor<128xi32>
2026-02-21T10:03:46.0438267Z         %78 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T10:03:46.0438526Z         %79 = arith.muli %78, %cst : tensor<16x1xi32>
2026-02-21T10:03:46.0438785Z         %80 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T10:03:46.0439077Z         %81 = tt.broadcast %79 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0439370Z         %82 = tt.broadcast %80 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0439604Z         %83 = arith.addi %81, %82 : tensor<16x128xi32>
2026-02-21T10:03:46.0439842Z         %84 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0440148Z         %85 = tt.addptr %84, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0440444Z         %86 = tt.load %85 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0440760Z         %87 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0441041Z         %88 = arith.extf %86 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0441301Z         %89 = tt.broadcast %87 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0441565Z         %90 = arith.subf %88, %89 : tensor<16x128xf32>
2026-02-21T10:03:46.0441929Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0442339Z         %92 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0442618Z         %93 = tt.broadcast %92 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0442858Z         %94 = arith.divf %91, %93 : tensor<16x128xf32>
2026-02-21T10:03:46.0443091Z         %95 = arith.truncf %94 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T10:03:46.0443366Z         %96 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0443646Z         %97 = tt.addptr %96, %83 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0443894Z         tt.store %97, %95 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0444100Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T10:03:46.0444294Z         %98 = arith.muli %c128_i32, %c2_i32 : i32
2026-02-21T10:03:46.0444496Z         %99 = arith.addi %arg3, %98 : i32
2026-02-21T10:03:46.0444730Z         %100 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T10:03:46.0444991Z         %101 = tt.splat %99 : i32 -> tensor<128xi32>
2026-02-21T10:03:46.0445200Z         %102 = arith.addi %101, %100 : tensor<128xi32>
2026-02-21T10:03:46.0445450Z         %103 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T10:03:46.0445718Z         %104 = arith.muli %103, %cst : tensor<16x1xi32>
2026-02-21T10:03:46.0445978Z         %105 = tt.expand_dims %102 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T10:03:46.0446281Z         %106 = tt.broadcast %104 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0446553Z         %107 = tt.broadcast %105 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0446795Z         %108 = arith.addi %106, %107 : tensor<16x128xi32>
2026-02-21T10:03:46.0447042Z         %109 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0447324Z         %110 = tt.addptr %109, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0447667Z         %111 = tt.load %110 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0447980Z         %112 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0448300Z         %113 = arith.extf %111 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0448572Z         %114 = tt.broadcast %112 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0448811Z         %115 = arith.subf %113, %114 : tensor<16x128xf32>
2026-02-21T10:03:46.0449193Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0449636Z         %117 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0449942Z         %118 = tt.broadcast %117 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0450226Z         %119 = arith.divf %116, %118 : tensor<16x128xf32>
2026-02-21T10:03:46.0450481Z         %120 = arith.truncf %119 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T10:03:46.0450778Z         %121 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0451105Z         %122 = tt.addptr %121, %108 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0451386Z         tt.store %122, %120 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0451632Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T10:03:46.0451835Z         %123 = arith.muli %c128_i32, %c3_i32 : i32
2026-02-21T10:03:46.0452040Z         %124 = arith.addi %arg3, %123 : i32
2026-02-21T10:03:46.0452283Z         %125 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T10:03:46.0452551Z         %126 = tt.splat %124 : i32 -> tensor<128xi32>
2026-02-21T10:03:46.0452764Z         %127 = arith.addi %126, %125 : tensor<128xi32>
2026-02-21T10:03:46.0453032Z         %128 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T10:03:46.0453308Z         %129 = arith.muli %128, %cst : tensor<16x1xi32>
2026-02-21T10:03:46.0453586Z         %130 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T10:03:46.0453902Z         %131 = tt.broadcast %129 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0454183Z         %132 = tt.broadcast %130 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0454466Z         %133 = arith.addi %131, %132 : tensor<16x128xi32>
2026-02-21T10:03:46.0454715Z         %134 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0455018Z         %135 = tt.addptr %134, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0455349Z         %136 = tt.load %135 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0455679Z         %137 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0455986Z         %138 = arith.extf %136 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0456271Z         %139 = tt.broadcast %137 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0456535Z         %140 = arith.subf %138, %139 : tensor<16x128xf32>
2026-02-21T10:03:46.0456965Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0457387Z         %142 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0457685Z         %143 = tt.broadcast %142 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0457926Z         %144 = arith.divf %141, %143 : tensor<16x128xf32>
2026-02-21T10:03:46.0458168Z         %145 = arith.truncf %144 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T10:03:46.0458442Z         %146 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0458718Z         %147 = tt.addptr %146, %133 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0459021Z         tt.store %147, %145 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0459230Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T10:03:46.0459482Z       %27 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T10:03:46.0459765Z       %28 = tt.splat %c11264_i32_2 : i32 -> tensor<128xi32>
2026-02-21T10:03:46.0459986Z       %29 = arith.addi %28, %27 : tensor<128xi32>
2026-02-21T10:03:46.0460237Z       %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T10:03:46.0460489Z       %31 = arith.muli %30, %cst : tensor<16x1xi32>
2026-02-21T10:03:46.0460743Z       %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
2026-02-21T10:03:46.0461022Z       %33 = tt.broadcast %31 : tensor<16x1xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0461287Z       %34 = tt.broadcast %32 : tensor<1x128xi32> -> tensor<16x128xi32>
2026-02-21T10:03:46.0461596Z       %35 = arith.addi %33, %34 : tensor<16x128xi32>
2026-02-21T10:03:46.0461842Z       %36 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0462123Z       %37 = tt.addptr %36, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0462457Z       %38 = tt.load %37 evictionPolicy = evict_first : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0462768Z       %39 = tt.expand_dims %17 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0463044Z       %40 = arith.extf %38 : tensor<16x128xf16> to tensor<16x128xf32>
2026-02-21T10:03:46.0463307Z       %41 = tt.broadcast %39 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0463537Z       %42 = arith.subf %40, %41 : tensor<16x128xf32>
2026-02-21T10:03:46.0463909Z       %43 = tt.extern_elementwise %42 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x128xf32>) -> tensor<16x128xf32>
2026-02-21T10:03:46.0464328Z       %44 = tt.expand_dims %26 {axis = 1 : i32} : tensor<16xf32> -> tensor<16x1xf32>
2026-02-21T10:03:46.0464608Z       %45 = tt.broadcast %44 : tensor<16x1xf32> -> tensor<16x128xf32>
2026-02-21T10:03:46.0464841Z       %46 = arith.divf %43, %45 : tensor<16x128xf32>
2026-02-21T10:03:46.0465072Z       %47 = arith.truncf %46 : tensor<16x128xf32> to tensor<16x128xf16>
2026-02-21T10:03:46.0465340Z       %48 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0465614Z       %49 = tt.addptr %48, %35 : tensor<16x128x!tt.ptr<f16>>, tensor<16x128xi32>
2026-02-21T10:03:46.0465864Z       tt.store %49, %47 : tensor<16x128x!tt.ptr<f16>>
2026-02-21T10:03:46.0466170Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T10:03:46.0466445Z     tt.return
2026-02-21T10:03:46.0466578Z   }
2026-02-21T10:03:46.0466695Z }
2026-02-21T10:03:46.0466771Z 
2026-02-21T10:03:46.0466821Z {-#
2026-02-21T10:03:46.0466946Z   external_resources: {
2026-02-21T10:03:46.0467111Z     mlir_reproducer: {
2026-02-21T10:03:46.0471481Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T10:03:46.0476094Z       disable_threading: false,
2026-02-21T10:03:46.0476277Z       verify_each: true
2026-02-21T10:03:46.0476426Z     }
2026-02-21T10:03:46.0476555Z   }
2026-02-21T10:03:46.0476672Z #-}
2026-02-21T10:03:46.0477134Z /tmp/torchinductor_root/cc/cccgmf6dglaruus34hbvdsr7nfyctkle4fuecbsoxmkrcgdutke5.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:03:46.0478352Z /tmp/torchinductor_root/cc/cccgmf6dglaruus34hbvdsr7nfyctkle4fuecbsoxmkrcgdutke5.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:03:46.0479332Z [43s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:03:46.0480393Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_sm_multiplier=32, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T10:03:46.0481344Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:03:46.0481633Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:03:53.9298852Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.9 configs/s
2026-02-21T10:03:53.9308427Z [51s] Adaptive compile timeout: 30s (90% percentile=16.5s, bounds=[30.0s, 30s])
2026-02-21T10:03:55.4259193Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 661.4 configs/s
2026-02-21T10:03:55.5087133Z [53s] Initial random population of 100, 5 starting points: 
2026-02-21T10:03:55.5091447Z error=12
2026-02-21T10:03:55.5092960Z timeout=1
2026-02-21T10:03:55.5093125Z ok=87
2026-02-21T10:03:55.5093251Z min=0.0757
2026-02-21T10:03:55.5093384Z mid=0.5713
2026-02-21T10:03:55.5093505Z max=267.8518
2026-02-21T10:03:55.5093659Z best={'block_sizes': [1, 4096],
2026-02-21T10:03:55.5093924Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:03:55.5094185Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T10:03:55.5094376Z  'num_stages': 5,
2026-02-21T10:03:55.5094511Z  'num_warps': 1,
2026-02-21T10:03:55.5094715Z  'pid_type': 'flat',
2026-02-21T10:03:55.5094877Z  'range_flattens': [None, False],
2026-02-21T10:03:55.5095076Z  'range_multi_buffers': [None, False],
2026-02-21T10:03:55.5095259Z  'range_num_stages': [0, 1],
2026-02-21T10:03:55.5099835Z  'range_unroll_factors': [0, 0],
2026-02-21T10:03:55.5101331Z  'range_warp_specializes': [None, False]}
2026-02-21T10:03:55.5101709Z [53s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:03:56.4444644Z [54s] Generation 1 starting: 73 neighbors, 5 active search path(s)
2026-02-21T10:04:12.6976664Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 3.7 configs/s
2026-02-21T10:04:17.2107432Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 17.0 configs/s
2026-02-21T10:04:22.3785047Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 194.7         
2026-02-21T10:04:22.3785950Z                                                                   configs/s     
2026-02-21T10:04:22.6192151Z [80s] Generation 1 complete: 
2026-02-21T10:04:22.6196443Z ok=79
2026-02-21T10:04:22.6201419Z min=0.0614
2026-02-21T10:04:22.6205387Z mid=0.0901
2026-02-21T10:04:22.6209704Z max=0.8972
2026-02-21T10:04:22.6214060Z best={'block_sizes': [1, 4096],
2026-02-21T10:04:22.6217641Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:04:22.6220659Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:04:22.6220905Z  'num_stages': 7,
2026-02-21T10:04:22.6221061Z  'num_warps': 4,
2026-02-21T10:04:22.6221203Z  'pid_type': 'flat',
2026-02-21T10:04:22.6221368Z  'range_flattens': [None, None],
2026-02-21T10:04:22.6221648Z  'range_multi_buffers': [None, True],
2026-02-21T10:04:22.6221841Z  'range_num_stages': [0, 0],
2026-02-21T10:04:22.6222014Z  'range_unroll_factors': [0, 4],
2026-02-21T10:04:22.6222207Z  'range_warp_specializes': [None, False]}
2026-02-21T10:04:22.6222427Z [80s] Fitting surrogate: 179 points, 179 targets
2026-02-21T10:04:23.6000365Z [81s] Generation 2 starting: 69 neighbors, 5 active search path(s)
2026-02-21T10:04:38.2608885Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 4.3 configs/s
2026-02-21T10:04:42.3483323Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.1 configs/s
2026-02-21T10:04:42.9343800Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1662.5         
2026-02-21T10:04:42.9344221Z                                                                  configs/s      
2026-02-21T10:04:42.9852518Z [100s] Generation 2 complete: 
2026-02-21T10:04:42.9856932Z ok=74
2026-02-21T10:04:42.9858486Z min=0.0421
2026-02-21T10:04:42.9858655Z mid=0.0716
2026-02-21T10:04:42.9858779Z max=0.1883
2026-02-21T10:04:42.9858927Z best={'block_sizes': [1, 16384],
2026-02-21T10:04:42.9859166Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:04:42.9859431Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:04:42.9859639Z  'num_stages': 7,
2026-02-21T10:04:42.9859846Z  'num_warps': 4,
2026-02-21T10:04:42.9860004Z  'pid_type': 'flat',
2026-02-21T10:04:42.9860165Z  'range_flattens': [None, None],
2026-02-21T10:04:42.9860379Z  'range_multi_buffers': [None, True],
2026-02-21T10:04:42.9863545Z  'range_num_stages': [0, 0],
2026-02-21T10:04:42.9868560Z  'range_unroll_factors': [0, 4],
2026-02-21T10:04:42.9868909Z  'range_warp_specializes': [None, False]}
2026-02-21T10:04:42.9869198Z [100s] Fitting surrogate: 253 points, 253 targets
2026-02-21T10:04:43.9194428Z [101s] Generation 3 starting: 62 neighbors, 5 active search path(s)
2026-02-21T10:04:56.4580254Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 4.4 configs/s
2026-02-21T10:05:00.1172688Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.2 configs/s
2026-02-21T10:05:01.7754284Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 603.8         
2026-02-21T10:05:01.7755102Z                                                                   configs/s     
2026-02-21T10:05:01.8736201Z [119s] Generation 3 complete: 
2026-02-21T10:05:01.8738228Z ok=67
2026-02-21T10:05:01.8738448Z min=0.0429
2026-02-21T10:05:01.8738646Z mid=0.0655
2026-02-21T10:05:01.8738844Z max=0.9596
2026-02-21T10:05:01.8739034Z best={'block_sizes': [1, 16384],
2026-02-21T10:05:01.8739314Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:05:01.8739605Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:05:01.8739848Z  'num_stages': 7,
2026-02-21T10:05:01.8739995Z  'num_warps': 4,
2026-02-21T10:05:01.8740134Z  'pid_type': 'flat',
2026-02-21T10:05:01.8740295Z  'range_flattens': [None, None],
2026-02-21T10:05:01.8740470Z  'range_multi_buffers': [None, True],
2026-02-21T10:05:01.8740657Z  'range_num_stages': [0, 0],
2026-02-21T10:05:01.8740822Z  'range_unroll_factors': [0, 4],
2026-02-21T10:05:01.8741236Z  'range_warp_specializes': [None, False]}
2026-02-21T10:05:01.8751030Z [119s] Fitting surrogate: 320 points, 320 targets
2026-02-21T10:05:02.5497715Z [120s] Generation 4 starting: 41 neighbors, 4 active search path(s)
2026-02-21T10:05:12.1968073Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 2.7 configs/s
2026-02-21T10:05:14.6602893Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.4 configs/s
2026-02-21T10:05:16.9856099Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 432.4         
2026-02-21T10:05:16.9860043Z                                                                   configs/s     
2026-02-21T10:05:17.1222820Z [134s] Generation 4 complete: 
2026-02-21T10:05:17.1227103Z ok=45
2026-02-21T10:05:17.1231443Z min=0.0410
2026-02-21T10:05:17.1232755Z mid=0.0614
2026-02-21T10:05:17.1232917Z max=0.7526
2026-02-21T10:05:17.1233078Z best={'block_sizes': [1, 16384],
2026-02-21T10:05:17.1233317Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:05:17.1233581Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:05:17.1233783Z  'num_stages': 7,
2026-02-21T10:05:17.1233932Z  'num_warps': 4,
2026-02-21T10:05:17.1234073Z  'pid_type': 'flat',
2026-02-21T10:05:17.1234238Z  'range_flattens': [None, None],
2026-02-21T10:05:17.1234669Z  'range_multi_buffers': [None, True],
2026-02-21T10:05:17.1234881Z  'range_num_stages': [0, 0],
2026-02-21T10:05:17.1235054Z  'range_unroll_factors': [0, 4],
2026-02-21T10:05:17.1235234Z  'range_warp_specializes': [None, False]}
2026-02-21T10:05:17.1239353Z [134s] Fitting surrogate: 365 points, 365 targets
2026-02-21T10:05:17.5216597Z [135s] Generation 5 starting: 21 neighbors, 2 active search path(s)
2026-02-21T10:05:22.0605943Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 3.7 configs/s
2026-02-21T10:05:23.3531014Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.6 configs/s
2026-02-21T10:05:24.8055765Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 689.4         
2026-02-21T10:05:24.8057646Z                                                                   configs/s     
2026-02-21T10:05:24.9002593Z [142s] Generation 5 complete: 
2026-02-21T10:05:24.9007037Z ok=24
2026-02-21T10:05:24.9011334Z min=0.0409
2026-02-21T10:05:24.9012770Z mid=0.0593
2026-02-21T10:05:24.9012961Z max=0.0820
2026-02-21T10:05:24.9013119Z best={'block_sizes': [1, 16384],
2026-02-21T10:05:24.9013359Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:05:24.9013599Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:05:24.9013791Z  'num_stages': 8,
2026-02-21T10:05:24.9013932Z  'num_warps': 1,
2026-02-21T10:05:24.9014079Z  'pid_type': 'flat',
2026-02-21T10:05:24.9014232Z  'range_flattens': [None, None],
2026-02-21T10:05:24.9014416Z  'range_multi_buffers': [None, None],
2026-02-21T10:05:24.9014605Z  'range_num_stages': [0, 3],
2026-02-21T10:05:24.9014767Z  'range_unroll_factors': [0, 1],
2026-02-21T10:05:24.9014952Z  'range_warp_specializes': [None, True]}
2026-02-21T10:05:24.9017610Z [142s] Fitting surrogate: 389 points, 389 targets
2026-02-21T10:05:25.2806936Z [142s] Generation 6 starting: 17 neighbors, 2 active search path(s)
2026-02-21T10:05:28.8889913Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 9.3 configs/s
2026-02-21T10:05:29.8835672Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.9 configs/s
2026-02-21T10:05:31.3039487Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 704.6         
2026-02-21T10:05:31.3039994Z                                                                   configs/s     
2026-02-21T10:05:31.3968314Z [148s] Generation 6 complete: 
2026-02-21T10:05:31.3972657Z ok=19
2026-02-21T10:05:31.3977079Z min=0.0410
2026-02-21T10:05:31.3981459Z mid=0.0430
2026-02-21T10:05:31.3985897Z max=0.0799
2026-02-21T10:05:31.3987351Z best={'block_sizes': [1, 16384],
2026-02-21T10:05:31.3987614Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:05:31.3988142Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:05:31.3988360Z  'num_stages': 8,
2026-02-21T10:05:31.3988513Z  'num_warps': 1,
2026-02-21T10:05:31.3988663Z  'pid_type': 'flat',
2026-02-21T10:05:31.3988818Z  'range_flattens': [None, None],
2026-02-21T10:05:31.3989003Z  'range_multi_buffers': [None, None],
2026-02-21T10:05:31.3989341Z  'range_num_stages': [0, 3],
2026-02-21T10:05:31.3989513Z  'range_unroll_factors': [0, 1],
2026-02-21T10:05:31.3989690Z  'range_warp_specializes': [None, True]}
2026-02-21T10:05:31.3989915Z [148s] Fitting surrogate: 408 points, 408 targets
2026-02-21T10:05:31.7673179Z [149s] Generation 7 starting: 13 neighbors, 2 active search path(s)
2026-02-21T10:05:35.9842158Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.7 configs/s
2026-02-21T10:05:36.7541055Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.0 configs/s
2026-02-21T10:05:37.9881379Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 809.5         
2026-02-21T10:05:37.9882626Z                                                                   configs/s     
2026-02-21T10:05:38.0734950Z [155s] Generation 7 complete: 
2026-02-21T10:05:38.0736753Z ok=15
2026-02-21T10:05:38.0736923Z min=0.0409
2026-02-21T10:05:38.0737052Z mid=0.0471
2026-02-21T10:05:38.0737184Z max=0.1352
2026-02-21T10:05:38.0737581Z best={'block_sizes': [1, 16384],
2026-02-21T10:05:38.0737846Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:05:38.0738086Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:05:38.0738283Z  'num_stages': 8,
2026-02-21T10:05:38.0738421Z  'num_warps': 1,
2026-02-21T10:05:38.0738566Z  'pid_type': 'flat',
2026-02-21T10:05:38.0738719Z  'range_flattens': [None, None],
2026-02-21T10:05:38.0738897Z  'range_multi_buffers': [None, None],
2026-02-21T10:05:38.0739084Z  'range_num_stages': [0, 3],
2026-02-21T10:05:38.0739246Z  'range_unroll_factors': [0, 1],
2026-02-21T10:05:38.0739429Z  'range_warp_specializes': [None, True]}
2026-02-21T10:05:38.0749921Z [155s] Fitting surrogate: 423 points, 423 targets
2026-02-21T10:05:38.2463756Z [155s] Autotuning complete in 155.8s after searching 400 configs.
2026-02-21T10:05:38.2464149Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:05:38.2465124Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T10:05:38.2465973Z 
2026-02-21T10:05:38.2466232Z [155s] Code of selected kernel: /tmp/torchinductor_root/di/cdiqaqmp6hqpf6mdpxmaeajmeqiazcpvadjtb4krqe5mqsv4i3vl.py
2026-02-21T10:05:38.2689116Z from __future__ import annotations
2026-02-21T10:05:38.2691090Z 
2026-02-21T10:05:38.2691315Z import torch
2026-02-21T10:05:38.2691506Z import triton
2026-02-21T10:05:38.2691900Z import triton.language as tl
2026-02-21T10:05:38.2692145Z from torch._inductor.runtime import triton_helpers
2026-02-21T10:05:38.2692692Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T10:05:38.2692995Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T10:05:38.2693179Z 
2026-02-21T10:05:38.2693362Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T10:05:38.2693549Z _BLOCK_SIZE_1 = tl.constexpr(16384)
2026-02-21T10:05:38.2693683Z 
2026-02-21T10:05:38.2693742Z @triton.jit
2026-02-21T10:05:38.2693897Z def _helion_softmax_two_pass(x, out):
2026-02-21T10:05:38.2694170Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:05:38.2694441Z     pid_0 = tl.program_id(0)
2026-02-21T10:05:38.2694610Z     offset_0 = pid_0
2026-02-21T10:05:38.2694798Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T10:05:38.2695084Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:05:38.2695461Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T10:05:38.2695731Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:05:38.2695996Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T10:05:38.2696264Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:05:38.2696554Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T10:05:38.2696825Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T10:05:38.2697065Z     # src[softmax.py:82-89]: ...
2026-02-21T10:05:38.2697393Z     for offset_2 in tl.range(0, 11392, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, num_stages=3):
2026-02-21T10:05:38.2697772Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:05:38.2698010Z         mask_1 = indices_2 < 11392
2026-02-21T10:05:38.2698174Z         mi_copy = mi
2026-02-21T10:05:38.2698321Z         di_copy = di
2026-02-21T10:05:38.2698466Z         mi_copy_0 = mi_copy
2026-02-21T10:05:38.2698620Z         di_copy_0 = di_copy
2026-02-21T10:05:38.2698808Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T10:05:38.2699216Z         values = tl.load(x + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:05:38.2699613Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T10:05:38.2700015Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T10:05:38.2700413Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T10:05:38.2700680Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T10:05:38.2700913Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T10:05:38.2701129Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T10:05:38.2701385Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:05:38.2701672Z         v_2 = mi_copy_0 - v_1
2026-02-21T10:05:38.2701843Z         v_3 = libdevice.exp(v_2)
2026-02-21T10:05:38.2702020Z         v_4 = di_copy_0 * v_3
2026-02-21T10:05:38.2702219Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T10:05:38.2702424Z         subscript = v_1[:, None]
2026-02-21T10:05:38.2702614Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T10:05:38.2702796Z         v_6 = v_5 - subscript
2026-02-21T10:05:38.2703023Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:05:38.2703288Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T10:05:38.2703515Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T10:05:38.2703712Z         v_7 = libdevice.exp(v_6)
2026-02-21T10:05:38.2704026Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T10:05:38.2704388Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T10:05:38.2704585Z         di = v_4 + sum_1
2026-02-21T10:05:38.2704803Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T10:05:38.2704970Z         mi = v_1
2026-02-21T10:05:38.2705175Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:05:38.2705493Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T10:05:38.2705795Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:05:38.2706210Z     for offset_2 in tl.range(0, 11392, _BLOCK_SIZE_1, loop_unroll_factor=1, warp_specialize=True, num_stages=3):
2026-02-21T10:05:38.2706572Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:05:38.2706813Z         mask_2 = indices_2 < 11392
2026-02-21T10:05:38.2706984Z         mi_copy_1 = mi
2026-02-21T10:05:38.2707142Z         di_copy_1 = di
2026-02-21T10:05:38.2707291Z         mi_copy_1_0 = mi_copy_1
2026-02-21T10:05:38.2707508Z         di_copy_1_0 = di_copy_1
2026-02-21T10:05:38.2707699Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T10:05:38.2708066Z         values_1 = tl.load(x + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:05:38.2708511Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:05:38.2708785Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T10:05:38.2708977Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T10:05:38.2709157Z         v_10 = v_9 - subscript_1
2026-02-21T10:05:38.2709328Z         v_11 = libdevice.exp(v_10)
2026-02-21T10:05:38.2709506Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T10:05:38.2709682Z         v_12 = v_11 / subscript_2
2026-02-21T10:05:38.2709858Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T10:05:38.2710132Z         tl.store(out + (indices_0[:, None] * 11392 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T10:05:38.2710354Z 
2026-02-21T10:05:38.2710481Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T10:05:38.2710708Z     """
2026-02-21T10:05:38.2710915Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T10:05:38.2711254Z     This version uses fewer passes but is less numerically stable.
2026-02-21T10:05:38.2711474Z     Args:
2026-02-21T10:05:38.2711664Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T10:05:38.2711853Z     Returns:
2026-02-21T10:05:38.2712034Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T10:05:38.2712238Z     """
2026-02-21T10:05:38.2712381Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T10:05:38.2712557Z     m, n = x.size()
2026-02-21T10:05:38.2712733Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T10:05:38.2712939Z     out = torch.empty_like(x)
2026-02-21T10:05:38.2713159Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:05:38.2713480Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:05:38.2713788Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:05:38.2714032Z     # src[softmax.py:79-92]: ...
2026-02-21T10:05:38.2714288Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=8)
2026-02-21T10:05:38.2714565Z     # src[softmax.py:93]: return out
2026-02-21T10:05:38.2714737Z     return out
2026-02-21T10:05:39.2720443Z WARNING:tritonbench.utils.triton_op:Completed input ID 87:
2026-02-21T10:05:39.2722317Z (M, N)
2026-02-21T10:05:39.2722481Z -------------
2026-02-21T10:05:39.2722636Z (4096, 11392)
2026-02-21T10:05:39.2722766Z 
2026-02-21T10:05:39.2734771Z  90%|█████████ | 18/20 [56:44<06:48, 204.23s/it]WARNING:tritonbench.utils.triton_op:Running input ID 92:
2026-02-21T10:05:39.2738796Z (M, N)
2026-02-21T10:05:39.2743117Z -------------
2026-02-21T10:05:39.2748115Z (4096, 12032)
2026-02-21T10:05:39.2751476Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T10:05:40.4471262Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T10:05:41.8207464Z INFO:tritonbench.utils.triton_op:Took 2.34ms to get benchmark function for torch_compile_softmax
2026-02-21T10:05:43.1571417Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:05:43.1575894Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:05:43.1579111Z               'dtype': 'torch.float16',
2026-02-21T10:05:43.1583478Z               'shape': (4096, 12032),
2026-02-21T10:05:43.1585033Z               'stride': (12032, 1)},),
2026-02-21T10:05:43.1585257Z   'kwargs': {}}
2026-02-21T10:05:43.1594307Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T10:05:43.3339296Z [0s] Autotune random seed: 2138408546
2026-02-21T10:05:43.3593977Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:06:20.4021737Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T10:06:23.2134804Z [39s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T10:06:25.2487547Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T10:06:34.6461307Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.5 configs/s
2026-02-21T10:06:34.6469913Z [51s] Adaptive compile timeout: 30s (90% percentile=17.6s, bounds=[30.0s, 30s])
2026-02-21T10:06:36.8640468Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 661.8 configs/s
2026-02-21T10:06:36.9525256Z [53s] Initial random population of 100, 5 starting points: 
2026-02-21T10:06:36.9529644Z error=11
2026-02-21T10:06:36.9533109Z timeout=2
2026-02-21T10:06:36.9539035Z ok=87
2026-02-21T10:06:36.9542509Z min=0.0757
2026-02-21T10:06:36.9546831Z mid=0.5504
2026-02-21T10:06:36.9550279Z max=280.9385
2026-02-21T10:06:36.9552618Z best={'block_sizes': [1, 4096],
2026-02-21T10:06:36.9553002Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:06:36.9553311Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T10:06:36.9557890Z  'num_stages': 5,
2026-02-21T10:06:36.9561822Z  'num_warps': 1,
2026-02-21T10:06:36.9563007Z  'pid_type': 'flat',
2026-02-21T10:06:36.9563249Z  'range_flattens': [None, False],
2026-02-21T10:06:36.9563477Z  'range_multi_buffers': [None, False],
2026-02-21T10:06:36.9563704Z  'range_num_stages': [0, 1],
2026-02-21T10:06:36.9563894Z  'range_unroll_factors': [0, 0],
2026-02-21T10:06:36.9564109Z  'range_warp_specializes': [None, False]}
2026-02-21T10:06:36.9564448Z [53s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:06:37.9882443Z [54s] Generation 1 starting: 73 neighbors, 5 active search path(s)
2026-02-21T10:07:03.4796825Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 0.5 configs/s
2026-02-21T10:07:07.9494607Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.9 configs/s
2026-02-21T10:07:11.6557298Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.1         
2026-02-21T10:07:11.6560827Z                                                                   configs/s     
2026-02-21T10:07:11.8339404Z [88s] Generation 1 complete: 
2026-02-21T10:07:11.8340627Z ok=79
2026-02-21T10:07:11.8341133Z min=0.0573
2026-02-21T10:07:11.8341281Z mid=0.0882
2026-02-21T10:07:11.8341502Z max=0.7086
2026-02-21T10:07:11.8341920Z best={'block_sizes': [1, 4096],
2026-02-21T10:07:11.8342180Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:07:11.8342466Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T10:07:11.8342769Z  'num_sm_multiplier': 128,
2026-02-21T10:07:11.8342938Z  'num_stages': 5,
2026-02-21T10:07:11.8343096Z  'num_warps': 1,
2026-02-21T10:07:11.8343258Z  'pid_type': 'persistent_interleaved',
2026-02-21T10:07:11.8343463Z  'range_flattens': [True, False],
2026-02-21T10:07:11.8343641Z  'range_multi_buffers': [True, False],
2026-02-21T10:07:11.8343826Z  'range_num_stages': [2, 1],
2026-02-21T10:07:11.8343995Z  'range_unroll_factors': [0, 0],
2026-02-21T10:07:11.8344183Z  'range_warp_specializes': [True, None]}
2026-02-21T10:07:11.8354149Z [88s] Fitting surrogate: 179 points, 179 targets
2026-02-21T10:07:12.8814370Z [89s] Generation 2 starting: 81 neighbors, 5 active search path(s)
2026-02-21T10:07:48.3457595Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 0.3 configs/s
2026-02-21T10:07:53.2780170Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.2 configs/s
2026-02-21T10:07:59.2428864Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 169.1         
2026-02-21T10:07:59.2429647Z                                                                   configs/s     
2026-02-21T10:07:59.5409129Z [136s] Generation 2 complete: 
2026-02-21T10:07:59.5410843Z error=1
2026-02-21T10:07:59.5411003Z ok=86
2026-02-21T10:07:59.5411130Z min=0.0555
2026-02-21T10:07:59.5411265Z mid=0.0726
2026-02-21T10:07:59.5411385Z max=1.8360
2026-02-21T10:07:59.5411527Z best={'block_sizes': [1, 4096],
2026-02-21T10:07:59.5411956Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T10:07:59.5412221Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T10:07:59.5412414Z  'num_sm_multiplier': 128,
2026-02-21T10:07:59.5412599Z  'num_stages': 5,
2026-02-21T10:07:59.5412740Z  'num_warps': 2,
2026-02-21T10:07:59.5412917Z  'pid_type': 'persistent_interleaved',
2026-02-21T10:07:59.5413118Z  'range_flattens': [True, False],
2026-02-21T10:07:59.5413297Z  'range_multi_buffers': [True, False],
2026-02-21T10:07:59.5413487Z  'range_num_stages': [2, 1],
2026-02-21T10:07:59.5413656Z  'range_unroll_factors': [0, 0],
2026-02-21T10:07:59.5413838Z  'range_warp_specializes': [None, False]}
2026-02-21T10:07:59.5430724Z [136s] Fitting surrogate: 266 points, 266 targets
2026-02-21T10:08:00.6434436Z [137s] Generation 3 starting: 73 neighbors, 5 active search path(s)
2026-02-21T10:08:17.7459972Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 3.1 configs/s
2026-02-21T10:08:22.1700943Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.1 configs/s
2026-02-21T10:08:25.4870347Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 303.6         
2026-02-21T10:08:25.4874236Z                                                                   configs/s     
2026-02-21T10:08:25.6828700Z [162s] Generation 3 complete: 
2026-02-21T10:08:25.6833023Z ok=79
2026-02-21T10:08:25.6837386Z min=0.0430
2026-02-21T10:08:25.6841200Z mid=0.0676
2026-02-21T10:08:25.6845685Z max=0.5201
2026-02-21T10:08:25.6850240Z best={'block_sizes': [1, 16384],
2026-02-21T10:08:25.6854685Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:08:25.6856007Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:08:25.6856227Z  'num_stages': 6,
2026-02-21T10:08:25.6856377Z  'num_warps': 4,
2026-02-21T10:08:25.6856520Z  'pid_type': 'flat',
2026-02-21T10:08:25.6856689Z  'range_flattens': [None, None],
2026-02-21T10:08:25.6856875Z  'range_multi_buffers': [None, True],
2026-02-21T10:08:25.6857072Z  'range_num_stages': [0, 0],
2026-02-21T10:08:25.6857242Z  'range_unroll_factors': [0, 4],
2026-02-21T10:08:25.6857423Z  'range_warp_specializes': [None, False]}
2026-02-21T10:08:25.6857637Z [162s] Fitting surrogate: 345 points, 345 targets
2026-02-21T10:08:27.2176510Z [163s] Generation 4 starting: 50 neighbors, 4 active search path(s)
2026-02-21T10:08:38.0029543Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 4.4 configs/s
2026-02-21T10:08:41.0188822Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.2 configs/s
2026-02-21T10:08:44.5172814Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.3         
2026-02-21T10:08:44.5175978Z                                                                   configs/s     
2026-02-21T10:08:44.7158410Z [181s] Generation 4 complete: 
2026-02-21T10:08:44.7160464Z ok=54
2026-02-21T10:08:44.7160688Z min=0.0430
2026-02-21T10:08:44.7165341Z mid=0.0594
2026-02-21T10:08:44.7167327Z max=0.4955
2026-02-21T10:08:44.7167517Z best={'block_sizes': [1, 16384],
2026-02-21T10:08:44.7167768Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:08:44.7168014Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:08:44.7168229Z  'num_stages': 6,
2026-02-21T10:08:44.7168411Z  'num_warps': 4,
2026-02-21T10:08:44.7168565Z  'pid_type': 'flat',
2026-02-21T10:08:44.7168728Z  'range_flattens': [None, None],
2026-02-21T10:08:44.7168907Z  'range_multi_buffers': [None, True],
2026-02-21T10:08:44.7169101Z  'range_num_stages': [0, 0],
2026-02-21T10:08:44.7169560Z  'range_unroll_factors': [0, 4],
2026-02-21T10:08:44.7169767Z  'range_warp_specializes': [None, False]}
2026-02-21T10:08:44.7176270Z [181s] Fitting surrogate: 399 points, 399 targets
2026-02-21T10:08:45.4604732Z [182s] Generation 5 starting: 46 neighbors, 4 active search path(s)
2026-02-21T10:08:57.0369977Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 3.5 configs/s
2026-02-21T10:08:59.8744648Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 17.2 configs/s
2026-02-21T10:09:02.5140455Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 381.4         
2026-02-21T10:09:02.5144471Z                                                                   configs/s     
2026-02-21T10:09:02.6684626Z [199s] Generation 5 complete: 
2026-02-21T10:09:02.6688955Z ok=50
2026-02-21T10:09:02.6693345Z min=0.0429
2026-02-21T10:09:02.6697763Z mid=0.0594
2026-02-21T10:09:02.6701679Z max=0.7075
2026-02-21T10:09:02.6704952Z best={'block_sizes': [1, 16384],
2026-02-21T10:09:02.6705290Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:09:02.6709488Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:09:02.6713208Z  'num_stages': 6,
2026-02-21T10:09:02.6716425Z  'num_warps': 4,
2026-02-21T10:09:02.6719704Z  'pid_type': 'flat',
2026-02-21T10:09:02.6719990Z  'range_flattens': [None, None],
2026-02-21T10:09:02.6720213Z  'range_multi_buffers': [None, True],
2026-02-21T10:09:02.6724194Z  'range_num_stages': [0, 0],
2026-02-21T10:09:02.6728508Z  'range_unroll_factors': [0, 4],
2026-02-21T10:09:02.6732298Z  'range_warp_specializes': [None, False]}
2026-02-21T10:09:02.6736725Z [199s] Fitting surrogate: 449 points, 449 targets
2026-02-21T10:09:03.1433878Z [199s] Generation 6 starting: 28 neighbors, 2 active search path(s)
2026-02-21T10:09:12.2548576Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 1.3 configs/s
2026-02-21T10:09:13.9795206Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 17.3 configs/s
2026-02-21T10:09:16.1828935Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 456.1         
2026-02-21T10:09:16.1833068Z                                                                   configs/s     
2026-02-21T10:09:16.3168171Z [212s] Generation 6 complete: 
2026-02-21T10:09:16.3170022Z ok=31
2026-02-21T10:09:16.3170179Z min=0.0410
2026-02-21T10:09:16.3170316Z mid=0.0593
2026-02-21T10:09:16.3170437Z max=0.5233
2026-02-21T10:09:16.3170581Z best={'block_sizes': [1, 16384],
2026-02-21T10:09:16.3170809Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:09:16.3171056Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:09:16.3171253Z  'num_stages': 6,
2026-02-21T10:09:16.3171883Z  'num_warps': 4,
2026-02-21T10:09:16.3172049Z  'pid_type': 'flat',
2026-02-21T10:09:16.3172214Z  'range_flattens': [None, None],
2026-02-21T10:09:16.3172400Z  'range_multi_buffers': [None, True],
2026-02-21T10:09:16.3172583Z  'range_num_stages': [0, 0],
2026-02-21T10:09:16.3172755Z  'range_unroll_factors': [0, 4],
2026-02-21T10:09:16.3173030Z  'range_warp_specializes': [None, False]}
2026-02-21T10:09:16.3186077Z [212s] Fitting surrogate: 480 points, 480 targets
2026-02-21T10:09:16.7680881Z [213s] Generation 7 starting: 24 neighbors, 2 active search path(s)
2026-02-21T10:09:27.4817103Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 0.5 configs/s
2026-02-21T10:09:28.9706037Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.3 configs/s
2026-02-21T10:09:30.8318105Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 539.3         
2026-02-21T10:09:30.8322116Z                                                                   configs/s     
2026-02-21T10:09:30.9448542Z [227s] Generation 7 complete: 
2026-02-21T10:09:30.9452978Z ok=27
2026-02-21T10:09:30.9457259Z min=0.0410
2026-02-21T10:09:30.9458617Z mid=0.0573
2026-02-21T10:09:30.9458786Z max=0.1597
2026-02-21T10:09:30.9458930Z best={'block_sizes': [1, 16384],
2026-02-21T10:09:30.9459442Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:09:30.9459733Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:09:30.9459925Z  'num_stages': 6,
2026-02-21T10:09:30.9460104Z  'num_warps': 4,
2026-02-21T10:09:30.9460244Z  'pid_type': 'flat',
2026-02-21T10:09:30.9460404Z  'range_flattens': [None, None],
2026-02-21T10:09:30.9460580Z  'range_multi_buffers': [None, True],
2026-02-21T10:09:30.9460767Z  'range_num_stages': [0, 0],
2026-02-21T10:09:30.9460931Z  'range_unroll_factors': [0, 4],
2026-02-21T10:09:30.9461115Z  'range_warp_specializes': [None, False]}
2026-02-21T10:09:30.9465864Z [227s] Fitting surrogate: 507 points, 507 targets
2026-02-21T10:09:31.5479687Z [228s] Generation 8 starting: 29 neighbors, 2 active search path(s)
2026-02-21T10:09:39.4077052Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 3.5 configs/s
2026-02-21T10:09:41.2305545Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 17.4 configs/s
2026-02-21T10:09:43.2755226Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 491.8         
2026-02-21T10:09:43.2758775Z                                                                   configs/s     
2026-02-21T10:09:43.4060578Z [240s] Generation 8 complete: 
2026-02-21T10:09:43.4065531Z ok=32
2026-02-21T10:09:43.4066971Z min=0.0429
2026-02-21T10:09:43.4067138Z mid=0.0431
2026-02-21T10:09:43.4067261Z max=0.1270
2026-02-21T10:09:43.4067412Z best={'block_sizes': [1, 16384],
2026-02-21T10:09:43.4067640Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:09:43.4067888Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:09:43.4068083Z  'num_stages': 6,
2026-02-21T10:09:43.4068220Z  'num_warps': 4,
2026-02-21T10:09:43.4068380Z  'pid_type': 'flat',
2026-02-21T10:09:43.4068538Z  'range_flattens': [None, None],
2026-02-21T10:09:43.4068987Z  'range_multi_buffers': [None, True],
2026-02-21T10:09:43.4069170Z  'range_num_stages': [0, 0],
2026-02-21T10:09:43.4069341Z  'range_unroll_factors': [0, 4],
2026-02-21T10:09:43.4069525Z  'range_warp_specializes': [None, False]}
2026-02-21T10:09:43.4080434Z [240s] Fitting surrogate: 539 points, 539 targets
2026-02-21T10:09:43.7844189Z [240s] Generation 9 starting: 11 neighbors, 1 active search path(s)
2026-02-21T10:10:03.2401226Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 0.2 configs/s
2026-02-21T10:10:03.9607946Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.8 configs/s
2026-02-21T10:10:04.6140076Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1505.8         
2026-02-21T10:10:04.6144933Z                                                                  configs/s      
2026-02-21T10:10:04.6669754Z [261s] Generation 9 complete: 
2026-02-21T10:10:04.6674300Z ok=13
2026-02-21T10:10:04.6675526Z min=0.0411
2026-02-21T10:10:04.6675748Z mid=0.0594
2026-02-21T10:10:04.6675914Z max=0.4300
2026-02-21T10:10:04.6676107Z best={'block_sizes': [1, 16384],
2026-02-21T10:10:04.6676595Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:10:04.6676887Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:10:04.6677109Z  'num_stages': 6,
2026-02-21T10:10:04.6677254Z  'num_warps': 4,
2026-02-21T10:10:04.6677427Z  'pid_type': 'flat',
2026-02-21T10:10:04.6677583Z  'range_flattens': [None, None],
2026-02-21T10:10:04.6677782Z  'range_multi_buffers': [None, True],
2026-02-21T10:10:04.6677965Z  'range_num_stages': [0, 0],
2026-02-21T10:10:04.6678139Z  'range_unroll_factors': [0, 4],
2026-02-21T10:10:04.6678318Z  'range_warp_specializes': [None, False]}
2026-02-21T10:10:04.6693335Z [261s] Fitting surrogate: 552 points, 552 targets
2026-02-21T10:10:05.0334402Z [261s] Generation 10 starting: 10 neighbors, 1 active search path(s)
2026-02-21T10:10:09.5712543Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 2.0 configs/s
2026-02-21T10:10:10.1704140Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.1 configs/s
2026-02-21T10:10:10.8365288Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1483.6        
2026-02-21T10:10:10.8367026Z                                                                   configs/s     
2026-02-21T10:10:10.8923567Z [267s] Generation 10 complete: 
2026-02-21T10:10:10.8925540Z ok=12
2026-02-21T10:10:10.8925708Z min=0.0428
2026-02-21T10:10:10.8925844Z mid=0.0430
2026-02-21T10:10:10.8925963Z max=0.7067
2026-02-21T10:10:10.8926109Z best={'block_sizes': [1, 16384],
2026-02-21T10:10:10.8926342Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:10:10.8926594Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:10:10.8926783Z  'num_stages': 6,
2026-02-21T10:10:10.8926928Z  'num_warps': 4,
2026-02-21T10:10:10.8927070Z  'pid_type': 'flat',
2026-02-21T10:10:10.8927250Z  'range_flattens': [None, None],
2026-02-21T10:10:10.8927687Z  'range_multi_buffers': [None, True],
2026-02-21T10:10:10.8927877Z  'range_num_stages': [0, 0],
2026-02-21T10:10:10.8928053Z  'range_unroll_factors': [0, 4],
2026-02-21T10:10:10.8928239Z  'range_warp_specializes': [None, False]}
2026-02-21T10:10:10.8940206Z [267s] Fitting surrogate: 564 points, 564 targets
2026-02-21T10:10:11.3583978Z [267s] Generation 11 starting: 10 neighbors, 1 active search path(s)
2026-02-21T10:10:14.7780430Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 3.7 configs/s
2026-02-21T10:10:15.3679405Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.5 configs/s
2026-02-21T10:10:16.9297969Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1305.3        
2026-02-21T10:10:16.9299303Z                                                                   configs/s     
2026-02-21T10:10:16.9931488Z [273s] Generation 11 complete: 
2026-02-21T10:10:16.9935677Z ok=12
2026-02-21T10:10:16.9939820Z min=0.0429
2026-02-21T10:10:16.9944374Z mid=0.0430
2026-02-21T10:10:16.9948329Z max=0.4566
2026-02-21T10:10:16.9952702Z best={'block_sizes': [1, 16384],
2026-02-21T10:10:16.9954379Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:10:16.9954721Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:10:16.9955169Z  'num_stages': 6,
2026-02-21T10:10:16.9959360Z  'num_warps': 4,
2026-02-21T10:10:16.9959633Z  'pid_type': 'flat',
2026-02-21T10:10:16.9959845Z  'range_flattens': [None, None],
2026-02-21T10:10:16.9960053Z  'range_multi_buffers': [None, True],
2026-02-21T10:10:16.9964677Z  'range_num_stages': [0, 0],
2026-02-21T10:10:16.9968051Z  'range_unroll_factors': [0, 4],
2026-02-21T10:10:16.9971836Z  'range_warp_specializes': [None, False]}
2026-02-21T10:10:16.9976263Z [273s] Fitting surrogate: 576 points, 576 targets
2026-02-21T10:10:17.2729691Z [273s] Autotuning complete in 273.9s after searching 548 configs.
2026-02-21T10:10:17.2730121Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:10:17.2731302Z     @helion.kernel(config=helion.Config(block_sizes=[1, 16384], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T10:10:17.2732308Z 
2026-02-21T10:10:17.2732562Z [273s] Code of selected kernel: /tmp/torchinductor_root/es/cesmpf2kmfidgpmoulxbj57xbmpocavecijc4vzx5bpbfemvuaoq.py
2026-02-21T10:10:17.2951638Z from __future__ import annotations
2026-02-21T10:10:17.2955842Z 
2026-02-21T10:10:17.2960422Z import torch
2026-02-21T10:10:17.2961992Z import triton
2026-02-21T10:10:17.2962186Z import triton.language as tl
2026-02-21T10:10:17.2962414Z from torch._inductor.runtime import triton_helpers
2026-02-21T10:10:17.2962702Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T10:10:17.2963030Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T10:10:17.2963214Z 
2026-02-21T10:10:17.2963292Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T10:10:17.2963469Z _BLOCK_SIZE_1 = tl.constexpr(16384)
2026-02-21T10:10:17.2963592Z 
2026-02-21T10:10:17.2963659Z @triton.jit
2026-02-21T10:10:17.2963803Z def _helion_softmax_two_pass(x, out):
2026-02-21T10:10:17.2964060Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:10:17.2964307Z     pid_0 = tl.program_id(0)
2026-02-21T10:10:17.2964477Z     offset_0 = pid_0
2026-02-21T10:10:17.2964647Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T10:10:17.2964930Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:10:17.2965225Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T10:10:17.2965484Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:10:17.2965739Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T10:10:17.2966247Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:10:17.2966529Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T10:10:17.2966792Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T10:10:17.2967028Z     # src[softmax.py:82-89]: ...
2026-02-21T10:10:17.2967384Z     for offset_2 in tl.range(0, 12032, _BLOCK_SIZE_1, loop_unroll_factor=4, warp_specialize=False, disallow_acc_multi_buffer=False):
2026-02-21T10:10:17.2967783Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:10:17.2968019Z         mask_1 = indices_2 < 12032
2026-02-21T10:10:17.2968181Z         mi_copy = mi
2026-02-21T10:10:17.2968347Z         di_copy = di
2026-02-21T10:10:17.2968494Z         mi_copy_0 = mi_copy
2026-02-21T10:10:17.2968645Z         di_copy_0 = di_copy
2026-02-21T10:10:17.2968887Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T10:10:17.2969259Z         values = tl.load(x + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), mask_1[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:10:17.2969657Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T10:10:17.2970113Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T10:10:17.2970500Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T10:10:17.2970765Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T10:10:17.2971000Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T10:10:17.2971215Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T10:10:17.2971477Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:10:17.2971782Z         v_2 = mi_copy_0 - v_1
2026-02-21T10:10:17.2971959Z         v_3 = libdevice.exp(v_2)
2026-02-21T10:10:17.2972125Z         v_4 = di_copy_0 * v_3
2026-02-21T10:10:17.2972323Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T10:10:17.2972524Z         subscript = v_1[:, None]
2026-02-21T10:10:17.2972703Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T10:10:17.2972920Z         v_6 = v_5 - subscript
2026-02-21T10:10:17.2973137Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:10:17.2973403Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T10:10:17.2973612Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T10:10:17.2973798Z         v_7 = libdevice.exp(v_6)
2026-02-21T10:10:17.2974109Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T10:10:17.2974466Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T10:10:17.2974660Z         di = v_4 + sum_1
2026-02-21T10:10:17.2974829Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T10:10:17.2975006Z         mi = v_1
2026-02-21T10:10:17.2975205Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:10:17.2975475Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T10:10:17.2975768Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:10:17.2976214Z     for offset_2 in tl.range(0, 12032, _BLOCK_SIZE_1, loop_unroll_factor=4, warp_specialize=False, disallow_acc_multi_buffer=False):
2026-02-21T10:10:17.2976610Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:10:17.2976846Z         mask_2 = indices_2 < 12032
2026-02-21T10:10:17.2977018Z         mi_copy_1 = mi
2026-02-21T10:10:17.2977161Z         di_copy_1 = di
2026-02-21T10:10:17.2977315Z         mi_copy_1_0 = mi_copy_1
2026-02-21T10:10:17.2977475Z         di_copy_1_0 = di_copy_1
2026-02-21T10:10:17.2977666Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T10:10:17.2978032Z         values_1 = tl.load(x + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), mask_2[None, :], other=0, eviction_policy='evict_first')
2026-02-21T10:10:17.2978512Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:10:17.2978797Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T10:10:17.2978987Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T10:10:17.2979181Z         v_10 = v_9 - subscript_1
2026-02-21T10:10:17.2979377Z         v_11 = libdevice.exp(v_10)
2026-02-21T10:10:17.2979558Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T10:10:17.2979740Z         v_12 = v_11 / subscript_2
2026-02-21T10:10:17.2979925Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T10:10:17.2980199Z         tl.store(out + (indices_0[:, None] * 12032 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T10:10:17.2980422Z 
2026-02-21T10:10:17.2980587Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T10:10:17.2980827Z     """
2026-02-21T10:10:17.2981029Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T10:10:17.2981335Z     This version uses fewer passes but is less numerically stable.
2026-02-21T10:10:17.2981585Z     Args:
2026-02-21T10:10:17.2981808Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T10:10:17.2982001Z     Returns:
2026-02-21T10:10:17.2982185Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T10:10:17.2982397Z     """
2026-02-21T10:10:17.2982532Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T10:10:17.2982715Z     m, n = x.size()
2026-02-21T10:10:17.2982882Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T10:10:17.2983089Z     out = torch.empty_like(x)
2026-02-21T10:10:17.2983310Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:10:17.2983630Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:10:17.2983934Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:10:17.2984173Z     # src[softmax.py:79-92]: ...
2026-02-21T10:10:17.2984429Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=4, num_stages=6)
2026-02-21T10:10:17.2984725Z     # src[softmax.py:93]: return out
2026-02-21T10:10:17.2984903Z     return out
2026-02-21T10:10:18.5000414Z WARNING:tritonbench.utils.triton_op:Completed input ID 92:
2026-02-21T10:10:18.5004674Z (M, N)
2026-02-21T10:10:18.5009195Z -------------
2026-02-21T10:10:18.5013284Z (4096, 12032)
2026-02-21T10:10:18.5013448Z 
2026-02-21T10:10:18.5018978Z  95%|█████████▌| 19/20 [1:01:23<03:46, 226.76s/it]WARNING:tritonbench.utils.triton_op:Running input ID 97:
2026-02-21T10:10:18.5022927Z (M, N)
2026-02-21T10:10:18.5024612Z -------------
2026-02-21T10:10:18.5024790Z (4096, 12672)
2026-02-21T10:10:18.5025064Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for naive_softmax
2026-02-21T10:10:19.6547503Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_softmax
2026-02-21T10:10:21.0207058Z INFO:tritonbench.utils.triton_op:Took 2.09ms to get benchmark function for torch_compile_softmax
2026-02-21T10:10:22.3439481Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:10:22.3443473Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:10:22.3445542Z               'dtype': 'torch.float16',
2026-02-21T10:10:22.3445802Z               'shape': (4096, 12672),
2026-02-21T10:10:22.3446012Z               'stride': (12672, 1)},),
2026-02-21T10:10:22.3446218Z   'kwargs': {}}
2026-02-21T10:10:22.3458069Z INFO:tritonbench.utils.triton_op:Took 1.79ms to get benchmark function for helion_softmax_tritonbench
2026-02-21T10:10:22.5184967Z [0s] Autotune random seed: 2138408546
2026-02-21T10:10:22.5433727Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:11:00.2231038Z [37s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=2, num_stages=8, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[False, False])
2026-02-21T10:11:03.5032819Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T10:11:06.3581526Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T10:11:08.8679513Z module {
2026-02-21T10:11:08.8683806Z   tt.func public @_helion_softmax_two_pass(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:11:08.8688445Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T10:11:08.8692413Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf16>
2026-02-21T10:11:08.8696856Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T10:11:08.8699021Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:11:08.8699262Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T10:11:08.8699500Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T10:11:08.8699776Z     %cst_1 = arith.constant dense<0xFC00> : tensor<8x1024xf16>
2026-02-21T10:11:08.8700031Z     %cst_2 = arith.constant dense<12672> : tensor<8x1xi32>
2026-02-21T10:11:08.8700265Z     %cst_3 = arith.constant dense<12672> : tensor<1024xi32>
2026-02-21T10:11:08.8700513Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<8xf32>
2026-02-21T10:11:08.8700760Z     %cst_5 = arith.constant dense<0xFF800000> : tensor<8xf32>
2026-02-21T10:11:08.8700990Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T10:11:08.8701176Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T10:11:08.8701369Z     %c12672_i32 = arith.constant 12672 : i32
2026-02-21T10:11:08.8701761Z     %c12672_i64 = arith.constant 12672 : i64
2026-02-21T10:11:08.8702193Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T10:11:08.8702529Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c12672_i32], [%c12672_i64, %c1_i64] : <f16>, <tensor<8x1024xf16>>
2026-02-21T10:11:08.8702858Z     %1 = tt.get_program_id x : i32
2026-02-21T10:11:08.8703074Z     scf.for %arg2 = %1 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T10:11:08.8703288Z       %2 = arith.muli %arg2, %c8_i32 : i32
2026-02-21T10:11:08.8703520Z       %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T10:11:08.8703763Z       %4 = tt.splat %2 : i32 -> tensor<8xi32>
2026-02-21T10:11:08.8703960Z       %5 = arith.addi %4, %3 : tensor<8xi32>
2026-02-21T10:11:08.8704154Z       %c12288_i32 = arith.constant 12288 : i32
2026-02-21T10:11:08.8704339Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T10:11:08.8704714Z       %6:2 = scf.for %arg3 = %c0_i32 to %c12288_i32 step %c3072_i32 iter_args(%arg4 = %cst_5, %arg5 = %cst_4) -> (tensor<8xf32>, tensor<8xf32>)  : i32 {
2026-02-21T10:11:08.8705141Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8705408Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8705618Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T10:11:08.8705844Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8706111Z         %70 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8706366Z         %71 = arith.muli %70, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8706629Z         %72 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8706926Z         %73 = tt.broadcast %71 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8707282Z         %74 = tt.broadcast %72 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8707665Z         %75 = arith.addi %73, %74 : tensor<8x1024xi32>
2026-02-21T10:11:08.8707909Z         %76 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8708231Z         %77 = tt.addptr %76, %75 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8711285Z         %78 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8711619Z         %79 = tt.broadcast %78 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8711918Z         %80 = tt.load %77, %79, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8712186Z         %81 = arith.select %79, %80, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T10:11:08.8712477Z         %82 = arith.extf %81 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8712772Z         %83 = "tt.reduce"(%82) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8712968Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8713168Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:11:08.8713361Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8713567Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8713834Z         %84 = arith.truncf %83 : tensor<8xf32> to tensor<8xf16>
2026-02-21T10:11:08.8714076Z         %85 = arith.extf %84 : tensor<8xf16> to tensor<8xf32>
2026-02-21T10:11:08.8714308Z         %86 = arith.cmpf ogt, %arg4, %85 : tensor<8xf32>
2026-02-21T10:11:08.8714530Z         %87 = arith.cmpf une, %arg4, %arg4 : tensor<8xf32>
2026-02-21T10:11:08.8714745Z         %88 = arith.ori %86, %87 : tensor<8xi1>
2026-02-21T10:11:08.8714968Z         %89 = arith.select %88, %arg4, %85 : tensor<8xi1>, tensor<8xf32>
2026-02-21T10:11:08.8715212Z         %90 = arith.subf %arg4, %89 : tensor<8xf32>
2026-02-21T10:11:08.8715569Z         %91 = tt.extern_elementwise %90 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8715934Z         %92 = arith.mulf %arg5, %91 : tensor<8xf32>
2026-02-21T10:11:08.8716191Z         %93 = tt.expand_dims %89 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8716511Z         %94 = arith.extf %80 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8716785Z         %95 = tt.broadcast %93 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8717016Z         %96 = arith.subf %94, %95 : tensor<8x1024xf32>
2026-02-21T10:11:08.8717380Z         %97 = tt.extern_elementwise %96 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8717786Z         %98 = arith.select %79, %97, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T10:11:08.8718045Z         %99 = "tt.reduce"(%98) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8718245Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8718425Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:11:08.8718618Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8718799Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8719006Z         %100 = arith.addf %92, %99 : tensor<8xf32>
2026-02-21T10:11:08.8719206Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T10:11:08.8719393Z         %101 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T10:11:08.8719589Z         %102 = arith.addi %arg3, %101 : i32
2026-02-21T10:11:08.8719828Z         %103 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8720092Z         %104 = tt.splat %102 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8720308Z         %105 = arith.addi %104, %103 : tensor<1024xi32>
2026-02-21T10:11:08.8720544Z         %106 = arith.cmpi slt, %105, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8720826Z         %107 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8721105Z         %108 = arith.muli %107, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8721429Z         %109 = tt.expand_dims %105 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8721779Z         %110 = tt.broadcast %108 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8722073Z         %111 = tt.broadcast %109 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8722335Z         %112 = arith.addi %110, %111 : tensor<8x1024xi32>
2026-02-21T10:11:08.8722594Z         %113 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8722902Z         %114 = tt.addptr %113, %112 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8723229Z         %115 = tt.expand_dims %106 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8723551Z         %116 = tt.broadcast %115 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8723824Z         %117 = tt.load %114, %116, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8724155Z         %118 = arith.select %116, %117, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T10:11:08.8724470Z         %119 = arith.extf %118 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8724723Z         %120 = "tt.reduce"(%119) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8724963Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8725154Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:11:08.8725358Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8725549Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8725788Z         %121 = arith.truncf %120 : tensor<8xf32> to tensor<8xf16>
2026-02-21T10:11:08.8726042Z         %122 = arith.extf %121 : tensor<8xf16> to tensor<8xf32>
2026-02-21T10:11:08.8726288Z         %123 = arith.cmpf ogt, %89, %122 : tensor<8xf32>
2026-02-21T10:11:08.8726518Z         %124 = arith.cmpf une, %89, %89 : tensor<8xf32>
2026-02-21T10:11:08.8726729Z         %125 = arith.ori %123, %124 : tensor<8xi1>
2026-02-21T10:11:08.8726972Z         %126 = arith.select %125, %89, %122 : tensor<8xi1>, tensor<8xf32>
2026-02-21T10:11:08.8727218Z         %127 = arith.subf %89, %126 : tensor<8xf32>
2026-02-21T10:11:08.8727640Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8728009Z         %129 = arith.mulf %100, %128 : tensor<8xf32>
2026-02-21T10:11:08.8728257Z         %130 = tt.expand_dims %126 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8728557Z         %131 = arith.extf %117 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8728825Z         %132 = tt.broadcast %130 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8729074Z         %133 = arith.subf %131, %132 : tensor<8x1024xf32>
2026-02-21T10:11:08.8729446Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8729873Z         %135 = arith.select %116, %134, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T10:11:08.8730144Z         %136 = "tt.reduce"(%135) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8730333Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8730519Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:11:08.8730754Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8730938Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8731143Z         %137 = arith.addf %129, %136 : tensor<8xf32>
2026-02-21T10:11:08.8731342Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T10:11:08.8731528Z         %138 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T10:11:08.8731772Z         %139 = arith.addi %arg3, %138 : i32
2026-02-21T10:11:08.8732017Z         %140 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8732281Z         %141 = tt.splat %139 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8732492Z         %142 = arith.addi %141, %140 : tensor<1024xi32>
2026-02-21T10:11:08.8732774Z         %143 = arith.cmpi slt, %142, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8733040Z         %144 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8733309Z         %145 = arith.muli %144, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8733580Z         %146 = tt.expand_dims %142 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8733887Z         %147 = tt.broadcast %145 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8734166Z         %148 = tt.broadcast %146 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8734418Z         %149 = arith.addi %147, %148 : tensor<8x1024xi32>
2026-02-21T10:11:08.8734666Z         %150 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8734950Z         %151 = tt.addptr %150, %149 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8735302Z         %152 = tt.expand_dims %143 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8735607Z         %153 = tt.broadcast %152 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8735861Z         %154 = tt.load %151, %153, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8736169Z         %155 = arith.select %153, %154, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T10:11:08.8736451Z         %156 = arith.extf %155 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8736691Z         %157 = "tt.reduce"(%156) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8736879Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8737070Z           %175 = arith.maxnumf %arg6, %arg7 : f32
2026-02-21T10:11:08.8737267Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8737449Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8737677Z         %158 = arith.truncf %157 : tensor<8xf32> to tensor<8xf16>
2026-02-21T10:11:08.8737916Z         %159 = arith.extf %158 : tensor<8xf16> to tensor<8xf32>
2026-02-21T10:11:08.8738152Z         %160 = arith.cmpf ogt, %126, %159 : tensor<8xf32>
2026-02-21T10:11:08.8738364Z         %161 = arith.cmpf une, %126, %126 : tensor<8xf32>
2026-02-21T10:11:08.8738603Z         %162 = arith.ori %160, %161 : tensor<8xi1>
2026-02-21T10:11:08.8738840Z         %163 = arith.select %162, %126, %159 : tensor<8xi1>, tensor<8xf32>
2026-02-21T10:11:08.8739077Z         %164 = arith.subf %126, %163 : tensor<8xf32>
2026-02-21T10:11:08.8739439Z         %165 = tt.extern_elementwise %164 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8739796Z         %166 = arith.mulf %137, %165 : tensor<8xf32>
2026-02-21T10:11:08.8740050Z         %167 = tt.expand_dims %163 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8740333Z         %168 = arith.extf %154 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8740604Z         %169 = tt.broadcast %167 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8740852Z         %170 = arith.subf %168, %169 : tensor<8x1024xf32>
2026-02-21T10:11:08.8741223Z         %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8741678Z         %172 = arith.select %153, %171, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T10:11:08.8741937Z         %173 = "tt.reduce"(%172) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8742136Z         ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T10:11:08.8742323Z           %175 = arith.addf %arg6, %arg7 : f32
2026-02-21T10:11:08.8742508Z           tt.reduce.return %175 : f32
2026-02-21T10:11:08.8742696Z         }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8742891Z         %174 = arith.addf %166, %173 : tensor<8xf32>
2026-02-21T10:11:08.8743114Z         scf.yield %163, %174 : tensor<8xf32>, tensor<8xf32>
2026-02-21T10:11:08.8743314Z       } {tt.flatten}
2026-02-21T10:11:08.8743522Z       %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8743855Z       %8 = tt.splat %c12288_i32 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8744074Z       %9 = arith.addi %8, %7 : tensor<1024xi32>
2026-02-21T10:11:08.8744291Z       %10 = arith.cmpi slt, %9, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8744548Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8744805Z       %12 = arith.muli %11, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8745053Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8745357Z       %14 = tt.broadcast %12 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8745625Z       %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8745871Z       %16 = arith.addi %14, %15 : tensor<8x1024xi32>
2026-02-21T10:11:08.8746144Z       %17 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8746419Z       %18 = tt.addptr %17, %16 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8746722Z       %19 = tt.expand_dims %10 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8747008Z       %20 = tt.broadcast %19 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8747292Z       %21 = tt.load %18, %20, %cst : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8747559Z       %22 = arith.select %20, %21, %cst_1 : tensor<8x1024xi1>, tensor<8x1024xf16>
2026-02-21T10:11:08.8747834Z       %23 = arith.extf %22 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8748068Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8748257Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T10:11:08.8748445Z         %66 = arith.maxnumf %arg3, %arg4 : f32
2026-02-21T10:11:08.8748633Z         tt.reduce.return %66 : f32
2026-02-21T10:11:08.8748822Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8749040Z       %25 = arith.truncf %24 : tensor<8xf32> to tensor<8xf16>
2026-02-21T10:11:08.8749279Z       %26 = arith.extf %25 : tensor<8xf16> to tensor<8xf32>
2026-02-21T10:11:08.8749505Z       %27 = arith.cmpf ogt, %6#0, %26 : tensor<8xf32>
2026-02-21T10:11:08.8749752Z       %28 = arith.cmpf une, %6#0, %6#0 : tensor<8xf32>
2026-02-21T10:11:08.8749960Z       %29 = arith.ori %27, %28 : tensor<8xi1>
2026-02-21T10:11:08.8750178Z       %30 = arith.select %29, %6#0, %26 : tensor<8xi1>, tensor<8xf32>
2026-02-21T10:11:08.8750407Z       %31 = arith.subf %6#0, %30 : tensor<8xf32>
2026-02-21T10:11:08.8750757Z       %32 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8751120Z       %33 = arith.mulf %6#1, %32 : tensor<8xf32>
2026-02-21T10:11:08.8751372Z       %34 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8751691Z       %35 = arith.extf %21 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8751959Z       %36 = tt.broadcast %34 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8752192Z       %37 = arith.subf %35, %36 : tensor<8x1024xf32>
2026-02-21T10:11:08.8752577Z       %38 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8752991Z       %39 = arith.select %20, %38, %cst_0 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T10:11:08.8753240Z       %40 = "tt.reduce"(%39) <{axis = 1 : i32}> ({
2026-02-21T10:11:08.8753440Z       ^bb0(%arg3: f32, %arg4: f32):
2026-02-21T10:11:08.8753620Z         %66 = arith.addf %arg3, %arg4 : f32
2026-02-21T10:11:08.8753812Z         tt.reduce.return %66 : f32
2026-02-21T10:11:08.8753995Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T10:11:08.8754200Z       %41 = arith.addf %33, %40 : tensor<8xf32>
2026-02-21T10:11:08.8754414Z       %c12288_i32_6 = arith.constant 12288 : i32
2026-02-21T10:11:08.8754611Z       %c3072_i32_7 = arith.constant 3072 : i32
2026-02-21T10:11:08.8754901Z       scf.for %arg3 = %c0_i32 to %c12288_i32_6 step %c3072_i32_7  : i32 {
2026-02-21T10:11:08.8755183Z         %66 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8755448Z         %67 = tt.splat %arg3 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8755653Z         %68 = arith.addi %67, %66 : tensor<1024xi32>
2026-02-21T10:11:08.8755873Z         %69 = arith.cmpi slt, %68, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8756185Z         %70 = tt.descriptor_load %0[%2, %arg3] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T10:11:08.8756524Z         %71 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8756815Z         %72 = arith.extf %70 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8757074Z         %73 = tt.broadcast %71 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8757345Z         %74 = arith.subf %72, %73 : tensor<8x1024xf32>
2026-02-21T10:11:08.8757711Z         %75 = tt.extern_elementwise %74 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8758126Z         %76 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8758442Z         %77 = tt.broadcast %76 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8758671Z         %78 = arith.divf %75, %77 : tensor<8x1024xf32>
2026-02-21T10:11:08.8758912Z         %79 = arith.truncf %78 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T10:11:08.8759192Z         %80 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8759456Z         %81 = arith.muli %80, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8759773Z         %82 = tt.expand_dims %68 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8760058Z         %83 = tt.broadcast %81 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8760321Z         %84 = tt.broadcast %82 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8760553Z         %85 = arith.addi %83, %84 : tensor<8x1024xi32>
2026-02-21T10:11:08.8760821Z         %86 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8761107Z         %87 = tt.addptr %86, %85 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8761414Z         %88 = tt.expand_dims %69 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8761745Z         %89 = tt.broadcast %88 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8761990Z         tt.store %87, %79, %89 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8762213Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T10:11:08.8762401Z         %90 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T10:11:08.8762598Z         %91 = arith.addi %arg3, %90 : i32
2026-02-21T10:11:08.8762829Z         %92 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8763087Z         %93 = tt.splat %91 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8763295Z         %94 = arith.addi %93, %92 : tensor<1024xi32>
2026-02-21T10:11:08.8763506Z         %95 = arith.cmpi slt, %94, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8763808Z         %96 = tt.descriptor_load %0[%2, %91] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T10:11:08.8764137Z         %97 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8764421Z         %98 = arith.extf %96 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8764679Z         %99 = tt.broadcast %97 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8764923Z         %100 = arith.subf %98, %99 : tensor<8x1024xf32>
2026-02-21T10:11:08.8765299Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8765712Z         %102 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8766030Z         %103 = tt.broadcast %102 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8766275Z         %104 = arith.divf %101, %103 : tensor<8x1024xf32>
2026-02-21T10:11:08.8766524Z         %105 = arith.truncf %104 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T10:11:08.8766815Z         %106 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8767075Z         %107 = arith.muli %106, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8767340Z         %108 = tt.expand_dims %94 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8767635Z         %109 = tt.broadcast %107 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8767909Z         %110 = tt.broadcast %108 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8768155Z         %111 = arith.addi %109, %110 : tensor<8x1024xi32>
2026-02-21T10:11:08.8768432Z         %112 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8768723Z         %113 = tt.addptr %112, %111 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8769028Z         %114 = tt.expand_dims %95 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8769357Z         %115 = tt.broadcast %114 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8769613Z         tt.store %113, %105, %115 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8769838Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T10:11:08.8770035Z         %116 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T10:11:08.8770229Z         %117 = arith.addi %arg3, %116 : i32
2026-02-21T10:11:08.8770475Z         %118 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8770735Z         %119 = tt.splat %117 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8770951Z         %120 = arith.addi %119, %118 : tensor<1024xi32>
2026-02-21T10:11:08.8771170Z         %121 = arith.cmpi slt, %120, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8771485Z         %122 = tt.descriptor_load %0[%2, %117] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T10:11:08.8771913Z         %123 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8772217Z         %124 = arith.extf %122 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8772501Z         %125 = tt.broadcast %123 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8772749Z         %126 = arith.subf %124, %125 : tensor<8x1024xf32>
2026-02-21T10:11:08.8773150Z         %127 = tt.extern_elementwise %126 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8773595Z         %128 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8773895Z         %129 = tt.broadcast %128 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8774154Z         %130 = arith.divf %127, %129 : tensor<8x1024xf32>
2026-02-21T10:11:08.8774409Z         %131 = arith.truncf %130 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T10:11:08.8774712Z         %132 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8774986Z         %133 = arith.muli %132, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8775277Z         %134 = tt.expand_dims %120 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8775597Z         %135 = tt.broadcast %133 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8775878Z         %136 = tt.broadcast %134 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8776145Z         %137 = arith.addi %135, %136 : tensor<8x1024xi32>
2026-02-21T10:11:08.8776395Z         %138 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8776703Z         %139 = tt.addptr %138, %137 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8777028Z         %140 = tt.expand_dims %121 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8777378Z         %141 = tt.broadcast %140 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8777657Z         tt.store %139, %131, %141 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8777883Z       } {tt.flatten}
2026-02-21T10:11:08.8778101Z       %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T10:11:08.8778385Z       %43 = tt.splat %c12288_i32_6 : i32 -> tensor<1024xi32>
2026-02-21T10:11:08.8778619Z       %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T10:11:08.8778976Z       %45 = arith.cmpi slt, %44, %cst_3 : tensor<1024xi32>
2026-02-21T10:11:08.8779333Z       %46 = tt.descriptor_load %0[%2, %c12288_i32_6] : !tt.tensordesc<tensor<8x1024xf16>> -> tensor<8x1024xf16>
2026-02-21T10:11:08.8779718Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8780046Z       %48 = arith.extf %46 : tensor<8x1024xf16> to tensor<8x1024xf32>
2026-02-21T10:11:08.8780313Z       %49 = tt.broadcast %47 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8780544Z       %50 = arith.subf %48, %49 : tensor<8x1024xf32>
2026-02-21T10:11:08.8780916Z       %51 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T10:11:08.8781349Z       %52 = tt.expand_dims %41 {axis = 1 : i32} : tensor<8xf32> -> tensor<8x1xf32>
2026-02-21T10:11:08.8781654Z       %53 = tt.broadcast %52 : tensor<8x1xf32> -> tensor<8x1024xf32>
2026-02-21T10:11:08.8781883Z       %54 = arith.divf %51, %53 : tensor<8x1024xf32>
2026-02-21T10:11:08.8782123Z       %55 = arith.truncf %54 : tensor<8x1024xf32> to tensor<8x1024xf16>
2026-02-21T10:11:08.8782405Z       %56 = tt.expand_dims %5 {axis = 1 : i32} : tensor<8xi32> -> tensor<8x1xi32>
2026-02-21T10:11:08.8782668Z       %57 = arith.muli %56, %cst_2 : tensor<8x1xi32>
2026-02-21T10:11:08.8782931Z       %58 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T10:11:08.8783226Z       %59 = tt.broadcast %57 : tensor<8x1xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8783524Z       %60 = tt.broadcast %58 : tensor<1x1024xi32> -> tensor<8x1024xi32>
2026-02-21T10:11:08.8783767Z       %61 = arith.addi %59, %60 : tensor<8x1024xi32>
2026-02-21T10:11:08.8783998Z       %62 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8784275Z       %63 = tt.addptr %62, %61 : tensor<8x1024x!tt.ptr<f16>>, tensor<8x1024xi32>
2026-02-21T10:11:08.8784566Z       %64 = tt.expand_dims %45 {axis = 0 : i32} : tensor<1024xi1> -> tensor<1x1024xi1>
2026-02-21T10:11:08.8784860Z       %65 = tt.broadcast %64 : tensor<1x1024xi1> -> tensor<8x1024xi1>
2026-02-21T10:11:08.8785110Z       tt.store %63, %55, %65 : tensor<8x1024x!tt.ptr<f16>>
2026-02-21T10:11:08.8785476Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T10:11:08.8785820Z     tt.return
2026-02-21T10:11:08.8785948Z   }
2026-02-21T10:11:08.8786075Z }
2026-02-21T10:11:08.8786144Z 
2026-02-21T10:11:08.8786193Z {-#
2026-02-21T10:11:08.8786332Z   external_resources: {
2026-02-21T10:11:08.8786490Z     mlir_reproducer: {
2026-02-21T10:11:08.8790806Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T10:11:08.8795263Z       disable_threading: false,
2026-02-21T10:11:08.8795438Z       verify_each: true
2026-02-21T10:11:08.8795581Z     }
2026-02-21T10:11:08.8795707Z   }
2026-02-21T10:11:08.8795818Z #-}
2026-02-21T10:11:08.8796243Z /tmp/torchinductor_root/4d/c4dzntrkfazw2bra6phnmfryd2xwzdozc56m5imk4xkozxkd2n3l.py:20:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:11:08.8797438Z /tmp/torchinductor_root/4d/c4dzntrkfazw2bra6phnmfryd2xwzdozc56m5imk4xkozxkd2n3l.py:20:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:11:08.8798416Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:11:08.8799499Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=32, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T10:11:08.8800444Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:11:08.8800696Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:11:16.0629448Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.2 configs/s
2026-02-21T10:11:16.0639878Z [53s] Adaptive compile timeout: 30s (90% percentile=19.2s, bounds=[30.0s, 30s])
2026-02-21T10:11:17.5460951Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 664.7 configs/s
2026-02-21T10:11:17.6304034Z [55s] Initial random population of 100, 5 starting points: 
2026-02-21T10:11:17.6308244Z error=12
2026-02-21T10:11:17.6309891Z timeout=2
2026-02-21T10:11:17.6310065Z ok=86
2026-02-21T10:11:17.6310211Z min=0.0901
2026-02-21T10:11:17.6310353Z mid=0.6880
2026-02-21T10:11:17.6310486Z max=293.4436
2026-02-21T10:11:17.6310634Z best={'block_sizes': [1, 1024],
2026-02-21T10:11:17.6310873Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:11:17.6311124Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T10:11:17.6311313Z  'num_stages': 6,
2026-02-21T10:11:17.6311454Z  'num_warps': 4,
2026-02-21T10:11:17.6311686Z  'pid_type': 'flat',
2026-02-21T10:11:17.6311844Z  'range_flattens': [None, None],
2026-02-21T10:11:17.6312030Z  'range_multi_buffers': [None, True],
2026-02-21T10:11:17.6312212Z  'range_num_stages': [0, 0],
2026-02-21T10:11:17.6312385Z  'range_unroll_factors': [0, 4],
2026-02-21T10:11:17.6312574Z  'range_warp_specializes': [None, False]}
2026-02-21T10:11:17.6329152Z [55s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:11:18.7696217Z [56s] Generation 1 starting: 83 neighbors, 5 active search path(s)
2026-02-21T10:11:52.7590910Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.3 configs/s
2026-02-21T10:11:57.8599743Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.0 configs/s
2026-02-21T10:12:03.6672162Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 173.1         
2026-02-21T10:12:03.6673558Z                                                                   configs/s     
2026-02-21T10:12:03.9132603Z [101s] Generation 1 complete: 
2026-02-21T10:12:03.9135775Z ok=89
2026-02-21T10:12:03.9139705Z min=0.0737
2026-02-21T10:12:03.9143566Z mid=0.1127
2026-02-21T10:12:03.9148097Z max=0.7374
2026-02-21T10:12:03.9149745Z best={'block_sizes': [1, 512],
2026-02-21T10:12:03.9150078Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'pointer'],
2026-02-21T10:12:03.9154753Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T10:12:03.9159111Z  'num_stages': 6,
2026-02-21T10:12:03.9160507Z  'num_warps': 1,
2026-02-21T10:12:03.9160975Z  'pid_type': 'flat',
2026-02-21T10:12:03.9161174Z  'range_flattens': [None, None],
2026-02-21T10:12:03.9161395Z  'range_multi_buffers': [None, True],
2026-02-21T10:12:03.9161817Z  'range_num_stages': [0, 0],
2026-02-21T10:12:03.9162017Z  'range_unroll_factors': [0, 4],
2026-02-21T10:12:03.9162318Z  'range_warp_specializes': [None, False]}
2026-02-21T10:12:03.9162636Z [101s] Fitting surrogate: 189 points, 189 targets
2026-02-21T10:12:04.9121884Z [102s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T10:12:22.7382177Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 5.3 configs/s
2026-02-21T10:12:27.0999093Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.1 configs/s
2026-02-21T10:12:34.1513259Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.9         
2026-02-21T10:12:34.1513902Z                                                                   configs/s     
2026-02-21T10:12:34.4584483Z [131s] Generation 2 complete: 
2026-02-21T10:12:34.4588197Z ok=77
2026-02-21T10:12:34.4592653Z min=0.0676
2026-02-21T10:12:34.4597260Z mid=0.0840
2026-02-21T10:12:34.4597492Z max=0.2999
2026-02-21T10:12:34.4597648Z best={'block_sizes': [1, 1024],
2026-02-21T10:12:34.4598197Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:12:34.4598512Z  'load_eviction_policies': ['', 'last'],
2026-02-21T10:12:34.4598738Z  'num_stages': 3,
2026-02-21T10:12:34.4598913Z  'num_warps': 1,
2026-02-21T10:12:34.4599065Z  'pid_type': 'flat',
2026-02-21T10:12:34.4599287Z  'range_flattens': [None, False],
2026-02-21T10:12:34.4599500Z  'range_multi_buffers': [None, None],
2026-02-21T10:12:34.4605628Z  'range_num_stages': [0, 2],
2026-02-21T10:12:34.4605864Z  'range_unroll_factors': [0, 3],
2026-02-21T10:12:34.4606059Z  'range_warp_specializes': [None, None]}
2026-02-21T10:12:34.4606288Z [131s] Fitting surrogate: 266 points, 266 targets
2026-02-21T10:12:35.3041486Z [132s] Generation 3 starting: 58 neighbors, 5 active search path(s)
2026-02-21T10:12:50.2638018Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.8 configs/s
2026-02-21T10:12:53.9015223Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 17.3 configs/s
2026-02-21T10:13:00.6007286Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 150.4         
2026-02-21T10:13:00.6008660Z                                                                   configs/s     
2026-02-21T10:13:00.9107488Z [158s] Generation 3 complete: 
2026-02-21T10:13:00.9112268Z ok=64
2026-02-21T10:13:00.9115478Z min=0.0676
2026-02-21T10:13:00.9119948Z mid=0.0778
2026-02-21T10:13:00.9122538Z max=0.2601
2026-02-21T10:13:00.9122713Z best={'block_sizes': [1, 512],
2026-02-21T10:13:00.9122977Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T10:13:00.9123246Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T10:13:00.9123444Z  'num_stages': 3,
2026-02-21T10:13:00.9123585Z  'num_warps': 1,
2026-02-21T10:13:00.9123731Z  'pid_type': 'flat',
2026-02-21T10:13:00.9123911Z  'range_flattens': [None, True],
2026-02-21T10:13:00.9124437Z  'range_multi_buffers': [None, None],
2026-02-21T10:13:00.9124628Z  'range_num_stages': [0, 2],
2026-02-21T10:13:00.9124793Z  'range_unroll_factors': [0, 4],
2026-02-21T10:13:00.9124989Z  'range_warp_specializes': [None, None]}
2026-02-21T10:13:00.9125212Z [158s] Fitting surrogate: 330 points, 330 targets
2026-02-21T10:13:01.7151698Z [159s] Generation 4 starting: 52 neighbors, 5 active search path(s)
2026-02-21T10:13:14.5845382Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 2.8 configs/s
2026-02-21T10:13:17.7010833Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.3 configs/s
2026-02-21T10:13:23.5849780Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 171.2         
2026-02-21T10:13:23.5851024Z                                                                   configs/s     
2026-02-21T10:13:23.8584164Z [181s] Generation 4 complete: 
2026-02-21T10:13:23.8586149Z ok=57
2026-02-21T10:13:23.8586337Z min=0.0655
2026-02-21T10:13:23.8586483Z mid=0.0738
2026-02-21T10:13:23.8586604Z max=0.3216
2026-02-21T10:13:23.8586746Z best={'block_sizes': [1, 1024],
2026-02-21T10:13:23.8586995Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T10:13:23.8587460Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T10:13:23.8587664Z  'num_stages': 8,
2026-02-21T10:13:23.8587809Z  'num_warps': 1,
2026-02-21T10:13:23.8587961Z  'pid_type': 'flat',
2026-02-21T10:13:23.8588116Z  'range_flattens': [None, None],
2026-02-21T10:13:23.8588300Z  'range_multi_buffers': [None, None],
2026-02-21T10:13:23.8588482Z  'range_num_stages': [0, 3],
2026-02-21T10:13:23.8588658Z  'range_unroll_factors': [0, 2],
2026-02-21T10:13:23.8588842Z  'range_warp_specializes': [None, None]}
2026-02-21T10:13:23.8599912Z [181s] Fitting surrogate: 387 points, 387 targets
2026-02-21T10:13:24.4364517Z [181s] Generation 5 starting: 34 neighbors, 3 active search path(s)
2026-02-21T10:13:32.8357743Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 3.1 configs/s
2026-02-21T10:13:34.8437697Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 17.3 configs/s
2026-02-21T10:13:38.5568633Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 270.8         
2026-02-21T10:13:38.5569033Z                                                                   configs/s     
2026-02-21T10:13:38.7475025Z [196s] Generation 5 complete: 
2026-02-21T10:13:38.7479935Z ok=37
2026-02-21T10:13:38.7483795Z min=0.0655
2026-02-21T10:13:38.7488760Z mid=0.0685
2026-02-21T10:13:38.7490747Z max=0.2909
2026-02-21T10:13:38.7490955Z best={'block_sizes': [1, 1024],
2026-02-21T10:13:38.7491270Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T10:13:38.7491658Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T10:13:38.7491886Z  'num_stages': 8,
2026-02-21T10:13:38.7492042Z  'num_warps': 1,
2026-02-21T10:13:38.7492195Z  'pid_type': 'flat',
2026-02-21T10:13:38.7492380Z  'range_flattens': [None, None],
2026-02-21T10:13:38.7492582Z  'range_multi_buffers': [None, None],
2026-02-21T10:13:38.7492766Z  'range_num_stages': [0, 3],
2026-02-21T10:13:38.7492939Z  'range_unroll_factors': [0, 2],
2026-02-21T10:13:38.7493128Z  'range_warp_specializes': [None, None]}
2026-02-21T10:13:38.7493354Z [196s] Fitting surrogate: 424 points, 424 targets
2026-02-21T10:13:39.1828793Z [196s] Generation 6 starting: 21 neighbors, 2 active search path(s)
2026-02-21T10:13:44.6847005Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 5.3 configs/s
2026-02-21T10:13:45.9219371Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.6 configs/s
2026-02-21T10:13:48.1878981Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 441.3         
2026-02-21T10:13:48.1882932Z                                                                   configs/s     
2026-02-21T10:13:48.3109245Z [205s] Generation 6 complete: 
2026-02-21T10:13:48.3113411Z ok=23
2026-02-21T10:13:48.3115258Z min=0.0655
2026-02-21T10:13:48.3115482Z mid=0.0676
2026-02-21T10:13:48.3115772Z max=0.2459
2026-02-21T10:13:48.3115947Z best={'block_sizes': [1, 1024],
2026-02-21T10:13:48.3116239Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T10:13:48.3116571Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T10:13:48.3116884Z  'num_stages': 8,
2026-02-21T10:13:48.3117032Z  'num_warps': 1,
2026-02-21T10:13:48.3117190Z  'pid_type': 'flat',
2026-02-21T10:13:48.3117354Z  'range_flattens': [None, None],
2026-02-21T10:13:48.3117555Z  'range_multi_buffers': [None, None],
2026-02-21T10:13:48.3117758Z  'range_num_stages': [0, 4],
2026-02-21T10:13:48.3117932Z  'range_unroll_factors': [0, 2],
2026-02-21T10:13:48.3118130Z  'range_warp_specializes': [None, None]}
2026-02-21T10:13:48.3144135Z [205s] Fitting surrogate: 447 points, 447 targets
2026-02-21T10:13:48.6066004Z [206s] Generation 7 starting: 11 neighbors, 1 active search path(s)
2026-02-21T10:13:51.5454995Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 5.7 configs/s
2026-02-21T10:13:52.1846939Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.6 configs/s
2026-02-21T10:13:53.5229308Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 741.6         
2026-02-21T10:13:53.5231068Z                                                                   configs/s     
2026-02-21T10:13:53.6046988Z [211s] Generation 7 complete: 
2026-02-21T10:13:53.6051925Z ok=13
2026-02-21T10:13:53.6055107Z min=0.0655
2026-02-21T10:13:53.6059564Z mid=0.0657
2026-02-21T10:13:53.6063932Z max=0.1107
2026-02-21T10:13:53.6065438Z best={'block_sizes': [1, 1024],
2026-02-21T10:13:53.6065718Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T10:13:53.6065959Z  'load_eviction_policies': ['', ''],
2026-02-21T10:13:53.6066153Z  'num_stages': 1,
2026-02-21T10:13:53.6066299Z  'num_warps': 1,
2026-02-21T10:13:53.6066452Z  'pid_type': 'flat',
2026-02-21T10:13:53.6066626Z  'range_flattens': [None, None],
2026-02-21T10:13:53.6066817Z  'range_multi_buffers': [None, None],
2026-02-21T10:13:53.6067015Z  'range_num_stages': [0, 2],
2026-02-21T10:13:53.6067181Z  'range_unroll_factors': [0, 2],
2026-02-21T10:13:53.6067369Z  'range_warp_specializes': [None, None]}
2026-02-21T10:13:53.6067689Z [211s] Fitting surrogate: 460 points, 460 targets
2026-02-21T10:13:53.7665435Z [211s] Autotuning complete in 211.2s after searching 439 configs.
2026-02-21T10:13:53.7667078Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:13:53.7667985Z     @helion.kernel(config=helion.Config(block_sizes=[1, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[None, None]), static_shapes=True)
2026-02-21T10:13:53.7668757Z 
2026-02-21T10:13:53.7669017Z [211s] Code of selected kernel: /tmp/torchinductor_root/2i/c2iwy5mebrvu2qeluv3o7rszzw2fzncbw5e2bp6uphikash4umg5.py
2026-02-21T10:13:53.7895747Z from __future__ import annotations
2026-02-21T10:13:53.7895989Z 
2026-02-21T10:13:53.7900119Z import torch
2026-02-21T10:13:53.7904168Z import triton
2026-02-21T10:13:53.7908647Z import triton.language as tl
2026-02-21T10:13:53.7913217Z from torch._inductor.runtime import triton_helpers
2026-02-21T10:13:53.7915367Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T10:13:53.7915699Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T10:13:53.7915876Z 
2026-02-21T10:13:53.7915951Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T10:13:53.7916142Z _BLOCK_SIZE_1 = tl.constexpr(1024)
2026-02-21T10:13:53.7916259Z 
2026-02-21T10:13:53.7916329Z @triton.jit
2026-02-21T10:13:53.7916479Z def _helion_softmax_two_pass(x, out):
2026-02-21T10:13:53.7916746Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:13:53.7917217Z     pid_0 = tl.program_id(0)
2026-02-21T10:13:53.7917410Z     offset_0 = pid_0
2026-02-21T10:13:53.7917590Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T10:13:53.7917882Z     # src[softmax.py:80]: mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:13:53.7918191Z     mi = tl.full([_BLOCK_SIZE_0], float('-inf'), tl.float32)
2026-02-21T10:13:53.7918541Z     # src[softmax.py:81]: di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:13:53.7918804Z     di = tl.full([_BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T10:13:53.7919056Z     # src[softmax.py:82]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:13:53.7919326Z     # src[softmax.py:83]:     values = x[tile_m, tile_n]
2026-02-21T10:13:53.7919573Z     # src[softmax.py:84]:     local_amax = torch.amax(values, dim=1)
2026-02-21T10:13:53.7919805Z     # src[softmax.py:82-89]: ...
2026-02-21T10:13:53.7920071Z     for offset_2 in tl.range(0, 12672, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1):
2026-02-21T10:13:53.7920395Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:13:53.7920639Z         mask_1 = indices_2 < 12672
2026-02-21T10:13:53.7920804Z         mi_copy = mi
2026-02-21T10:13:53.7920950Z         di_copy = di
2026-02-21T10:13:53.7921092Z         mi_copy_0 = mi_copy
2026-02-21T10:13:53.7921294Z         di_copy_0 = di_copy
2026-02-21T10:13:53.7921476Z         # src[softmax.py:83]: values = x[tile_m, tile_n]
2026-02-21T10:13:53.7921875Z         values = tl.load(x + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), mask_1[None, :], other=0)
2026-02-21T10:13:53.7922217Z         # src[softmax.py:84]: local_amax = torch.amax(values, dim=1)
2026-02-21T10:13:53.7922617Z         _mask_to = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), values, tl.full([], float('-inf'), tl.float16))
2026-02-21T10:13:53.7923018Z         local_amax = tl.cast(tl.max(_mask_to, 1), tl.float16)
2026-02-21T10:13:53.7923278Z         # src[softmax.py:85]: mi_next = torch.maximum(mi, local_amax)
2026-02-21T10:13:53.7923519Z         v_0 = tl.cast(local_amax, tl.float32)
2026-02-21T10:13:53.7923735Z         v_1 = triton_helpers.maximum(mi_copy_0, v_0)
2026-02-21T10:13:53.7923991Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:13:53.7924238Z         v_2 = mi_copy_0 - v_1
2026-02-21T10:13:53.7924409Z         v_3 = libdevice.exp(v_2)
2026-02-21T10:13:53.7924581Z         v_4 = di_copy_0 * v_3
2026-02-21T10:13:53.7924764Z         # src[softmax.py:87]: values - mi_next[:, None]
2026-02-21T10:13:53.7924967Z         subscript = v_1[:, None]
2026-02-21T10:13:53.7925134Z         v_5 = tl.cast(values, tl.float32)
2026-02-21T10:13:53.7925316Z         v_6 = v_5 - subscript
2026-02-21T10:13:53.7925527Z         # src[softmax.py:86]: di = di * torch.exp(mi - mi_next) + torch.exp(
2026-02-21T10:13:53.7925781Z         # src[softmax.py:87]:     values - mi_next[:, None]
2026-02-21T10:13:53.7926000Z         # src[softmax.py:88]: ).sum(dim=1)
2026-02-21T10:13:53.7926185Z         v_7 = libdevice.exp(v_6)
2026-02-21T10:13:53.7926505Z         _mask_to_1 = tl.where(tl.broadcast_to(mask_1[None, :], [_BLOCK_SIZE_0, _BLOCK_SIZE_1]), v_7, tl.full([], 0, tl.float32))
2026-02-21T10:13:53.7926904Z         sum_1 = tl.cast(tl.sum(_mask_to_1, 1), tl.float32)
2026-02-21T10:13:53.7927114Z         di = v_4 + sum_1
2026-02-21T10:13:53.7927285Z         # src[softmax.py:89]: mi = mi_next
2026-02-21T10:13:53.7927456Z         mi = v_1
2026-02-21T10:13:53.7927660Z     # src[softmax.py:90]: for tile_n in hl.tile(n, block_size=block_size_n):
2026-02-21T10:13:53.7927927Z     # src[softmax.py:91]:     values = x[tile_m, tile_n]
2026-02-21T10:13:53.7928225Z     # src[softmax.py:92]:     out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:13:53.7928578Z     for offset_2 in tl.range(0, 12672, _BLOCK_SIZE_1, loop_unroll_factor=2, num_stages=1):
2026-02-21T10:13:53.7928908Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_1).to(tl.int32)
2026-02-21T10:13:53.7929207Z         mask_2 = indices_2 < 12672
2026-02-21T10:13:53.7929377Z         mi_copy_1 = mi
2026-02-21T10:13:53.7929529Z         di_copy_1 = di
2026-02-21T10:13:53.7929676Z         mi_copy_1_0 = mi_copy_1
2026-02-21T10:13:53.7929845Z         di_copy_1_0 = di_copy_1
2026-02-21T10:13:53.7930028Z         # src[softmax.py:91]: values = x[tile_m, tile_n]
2026-02-21T10:13:53.7930390Z         values_1 = tl.load(x + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), mask_2[None, :], other=0)
2026-02-21T10:13:53.7930784Z         # src[softmax.py:92]: out[tile_m, tile_n] = torch.exp(values - mi[:, None]) / di[:, None]
2026-02-21T10:13:53.7931065Z         subscript_1 = mi_copy_1_0[:, None]
2026-02-21T10:13:53.7931260Z         v_9 = tl.cast(values_1, tl.float32)
2026-02-21T10:13:53.7931440Z         v_10 = v_9 - subscript_1
2026-02-21T10:13:53.7931658Z         v_11 = libdevice.exp(v_10)
2026-02-21T10:13:53.7931832Z         subscript_2 = di_copy_1_0[:, None]
2026-02-21T10:13:53.7932016Z         v_12 = v_11 / subscript_2
2026-02-21T10:13:53.7932194Z         v_13 = tl.cast(v_12, tl.float16)
2026-02-21T10:13:53.7932461Z         tl.store(out + (indices_0[:, None] * 12672 + indices_2[None, :] * 1), v_13, mask_2[None, :])
2026-02-21T10:13:53.7932672Z 
2026-02-21T10:13:53.7932841Z def softmax_two_pass(x: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T10:13:53.7933075Z     """
2026-02-21T10:13:53.7933288Z     Numerically optimized Helion kernel performing softmax in two passes.
2026-02-21T10:13:53.7933592Z     This version uses fewer passes but is less numerically stable.
2026-02-21T10:13:53.7933820Z     Args:
2026-02-21T10:13:53.7933990Z         x (torch.Tensor): Input tensor of shape [m, n].
2026-02-21T10:13:53.7934183Z     Returns:
2026-02-21T10:13:53.7934370Z         torch.Tensor: Softmax output tensor of the same shape.
2026-02-21T10:13:53.7934577Z     """
2026-02-21T10:13:53.7934721Z     # src[softmax.py:75]: m, n = x.size()
2026-02-21T10:13:53.7934897Z     m, n = x.size()
2026-02-21T10:13:53.7935076Z     # src[softmax.py:76]: out = torch.empty_like(x)
2026-02-21T10:13:53.7935285Z     out = torch.empty_like(x)
2026-02-21T10:13:53.7935530Z     # src[softmax.py:79]: for tile_m in hl.tile(m, block_size=block_size_m):
2026-02-21T10:13:53.7935868Z     # src[softmax.py:80]:     mi = hl.full([tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T10:13:53.7936197Z     # src[softmax.py:81]:     di = hl.zeros([tile_m], dtype=torch.float32)
2026-02-21T10:13:53.7936450Z     # src[softmax.py:79-92]: ...
2026-02-21T10:13:53.7936716Z     _launcher(_helion_softmax_two_pass, (4096,), x, out, num_warps=1, num_stages=1)
2026-02-21T10:13:53.7937007Z     # src[softmax.py:93]: return out
2026-02-21T10:13:53.7937185Z     return out
2026-02-21T10:13:54.5308979Z WARNING:tritonbench.utils.triton_op:Completed input ID 97:
2026-02-21T10:13:54.5310916Z (M, N)
2026-02-21T10:13:54.5311091Z -------------
2026-02-21T10:13:54.5311243Z (4096, 12672)
2026-02-21T10:13:54.5311323Z 
2026-02-21T10:13:54.5311691Z 100%|██████████| 20/20 [1:04:59<00:00, 223.54s/it]
2026-02-21T10:13:54.5315005Z 100%|██████████| 20/20 [1:04:59<00:00, 194.99s/it]
2026-02-21T10:13:54.5341086Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp4vqrkpfg.csv
2026-02-21T10:13:56.2112601Z        (M, N)    triton_softmax-speedup    triton_softmax-accuracy    torch_compile_softmax-speedup    torch_compile_softmax-accuracy    helion_softmax_tritonbench-speedup    helion_softmax_tritonbench-accuracy
2026-02-21T10:13:56.2113461Z -------------  ------------------------  -------------------------  -------------------------------  --------------------------------  ------------------------------------  -------------------------------------
2026-02-21T10:13:56.2114073Z   (4096, 256)                  0.931476                          1                         1.25056                                  1                               1.39284                                      1
2026-02-21T10:13:56.2114834Z   (4096, 896)                  1.85641                           1                         1.43552                                  1                               2.02169                                      1
2026-02-21T10:13:56.2115363Z  (4096, 1536)                  3.52879                           1                         2.21949                                  1                               4.483                                        1
2026-02-21T10:13:56.2115937Z  (4096, 2176)                  2.36344                           1                         1.98727                                  1                               4.14473                                      1
2026-02-21T10:13:56.2116410Z  (4096, 2816)                  2.41335                           1                         1.68812                                  1                               4.28562                                      1
2026-02-21T10:13:56.2116882Z  (4096, 3584)                  2.7054                            1                         1.54707                                  1                               3.36528                                      1
2026-02-21T10:13:56.2117419Z  (4096, 4224)                  3.72627                           1                         1.97747                                  1                               4.89165                                      1
2026-02-21T10:13:56.2117896Z  (4096, 4864)                  3.78826                           1                         1.85434                                  1                               5.02964                                      1
2026-02-21T10:13:56.2118362Z  (4096, 5504)                  4.1215                            1                         1.88542                                  1                               4.8864                                       1
2026-02-21T10:13:56.2118827Z  (4096, 6144)                  4.15768                           1                         2.09188                                  1                               4.54654                                      1
2026-02-21T10:13:56.2119288Z  (4096, 6784)                  4.28904                           1                         1.67317                                  1                               4.62843                                      1
2026-02-21T10:13:56.2119762Z  (4096, 7424)                  4.89155                           1                         1.86434                                  1                               4.77075                                      1
2026-02-21T10:13:56.2120231Z  (4096, 8064)                  4.84175                           1                         1.78775                                  1                               4.6481                                       1
2026-02-21T10:13:56.2120692Z  (4096, 8704)                  2.67811                           1                         1.91351                                  1                               2.96255                                      1
2026-02-21T10:13:56.2121165Z  (4096, 9344)                  1.74363                           1                         0.985811                                 1                               1.85503                                      1
2026-02-21T10:13:56.2121770Z (4096, 10112)                  1.74389                           1                         0.950067                                 1                               2.43056                                      1
2026-02-21T10:13:56.2122246Z (4096, 10752)                  1.73459                           1                         1.06085                                  1                               2.27849                                      1
2026-02-21T10:13:56.2122711Z (4096, 11392)                  1.74057                           1                         0.862269                                 1                               2.2063                                       1
2026-02-21T10:13:56.2123210Z (4096, 12032)                  1.74097                           1                         0.843324                                 1                               2.2441                                       1
2026-02-21T10:13:56.2123679Z (4096, 12672)                  1.7545                            1                         0.83287                                  1                               1.35196                                      1
2026-02-21T10:13:56.2124217Z       average                  2.83756                           1                         1.53556                                  1                               3.42118                                      1
2026-02-21T10:14:00.7122044Z ✅ Completed benchmark for kernel: softmax
2026-02-21T10:14:00.7129413Z [
2026-02-21T10:14:00.7134406Z   {
2026-02-21T10:14:00.7136381Z     "benchmark": {
2026-02-21T10:14:00.7136596Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7136783Z       "extra_info": {
2026-02-21T10:14:00.7136948Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7137110Z       }
2026-02-21T10:14:00.7137239Z     },
2026-02-21T10:14:00.7137390Z     "model": {
2026-02-21T10:14:00.7137546Z       "name": "softmax"
2026-02-21T10:14:00.7137696Z     },
2026-02-21T10:14:00.7137812Z     "metric": {
2026-02-21T10:14:00.7137961Z       "name": "triton_speedup",
2026-02-21T10:14:00.7138468Z       "benchmark_values": [
2026-02-21T10:14:00.7138659Z         0.931475984939404,
2026-02-21T10:14:00.7138808Z         1.8564143200843095,
2026-02-21T10:14:00.7138962Z         3.528789059960257,
2026-02-21T10:14:00.7139104Z         2.3634400016828625,
2026-02-21T10:14:00.7139255Z         2.4133479657797072,
2026-02-21T10:14:00.7139397Z         2.7053978669166314,
2026-02-21T10:14:00.7139548Z         3.7262709990419967,
2026-02-21T10:14:00.7139702Z         3.788257933823361,
2026-02-21T10:14:00.7139842Z         4.121495204138856,
2026-02-21T10:14:00.7139989Z         4.157681244965289,
2026-02-21T10:14:00.7140126Z         4.289042305257703,
2026-02-21T10:14:00.7140270Z         4.891546023739693,
2026-02-21T10:14:00.7140413Z         4.841745530599691,
2026-02-21T10:14:00.7140555Z         2.678114711491322,
2026-02-21T10:14:00.7140694Z         1.743631011334874,
2026-02-21T10:14:00.7140837Z         1.7438875801093323,
2026-02-21T10:14:00.7140976Z         1.7345903226492485,
2026-02-21T10:14:00.7141122Z         1.7405653952784175,
2026-02-21T10:14:00.7141275Z         1.7409664670249556,
2026-02-21T10:14:00.7141413Z         1.7544973423304502
2026-02-21T10:14:00.7141621Z       ]
2026-02-21T10:14:00.7141740Z     },
2026-02-21T10:14:00.7141867Z     "shape": [
2026-02-21T10:14:00.7141994Z       "(4096, 256)",
2026-02-21T10:14:00.7142136Z       "(4096, 896)",
2026-02-21T10:14:00.7142267Z       "(4096, 1536)",
2026-02-21T10:14:00.7142411Z       "(4096, 2176)",
2026-02-21T10:14:00.7142543Z       "(4096, 2816)",
2026-02-21T10:14:00.7142759Z       "(4096, 3584)",
2026-02-21T10:14:00.7142898Z       "(4096, 4224)",
2026-02-21T10:14:00.7143029Z       "(4096, 4864)",
2026-02-21T10:14:00.7143171Z       "(4096, 5504)",
2026-02-21T10:14:00.7143300Z       "(4096, 6144)",
2026-02-21T10:14:00.7143436Z       "(4096, 6784)",
2026-02-21T10:14:00.7143653Z       "(4096, 7424)",
2026-02-21T10:14:00.7143799Z       "(4096, 8064)",
2026-02-21T10:14:00.7143933Z       "(4096, 8704)",
2026-02-21T10:14:00.7144079Z       "(4096, 9344)",
2026-02-21T10:14:00.7144215Z       "(4096, 10112)",
2026-02-21T10:14:00.7144364Z       "(4096, 10752)",
2026-02-21T10:14:00.7144506Z       "(4096, 11392)",
2026-02-21T10:14:00.7144638Z       "(4096, 12032)",
2026-02-21T10:14:00.7144780Z       "(4096, 12672)"
2026-02-21T10:14:00.7144907Z     ]
2026-02-21T10:14:00.7145031Z   },
2026-02-21T10:14:00.7145145Z   {
2026-02-21T10:14:00.7145271Z     "benchmark": {
2026-02-21T10:14:00.7145416Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7145588Z       "extra_info": {
2026-02-21T10:14:00.7145728Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7145887Z       }
2026-02-21T10:14:00.7145998Z     },
2026-02-21T10:14:00.7146120Z     "model": {
2026-02-21T10:14:00.7146253Z       "name": "softmax"
2026-02-21T10:14:00.7146450Z     },
2026-02-21T10:14:00.7146577Z     "metric": {
2026-02-21T10:14:00.7146718Z       "name": "triton_accuracy",
2026-02-21T10:14:00.7146889Z       "benchmark_values": [
2026-02-21T10:14:00.7147033Z         1.0,
2026-02-21T10:14:00.7147159Z         1.0,
2026-02-21T10:14:00.7147282Z         1.0,
2026-02-21T10:14:00.7147469Z         1.0,
2026-02-21T10:14:00.7147583Z         1.0,
2026-02-21T10:14:00.7147705Z         1.0,
2026-02-21T10:14:00.7147818Z         1.0,
2026-02-21T10:14:00.7147939Z         1.0,
2026-02-21T10:14:00.7148052Z         1.0,
2026-02-21T10:14:00.7148172Z         1.0,
2026-02-21T10:14:00.7148294Z         1.0,
2026-02-21T10:14:00.7148409Z         1.0,
2026-02-21T10:14:00.7148532Z         1.0,
2026-02-21T10:14:00.7148647Z         1.0,
2026-02-21T10:14:00.7148771Z         1.0,
2026-02-21T10:14:00.7148886Z         1.0,
2026-02-21T10:14:00.7149010Z         1.0,
2026-02-21T10:14:00.7149123Z         1.0,
2026-02-21T10:14:00.7149244Z         1.0,
2026-02-21T10:14:00.7149362Z         1.0
2026-02-21T10:14:00.7149490Z       ]
2026-02-21T10:14:00.7149604Z     },
2026-02-21T10:14:00.7149735Z     "shape": [
2026-02-21T10:14:00.7149872Z       "(4096, 256)",
2026-02-21T10:14:00.7150005Z       "(4096, 896)",
2026-02-21T10:14:00.7150143Z       "(4096, 1536)",
2026-02-21T10:14:00.7150315Z       "(4096, 2176)",
2026-02-21T10:14:00.7150455Z       "(4096, 2816)",
2026-02-21T10:14:00.7150582Z       "(4096, 3584)",
2026-02-21T10:14:00.7150717Z       "(4096, 4224)",
2026-02-21T10:14:00.7150844Z       "(4096, 4864)",
2026-02-21T10:14:00.7150980Z       "(4096, 5504)",
2026-02-21T10:14:00.7151107Z       "(4096, 6144)",
2026-02-21T10:14:00.7151240Z       "(4096, 6784)",
2026-02-21T10:14:00.7151367Z       "(4096, 7424)",
2026-02-21T10:14:00.7151503Z       "(4096, 8064)",
2026-02-21T10:14:00.7151680Z       "(4096, 8704)",
2026-02-21T10:14:00.7151807Z       "(4096, 9344)",
2026-02-21T10:14:00.7151944Z       "(4096, 10112)",
2026-02-21T10:14:00.7152078Z       "(4096, 10752)",
2026-02-21T10:14:00.7152222Z       "(4096, 11392)",
2026-02-21T10:14:00.7152355Z       "(4096, 12032)",
2026-02-21T10:14:00.7152496Z       "(4096, 12672)"
2026-02-21T10:14:00.7152626Z     ]
2026-02-21T10:14:00.7152750Z   },
2026-02-21T10:14:00.7152862Z   {
2026-02-21T10:14:00.7152989Z     "benchmark": {
2026-02-21T10:14:00.7153136Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7153310Z       "extra_info": {
2026-02-21T10:14:00.7153454Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7153602Z       }
2026-02-21T10:14:00.7153720Z     },
2026-02-21T10:14:00.7153832Z     "model": {
2026-02-21T10:14:00.7153965Z       "name": "softmax"
2026-02-21T10:14:00.7154099Z     },
2026-02-21T10:14:00.7154222Z     "metric": {
2026-02-21T10:14:00.7154367Z       "name": "torch_compile_speedup",
2026-02-21T10:14:00.7154550Z       "benchmark_values": [
2026-02-21T10:14:00.7154699Z         1.250564387047188,
2026-02-21T10:14:00.7154851Z         1.435517213469286,
2026-02-21T10:14:00.7154997Z         2.219487315150675,
2026-02-21T10:14:00.7155138Z         1.9872697309694431,
2026-02-21T10:14:00.7155287Z         1.6881221240745872,
2026-02-21T10:14:00.7155470Z         1.5470672097488474,
2026-02-21T10:14:00.7155615Z         1.9774716427575567,
2026-02-21T10:14:00.7155751Z         1.8543425154620692,
2026-02-21T10:14:00.7155895Z         1.8854201106396877,
2026-02-21T10:14:00.7156035Z         2.0918793313501003,
2026-02-21T10:14:00.7156182Z         1.673172486995061,
2026-02-21T10:14:00.7156321Z         1.8643445788674753,
2026-02-21T10:14:00.7156470Z         1.7877475922999537,
2026-02-21T10:14:00.7156623Z         1.913510707795816,
2026-02-21T10:14:00.7156767Z         0.9858111921332556,
2026-02-21T10:14:00.7156922Z         0.9500668006873804,
2026-02-21T10:14:00.7157069Z         1.0608469534383906,
2026-02-21T10:14:00.7157219Z         0.8622689777512719,
2026-02-21T10:14:00.7157358Z         0.843324263350364,
2026-02-21T10:14:00.7157506Z         0.8328696803411412
2026-02-21T10:14:00.7157643Z       ]
2026-02-21T10:14:00.7157763Z     },
2026-02-21T10:14:00.7157911Z     "shape": [
2026-02-21T10:14:00.7158045Z       "(4096, 256)",
2026-02-21T10:14:00.7158183Z       "(4096, 896)",
2026-02-21T10:14:00.7158312Z       "(4096, 1536)",
2026-02-21T10:14:00.7158448Z       "(4096, 2176)",
2026-02-21T10:14:00.7158577Z       "(4096, 2816)",
2026-02-21T10:14:00.7158715Z       "(4096, 3584)",
2026-02-21T10:14:00.7158889Z       "(4096, 4224)",
2026-02-21T10:14:00.7159025Z       "(4096, 4864)",
2026-02-21T10:14:00.7159152Z       "(4096, 5504)",
2026-02-21T10:14:00.7159283Z       "(4096, 6144)",
2026-02-21T10:14:00.7159408Z       "(4096, 6784)",
2026-02-21T10:14:00.7159541Z       "(4096, 7424)",
2026-02-21T10:14:00.7159667Z       "(4096, 8064)",
2026-02-21T10:14:00.7159802Z       "(4096, 8704)",
2026-02-21T10:14:00.7159936Z       "(4096, 9344)",
2026-02-21T10:14:00.7160067Z       "(4096, 10112)",
2026-02-21T10:14:00.7160208Z       "(4096, 10752)",
2026-02-21T10:14:00.7160369Z       "(4096, 11392)",
2026-02-21T10:14:00.7160500Z       "(4096, 12032)",
2026-02-21T10:14:00.7160638Z       "(4096, 12672)"
2026-02-21T10:14:00.7160773Z     ]
2026-02-21T10:14:00.7160886Z   },
2026-02-21T10:14:00.7161006Z   {
2026-02-21T10:14:00.7161123Z     "benchmark": {
2026-02-21T10:14:00.7161271Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7161431Z       "extra_info": {
2026-02-21T10:14:00.7161652Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7161806Z       }
2026-02-21T10:14:00.7161926Z     },
2026-02-21T10:14:00.7162040Z     "model": {
2026-02-21T10:14:00.7162176Z       "name": "softmax"
2026-02-21T10:14:00.7162311Z     },
2026-02-21T10:14:00.7162436Z     "metric": {
2026-02-21T10:14:00.7162587Z       "name": "torch_compile_accuracy",
2026-02-21T10:14:00.7162767Z       "benchmark_values": [
2026-02-21T10:14:00.7162922Z         1.0,
2026-02-21T10:14:00.7163041Z         1.0,
2026-02-21T10:14:00.7163164Z         1.0,
2026-02-21T10:14:00.7163281Z         1.0,
2026-02-21T10:14:00.7163404Z         1.0,
2026-02-21T10:14:00.7163517Z         1.0,
2026-02-21T10:14:00.7163638Z         1.0,
2026-02-21T10:14:00.7163753Z         1.0,
2026-02-21T10:14:00.7163875Z         1.0,
2026-02-21T10:14:00.7163990Z         1.0,
2026-02-21T10:14:00.7164108Z         1.0,
2026-02-21T10:14:00.7164229Z         1.0,
2026-02-21T10:14:00.7164342Z         1.0,
2026-02-21T10:14:00.7164462Z         1.0,
2026-02-21T10:14:00.7164578Z         1.0,
2026-02-21T10:14:00.7164700Z         1.0,
2026-02-21T10:14:00.7164814Z         1.0,
2026-02-21T10:14:00.7164934Z         1.0,
2026-02-21T10:14:00.7165049Z         1.0,
2026-02-21T10:14:00.7165172Z         1.0
2026-02-21T10:14:00.7165287Z       ]
2026-02-21T10:14:00.7165407Z     },
2026-02-21T10:14:00.7165521Z     "shape": [
2026-02-21T10:14:00.7165653Z       "(4096, 256)",
2026-02-21T10:14:00.7165783Z       "(4096, 896)",
2026-02-21T10:14:00.7165921Z       "(4096, 1536)",
2026-02-21T10:14:00.7166059Z       "(4096, 2176)",
2026-02-21T10:14:00.7166190Z       "(4096, 2816)",
2026-02-21T10:14:00.7166326Z       "(4096, 3584)",
2026-02-21T10:14:00.7166452Z       "(4096, 4224)",
2026-02-21T10:14:00.7166594Z       "(4096, 4864)",
2026-02-21T10:14:00.7166725Z       "(4096, 5504)",
2026-02-21T10:14:00.7166904Z       "(4096, 6144)",
2026-02-21T10:14:00.7167031Z       "(4096, 6784)",
2026-02-21T10:14:00.7167166Z       "(4096, 7424)",
2026-02-21T10:14:00.7167295Z       "(4096, 8064)",
2026-02-21T10:14:00.7167435Z       "(4096, 8704)",
2026-02-21T10:14:00.7167587Z       "(4096, 9344)",
2026-02-21T10:14:00.7167721Z       "(4096, 10112)",
2026-02-21T10:14:00.7167870Z       "(4096, 10752)",
2026-02-21T10:14:00.7168008Z       "(4096, 11392)",
2026-02-21T10:14:00.7168152Z       "(4096, 12032)",
2026-02-21T10:14:00.7168286Z       "(4096, 12672)"
2026-02-21T10:14:00.7168427Z     ]
2026-02-21T10:14:00.7168544Z   },
2026-02-21T10:14:00.7168671Z   {
2026-02-21T10:14:00.7168790Z     "benchmark": {
2026-02-21T10:14:00.7168944Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7169111Z       "extra_info": {
2026-02-21T10:14:00.7169266Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7169429Z       }
2026-02-21T10:14:00.7169586Z     },
2026-02-21T10:14:00.7169717Z     "model": {
2026-02-21T10:14:00.7169851Z       "name": "softmax"
2026-02-21T10:14:00.7170001Z     },
2026-02-21T10:14:00.7170119Z     "metric": {
2026-02-21T10:14:00.7170267Z       "name": "helion_speedup",
2026-02-21T10:14:00.7170434Z       "benchmark_values": [
2026-02-21T10:14:00.7170634Z         1.3928367223219527,
2026-02-21T10:14:00.7170782Z         2.021686809958268,
2026-02-21T10:14:00.7170937Z         4.482999885115271,
2026-02-21T10:14:00.7171092Z         4.144729722512214,
2026-02-21T10:14:00.7171236Z         4.285616670885379,
2026-02-21T10:14:00.7171389Z         3.3652760520851452,
2026-02-21T10:14:00.7171563Z         4.891645563073838,
2026-02-21T10:14:00.7171719Z         5.029636573721792,
2026-02-21T10:14:00.7171863Z         4.886399900166625,
2026-02-21T10:14:00.7172015Z         4.5465352701655695,
2026-02-21T10:14:00.7172162Z         4.628430633856397,
2026-02-21T10:14:00.7172315Z         4.770753433747418,
2026-02-21T10:14:00.7172459Z         4.648096395332824,
2026-02-21T10:14:00.7172612Z         2.9625486935284497,
2026-02-21T10:14:00.7172769Z         1.855029253167436,
2026-02-21T10:14:00.7172914Z         2.4305577597188064,
2026-02-21T10:14:00.7173073Z         2.278486748774394,
2026-02-21T10:14:00.7173268Z         2.2062960096361914,
2026-02-21T10:14:00.7173427Z         2.2441016414386814,
2026-02-21T10:14:00.7173573Z         1.3519636111885713
2026-02-21T10:14:00.7173725Z       ]
2026-02-21T10:14:00.7173842Z     },
2026-02-21T10:14:00.7173973Z     "shape": [
2026-02-21T10:14:00.7174107Z       "(4096, 256)",
2026-02-21T10:14:00.7174257Z       "(4096, 896)",
2026-02-21T10:14:00.7174399Z       "(4096, 1536)",
2026-02-21T10:14:00.7174555Z       "(4096, 2176)",
2026-02-21T10:14:00.7174709Z       "(4096, 2816)",
2026-02-21T10:14:00.7174850Z       "(4096, 3584)",
2026-02-21T10:14:00.7175002Z       "(4096, 4224)",
2026-02-21T10:14:00.7175135Z       "(4096, 4864)",
2026-02-21T10:14:00.7175277Z       "(4096, 5504)",
2026-02-21T10:14:00.7175413Z       "(4096, 6144)",
2026-02-21T10:14:00.7175558Z       "(4096, 6784)",
2026-02-21T10:14:00.7175697Z       "(4096, 7424)",
2026-02-21T10:14:00.7175841Z       "(4096, 8064)",
2026-02-21T10:14:00.7175977Z       "(4096, 8704)",
2026-02-21T10:14:00.7176121Z       "(4096, 9344)",
2026-02-21T10:14:00.7176268Z       "(4096, 10112)",
2026-02-21T10:14:00.7176411Z       "(4096, 10752)",
2026-02-21T10:14:00.7176562Z       "(4096, 11392)",
2026-02-21T10:14:00.7176700Z       "(4096, 12032)",
2026-02-21T10:14:00.7176846Z       "(4096, 12672)"
2026-02-21T10:14:00.7176983Z     ]
2026-02-21T10:14:00.7177113Z   },
2026-02-21T10:14:00.7177230Z   {
2026-02-21T10:14:00.7177360Z     "benchmark": {
2026-02-21T10:14:00.7177509Z       "name": "Helion Benchmark",
2026-02-21T10:14:00.7177687Z       "extra_info": {
2026-02-21T10:14:00.7177834Z         "device": "NVIDIA B200"
2026-02-21T10:14:00.7177999Z       }
2026-02-21T10:14:00.7178124Z     },
2026-02-21T10:14:00.7178257Z     "model": {
2026-02-21T10:14:00.7178396Z       "name": "softmax"
2026-02-21T10:14:00.7178535Z     },
2026-02-21T10:14:00.7178703Z     "metric": {
2026-02-21T10:14:00.7178847Z       "name": "helion_accuracy",
2026-02-21T10:14:00.7179023Z       "benchmark_values": [
2026-02-21T10:14:00.7179172Z         1.0,
2026-02-21T10:14:00.7179303Z         1.0,
2026-02-21T10:14:00.7179428Z         1.0,
2026-02-21T10:14:00.7179557Z         1.0,
2026-02-21T10:14:00.7179677Z         1.0,
2026-02-21T10:14:00.7179804Z         1.0,
2026-02-21T10:14:00.7179930Z         1.0,
2026-02-21T10:14:00.7180051Z         1.0,
2026-02-21T10:14:00.7180180Z         1.0,
2026-02-21T10:14:00.7180304Z         1.0,
2026-02-21T10:14:00.7180436Z         1.0,
2026-02-21T10:14:00.7180558Z         1.0,
2026-02-21T10:14:00.7180690Z         1.0,
2026-02-21T10:14:00.7180813Z         1.0,
2026-02-21T10:14:00.7180944Z         1.0,
2026-02-21T10:14:00.7181069Z         1.0,
2026-02-21T10:14:00.7181204Z         1.0,
2026-02-21T10:14:00.7181329Z         1.0,
2026-02-21T10:14:00.7181460Z         1.0,
2026-02-21T10:14:00.7181685Z         1.0
2026-02-21T10:14:00.7181837Z       ]
2026-02-21T10:14:00.7181975Z     },
2026-02-21T10:14:00.7182100Z     "shape": [
2026-02-21T10:14:00.7182236Z       "(4096, 256)",
2026-02-21T10:14:00.7182374Z       "(4096, 896)",
2026-02-21T10:14:00.7182523Z       "(4096, 1536)",
2026-02-21T10:14:00.7182708Z       "(4096, 2176)",
2026-02-21T10:14:00.7182850Z       "(4096, 2816)",
2026-02-21T10:14:00.7182985Z       "(4096, 3584)",
2026-02-21T10:14:00.7183137Z       "(4096, 4224)",
2026-02-21T10:14:00.7183270Z       "(4096, 4864)",
2026-02-21T10:14:00.7183410Z       "(4096, 5504)",
2026-02-21T10:14:00.7183548Z       "(4096, 6144)",
2026-02-21T10:14:00.7183680Z       "(4096, 6784)",
2026-02-21T10:14:00.7183818Z       "(4096, 7424)",
2026-02-21T10:14:00.7183950Z       "(4096, 8064)",
2026-02-21T10:14:00.7184090Z       "(4096, 8704)",
2026-02-21T10:14:00.7184225Z       "(4096, 9344)",
2026-02-21T10:14:00.7184365Z       "(4096, 10112)",
2026-02-21T10:14:00.7184502Z       "(4096, 10752)",
2026-02-21T10:14:00.7184650Z       "(4096, 11392)",
2026-02-21T10:14:00.7184789Z       "(4096, 12032)",
2026-02-21T10:14:00.7184935Z       "(4096, 12672)"
2026-02-21T10:14:00.7185067Z     ]
2026-02-21T10:14:00.7185193Z   }
2026-02-21T10:14:00.7197948Z ]
2026-02-21T10:14:00.7268495Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main
2026-02-21T10:14:00.7268808Z with:
2026-02-21T10:14:00.7269167Z   github-token: ***
2026-02-21T10:14:00.7269319Z   venv: .venv/bin/activate
2026-02-21T10:14:00.7269491Z   schema-version: v3
2026-02-21T10:14:00.7269630Z env:
2026-02-21T10:14:00.7269776Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:00.7269986Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7270237Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:00.7270491Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7270709Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7270932Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7271293Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:00.7271806Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:00.7272030Z ##[endgroup]
2026-02-21T10:14:00.7327680Z ##[group]Run set -eux
2026-02-21T10:14:00.7327861Z [36;1mset -eux[0m
2026-02-21T10:14:00.7328002Z [36;1m[0m
2026-02-21T10:14:00.7328157Z [36;1mif [[ -z "${GITHUB_TOKEN}" ]]; then[0m
2026-02-21T10:14:00.7328365Z [36;1m  echo "Missing github-token input"[0m
2026-02-21T10:14:00.7328554Z [36;1m  exit 1[0m
2026-02-21T10:14:00.7328682Z [36;1mfi[0m
2026-02-21T10:14:00.7329600Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:00.7329800Z env:
2026-02-21T10:14:00.7329947Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:00.7330147Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7330408Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:00.7330663Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7330970Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7331203Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.7331650Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:00.7332060Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:00.7332432Z   GITHUB_TOKEN: ***
2026-02-21T10:14:00.7332581Z ##[endgroup]
2026-02-21T10:14:00.7981784Z + [[ -z *** ]]
2026-02-21T10:14:00.8062347Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main
2026-02-21T10:14:00.8062640Z with:
2026-02-21T10:14:00.8062910Z   github-token: ***
2026-02-21T10:14:00.8063067Z env:
2026-02-21T10:14:00.8063211Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:00.8063430Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8063684Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:00.8063937Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8064162Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8064387Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8064774Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:00.8065326Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:00.8065569Z ##[endgroup]
2026-02-21T10:14:00.8074939Z ##[group]Run set -eux
2026-02-21T10:14:00.8075121Z [36;1mset -eux[0m
2026-02-21T10:14:00.8075273Z [36;1m[0m
2026-02-21T10:14:00.8075566Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}"[0m
2026-02-21T10:14:00.8075984Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:00.8076186Z env:
2026-02-21T10:14:00.8076324Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:00.8076533Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8076802Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:00.8077197Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8077445Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8077675Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:00.8078064Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:00.8078471Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:00.8078808Z   GITHUB_TOKEN: ***
2026-02-21T10:14:00.8078953Z ##[endgroup]
2026-02-21T10:14:00.8664862Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-03-1005
2026-02-21T10:14:02.4019841Z setting job-id=64380329741
2026-02-21T10:14:02.4024489Z setting job-name=run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200
2026-02-21T10:14:02.4208653Z ##[group]Run set -eux
2026-02-21T10:14:02.4208832Z [36;1mset -eux[0m
2026-02-21T10:14:02.4208967Z [36;1m[0m
2026-02-21T10:14:02.4209135Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T10:14:02.4209351Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T10:14:02.4209531Z [36;1mfi[0m
2026-02-21T10:14:02.4209659Z [36;1m[0m
2026-02-21T10:14:02.4209885Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \[0m
2026-02-21T10:14:02.4210193Z [36;1m  --schema-version "${SCHEMA_VERSION}" \[0m
2026-02-21T10:14:02.4210395Z [36;1m  --repo "${REPO}" \[0m
2026-02-21T10:14:02.4210581Z [36;1m  --head-branch "${HEAD_BRANCH}" \[0m
2026-02-21T10:14:02.4210781Z [36;1m  --head-sha "${HEAD_SHA}" \[0m
2026-02-21T10:14:02.4210973Z [36;1m  --workflow-id "${WORKFLOW_RUN_ID}" \[0m
2026-02-21T10:14:02.4211182Z [36;1m  --run-attempt "${RUN_ATTEMPT}" \[0m
2026-02-21T10:14:02.4211365Z [36;1m  --job-id "${JOB_ID}" \[0m
2026-02-21T10:14:02.4211588Z [36;1m  --job-name "${JOB_NAME}"[0m
2026-02-21T10:14:02.4211982Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:02.4212204Z env:
2026-02-21T10:14:02.4212364Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:02.4212571Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.4212838Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:02.4213094Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.4213326Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.4213558Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.4213914Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:02.4214300Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:02.4214532Z   SCHEMA_VERSION: v3
2026-02-21T10:14:02.4214697Z   REPO: pytorch/helion
2026-02-21T10:14:02.4214859Z   HEAD_BRANCH: refs/heads/main
2026-02-21T10:14:02.4215070Z   HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T10:14:02.4215280Z   WORKFLOW_RUN_ID: 22253280836
2026-02-21T10:14:02.4215451Z   RUN_ATTEMPT: 1
2026-02-21T10:14:02.4215595Z   JOB_ID: 64380329741
2026-02-21T10:14:02.4215909Z   JOB_NAME: run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200
2026-02-21T10:14:02.4216166Z ##[endgroup]
2026-02-21T10:14:02.4903770Z + [[ -n .venv/bin/activate ]]
2026-02-21T10:14:02.4908661Z + source .venv/bin/activate
2026-02-21T10:14:02.4908959Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4909150Z ++ '[' -n x ']'
2026-02-21T10:14:02.4909354Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T10:14:02.4909660Z ++ '[' .venv/bin/activate = /__w/_temp/c1f10b70-5831-4a52-a48a-1f73ca90304c.sh ']'
2026-02-21T10:14:02.4909993Z ++ deactivate nondestructive
2026-02-21T10:14:02.4910195Z ++ unset -f pydoc
2026-02-21T10:14:02.4910343Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4910477Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4910617Z ++ hash -r
2026-02-21T10:14:02.4910764Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4910960Z ++ unset VIRTUAL_ENV
2026-02-21T10:14:02.4911138Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T10:14:02.4911758Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T10:14:02.4912005Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T10:14:02.4912233Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T10:14:02.4912430Z ++ '[' linux-gnu = msys ']'
2026-02-21T10:14:02.4912602Z ++ export VIRTUAL_ENV
2026-02-21T10:14:02.4912762Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4912906Z ++ unset SCRIPT_PATH
2026-02-21T10:14:02.4913634Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T10:14:02.4914894Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T10:14:02.4915615Z ++ export PATH
2026-02-21T10:14:02.4915778Z ++ '[' xhelion '!=' x ']'
2026-02-21T10:14:02.4915956Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T10:14:02.4916146Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T10:14:02.4916311Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4916455Z ++ '[' -z '' ']'
2026-02-21T10:14:02.4916597Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T10:14:02.4916756Z ++ PS1='(helion) '
2026-02-21T10:14:02.4916902Z ++ export PS1
2026-02-21T10:14:02.4917049Z ++ alias pydoc
2026-02-21T10:14:02.4917196Z ++ true
2026-02-21T10:14:02.4917326Z ++ hash -r
2026-02-21T10:14:02.4918404Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329741 --job-name 'run-b200 (softmax) / benchmark-cu130-softmax-py3.12-b200'
2026-02-21T10:14:02.5309104Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main
2026-02-21T10:14:02.5309376Z with:
2026-02-21T10:14:02.5309521Z   venv: .venv/bin/activate
2026-02-21T10:14:02.5309683Z env:
2026-02-21T10:14:02.5309826Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:02.5310025Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5310279Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:02.5310520Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5310743Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5310954Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5311326Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:02.5311763Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:02.5311985Z ##[endgroup]
2026-02-21T10:14:02.5321692Z ##[group]Run set -eux
2026-02-21T10:14:02.5321870Z [36;1mset -eux[0m
2026-02-21T10:14:02.5322013Z [36;1m[0m
2026-02-21T10:14:02.5322160Z [36;1mif command -v nvidia-smi; then[0m
2026-02-21T10:14:02.5322427Z [36;1m  DEVICE_NAME=cuda[0m
2026-02-21T10:14:02.5322593Z [36;1m  nvidia-smi[0m
2026-02-21T10:14:02.5322751Z [36;1melif command -v rocm-smi; then[0m
2026-02-21T10:14:02.5322940Z [36;1m  DEVICE_NAME=rocm[0m
2026-02-21T10:14:02.5323094Z [36;1m  rocm-smi[0m
2026-02-21T10:14:02.5323254Z [36;1melif command -v hl-smi; then[0m
2026-02-21T10:14:02.5323433Z [36;1m  DEVICE_NAME=hpu[0m
2026-02-21T10:14:02.5323596Z [36;1m  hl-smi[0m
2026-02-21T10:14:02.5323727Z [36;1melse[0m
2026-02-21T10:14:02.5323873Z [36;1m  arch=$(uname -m)[0m
2026-02-21T10:14:02.5324031Z [36;1m[0m
2026-02-21T10:14:02.5324160Z [36;1m  case "$arch" in[0m
2026-02-21T10:14:02.5324323Z [36;1m    aarch64|arm64)[0m
2026-02-21T10:14:02.5324488Z [36;1m      DEVICE_NAME=arm64-cpu[0m
2026-02-21T10:14:02.5324672Z [36;1m      ;;[0m
2026-02-21T10:14:02.5324806Z [36;1m    *)[0m
2026-02-21T10:14:02.5324956Z [36;1m      DEVICE_NAME=cpu[0m
2026-02-21T10:14:02.5325115Z [36;1m      ;;[0m
2026-02-21T10:14:02.5325256Z [36;1m  esac[0m
2026-02-21T10:14:02.5325387Z [36;1m  lscpu[0m
2026-02-21T10:14:02.5325522Z [36;1mfi[0m
2026-02-21T10:14:02.5325696Z [36;1mecho "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV[0m
2026-02-21T10:14:02.5326002Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:02.5326201Z env:
2026-02-21T10:14:02.5326337Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:02.5326543Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5326785Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:02.5327033Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5327252Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5327463Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.5327832Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:02.5328209Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:02.5328428Z ##[endgroup]
2026-02-21T10:14:02.5972699Z + command -v nvidia-smi
2026-02-21T10:14:02.5972916Z + DEVICE_NAME=cuda
2026-02-21T10:14:02.5973071Z + nvidia-smi
2026-02-21T10:14:02.5973234Z /usr/bin/nvidia-smi
2026-02-21T10:14:02.6138739Z Sat Feb 21 10:14:02 2026       
2026-02-21T10:14:02.6140501Z +-----------------------------------------------------------------------------------------+
2026-02-21T10:14:02.6140974Z | NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
2026-02-21T10:14:02.6141387Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T10:14:02.6141843Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2026-02-21T10:14:02.6142653Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2026-02-21T10:14:02.6142981Z |                                         |                        |               MIG M. |
2026-02-21T10:14:02.6143278Z |=========================================+========================+======================|
2026-02-21T10:14:02.6243819Z |   0  NVIDIA B200                    Off |   00000000:52:00.0 Off |                    0 |
2026-02-21T10:14:02.6245869Z | N/A   33C    P0            191W /  750W |       0MiB / 183359MiB |      0%      Default |
2026-02-21T10:14:02.6246188Z |                                         |                        |             Disabled |
2026-02-21T10:14:02.6246488Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T10:14:02.6246679Z 
2026-02-21T10:14:02.6246815Z +-----------------------------------------------------------------------------------------+
2026-02-21T10:14:02.6247126Z | Processes:                                                                              |
2026-02-21T10:14:02.6247429Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2026-02-21T10:14:02.6247932Z |        ID   ID                                                               Usage      |
2026-02-21T10:14:02.6248180Z |=========================================================================================|
2026-02-21T10:14:02.6248467Z |  No running processes found                                                             |
2026-02-21T10:14:02.6248776Z +-----------------------------------------------------------------------------------------+
2026-02-21T10:14:02.6565158Z + echo DEVICE_NAME=cuda
2026-02-21T10:14:02.6607435Z ##[group]Run set -eux
2026-02-21T10:14:02.6607624Z [36;1mset -eux[0m
2026-02-21T10:14:02.6607769Z [36;1m[0m
2026-02-21T10:14:02.6607918Z [36;1mif [[ "${DEVICE_NAME}" == "cuda" ]]; then[0m
2026-02-21T10:14:02.6608149Z [36;1m  # Return the same device name as PyTorch[0m
2026-02-21T10:14:02.6608447Z [36;1m  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader)[0m
2026-02-21T10:14:02.6608732Z [36;1melif [[ "${DEVICE_NAME}" == "rocm" ]]; then[0m
2026-02-21T10:14:02.6609051Z [36;1m  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs)[0m
2026-02-21T10:14:02.6609363Z [36;1melif [[ "${DEVICE_NAME}" == "hpu" ]]; then[0m
2026-02-21T10:14:02.6609717Z [36;1m  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//')[0m
2026-02-21T10:14:02.6610063Z [36;1melif [[ "${DEVICE_NAME}" == "cpu" ]]; then[0m
2026-02-21T10:14:02.6610739Z [36;1m  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))"[0m
2026-02-21T10:14:02.6611420Z [36;1melif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then[0m
2026-02-21T10:14:02.6611814Z [36;1m  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ")[0m
2026-02-21T10:14:02.6612096Z [36;1mfi[0m
2026-02-21T10:14:02.6612270Z [36;1mecho "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV[0m
2026-02-21T10:14:02.6612561Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:02.6612757Z env:
2026-02-21T10:14:02.6612890Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:02.6613089Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.6613330Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:02.6613570Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.6613788Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.6613996Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.6614417Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:02.6614842Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:02.6615061Z   DEVICE_NAME: cuda
2026-02-21T10:14:02.6615198Z ##[endgroup]
2026-02-21T10:14:02.7253576Z + [[ cuda == \c\u\d\a ]]
2026-02-21T10:14:02.7253835Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader
2026-02-21T10:14:02.7448363Z + DEVICE_TYPE='NVIDIA B200'
2026-02-21T10:14:02.7452647Z + echo 'DEVICE_TYPE=NVIDIA B200'
2026-02-21T10:14:02.7502000Z ##[group]Run set -eux
2026-02-21T10:14:02.7502176Z [36;1mset -eux[0m
2026-02-21T10:14:02.7502310Z [36;1m[0m
2026-02-21T10:14:02.7502481Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T10:14:02.7502685Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T10:14:02.7502862Z [36;1mfi[0m
2026-02-21T10:14:02.7502983Z [36;1m[0m
2026-02-21T10:14:02.7503182Z [36;1mpython3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82[0m
2026-02-21T10:14:02.7503538Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py"[0m
2026-02-21T10:14:02.7503882Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:02.7504079Z env:
2026-02-21T10:14:02.7504276Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:02.7504488Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.7504743Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:02.7505006Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.7505241Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.7505462Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:02.7505842Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:02.7506239Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:02.7506461Z   DEVICE_NAME: cuda
2026-02-21T10:14:02.7506605Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T10:14:02.7506767Z ##[endgroup]
2026-02-21T10:14:02.8065770Z + [[ -n .venv/bin/activate ]]
2026-02-21T10:14:02.8066065Z + source .venv/bin/activate
2026-02-21T10:14:02.8066259Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8066420Z ++ '[' -n x ']'
2026-02-21T10:14:02.8066597Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T10:14:02.8066943Z ++ '[' .venv/bin/activate = /__w/_temp/6601c08a-5e44-423e-b852-79dcb39b8f99.sh ']'
2026-02-21T10:14:02.8067252Z ++ deactivate nondestructive
2026-02-21T10:14:02.8067456Z ++ unset -f pydoc
2026-02-21T10:14:02.8067624Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8067776Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8067935Z ++ hash -r
2026-02-21T10:14:02.8068078Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8068239Z ++ unset VIRTUAL_ENV
2026-02-21T10:14:02.8068414Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T10:14:02.8068635Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T10:14:02.8068877Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T10:14:02.8069119Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T10:14:02.8069334Z ++ '[' linux-gnu = msys ']'
2026-02-21T10:14:02.8070932Z ++ export VIRTUAL_ENV
2026-02-21T10:14:02.8071121Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8071293Z ++ unset SCRIPT_PATH
2026-02-21T10:14:02.8072035Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T10:14:02.8073257Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T10:14:02.8073990Z ++ export PATH
2026-02-21T10:14:02.8074149Z ++ '[' xhelion '!=' x ']'
2026-02-21T10:14:02.8074324Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T10:14:02.8074515Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T10:14:02.8075810Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8076040Z ++ '[' -z '' ']'
2026-02-21T10:14:02.8076186Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T10:14:02.8076348Z ++ PS1='(helion) '
2026-02-21T10:14:02.8076501Z ++ export PS1
2026-02-21T10:14:02.8076651Z ++ alias pydoc
2026-02-21T10:14:02.8076802Z ++ true
2026-02-21T10:14:02.8076939Z ++ hash -r
2026-02-21T10:14:02.8077147Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82
2026-02-21T10:14:03.4734263Z Collecting psutil==7.0.0
2026-02-21T10:14:03.5913583Z   Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
2026-02-21T10:14:03.6134487Z Collecting nvidia-ml-py==13.580.82
2026-02-21T10:14:03.6211042Z   Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB)
2026-02-21T10:14:03.6332930Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB)
2026-02-21T10:14:03.6590878Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB)
2026-02-21T10:14:03.7437579Z Installing collected packages: nvidia-ml-py, psutil
2026-02-21T10:14:03.7445127Z   Attempting uninstall: nvidia-ml-py
2026-02-21T10:14:03.7466433Z     Found existing installation: nvidia-ml-py 13.590.48
2026-02-21T10:14:03.7478587Z     Uninstalling nvidia-ml-py-13.590.48:
2026-02-21T10:14:03.8118479Z       Successfully uninstalled nvidia-ml-py-13.590.48
2026-02-21T10:14:03.8583780Z   Attempting uninstall: psutil
2026-02-21T10:14:03.8612810Z     Found existing installation: psutil 7.2.2
2026-02-21T10:14:03.8626674Z     Uninstalling psutil-7.2.2:
2026-02-21T10:14:03.8628899Z       Successfully uninstalled psutil-7.2.2
2026-02-21T10:14:03.9789849Z 
2026-02-21T10:14:03.9823149Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0
2026-02-21T10:14:04.1136576Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py
2026-02-21T10:14:05.8259606Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main
2026-02-21T10:14:05.8259916Z with:
2026-02-21T10:14:05.8260063Z   venv: .venv/bin/activate
2026-02-21T10:14:05.8260214Z env:
2026-02-21T10:14:05.8260372Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:05.8260574Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8260830Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:05.8261072Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8261303Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8261527Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8261986Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:05.8262384Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:05.8262596Z   DEVICE_NAME: cuda
2026-02-21T10:14:05.8262856Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T10:14:05.8263021Z ##[endgroup]
2026-02-21T10:14:05.8271912Z ##[group]Run set -eux
2026-02-21T10:14:05.8272091Z [36;1mset -eux[0m
2026-02-21T10:14:05.8272236Z [36;1m[0m
2026-02-21T10:14:05.8272395Z [36;1m# TODO (huydhn): Implement this part[0m
2026-02-21T10:14:05.8272630Z [36;1mecho "dependencies={}" >> "${GITHUB_OUTPUT}"[0m
2026-02-21T10:14:05.8272940Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T10:14:05.8273135Z env:
2026-02-21T10:14:05.8273278Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:05.8273473Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8273725Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:05.8273971Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8274183Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8274408Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8274773Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:05.8275269Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:05.8275493Z   DEVICE_NAME: cuda
2026-02-21T10:14:05.8275658Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T10:14:05.8275835Z ##[endgroup]
2026-02-21T10:14:05.8819428Z + echo 'dependencies={}'
2026-02-21T10:14:05.8871311Z ##[group]Run actions/upload-artifact@v6
2026-02-21T10:14:05.8871507Z with:
2026-02-21T10:14:05.8871733Z   name: benchmark-results-b200-softmax
2026-02-21T10:14:05.8871925Z   path: test/test-reports
2026-02-21T10:14:05.8872098Z   if-no-files-found: warn
2026-02-21T10:14:05.8872265Z   compression-level: 6
2026-02-21T10:14:05.8872433Z   overwrite: false
2026-02-21T10:14:05.8872593Z   include-hidden-files: false
2026-02-21T10:14:05.8872775Z env:
2026-02-21T10:14:05.8872918Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T10:14:05.8873140Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8873404Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T10:14:05.8873684Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8873911Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8874126Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T10:14:05.8874625Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T10:14:05.8875015Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T10:14:05.8875252Z   DEVICE_NAME: cuda
2026-02-21T10:14:05.8875405Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T10:14:05.8875583Z ##[endgroup]
2026-02-21T10:14:05.8877789Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T10:14:06.1183033Z With the provided path, there will be 1 file uploaded
2026-02-21T10:14:06.1188371Z Artifact name is valid!
2026-02-21T10:14:06.1188647Z Root directory input is valid!
2026-02-21T10:14:06.3901997Z Beginning upload of artifact content to blob storage
2026-02-21T10:14:06.7572908Z Uploaded bytes 1090
2026-02-21T10:14:06.8517183Z Finished uploading artifact content to blob storage!
2026-02-21T10:14:06.8518779Z SHA256 digest of uploaded artifact zip is ff7e00cf30fa3c0a5eaec5360e821f9d3995630ca5e7325469adbcc103e6907d
2026-02-21T10:14:06.8519261Z Finalizing artifact upload
2026-02-21T10:14:07.1586852Z Artifact benchmark-results-b200-softmax.zip successfully finalized. Artifact ID 5600810980
2026-02-21T10:14:07.1587426Z Artifact benchmark-results-b200-softmax has been successfully uploaded! Final size is 1090 bytes. Artifact ID is 5600810980
2026-02-21T10:14:07.1587996Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5600810980
2026-02-21T10:14:07.1805568Z Post job cleanup.
2026-02-21T10:14:07.1809514Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T10:14:07.3989757Z (node:388805) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T10:14:07.3992619Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python
2026-02-21T10:14:07.3993073Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T10:14:07.4139379Z Post job cleanup.
2026-02-21T10:14:07.4142057Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T10:14:07.6382994Z Post job cleanup.
2026-02-21T10:14:07.6386372Z ##[command]/usr/bin/docker exec  2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T10:14:07.8198226Z [command]/usr/bin/git version
2026-02-21T10:14:07.8236782Z git version 2.43.0
2026-02-21T10:14:07.8270711Z Temporarily overriding HOME='/__w/_temp/08fd4e3a-806a-43b1-bcb8-1e2238f47cf5' before making global git config changes
2026-02-21T10:14:07.8274876Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T10:14:07.8280462Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T10:14:07.8320364Z Removing SSH command configuration
2026-02-21T10:14:07.8323883Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T10:14:07.8366053Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T10:14:07.8624718Z Removing HTTP extra header
2026-02-21T10:14:07.8667632Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T10:14:07.8689861Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T10:14:07.8926142Z Removing includeIf entries pointing to credentials config files
2026-02-21T10:14:07.8926561Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T10:14:07.8956763Z includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T10:14:07.8960457Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T10:14:07.8964619Z includeif.gitdir:/github/workspace/.git.path
2026-02-21T10:14:07.8969668Z includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T10:14:07.8978369Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T10:14:07.8985492Z /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.8997252Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9031003Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T10:14:07.9043878Z /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9055569Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9085603Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path
2026-02-21T10:14:07.9094089Z /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9097996Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9125628Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T10:14:07.9145987Z /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9159803Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config
2026-02-21T10:14:07.9185403Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T10:14:07.9419696Z Removing credentials config '/__w/_temp/git-credentials-99122761-28fd-49a4-807f-34b8089b58f0.config'
2026-02-21T10:14:07.9538287Z Stop and remove container: 6d984ead33f845ac9a028d8d082e23df_nvidiacuda1301develubuntu2404_d3efdf
2026-02-21T10:14:07.9541876Z ##[command]/usr/bin/docker rm --force 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T10:14:13.2385143Z 2d7de649dbc43c065ac2860f1e34584faf32fd9dbac714815c5d7907a4fecb66
2026-02-21T10:14:13.2422617Z Remove container network: github_network_1dabb68bcff447bd84adae5308b06429
2026-02-21T10:14:13.2425347Z ##[command]/usr/bin/docker network rm github_network_1dabb68bcff447bd84adae5308b06429
2026-02-21T10:14:13.3362642Z github_network_1dabb68bcff447bd84adae5308b06429
2026-02-21T10:14:13.3407047Z Evaluate and set job outputs
2026-02-21T10:14:13.3411964Z Set output 'benchmark-metadata'
2026-02-21T10:14:13.3413424Z Set output 'runners-info'
2026-02-21T10:14:13.3413931Z Set output 'dependencies'
2026-02-21T10:14:13.3414327Z Cleaning up orphan processes