2026-02-21T08:04:43.3368283Z Current runner version: '2.331.0'
2026-02-21T08:04:43.3372126Z Runner name: 'dgxb200-04-1007'
2026-02-21T08:04:43.3372643Z Runner group name: 'default'
2026-02-21T08:04:43.3373259Z Machine name: 'a3bc1758654d'
2026-02-21T08:04:43.3374906Z ##[group]GITHUB_TOKEN Permissions
2026-02-21T08:04:43.3376411Z Contents: read
2026-02-21T08:04:43.3376795Z Metadata: read
2026-02-21T08:04:43.3377191Z ##[endgroup]
2026-02-21T08:04:43.3378625Z Secret source: Actions
2026-02-21T08:04:43.3379112Z Prepare workflow directory
2026-02-21T08:04:43.3740661Z Prepare all required actions
2026-02-21T08:04:43.3769110Z Getting action download info
2026-02-21T08:04:43.8293574Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd)
2026-02-21T08:04:44.1523697Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405)
2026-02-21T08:04:44.5027040Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b)
2026-02-21T08:04:44.8777290Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909)
2026-02-21T08:04:45.5042756Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f)
2026-02-21T08:04:45.9823539Z Getting action download info
2026-02-21T08:04:46.2022857Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820)
2026-02-21T08:04:46.2026539Z ##[group] Inputs
2026-02-21T08:04:46.2027208Z   runner: linux.dgx.b200
2026-02-21T08:04:46.2027895Z   python-version: 3.12
2026-02-21T08:04:46.2028570Z   image: nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:04:46.2029387Z   runtime-version: cu130
2026-02-21T08:04:46.2030085Z   container-options: --gpus all
2026-02-21T08:04:46.2030725Z   alias: b200
2026-02-21T08:04:46.2031304Z   kernels: kl_div
2026-02-21T08:04:46.2031921Z   env-vars: 
2026-02-21T08:04:46.2032472Z   custom-args: 
2026-02-21T08:04:46.2033252Z   run_h100: true
2026-02-21T08:04:46.2033902Z   run_b200: true
2026-02-21T08:04:46.2034443Z   run_mi325x: true
2026-02-21T08:04:46.2035004Z ##[endgroup]
2026-02-21T08:04:46.2035842Z Complete job name: run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200
2026-02-21T08:04:46.2338073Z ##[group]Checking docker version
2026-02-21T08:04:46.2348010Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}'
2026-02-21T08:04:46.3358808Z '1.53'
2026-02-21T08:04:46.3379514Z Docker daemon API version: '1.53'
2026-02-21T08:04:46.3380478Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}'
2026-02-21T08:04:46.4254851Z '1.52'
2026-02-21T08:04:46.4278316Z Docker client API version: '1.52'
2026-02-21T08:04:46.4282964Z ##[endgroup]
2026-02-21T08:04:46.4285173Z ##[group]Clean up resources from previous jobs
2026-02-21T08:04:46.4289359Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=f446d1"
2026-02-21T08:04:46.4416515Z ##[command]/usr/bin/docker network prune --force --filter "label=f446d1"
2026-02-21T08:04:46.4535132Z ##[endgroup]
2026-02-21T08:04:46.4535817Z ##[group]Create local container network
2026-02-21T08:04:46.4543659Z ##[command]/usr/bin/docker network create --label f446d1 github_network_635fe730ba6e422c803643553ff1a973
2026-02-21T08:04:46.9217946Z 7031020d6c0021ee991d36b442e7325c86524e11541bb3c152cea95d7e7fb78f
2026-02-21T08:04:46.9241137Z ##[endgroup]
2026-02-21T08:04:46.9262699Z ##[group]Starting job container
2026-02-21T08:04:46.9279375Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:04:47.7508419Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda
2026-02-21T08:04:48.0466017Z 1cd98a0b9132: Pulling fs layer
2026-02-21T08:04:48.0470152Z eea924c2c8fb: Pulling fs layer
2026-02-21T08:04:48.0474994Z afcf80b42416: Pulling fs layer
2026-02-21T08:04:48.0476203Z e93dd1223ff5: Pulling fs layer
2026-02-21T08:04:48.0476464Z 76249c7cd503: Pulling fs layer
2026-02-21T08:04:48.0476762Z c20926c42231: Pulling fs layer
2026-02-21T08:04:48.0477072Z c03b8ec8dd33: Pulling fs layer
2026-02-21T08:04:48.0477644Z d7913b78456a: Pulling fs layer
2026-02-21T08:04:48.0477901Z 8fb7ecb711ef: Pulling fs layer
2026-02-21T08:04:48.0478186Z ab7341a40ee7: Pulling fs layer
2026-02-21T08:04:48.0478409Z 401d11fb2a09: Pulling fs layer
2026-02-21T08:04:48.2432361Z 8fb7ecb711ef: Download complete
2026-02-21T08:04:48.2436722Z 1cd98a0b9132: Download complete
2026-02-21T08:04:48.3428527Z afcf80b42416: Download complete
2026-02-21T08:04:48.3431248Z c20926c42231: Download complete
2026-02-21T08:04:48.3438728Z c03b8ec8dd33: Download complete
2026-02-21T08:04:48.3444090Z d7913b78456a: Download complete
2026-02-21T08:04:48.4421893Z 401d11fb2a09: Download complete
2026-02-21T08:04:48.8422883Z 76249c7cd503: Download complete
2026-02-21T08:04:49.9423896Z ab7341a40ee7: Download complete
2026-02-21T08:04:49.9452086Z 76249c7cd503: Pull complete
2026-02-21T08:04:59.1420091Z eea924c2c8fb: Download complete
2026-02-21T08:05:01.6472411Z 401d11fb2a09: Pull complete
2026-02-21T08:05:06.1431667Z e93dd1223ff5: Download complete
2026-02-21T08:05:06.2457036Z c03b8ec8dd33: Pull complete
2026-02-21T08:05:06.2461983Z d7913b78456a: Pull complete
2026-02-21T08:05:06.2466800Z ab7341a40ee7: Pull complete
2026-02-21T08:05:22.6450886Z 8fb7ecb711ef: Pull complete
2026-02-21T08:05:22.6457911Z afcf80b42416: Pull complete
2026-02-21T08:05:22.6458366Z c20926c42231: Pull complete
2026-02-21T08:05:22.6461983Z eea924c2c8fb: Pull complete
2026-02-21T08:06:01.3445994Z e93dd1223ff5: Pull complete
2026-02-21T08:06:01.3653968Z 1cd98a0b9132: Pull complete
2026-02-21T08:06:01.3654572Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92
2026-02-21T08:06:01.3659936Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:06:01.3660834Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04
2026-02-21T08:06:01.3731170Z ##[command]/usr/bin/docker create --name d92e73d387994e1e949d78541a22b449_nvidiacuda1301develubuntu2404_f71b97 --label f446d1 --workdir /__w/helion/helion --network github_network_635fe730ba6e422c803643553ff1a973 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/eve/_work":"/__w" -v "/home/eve/externals":"/__e":ro -v "/home/eve/_work/_temp":"/__w/_temp" -v "/home/eve/_work/_actions":"/__w/_actions" -v "/home/eve/_work/_tool":"/__w/_tool" -v "/home/eve/_work/_temp/_github_home":"/github/home" -v "/home/eve/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null"
2026-02-21T08:06:01.4372048Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:06:01.4396375Z ##[command]/usr/bin/docker start 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:06:02.0761501Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:06:02.0782664Z ##[command]/usr/bin/docker ps --all --filter id=227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}"
2026-02-21T08:06:02.0937938Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 Up Less than a second
2026-02-21T08:06:02.0958792Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:06:02.1059325Z HOME=/github/home
2026-02-21T08:06:02.1059637Z GITHUB_ACTIONS=true
2026-02-21T08:06:02.1059865Z CI=true
2026-02-21T08:06:02.1060386Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:06:02.1060792Z NVARCH=x86_64
2026-02-21T08:06:02.1065895Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
2026-02-21T08:06:02.1071292Z NV_CUDA_CUDART_VERSION=13.0.88-1
2026-02-21T08:06:02.1071574Z CUDA_VERSION=13.0.1
2026-02-21T08:06:02.1071932Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:02.1072378Z NVIDIA_VISIBLE_DEVICES=all
2026-02-21T08:06:02.1072670Z NVIDIA_DRIVER_CAPABILITIES=compute,utility
2026-02-21T08:06:02.1072957Z NV_CUDA_LIB_VERSION=13.0.1-1
2026-02-21T08:06:02.1073393Z NV_NVTX_VERSION=13.0.85-1
2026-02-21T08:06:02.1073628Z NV_LIBNPP_VERSION=13.0.1.2-1
2026-02-21T08:06:02.1073921Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1
2026-02-21T08:06:02.1074193Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1
2026-02-21T08:06:02.1074563Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0
2026-02-21T08:06:02.1074940Z NV_LIBCUBLAS_VERSION=13.0.2.14-1
2026-02-21T08:06:02.1075373Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1
2026-02-21T08:06:02.1075816Z NV_LIBNCCL_PACKAGE_NAME=libnccl2
2026-02-21T08:06:02.1076162Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1
2026-02-21T08:06:02.1076572Z NCCL_VERSION=2.28.3-1
2026-02-21T08:06:02.1076887Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0
2026-02-21T08:06:02.1077304Z NVIDIA_PRODUCT_NAME=CUDA
2026-02-21T08:06:02.1077587Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1
2026-02-21T08:06:02.1078031Z NV_NVML_DEV_VERSION=13.0.87-1
2026-02-21T08:06:02.1078419Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1
2026-02-21T08:06:02.1078746Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1
2026-02-21T08:06:02.1079210Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1
2026-02-21T08:06:02.1079661Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1
2026-02-21T08:06:02.1080107Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0
2026-02-21T08:06:02.1080568Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1
2026-02-21T08:06:02.1081026Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1
2026-02-21T08:06:02.1081491Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1
2026-02-21T08:06:02.1082047Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
2026-02-21T08:06:02.1082441Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1
2026-02-21T08:06:02.1082794Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0
2026-02-21T08:06:02.1083170Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs
2026-02-21T08:06:02.1089072Z ##[endgroup]
2026-02-21T08:06:02.1096761Z ##[group]Waiting for all services to be ready
2026-02-21T08:06:02.1098264Z ##[endgroup]
2026-02-21T08:06:02.1236773Z ##[group]Run echo "Detected NVIDIA image"
2026-02-21T08:06:02.1237100Z [36;1mecho "Detected NVIDIA image"[0m
2026-02-21T08:06:02.1237421Z [36;1mnvidia-smi || echo "nvidia-smi not found"[0m
2026-02-21T08:06:02.1239606Z shell: bash -l {0}
2026-02-21T08:06:02.1239972Z env:
2026-02-21T08:06:02.1240159Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:02.1240459Z ##[endgroup]
2026-02-21T08:06:02.1902180Z Detected NVIDIA image
2026-02-21T08:06:02.2177926Z Sat Feb 21 08:06:02 2026       
2026-02-21T08:06:02.2179673Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:02.2180134Z | NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
2026-02-21T08:06:02.2180572Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:06:02.2181140Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2026-02-21T08:06:02.2181663Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2026-02-21T08:06:02.2182272Z |                                         |                        |               MIG M. |
2026-02-21T08:06:02.2182575Z |=========================================+========================+======================|
2026-02-21T08:06:02.2264178Z |   0  NVIDIA B200                    Off |   00000000:9D:00.0 Off |                    0 |
2026-02-21T08:06:02.2264717Z | N/A   30C    P0            141W /  750W |       0MiB / 183359MiB |      0%      Default |
2026-02-21T08:06:02.2265149Z |                                         |                        |             Disabled |
2026-02-21T08:06:02.2265548Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:06:02.2265752Z 
2026-02-21T08:06:02.2265987Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:02.2266464Z | Processes:                                                                              |
2026-02-21T08:06:02.2266843Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2026-02-21T08:06:02.2267233Z |        ID   ID                                                               Usage      |
2026-02-21T08:06:02.2267599Z |=========================================================================================|
2026-02-21T08:06:02.2267980Z |  No running processes found                                                             |
2026-02-21T08:06:02.2268428Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:06:02.2653324Z ##[group]Run set -x
2026-02-21T08:06:02.2653603Z [36;1mset -x[0m
2026-02-21T08:06:02.2653847Z [36;1mapt-get update[0m
2026-02-21T08:06:02.2654051Z [36;1mapt-get install -y git[0m
2026-02-21T08:06:02.2654379Z shell: bash -l {0}
2026-02-21T08:06:02.2654606Z env:
2026-02-21T08:06:02.2654779Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:02.2655025Z ##[endgroup]
2026-02-21T08:06:02.3337203Z + apt-get update
2026-02-21T08:06:02.4037990Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease [1581 B]
2026-02-21T08:06:02.5048299Z Get:2 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB]
2026-02-21T08:06:02.5217282Z Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  Packages [1218 kB]
2026-02-21T08:06:02.7038209Z Get:4 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
2026-02-21T08:06:02.8853330Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
2026-02-21T08:06:02.9798405Z Get:6 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
2026-02-21T08:06:03.0879310Z Get:7 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB]
2026-02-21T08:06:03.4928632Z Get:8 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB]
2026-02-21T08:06:03.6780963Z Get:9 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB]
2026-02-21T08:06:03.6803333Z Get:10 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB]
2026-02-21T08:06:03.6820022Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB]
2026-02-21T08:06:03.7134140Z Get:12 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB]
2026-02-21T08:06:03.7487782Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB]
2026-02-21T08:06:03.8063516Z Get:14 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB]
2026-02-21T08:06:03.8349289Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB]
2026-02-21T08:06:03.8360917Z Get:16 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB]
2026-02-21T08:06:03.8367210Z Get:17 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB]
2026-02-21T08:06:04.2925413Z Get:18 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB]
2026-02-21T08:06:04.2926078Z Get:19 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB]
2026-02-21T08:06:04.6281213Z Get:20 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB]
2026-02-21T08:06:04.8164375Z Fetched 37.5 MB in 2s (15.3 MB/s)
2026-02-21T08:06:05.5357689Z Reading package lists...
2026-02-21T08:06:05.5507659Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
2026-02-21T08:06:05.5517007Z + apt-get install -y git
2026-02-21T08:06:06.2345827Z Reading package lists...
2026-02-21T08:06:06.4085499Z Building dependency tree...
2026-02-21T08:06:06.4089604Z Reading state information...
2026-02-21T08:06:06.5763568Z The following additional packages will be installed:
2026-02-21T08:06:06.5764302Z   git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls
2026-02-21T08:06:06.5764985Z   libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3
2026-02-21T08:06:06.5765442Z   libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1
2026-02-21T08:06:06.5765917Z   libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1
2026-02-21T08:06:06.5767896Z   openssh-client publicsuffix xauth
2026-02-21T08:06:06.5775229Z Suggested packages:
2026-02-21T08:06:06.5775649Z   gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui
2026-02-21T08:06:06.5776553Z   gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain
2026-02-21T08:06:06.5776916Z   libpam-ssh monkeysphere ssh-askpass
2026-02-21T08:06:06.6160387Z The following NEW packages will be installed:
2026-02-21T08:06:06.6162433Z   git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10
2026-02-21T08:06:06.6162934Z   libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1
2026-02-21T08:06:06.6163324Z   libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0
2026-02-21T08:06:06.6163857Z   libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6
2026-02-21T08:06:06.6164249Z   libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth
2026-02-21T08:06:06.9720263Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded.
2026-02-21T08:06:06.9720968Z Need to get 8886 kB of archives.
2026-02-21T08:06:06.9721381Z After this operation, 38.0 MB of additional disk space will be used.
2026-02-21T08:06:06.9722438Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB]
2026-02-21T08:06:07.3153653Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB]
2026-02-21T08:06:07.7774995Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB]
2026-02-21T08:06:07.8465778Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB]
2026-02-21T08:06:07.9394552Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB]
2026-02-21T08:06:07.9682883Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB]
2026-02-21T08:06:08.0239354Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B]
2026-02-21T08:06:08.0287320Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB]
2026-02-21T08:06:08.1843644Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB]
2026-02-21T08:06:08.2343739Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB]
2026-02-21T08:06:08.2379714Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB]
2026-02-21T08:06:08.2676490Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB]
2026-02-21T08:06:08.2807567Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB]
2026-02-21T08:06:08.2924571Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB]
2026-02-21T08:06:08.3023523Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B]
2026-02-21T08:06:08.3035612Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB]
2026-02-21T08:06:08.3068365Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB]
2026-02-21T08:06:08.3524776Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB]
2026-02-21T08:06:08.4107278Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB]
2026-02-21T08:06:08.4849108Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB]
2026-02-21T08:06:08.4970868Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B]
2026-02-21T08:06:08.4980376Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB]
2026-02-21T08:06:08.5920916Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB]
2026-02-21T08:06:08.6032445Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB]
2026-02-21T08:06:08.6060559Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB]
2026-02-21T08:06:08.6357989Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB]
2026-02-21T08:06:08.6408388Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB]
2026-02-21T08:06:08.6568880Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB]
2026-02-21T08:06:08.6830526Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB]
2026-02-21T08:06:08.6847321Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB]
2026-02-21T08:06:08.7396742Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB]
2026-02-21T08:06:08.9716914Z debconf: delaying package configuration, since apt-utils is not installed
2026-02-21T08:06:08.9923189Z Fetched 8886 kB in 2s (3934 kB/s)
2026-02-21T08:06:09.0055352Z Selecting previously unselected package krb5-locales.
2026-02-21T08:06:09.0074257Z (Reading database ... 
2026-02-21T08:06:09.0078929Z (Reading database ... 5%
2026-02-21T08:06:09.0080735Z (Reading database ... 10%
2026-02-21T08:06:09.0081011Z (Reading database ... 15%
2026-02-21T08:06:09.0081197Z (Reading database ... 20%
2026-02-21T08:06:09.0081474Z (Reading database ... 25%
2026-02-21T08:06:09.0081707Z (Reading database ... 30%
2026-02-21T08:06:09.0086274Z (Reading database ... 35%
2026-02-21T08:06:09.0088813Z (Reading database ... 40%
2026-02-21T08:06:09.0089145Z (Reading database ... 45%
2026-02-21T08:06:09.0089351Z (Reading database ... 50%
2026-02-21T08:06:09.0089700Z (Reading database ... 55%
2026-02-21T08:06:09.0089895Z (Reading database ... 60%
2026-02-21T08:06:09.0090109Z (Reading database ... 65%
2026-02-21T08:06:09.0090304Z (Reading database ... 70%
2026-02-21T08:06:09.0090548Z (Reading database ... 75%
2026-02-21T08:06:09.0099908Z (Reading database ... 80%
2026-02-21T08:06:09.0101711Z (Reading database ... 85%
2026-02-21T08:06:09.0108548Z (Reading database ... 90%
2026-02-21T08:06:09.0116650Z (Reading database ... 95%
2026-02-21T08:06:09.0118559Z (Reading database ... 100%
2026-02-21T08:06:09.0118867Z (Reading database ... 15507 files and directories currently installed.)
2026-02-21T08:06:09.0119384Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ...
2026-02-21T08:06:09.0129883Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.0278103Z Selecting previously unselected package less.
2026-02-21T08:06:09.0286577Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ...
2026-02-21T08:06:09.0305850Z Unpacking less (590-2ubuntu2.1) ...
2026-02-21T08:06:09.0461596Z Selecting previously unselected package libbsd0:amd64.
2026-02-21T08:06:09.0470415Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ...
2026-02-21T08:06:09.0488803Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ...
2026-02-21T08:06:09.0629183Z Selecting previously unselected package libexpat1:amd64.
2026-02-21T08:06:09.0636720Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ...
2026-02-21T08:06:09.0647265Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ...
2026-02-21T08:06:09.0789025Z Selecting previously unselected package libkrb5support0:amd64.
2026-02-21T08:06:09.0797332Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:09.0807427Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.0942930Z Selecting previously unselected package libk5crypto3:amd64.
2026-02-21T08:06:09.0949473Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:09.0961367Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.1097081Z Selecting previously unselected package libkeyutils1:amd64.
2026-02-21T08:06:09.1104102Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ...
2026-02-21T08:06:09.1111341Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ...
2026-02-21T08:06:09.1235646Z Selecting previously unselected package libkrb5-3:amd64.
2026-02-21T08:06:09.1243638Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:09.1255235Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.1422641Z Selecting previously unselected package libgssapi-krb5-2:amd64.
2026-02-21T08:06:09.1430279Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ...
2026-02-21T08:06:09.1440274Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.1583153Z Selecting previously unselected package libcbor0.10:amd64.
2026-02-21T08:06:09.1589448Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ...
2026-02-21T08:06:09.1598912Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:09.1721952Z Selecting previously unselected package libedit2:amd64.
2026-02-21T08:06:09.1730425Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ...
2026-02-21T08:06:09.1739574Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ...
2026-02-21T08:06:09.1878029Z Selecting previously unselected package libfido2-1:amd64.
2026-02-21T08:06:09.1885345Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ...
2026-02-21T08:06:09.1886070Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:09.2019500Z Selecting previously unselected package libnghttp2-14:amd64.
2026-02-21T08:06:09.2027268Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ...
2026-02-21T08:06:09.2043392Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ...
2026-02-21T08:06:09.2186850Z Selecting previously unselected package libpsl5t64:amd64.
2026-02-21T08:06:09.2195601Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ...
2026-02-21T08:06:09.2209773Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ...
2026-02-21T08:06:09.2335222Z Selecting previously unselected package libxau6:amd64.
2026-02-21T08:06:09.2342902Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ...
2026-02-21T08:06:09.2353063Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ...
2026-02-21T08:06:09.2466441Z Selecting previously unselected package libxdmcp6:amd64.
2026-02-21T08:06:09.2477128Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ...
2026-02-21T08:06:09.2491500Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ...
2026-02-21T08:06:09.2625605Z Selecting previously unselected package libxcb1:amd64.
2026-02-21T08:06:09.2637059Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ...
2026-02-21T08:06:09.2646789Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ...
2026-02-21T08:06:09.2769993Z Selecting previously unselected package libx11-data.
2026-02-21T08:06:09.2782165Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ...
2026-02-21T08:06:09.2792556Z Unpacking libx11-data (2:1.8.7-1build1) ...
2026-02-21T08:06:09.3090406Z Selecting previously unselected package libx11-6:amd64.
2026-02-21T08:06:09.3101821Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ...
2026-02-21T08:06:09.3111278Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ...
2026-02-21T08:06:09.3296772Z Selecting previously unselected package libxext6:amd64.
2026-02-21T08:06:09.3308891Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ...
2026-02-21T08:06:09.3318506Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ...
2026-02-21T08:06:09.3446903Z Selecting previously unselected package libxmuu1:amd64.
2026-02-21T08:06:09.3458272Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ...
2026-02-21T08:06:09.3468236Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:09.3608361Z Selecting previously unselected package openssh-client.
2026-02-21T08:06:09.3620610Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ...
2026-02-21T08:06:09.3677216Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:09.3973059Z Selecting previously unselected package publicsuffix.
2026-02-21T08:06:09.3980395Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ...
2026-02-21T08:06:09.3989600Z Unpacking publicsuffix (20231001.0357-0.1) ...
2026-02-21T08:06:09.4116488Z Selecting previously unselected package xauth.
2026-02-21T08:06:09.4124568Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ...
2026-02-21T08:06:09.4133375Z Unpacking xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:09.4260158Z Selecting previously unselected package libbrotli1:amd64.
2026-02-21T08:06:09.4266550Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ...
2026-02-21T08:06:09.4277745Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ...
2026-02-21T08:06:09.4425849Z Selecting previously unselected package librtmp1:amd64.
2026-02-21T08:06:09.4434267Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ...
2026-02-21T08:06:09.4444839Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ...
2026-02-21T08:06:09.4577054Z Selecting previously unselected package libssh-4:amd64.
2026-02-21T08:06:09.4585596Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ...
2026-02-21T08:06:09.4595036Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ...
2026-02-21T08:06:09.4739003Z Selecting previously unselected package libcurl3t64-gnutls:amd64.
2026-02-21T08:06:09.4749025Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ...
2026-02-21T08:06:09.4757863Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:09.4904845Z Selecting previously unselected package liberror-perl.
2026-02-21T08:06:09.4914741Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ...
2026-02-21T08:06:09.4919774Z Unpacking liberror-perl (0.17029-2) ...
2026-02-21T08:06:09.5049869Z Selecting previously unselected package git-man.
2026-02-21T08:06:09.5057488Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ...
2026-02-21T08:06:09.5067147Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:09.5265233Z Selecting previously unselected package git.
2026-02-21T08:06:09.5267652Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ...
2026-02-21T08:06:09.5328433Z Unpacking git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:09.6510527Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ...
2026-02-21T08:06:09.6546357Z Setting up libxau6:amd64 (1:1.0.9-1build6) ...
2026-02-21T08:06:09.6563363Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ...
2026-02-21T08:06:09.6612604Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:09.6644258Z Setting up libbrotli1:amd64 (1.1.0-2build2) ...
2026-02-21T08:06:09.6665359Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ...
2026-02-21T08:06:09.6726908Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ...
2026-02-21T08:06:09.6743008Z Setting up less (590-2ubuntu2.1) ...
2026-02-21T08:06:09.6824734Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.6872876Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.6950287Z Setting up liberror-perl (0.17029-2) ...
2026-02-21T08:06:09.6969440Z Setting up libx11-data (2:1.8.7-1build1) ...
2026-02-21T08:06:09.6995074Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ...
2026-02-21T08:06:09.7025756Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.7059736Z Setting up git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:09.7071327Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.7080207Z Setting up libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:09.7128228Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ...
2026-02-21T08:06:09.7161499Z Setting up publicsuffix (20231001.0357-0.1) ...
2026-02-21T08:06:09.7175745Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ...
2026-02-21T08:06:09.7195691Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ...
2026-02-21T08:06:09.7254112Z Setting up libedit2:amd64 (3.1-20230828-1build1) ...
2026-02-21T08:06:09.7323351Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ...
2026-02-21T08:06:09.7400996Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ...
2026-02-21T08:06:09.7458860Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ...
2026-02-21T08:06:09.7527549Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:09.7579733Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:09.8138336Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:09.8191227Z Setting up libxext6:amd64 (2:1.3.4-1build2) ...
2026-02-21T08:06:09.8215026Z Setting up git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:09.8285070Z Setting up xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:09.8304123Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ...
2026-02-21T08:06:09.8690768Z ##[group]Run actions/checkout@v6
2026-02-21T08:06:09.8691044Z with:
2026-02-21T08:06:09.8691308Z   repository: pytorch/helion
2026-02-21T08:06:09.8691714Z   token: ***
2026-02-21T08:06:09.8692231Z   ssh-strict: true
2026-02-21T08:06:09.8692473Z   ssh-user: git
2026-02-21T08:06:09.8692704Z   persist-credentials: true
2026-02-21T08:06:09.8692908Z   clean: true
2026-02-21T08:06:09.8693168Z   sparse-checkout-cone-mode: true
2026-02-21T08:06:09.8693391Z   fetch-depth: 1
2026-02-21T08:06:09.8693606Z   fetch-tags: false
2026-02-21T08:06:09.8693810Z   show-progress: true
2026-02-21T08:06:09.8694205Z   lfs: false
2026-02-21T08:06:09.8694427Z   submodules: false
2026-02-21T08:06:09.8694621Z   set-safe-directory: true
2026-02-21T08:06:09.8694866Z env:
2026-02-21T08:06:09.8695045Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:09.8695294Z ##[endgroup]
2026-02-21T08:06:09.8730208Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:10.0486316Z Syncing repository: pytorch/helion
2026-02-21T08:06:10.0487243Z ##[group]Getting Git version info
2026-02-21T08:06:10.0487593Z Working directory is '/__w/helion/helion'
2026-02-21T08:06:10.0488001Z [command]/usr/bin/git version
2026-02-21T08:06:10.0491608Z git version 2.43.0
2026-02-21T08:06:10.0508288Z ##[endgroup]
2026-02-21T08:06:10.0519414Z Temporarily overriding HOME='/__w/_temp/d4a90a64-5a5e-461b-8f77-058ba5cbd50e' before making global git config changes
2026-02-21T08:06:10.0519978Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T08:06:10.0528469Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T08:06:10.0547487Z Deleting the contents of '/__w/helion/helion'
2026-02-21T08:06:10.0552625Z ##[group]Initializing the repository
2026-02-21T08:06:10.0556425Z [command]/usr/bin/git init /__w/helion/helion
2026-02-21T08:06:10.0580423Z hint: Using 'master' as the name for the initial branch. This default branch name
2026-02-21T08:06:10.0590371Z hint: is subject to change. To configure the initial branch name to use in all
2026-02-21T08:06:10.0590831Z hint: of your new repositories, which will suppress this warning, call:
2026-02-21T08:06:10.0591151Z hint: 
2026-02-21T08:06:10.0591427Z hint: 	git config --global init.defaultBranch <name>
2026-02-21T08:06:10.0591708Z hint: 
2026-02-21T08:06:10.0591991Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2026-02-21T08:06:10.0592408Z hint: 'development'. The just-created branch can be renamed via this command:
2026-02-21T08:06:10.0592725Z hint: 
2026-02-21T08:06:10.0592940Z hint: 	git branch -m <name>
2026-02-21T08:06:10.0593263Z Initialized empty Git repository in /__w/helion/helion/.git/
2026-02-21T08:06:10.0593923Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion
2026-02-21T08:06:10.0621997Z ##[endgroup]
2026-02-21T08:06:10.0622376Z ##[group]Disabling automatic garbage collection
2026-02-21T08:06:10.0626603Z [command]/usr/bin/git config --local gc.auto 0
2026-02-21T08:06:10.0654043Z ##[endgroup]
2026-02-21T08:06:10.0654368Z ##[group]Setting up auth
2026-02-21T08:06:10.0654671Z Removing SSH command configuration
2026-02-21T08:06:10.0659730Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T08:06:10.0687422Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T08:06:10.0929312Z Removing HTTP extra header
2026-02-21T08:06:10.0931906Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T08:06:10.0952983Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T08:06:10.1186609Z Removing includeIf entries pointing to credentials config files
2026-02-21T08:06:10.1187046Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T08:06:10.1211186Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T08:06:10.1436537Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config http.https://github.com/.extraheader AUTHORIZATION: basic ***
2026-02-21T08:06:10.1468617Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:06:10.1495797Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:06:10.1522100Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:06:10.1548391Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:06:10.1572982Z ##[endgroup]
2026-02-21T08:06:10.1573542Z ##[group]Fetching the repository
2026-02-21T08:06:10.1581253Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main
2026-02-21T08:06:10.7027859Z From https://github.com/pytorch/helion
2026-02-21T08:06:10.7028419Z  * [new ref]         874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main
2026-02-21T08:06:10.7051106Z [command]/usr/bin/git branch --list --remote origin/main
2026-02-21T08:06:10.7075046Z   origin/main
2026-02-21T08:06:10.7080151Z [command]/usr/bin/git rev-parse refs/remotes/origin/main
2026-02-21T08:06:10.7099788Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:10.7106467Z ##[endgroup]
2026-02-21T08:06:10.7106904Z ##[group]Determining the checkout info
2026-02-21T08:06:10.7107310Z ##[endgroup]
2026-02-21T08:06:10.7107620Z [command]/usr/bin/git sparse-checkout disable
2026-02-21T08:06:10.7137003Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2026-02-21T08:06:10.7158161Z ##[group]Checking out the ref
2026-02-21T08:06:10.7158722Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main
2026-02-21T08:06:10.7381140Z Switched to a new branch 'main'
2026-02-21T08:06:10.7383307Z branch 'main' set up to track 'origin/main'.
2026-02-21T08:06:10.7393489Z ##[endgroup]
2026-02-21T08:06:10.7418705Z [command]/usr/bin/git log -1 --format=%H
2026-02-21T08:06:10.7440129Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:10.7620283Z ##[group]Run actions/setup-python@v6
2026-02-21T08:06:10.7620550Z with:
2026-02-21T08:06:10.7620795Z   python-version: 3.12
2026-02-21T08:06:10.7621001Z   check-latest: false
2026-02-21T08:06:10.7621347Z   token: ***
2026-02-21T08:06:10.7621563Z   update-environment: true
2026-02-21T08:06:10.7621822Z   allow-prereleases: false
2026-02-21T08:06:10.7622350Z   freethreaded: false
2026-02-21T08:06:10.7622599Z env:
2026-02-21T08:06:10.7622889Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:10.7623148Z ##[endgroup]
2026-02-21T08:06:10.7627502Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:10.9753142Z ##[group]Installed versions
2026-02-21T08:06:10.9765799Z Version 3.12 was not found in the local cache
2026-02-21T08:06:11.5393127Z Version 3.12 is available for downloading
2026-02-21T08:06:11.5393726Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz"
2026-02-21T08:06:12.1030930Z Extract downloaded archive
2026-02-21T08:06:12.1136041Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/d62fbf56-d8ad-4d0d-af37-a0efc99c2fd5 -f /__w/_temp/e5950131-de28-49d4-8940-531b92dab25a
2026-02-21T08:06:13.9560102Z Execute installation script
2026-02-21T08:06:13.9655991Z Check if Python hostedtoolcache folder exist...
2026-02-21T08:06:13.9656468Z Creating Python hostedtoolcache folder...
2026-02-21T08:06:13.9665504Z Create Python 3.12.12 folder
2026-02-21T08:06:13.9675718Z Copy Python binaries to hostedtoolcache folder
2026-02-21T08:06:14.2252656Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action)
2026-02-21T08:06:14.2291126Z Upgrading pip...
2026-02-21T08:06:15.9003864Z Looking in links: /tmp/tmp4xt3whd9
2026-02-21T08:06:15.9005741Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1)
2026-02-21T08:06:15.9044470Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
2026-02-21T08:06:16.5040699Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
2026-02-21T08:06:16.6650737Z Collecting pip
2026-02-21T08:06:16.7748270Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
2026-02-21T08:06:16.7795140Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
2026-02-21T08:06:16.8153552Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 88.8 MB/s eta 0:00:00
2026-02-21T08:06:16.8154998Z 
2026-02-21T08:06:16.8239742Z Installing collected packages: pip
2026-02-21T08:06:16.8243784Z Attempting uninstall: pip
2026-02-21T08:06:16.8252982Z Found existing installation: pip 25.0.1
2026-02-21T08:06:16.8425623Z Uninstalling pip-25.0.1:
2026-02-21T08:06:16.8459474Z Successfully uninstalled pip-25.0.1
2026-02-21T08:06:17.4624620Z Successfully installed pip-26.0.1
2026-02-21T08:06:17.5142795Z Create complete file
2026-02-21T08:06:17.5185301Z Successfully set up CPython (3.12.12)
2026-02-21T08:06:17.5186977Z ##[endgroup]
2026-02-21T08:06:17.5392054Z ##[group]Run astral-sh/setup-uv@v7
2026-02-21T08:06:17.5392359Z with:
2026-02-21T08:06:17.5392556Z   activate-environment: false
2026-02-21T08:06:17.5392841Z   working-directory: /home/eve/_work/helion/helion
2026-02-21T08:06:17.5393278Z   github-token: ***
2026-02-21T08:06:17.5393473Z   enable-cache: auto
2026-02-21T08:06:17.5393926Z   cache-dependency-glob: **/*requirements*.txt
**/*requirements*.in
**/*constraints*.txt
**/*constraints*.in
**/pyproject.toml
**/uv.lock
**/*.py.lock

2026-02-21T08:06:17.5394424Z   restore-cache: true
2026-02-21T08:06:17.5394654Z   save-cache: true
2026-02-21T08:06:17.5394868Z   prune-cache: true
2026-02-21T08:06:17.5395085Z   cache-python: false
2026-02-21T08:06:17.5395336Z   ignore-nothing-to-cache: false
2026-02-21T08:06:17.5395562Z   ignore-empty-workdir: false
2026-02-21T08:06:17.5395832Z   add-problem-matchers: true
2026-02-21T08:06:17.5396048Z   resolution-strategy: highest
2026-02-21T08:06:17.5396299Z env:
2026-02-21T08:06:17.5396581Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:17.5396834Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:17.5397138Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:17.5397492Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:17.5397799Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:17.5398045Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:17.5398517Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:17.5398909Z ##[endgroup]
2026-02-21T08:06:17.5405091Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:17.7586923Z (node:802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T08:06:17.7591350Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T08:06:17.7666079Z Trying to find version for uv in: /__w/helion/helion/uv.toml
2026-02-21T08:06:17.7670809Z Could not find file: /__w/helion/helion/uv.toml
2026-02-21T08:06:17.7674890Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml
2026-02-21T08:06:17.7675854Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest.
2026-02-21T08:06:17.7676430Z Getting latest version from GitHub API...
2026-02-21T08:06:17.9994286Z manifest-file not provided, reading from local file.
2026-02-21T08:06:18.0036840Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases.
2026-02-21T08:06:18.0038096Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ...
2026-02-21T08:06:18.2892404Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/64c5fce0-958c-4688-96ee-c42a87209806 -f /__w/_temp/a48647e9-acec-4d1e-9dfb-f63693d86f7a
2026-02-21T08:06:18.7027267Z Added /github/home/.local/bin to the path
2026-02-21T08:06:18.7027773Z Added /__w/_tool/uv/0.10.4/x86_64 to the path
2026-02-21T08:06:18.7028117Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python
2026-02-21T08:06:18.7028448Z Added /github/home/.local/share/uv/python to the path
2026-02-21T08:06:18.7037577Z Successfully installed uv version 0.10.4
2026-02-21T08:06:18.8429873Z ##[group]Run uv venv --python 3.12
2026-02-21T08:06:18.8430185Z [36;1muv venv --python 3.12[0m
2026-02-21T08:06:18.8430549Z shell: bash -l {0}
2026-02-21T08:06:18.8430766Z env:
2026-02-21T08:06:18.8430981Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:18.8431274Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.8431611Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:18.8431990Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.8432314Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.8432575Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.8433030Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:18.8433525Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:18.8433790Z ##[endgroup]
2026-02-21T08:06:18.9687321Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12
2026-02-21T08:06:18.9687837Z Creating virtual environment at: .venv
2026-02-21T08:06:18.9692156Z Activate with: source .venv/bin/activate
2026-02-21T08:06:18.9752008Z ##[group]Run source .venv/bin/activate
2026-02-21T08:06:18.9752319Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:06:18.9752744Z [36;1muv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130[0m
2026-02-21T08:06:18.9753255Z shell: bash -l {0}
2026-02-21T08:06:18.9753452Z env:
2026-02-21T08:06:18.9753717Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:18.9754012Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.9754338Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:18.9754668Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.9754947Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.9755232Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:18.9755626Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:18.9756186Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:18.9756432Z ##[endgroup]
2026-02-21T08:06:19.7425218Z Resolved 26 packages in 677ms
2026-02-21T08:06:19.7455352Z Downloading networkx (2.0MiB)
2026-02-21T08:06:19.7455718Z Downloading sympy (6.0MiB)
2026-02-21T08:06:19.7516726Z Downloading torch (584.2MiB)
2026-02-21T08:06:19.7517053Z Downloading triton (162.6MiB)
2026-02-21T08:06:19.7619112Z Downloading nvidia-cuda-cupti (10.2MiB)
2026-02-21T08:06:19.7663527Z Downloading nvidia-cufft (204.2MiB)
2026-02-21T08:06:19.7815632Z Downloading nvidia-cuda-runtime (2.1MiB)
2026-02-21T08:06:19.7816127Z Downloading nvidia-cusolver (184.5MiB)
2026-02-21T08:06:19.7834810Z Downloading nvidia-curand (56.8MiB)
2026-02-21T08:06:19.7899546Z Downloading nvidia-cusparse (133.8MiB)
2026-02-21T08:06:19.7991568Z Downloading nvidia-nvshmem-cu13 (57.6MiB)
2026-02-21T08:06:19.8055341Z Downloading nvidia-cusparselt-cu13 (162.0MiB)
2026-02-21T08:06:19.8153455Z Downloading nvidia-cufile (1.2MiB)
2026-02-21T08:06:19.8254350Z Downloading nvidia-cuda-nvrtc (86.0MiB)
2026-02-21T08:06:19.8296238Z Downloading nvidia-nvjitlink (38.8MiB)
2026-02-21T08:06:19.8447679Z Downloading nvidia-cudnn-cu13 (332.4MiB)
2026-02-21T08:06:19.8457183Z Downloading nvidia-cublas (400.0MiB)
2026-02-21T08:06:19.8511738Z Downloading nvidia-nccl-cu13 (184.9MiB)
2026-02-21T08:06:20.1887214Z  Downloaded nvidia-cufile
2026-02-21T08:06:20.3823310Z  Downloaded nvidia-cuda-runtime
2026-02-21T08:06:20.7340746Z  Downloaded networkx
2026-02-21T08:06:21.2139641Z  Downloaded nvidia-cuda-cupti
2026-02-21T08:06:22.1827692Z  Downloaded sympy
2026-02-21T08:06:22.8033537Z  Downloaded triton
2026-02-21T08:06:24.1800704Z  Downloaded nvidia-nvjitlink
2026-02-21T08:06:24.8383938Z  Downloaded nvidia-curand
2026-02-21T08:06:24.8568612Z  Downloaded nvidia-nvshmem-cu13
2026-02-21T08:06:26.4547457Z  Downloaded nvidia-cuda-nvrtc
2026-02-21T08:06:27.2401085Z  Downloaded nvidia-cusparse
2026-02-21T08:06:27.4027116Z  Downloaded nvidia-cufft
2026-02-21T08:06:27.5774501Z  Downloaded nvidia-cusolver
2026-02-21T08:06:27.7892806Z  Downloaded nvidia-cusparselt-cu13
2026-02-21T08:06:28.0327616Z  Downloaded nvidia-nccl-cu13
2026-02-21T08:06:29.2162216Z  Downloaded nvidia-cudnn-cu13
2026-02-21T08:06:29.6663905Z  Downloaded nvidia-cublas
2026-02-21T08:06:33.0426751Z  Downloaded torch
2026-02-21T08:06:33.0431426Z Prepared 26 packages in 13.30s
2026-02-21T08:06:33.0465728Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:06:33.0466296Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:06:33.0466997Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:06:33.8675455Z Installed 26 packages in 823ms
2026-02-21T08:06:33.8677974Z  + filelock==3.20.0
2026-02-21T08:06:33.8678297Z  + fsspec==2025.12.0
2026-02-21T08:06:33.8678488Z  + jinja2==3.1.6
2026-02-21T08:06:33.8678874Z  + markupsafe==3.0.2
2026-02-21T08:06:33.8679401Z  + mpmath==1.3.0
2026-02-21T08:06:33.8679619Z  + networkx==3.6.1
2026-02-21T08:06:33.8679881Z  + nvidia-cublas==13.0.0.19
2026-02-21T08:06:33.8680145Z  + nvidia-cuda-cupti==13.0.48
2026-02-21T08:06:33.8680358Z  + nvidia-cuda-nvrtc==13.0.48
2026-02-21T08:06:33.8680629Z  + nvidia-cuda-runtime==13.0.48
2026-02-21T08:06:33.8680874Z  + nvidia-cudnn-cu13==9.13.0.50
2026-02-21T08:06:33.8681091Z  + nvidia-cufft==12.0.0.15
2026-02-21T08:06:33.8681344Z  + nvidia-cufile==1.15.0.42
2026-02-21T08:06:33.8681542Z  + nvidia-curand==10.4.0.35
2026-02-21T08:06:33.8681779Z  + nvidia-cusolver==12.0.3.29
2026-02-21T08:06:33.8682212Z  + nvidia-cusparse==12.6.2.49
2026-02-21T08:06:33.8682493Z  + nvidia-cusparselt-cu13==0.8.0
2026-02-21T08:06:33.8682730Z  + nvidia-nccl-cu13==2.27.7
2026-02-21T08:06:33.8682979Z  + nvidia-nvjitlink==13.0.39
2026-02-21T08:06:33.8683242Z  + nvidia-nvshmem-cu13==3.3.24
2026-02-21T08:06:33.8683458Z  + nvidia-nvtx==13.0.39
2026-02-21T08:06:33.8683685Z  + setuptools==70.2.0
2026-02-21T08:06:33.8683885Z  + sympy==1.14.0
2026-02-21T08:06:33.8684108Z  + torch==2.9.1+cu130
2026-02-21T08:06:33.8684275Z  + triton==3.5.1
2026-02-21T08:06:33.8684526Z  + typing-extensions==4.15.0
2026-02-21T08:06:33.8794968Z ##[group]Run source .venv/bin/activate
2026-02-21T08:06:33.8795289Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:06:33.8795596Z [36;1mSETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]'[0m
2026-02-21T08:06:33.8795983Z [36;1mpython -c "import helion; print(helion.__name__)"[0m
2026-02-21T08:06:33.8796445Z shell: bash -l {0}
2026-02-21T08:06:33.8796661Z env:
2026-02-21T08:06:33.8796888Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:33.8797282Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:33.8797609Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:33.8797904Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:33.8798200Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:33.8798456Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:33.8799012Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:33.8799444Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:33.8799748Z ##[endgroup]
2026-02-21T08:06:34.9777784Z Resolved 30 packages in 996ms
2026-02-21T08:06:34.9785668Z    Building helion @ file:///__w/helion/helion
2026-02-21T08:06:34.9838804Z Downloading pygments (1.2MiB)
2026-02-21T08:06:34.9847993Z Downloading scikit-learn (8.5MiB)
2026-02-21T08:06:34.9848330Z Downloading virtualenv (5.6MiB)
2026-02-21T08:06:34.9992904Z Downloading numpy (15.8MiB)
2026-02-21T08:06:35.0029667Z Downloading scipy (33.4MiB)
2026-02-21T08:06:35.1469628Z       Built helion @ file:///__w/helion/helion
2026-02-21T08:06:35.1711417Z  Downloaded pygments
2026-02-21T08:06:35.1782854Z  Downloaded virtualenv
2026-02-21T08:06:35.5851424Z  Downloaded scikit-learn
2026-02-21T08:06:35.6512659Z  Downloaded numpy
2026-02-21T08:06:36.0021729Z  Downloaded scipy
2026-02-21T08:06:36.0029372Z Prepared 27 packages in 1.02s
2026-02-21T08:06:36.0036547Z Uninstalled 1 package in 0.63ms
2026-02-21T08:06:36.0037101Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:06:36.0037759Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:06:36.0038391Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:06:36.1024108Z Installed 29 packages in 98ms
2026-02-21T08:06:36.1025866Z  + cfgv==3.5.0
2026-02-21T08:06:36.1026095Z  + distlib==0.4.0
2026-02-21T08:06:36.1026312Z  + expecttest==0.3.0
2026-02-21T08:06:36.1026587Z  + filecheck==1.0.3
2026-02-21T08:06:36.1031795Z  - filelock==3.20.0
2026-02-21T08:06:36.1035642Z  + filelock==3.24.3
2026-02-21T08:06:36.1037428Z  + helion==0.0.0 (from file:///__w/helion/helion)
2026-02-21T08:06:36.1037753Z  + hypothesis==6.151.9
2026-02-21T08:06:36.1037934Z  + identify==2.6.16
2026-02-21T08:06:36.1038193Z  + iniconfig==2.3.0
2026-02-21T08:06:36.1038409Z  + joblib==1.5.3
2026-02-21T08:06:36.1038617Z  + markdown-it-py==4.0.0
2026-02-21T08:06:36.1038853Z  + mdurl==0.1.2
2026-02-21T08:06:36.1039073Z  + nodeenv==1.10.0
2026-02-21T08:06:36.1039252Z  + numpy==2.4.2
2026-02-21T08:06:36.1039480Z  + packaging==26.0
2026-02-21T08:06:36.1039672Z  + platformdirs==4.9.2
2026-02-21T08:06:36.1039906Z  + pluggy==1.6.0
2026-02-21T08:06:36.1040138Z  + pre-commit==4.5.1
2026-02-21T08:06:36.1040317Z  + psutil==7.2.2
2026-02-21T08:06:36.1040525Z  + pygments==2.19.2
2026-02-21T08:06:36.1040717Z  + pytest==9.0.2
2026-02-21T08:06:36.1040944Z  + pytest-timeout==2.4.0
2026-02-21T08:06:36.1041135Z  + pyyaml==6.0.3
2026-02-21T08:06:36.1041347Z  + rich==14.3.3
2026-02-21T08:06:36.1041531Z  + scikit-learn==1.8.0
2026-02-21T08:06:36.1041744Z  + scipy==1.17.0
2026-02-21T08:06:36.1042036Z  + sortedcontainers==2.4.0
2026-02-21T08:06:36.1042250Z  + threadpoolctl==3.6.0
2026-02-21T08:06:36.1042474Z  + virtualenv==20.38.0
2026-02-21T08:06:47.3830423Z helion
2026-02-21T08:06:48.0677409Z ##[group]Run set -x
2026-02-21T08:06:48.0677651Z [36;1mset -x[0m
2026-02-21T08:06:48.0677891Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:06:48.0678117Z [36;1muv pip install pip[0m
2026-02-21T08:06:48.0678388Z [36;1muv pip install quack-kernels --no-deps[0m
2026-02-21T08:06:48.0678702Z [36;1mmkdir -p benchmarks/ && pushd benchmarks/[0m
2026-02-21T08:06:48.0679010Z [36;1mgit clone https://github.com/pytorch-labs/tritonbench/[0m
2026-02-21T08:06:48.0679331Z [36;1mpushd tritonbench/[0m
2026-02-21T08:06:48.0679708Z [36;1mgit submodule update --init --recursive[0m
2026-02-21T08:06:48.0680020Z [36;1muv pip install -r requirements.txt[0m
2026-02-21T08:06:48.0680261Z [36;1mpython install.py --liger[0m
2026-02-21T08:06:48.0680539Z [36;1muv pip install -e . --no-deps[0m
2026-02-21T08:06:48.0680812Z [36;1mpopd[0m
2026-02-21T08:06:48.0681000Z [36;1mpopd[0m
2026-02-21T08:06:48.0681335Z shell: bash -l {0}
2026-02-21T08:06:48.0681548Z env:
2026-02-21T08:06:48.0681836Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:48.0682158Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:48.0682492Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:48.0682780Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:48.0683099Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:48.0683396Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:48.0683806Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:06:48.0684297Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:48.0684563Z ##[endgroup]
2026-02-21T08:06:48.2204564Z + source .venv/bin/activate
2026-02-21T08:06:48.2206124Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2206318Z ++ '[' -n x ']'
2026-02-21T08:06:48.2206578Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T08:06:48.2206942Z ++ '[' .venv/bin/activate = /__w/_temp/cca8e86e-7f80-40d3-a870-a3aad5960484.sh ']'
2026-02-21T08:06:48.2207295Z ++ deactivate nondestructive
2026-02-21T08:06:48.2207491Z ++ unset -f pydoc
2026-02-21T08:06:48.2207738Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2207913Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2208102Z ++ hash -r
2026-02-21T08:06:48.2208350Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2208528Z ++ unset VIRTUAL_ENV
2026-02-21T08:06:48.2208745Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T08:06:48.2209026Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T08:06:48.2209309Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T08:06:48.2209624Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T08:06:48.2209845Z ++ '[' linux-gnu = msys ']'
2026-02-21T08:06:48.2210074Z ++ export VIRTUAL_ENV
2026-02-21T08:06:48.2210251Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2210510Z ++ unset SCRIPT_PATH
2026-02-21T08:06:48.2211169Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:06:48.2212404Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:06:48.2213146Z ++ export PATH
2026-02-21T08:06:48.2213335Z ++ '[' xhelion '!=' x ']'
2026-02-21T08:06:48.2213578Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T08:06:48.2213819Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T08:06:48.2214054Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2214230Z ++ '[' -z '' ']'
2026-02-21T08:06:48.2214451Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T08:06:48.2214667Z ++ PS1='(helion) '
2026-02-21T08:06:48.2214855Z ++ export PS1
2026-02-21T08:06:48.2215074Z ++ alias pydoc
2026-02-21T08:06:48.2215243Z ++ true
2026-02-21T08:06:48.2215448Z ++ hash -r
2026-02-21T08:06:48.2215984Z + uv pip install pip
2026-02-21T08:06:48.2541632Z Resolved 1 package in 24ms
2026-02-21T08:06:48.2592780Z Downloading pip (1.7MiB)
2026-02-21T08:06:48.3065572Z  Downloaded pip
2026-02-21T08:06:48.3067111Z Prepared 1 package in 52ms
2026-02-21T08:06:48.3105218Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:06:48.3106222Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:06:48.3107279Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:06:48.3294480Z Installed 1 package in 22ms
2026-02-21T08:06:48.3294937Z  + pip==26.0.1
2026-02-21T08:06:48.3330394Z + uv pip install quack-kernels --no-deps
2026-02-21T08:06:48.6477110Z Resolved 1 package in 304ms
2026-02-21T08:06:48.7483846Z Prepared 1 package in 100ms
2026-02-21T08:06:48.7512066Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:06:48.7512842Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:06:48.7513506Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:06:48.8735399Z Installed 1 package in 124ms
2026-02-21T08:06:48.8738471Z  + quack-kernels==0.2.10
2026-02-21T08:06:48.8764043Z + mkdir -p benchmarks/
2026-02-21T08:06:48.8772308Z + pushd benchmarks/
2026-02-21T08:06:48.8772753Z + git clone https://github.com/pytorch-labs/tritonbench/
2026-02-21T08:06:48.8773148Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:06:48.8785986Z Cloning into 'tritonbench'...
2026-02-21T08:06:50.7102251Z + pushd tritonbench/
2026-02-21T08:06:50.7102745Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:06:50.7103340Z + git submodule update --init --recursive
2026-02-21T08:06:50.8431387Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens'
2026-02-21T08:06:50.8432748Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter'
2026-02-21T08:06:50.8439161Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass'
2026-02-21T08:06:50.9193504Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention'
2026-02-21T08:06:50.9198088Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders'
2026-02-21T08:06:50.9232344Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers'
2026-02-21T08:06:50.9254729Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'...
2026-02-21T08:06:54.2042365Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'...
2026-02-21T08:07:06.9085770Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'...
2026-02-21T08:07:11.4233107Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'...
2026-02-21T08:07:12.3114214Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'...
2026-02-21T08:07:12.7746618Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'...
2026-02-21T08:07:14.3082408Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b'
2026-02-21T08:07:14.6208495Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190'
2026-02-21T08:07:14.8557980Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel'
2026-02-21T08:07:14.8582242Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'...
2026-02-21T08:07:19.2397598Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40'
2026-02-21T08:07:19.7049731Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e'
2026-02-21T08:07:19.7866866Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6'
2026-02-21T08:07:19.7884155Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel'
2026-02-21T08:07:19.7886159Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass'
2026-02-21T08:07:19.7910847Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'...
2026-02-21T08:07:23.8118806Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'...
2026-02-21T08:07:27.8667275Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb'
2026-02-21T08:07:28.3451038Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52'
2026-02-21T08:07:28.3711815Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5'
2026-02-21T08:07:28.3727629Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'
2026-02-21T08:07:28.3749207Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'...
2026-02-21T08:07:32.7798311Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b'
2026-02-21T08:07:32.9380397Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44'
2026-02-21T08:07:32.9395740Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled'
2026-02-21T08:07:32.9396553Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass'
2026-02-21T08:07:32.9398714Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention'
2026-02-21T08:07:32.9423977Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'...
2026-02-21T08:07:37.1043334Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'...
2026-02-21T08:07:40.6255788Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'...
2026-02-21T08:07:41.6066837Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d'
2026-02-21T08:07:42.0968079Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0'
2026-02-21T08:07:42.1509740Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5'
2026-02-21T08:07:42.1529265Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel'
2026-02-21T08:07:42.1534068Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass'
2026-02-21T08:07:42.1555184Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'...
2026-02-21T08:07:47.1167634Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'...
2026-02-21T08:07:51.3367719Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33'
2026-02-21T08:07:51.8245317Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420'
2026-02-21T08:07:51.8287801Z + uv pip install -r requirements.txt
2026-02-21T08:07:51.8358686Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:07:51.9722735Z Resolved 30 packages in 135ms
2026-02-21T08:07:51.9779784Z Downloading fonttools (4.7MiB)
2026-02-21T08:07:51.9780008Z Downloading hf-xet (3.2MiB)
2026-02-21T08:07:51.9782280Z Downloading matplotlib (8.3MiB)
2026-02-21T08:07:51.9789336Z Downloading kiwisolver (1.4MiB)
2026-02-21T08:07:51.9789721Z Downloading transformers (10.3MiB)
2026-02-21T08:07:51.9805690Z Downloading tokenizers (3.0MiB)
2026-02-21T08:07:51.9805997Z Downloading pillow (6.7MiB)
2026-02-21T08:07:52.1314544Z  Downloaded kiwisolver
2026-02-21T08:07:52.1940728Z  Downloaded tokenizers
2026-02-21T08:07:52.1973595Z  Downloaded hf-xet
2026-02-21T08:07:52.3719922Z  Downloaded pillow
2026-02-21T08:07:52.3969043Z  Downloaded fonttools
2026-02-21T08:07:52.5168300Z  Downloaded matplotlib
2026-02-21T08:07:53.0360523Z  Downloaded transformers
2026-02-21T08:07:53.0360951Z Prepared 23 packages in 1.06s
2026-02-21T08:07:53.0400588Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:53.0401400Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:53.0402491Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:53.1056476Z Installed 23 packages in 69ms
2026-02-21T08:07:53.1056689Z  + certifi==2026.1.4
2026-02-21T08:07:53.1056869Z  + charset-normalizer==3.4.4
2026-02-21T08:07:53.1057040Z  + contourpy==1.3.3
2026-02-21T08:07:53.1057195Z  + cycler==0.12.1
2026-02-21T08:07:53.1057329Z  + fonttools==4.61.1
2026-02-21T08:07:53.1057479Z  + hf-xet==1.2.0
2026-02-21T08:07:53.1057622Z  + huggingface-hub==0.36.2
2026-02-21T08:07:53.1057783Z  + idna==3.11
2026-02-21T08:07:53.1057914Z  + kiwisolver==1.4.9
2026-02-21T08:07:53.1058082Z  + matplotlib==3.10.8
2026-02-21T08:07:53.1058254Z  + nvidia-ml-py==13.590.48
2026-02-21T08:07:53.1058404Z  + pillow==12.1.1
2026-02-21T08:07:53.1058547Z  + pyparsing==3.3.2
2026-02-21T08:07:53.1058691Z  + python-dateutil==2.9.0.post0
2026-02-21T08:07:53.1058856Z  + regex==2026.2.19
2026-02-21T08:07:53.1058986Z  + requests==2.32.5
2026-02-21T08:07:53.1059131Z  + safetensors==0.7.0
2026-02-21T08:07:53.1059269Z  + six==1.17.0
2026-02-21T08:07:53.1059406Z  + tabulate==0.9.0
2026-02-21T08:07:53.1059558Z  + tokenizers==0.21.4
2026-02-21T08:07:53.1059695Z  + tqdm==4.67.3
2026-02-21T08:07:53.1059836Z  + transformers==4.53.0
2026-02-21T08:07:53.1059976Z  + urllib3==2.6.3
2026-02-21T08:07:53.1163760Z + python install.py --liger
2026-02-21T08:07:58.3507675Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:07:58.3531332Z Audited 6 packages in 3ms
2026-02-21T08:07:58.4118462Z INFO:__main__:[tritonbench] installing liger-kernels...
2026-02-21T08:07:58.4177595Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:07:58.5069653Z Resolved 1 package in 88ms
2026-02-21T08:07:58.5288593Z Prepared 1 package in 21ms
2026-02-21T08:07:58.5326361Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:58.5326982Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:58.5328010Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:58.5355299Z Installed 1 package in 7ms
2026-02-21T08:07:58.5355616Z  + liger-kernel-nightly==0.7.0.dev20260219183429
2026-02-21T08:07:58.5386790Z INFO:__main__:[tritonbench] installation complete!
2026-02-21T08:07:58.9814125Z + uv pip install -e . --no-deps
2026-02-21T08:07:59.0245768Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:07:59.0277546Z Resolved 1 package in 2ms
2026-02-21T08:07:59.0288069Z    Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:07:59.8034807Z       Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:07:59.8056166Z Prepared 1 package in 777ms
2026-02-21T08:07:59.8058374Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:59.8058879Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:59.8059424Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:59.8059767Z Installed 1 package in 0.60ms
2026-02-21T08:07:59.8060041Z  + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench)
2026-02-21T08:07:59.8138223Z + popd
2026-02-21T08:07:59.8138404Z + popd
2026-02-21T08:07:59.8140218Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:07:59.8140471Z /__w/helion/helion
2026-02-21T08:07:59.8194791Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true
2026-02-21T08:07:59.8195090Z [36;1mrm -rf /tmp/torchinductor_*/ || true[0m
2026-02-21T08:07:59.8195269Z [36;1m[0m
2026-02-21T08:07:59.8195411Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:07:59.8195574Z [36;1m[0m
2026-02-21T08:07:59.8195732Z [36;1mTEST_REPORTS_DIR=$(pwd)/test/test-reports[0m
2026-02-21T08:07:59.8195935Z [36;1mmkdir -p "$TEST_REPORTS_DIR"[0m
2026-02-21T08:07:59.8196118Z [36;1mecho "$TEST_REPORTS_DIR"[0m
2026-02-21T08:07:59.8196273Z [36;1m[0m
2026-02-21T08:07:59.8196401Z [36;1mKERNEL_LIST="kl_div"[0m
2026-02-21T08:07:59.8196583Z [36;1mfor kernel in ${KERNEL_LIST//,/ }; do[0m
2026-02-21T08:07:59.8196795Z [36;1m  echo "=========================================="[0m
2026-02-21T08:07:59.8197034Z [36;1m  echo "Running benchmark for kernel: $kernel"[0m
2026-02-21T08:07:59.8197250Z [36;1m  echo "=========================================="[0m
2026-02-21T08:07:59.8197438Z [36;1m[0m
2026-02-21T08:07:59.8197651Z [36;1m  # Get available implementations and baseline for this kernel[0m
2026-02-21T08:07:59.8198039Z [36;1m  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:")[0m
2026-02-21T08:07:59.8198426Z [36;1m  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:07:59.8198733Z [36;1m  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:07:59.8198971Z [36;1m[0m
2026-02-21T08:07:59.8199107Z [36;1m  if [[ -z "$IMPLS" ]]; then[0m
2026-02-21T08:07:59.8199356Z [36;1m    echo "Warning: No implementations found for kernel $kernel, skipping..."[0m
2026-02-21T08:07:59.8199618Z [36;1m    continue[0m
2026-02-21T08:07:59.8199771Z [36;1m  fi[0m
2026-02-21T08:07:59.8199918Z [36;1m  if [[ -z "$BASELINE" ]]; then[0m
2026-02-21T08:07:59.8200161Z [36;1m    echo "Warning: No baseline found for kernel $kernel, skipping..."[0m
2026-02-21T08:07:59.8200393Z [36;1m    continue[0m
2026-02-21T08:07:59.8200546Z [36;1m  fi[0m
2026-02-21T08:07:59.8200691Z [36;1m  echo "Using baseline: $BASELINE"[0m
2026-02-21T08:07:59.8200923Z [36;1m  echo "Available implementations for $kernel: $IMPLS"[0m
2026-02-21T08:07:59.8201127Z [36;1m[0m
2026-02-21T08:07:59.8201294Z [36;1m  # Do autotuning but do not record the results[0m
2026-02-21T08:07:59.8201497Z [36;1m   python benchmarks/run.py \[0m
2026-02-21T08:07:59.8201676Z [36;1m      --op $kernel \[0m
2026-02-21T08:07:59.8201898Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:07:59.8202125Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:07:59.8202346Z [36;1m      --cudagraph \[0m
2026-02-21T08:07:59.8202513Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:07:59.8202766Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:07:59.8202964Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:07:59.8203138Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:07:59.8203300Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:07:59.8203607Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:07:59.8203810Z [36;1m      --keep-going \[0m
2026-02-21T08:07:59.8203957Z [36;1m      [0m
2026-02-21T08:07:59.8204086Z [36;1m[0m
2026-02-21T08:07:59.8204209Z [36;1m  # Relax the GPU[0m
2026-02-21T08:07:59.8204364Z [36;1m  sleep 2m[0m
2026-02-21T08:07:59.8204493Z [36;1m[0m
2026-02-21T08:07:59.8204645Z [36;1m  # Run again with cache and record results[0m
2026-02-21T08:07:59.8204937Z [36;1m   HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \[0m
2026-02-21T08:07:59.8205207Z [36;1m      --op $kernel \[0m
2026-02-21T08:07:59.8205382Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:07:59.8205583Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:07:59.8205776Z [36;1m      --cudagraph \[0m
2026-02-21T08:07:59.8205927Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:07:59.8206219Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:07:59.8206419Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:07:59.8206587Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:07:59.8206738Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:07:59.8206913Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:07:59.8207145Z [36;1m      --output "$TEST_REPORTS_DIR/helionbench.json" \[0m
2026-02-21T08:07:59.8207371Z [36;1m      --append-to-output \[0m
2026-02-21T08:07:59.8207559Z [36;1m      --keep-going \[0m
2026-02-21T08:07:59.8207703Z [36;1m      [0m
2026-02-21T08:07:59.8207834Z [36;1m[0m
2026-02-21T08:07:59.8208007Z [36;1m  echo "✅ Completed benchmark for kernel: $kernel"[0m
2026-02-21T08:07:59.8208201Z [36;1mdone[0m
2026-02-21T08:07:59.8208328Z [36;1m[0m
2026-02-21T08:07:59.8208491Z [36;1mif [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then[0m
2026-02-21T08:07:59.8208738Z [36;1m  echo "❌ helionbench.json is missing or empty"[0m
2026-02-21T08:07:59.8208925Z [36;1m  exit 1[0m
2026-02-21T08:07:59.8209065Z [36;1mfi[0m
2026-02-21T08:07:59.8209217Z [36;1mcat "$TEST_REPORTS_DIR/helionbench.json"[0m
2026-02-21T08:07:59.8209531Z shell: bash -l {0}
2026-02-21T08:07:59.8209673Z env:
2026-02-21T08:07:59.8209808Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:07:59.8210009Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:59.8210239Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:07:59.8210483Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:59.8210701Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:59.8210929Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:59.8211296Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:07:59.8211682Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:07:59.8211993Z ##[endgroup]
2026-02-21T08:07:59.8805612Z /__w/helion/helion/test/test-reports
2026-02-21T08:07:59.8805936Z ==========================================
2026-02-21T08:07:59.8806194Z Running benchmark for kernel: kl_div
2026-02-21T08:07:59.8806398Z ==========================================
2026-02-21T08:08:05.3733002Z Using baseline: torch_kl_div
2026-02-21T08:08:05.3735211Z Available implementations for kl_div: helion_kl_div_tritonbench,liger_kl_div,torch_compile_kl_div
2026-02-21T08:08:10.8191226Z Using num_inputs=20 for kl_div
2026-02-21T08:08:11.6803413Z Running kl_div benchmark with Helion implementation...
2026-02-21T08:08:11.6807425Z 
2026-02-21T08:08:12.1529161Z Warning: Requested 20 inputs but only 6 available. Using all available inputs.
2026-02-21T08:08:12.1529540Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 6)
2026-02-21T08:08:12.1529874Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 3, 4, 5]
2026-02-21T08:08:12.1533956Z 
2026-02-21T08:08:12.1545756Z   0%|          | 0/6 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T08:08:12.1550170Z (B, T, V)
2026-02-21T08:08:12.1552013Z --------------
2026-02-21T08:08:12.1552188Z (8, 512, 4096)
2026-02-21T08:08:12.1581147Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:08:13.4696262Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:08:14.9705750Z INFO:tritonbench.utils.triton_op:Took 125.12ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:08:18.3202244Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:08:18.3202818Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:08:18.3203008Z               'dtype': 'torch.float32',
2026-02-21T08:08:18.3203199Z               'shape': (4096, 4096),
2026-02-21T08:08:18.3203370Z               'stride': (4096, 1)},
2026-02-21T08:08:18.3203548Z             { 'device': 'cuda:0',
2026-02-21T08:08:18.3203726Z               'dtype': 'torch.float32',
2026-02-21T08:08:18.3203900Z               'shape': (4096, 4096),
2026-02-21T08:08:18.3204507Z               'stride': (4096, 1)}),
2026-02-21T08:08:18.3204704Z   'kwargs': {}}
2026-02-21T08:08:18.3204996Z INFO:tritonbench.utils.triton_op:Took 0.71ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:08:18.6753330Z [0s] Autotune random seed: 2134765727
2026-02-21T08:08:19.0657783Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:09:21.2832870Z [62s] Timeout after 60s compiling Config(block_sizes=[64, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=128, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[0, 1], range_warp_specializes=[False, None])
2026-02-21T08:09:21.4923461Z [62s] Timeout after 60s compiling Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None])
2026-02-21T08:09:22.2558269Z [63s] Timeout after 60s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[4, 1], range_unroll_factors=[0, 1], range_warp_specializes=[True, None])
2026-02-21T08:09:22.2574727Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:09:22.5426724Z module {
2026-02-21T08:09:22.5427467Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:22.5433207Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:09:22.5433414Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:22.5433647Z     %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32>
2026-02-21T08:09:22.5433870Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:09:22.5434054Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:22.5434231Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:22.5434409Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:22.5434711Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:09:22.5435134Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:09:22.5435439Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:22.5435607Z     %3 = arith.muli %2, %c512_i32 : i32
2026-02-21T08:09:22.5435850Z     %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:09:22.5436382Z     %5 = tt.splat %3 : i32 -> tensor<512xi32>
2026-02-21T08:09:22.5436580Z     %6 = arith.addi %5, %4 : tensor<512xi32>
2026-02-21T08:09:22.5436894Z     %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>)  : i32 {
2026-02-21T08:09:22.5437297Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:09:22.5438104Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:09:22.5438461Z       %13 = scf.if %arg3 -> (tensor<512x32xf32>) {
2026-02-21T08:09:22.5438838Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32>
2026-02-21T08:09:22.5439221Z         %16 = arith.subf %12, %11 : tensor<512x32xf32>
2026-02-21T08:09:22.5439714Z         %17 = arith.mulf %15, %16 : tensor<512x32xf32>
2026-02-21T08:09:22.5439938Z         %18 = arith.addf %17, %cst : tensor<512x32xf32>
2026-02-21T08:09:22.5440152Z         scf.yield %18 : tensor<512x32xf32>
2026-02-21T08:09:22.5441029Z       } else {
2026-02-21T08:09:22.5441212Z         %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32>
2026-02-21T08:09:22.5441448Z         %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32>
2026-02-21T08:09:22.5441708Z         %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32>
2026-02-21T08:09:22.5441969Z         %18 = arith.ori %16, %17 : tensor<512x32xi1>
2026-02-21T08:09:22.5442206Z         %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32>
2026-02-21T08:09:22.5442455Z         %20 = math.log %19 : tensor<512x32xf32>
2026-02-21T08:09:22.5442657Z         %21 = arith.subf %20, %11 : tensor<512x32xf32>
2026-02-21T08:09:22.5442849Z         %22 = arith.mulf %12, %21 : tensor<512x32xf32>
2026-02-21T08:09:22.5443060Z         %23 = arith.addf %22, %cst : tensor<512x32xf32>
2026-02-21T08:09:22.5443254Z         scf.yield %23 : tensor<512x32xf32>
2026-02-21T08:09:22.5443426Z       }
2026-02-21T08:09:22.5443565Z       %14 = arith.addf %arg6, %13 : tensor<512x32xf32>
2026-02-21T08:09:22.5443759Z       scf.yield %14 : tensor<512x32xf32>
2026-02-21T08:09:22.5444069Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:09:22.5444397Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:09:22.5444587Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:09:22.5444755Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:09:22.5444939Z       tt.reduce.return %11 : f32
2026-02-21T08:09:22.5445113Z     }) : (tensor<512x32xf32>) -> tensor<512xf32>
2026-02-21T08:09:22.5445340Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<512x!tt.ptr<f32>>
2026-02-21T08:09:22.5445623Z     %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr<f32>>, tensor<512xi32>
2026-02-21T08:09:22.5445854Z     tt.store %10, %8 : tensor<512x!tt.ptr<f32>>
2026-02-21T08:09:22.5446040Z     tt.return
2026-02-21T08:09:22.5446160Z   }
2026-02-21T08:09:22.5446281Z }
2026-02-21T08:09:22.5446346Z 
2026-02-21T08:09:22.5446394Z {-#
2026-02-21T08:09:22.5446532Z   external_resources: {
2026-02-21T08:09:22.5446687Z     mlir_reproducer: {
2026-02-21T08:09:22.5451100Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:22.5455636Z       disable_threading: false,
2026-02-21T08:09:22.5455818Z       verify_each: true
2026-02-21T08:09:22.5455972Z     }
2026-02-21T08:09:22.5456107Z   }
2026-02-21T08:09:22.5456223Z #-}
2026-02-21T08:09:22.5456758Z /tmp/torchinductor_root/kq/ckqgjm23ayjvdfpaxlbrhuqqydpuweref52jm26k6bgk45sg2bj3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:22.5458046Z /tmp/torchinductor_root/kq/ckqgjm23ayjvdfpaxlbrhuqqydpuweref52jm26k6bgk45sg2bj3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:22.5459082Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:22.5460167Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:09:22.5461095Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:22.5461346Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:23.1297349Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:09:23.1298040Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:23.1298607Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:09:23.1299988Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:09:23.1300190Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:23.1300407Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:09:23.1306151Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32>
2026-02-21T08:09:23.1310015Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:09:23.1314642Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:23.1318620Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:23.1322501Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:23.1324631Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:09:23.1325071Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:09:23.1325382Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:23.1325559Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T08:09:23.1325745Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:23.1325972Z     %4 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:09:23.1326514Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:09:23.1326690Z     %6 = arith.divui %5, %c2368_i32 : i32
2026-02-21T08:09:23.1326864Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:09:23.1327041Z     %7 = arith.remsi %6, %c4_i32 : i32
2026-02-21T08:09:23.1327208Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:09:23.1327376Z     %9 = arith.muli %8, %c2368_i32 : i32
2026-02-21T08:09:23.1327544Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:09:23.1327724Z     %11 = arith.muli %c2368_i32, %c4_i32 : i32
2026-02-21T08:09:23.1327916Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:09:23.1328110Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:09:23.1328385Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:23.1328634Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:09:23.1328823Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:09:23.1329222Z       %16 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:09:23.1329629Z         %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1329996Z         %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1330286Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:09:23.1330655Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:09:23.1331029Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1331231Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:09:23.1331495Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1331693Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:09:23.1332040Z         } else {
2026-02-21T08:09:23.1332219Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:09:23.1332437Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:09:23.1332661Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:09:23.1332871Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:09:23.1333115Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:09:23.1333362Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:09:23.1333554Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1333755Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:09:23.1333956Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1334155Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:09:23.1334319Z         }
2026-02-21T08:09:23.1334474Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:09:23.1334667Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:09:23.1334894Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:09:23.1335121Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:09:23.1335305Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:23.1335485Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:23.1335665Z         tt.reduce.return %50 : f32
2026-02-21T08:09:23.1335851Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:09:23.1336072Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1336336Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:23.1336573Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1336765Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:09:23.1336955Z       %20 = arith.muli %c2368_i32, %c1_i32_0 : i32
2026-02-21T08:09:23.1337140Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:09:23.1337323Z       %22 = arith.muli %21, %c16_i32 : i32
2026-02-21T08:09:23.1337617Z       %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:23.1337858Z       %24 = tt.splat %22 : i32 -> tensor<16xi32>
2026-02-21T08:09:23.1338042Z       %25 = arith.addi %24, %23 : tensor<16xi32>
2026-02-21T08:09:23.1338354Z       %26 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:09:23.1338751Z         %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1339110Z         %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1339400Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:09:23.1339752Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:09:23.1340176Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1340386Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:09:23.1340582Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1340775Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:09:23.1340938Z         } else {
2026-02-21T08:09:23.1341098Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:09:23.1341306Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:09:23.1341524Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:09:23.1341731Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:09:23.1341994Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:09:23.1342235Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:09:23.1342423Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1342624Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:09:23.1342823Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1343028Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:09:23.1343198Z         }
2026-02-21T08:09:23.1343351Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:09:23.1343555Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:09:23.1343770Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:09:23.1343997Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:09:23.1344189Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:23.1344375Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:23.1344558Z         tt.reduce.return %50 : f32
2026-02-21T08:09:23.1344751Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:09:23.1344986Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1345251Z       %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:23.1345502Z       tt.store %29, %27 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1345704Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:09:23.1345900Z       %30 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:09:23.1346090Z       %31 = arith.addi %arg5, %30 : i32
2026-02-21T08:09:23.1346279Z       %32 = arith.muli %31, %c16_i32 : i32
2026-02-21T08:09:23.1346505Z       %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:23.1346753Z       %34 = tt.splat %32 : i32 -> tensor<16xi32>
2026-02-21T08:09:23.1346950Z       %35 = arith.addi %34, %33 : tensor<16xi32>
2026-02-21T08:09:23.1347257Z       %36 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:09:23.1347676Z         %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1348058Z         %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1348418Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:09:23.1348789Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:09:23.1349160Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1349376Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:09:23.1349586Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1349789Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:09:23.1349962Z         } else {
2026-02-21T08:09:23.1350132Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:09:23.1350358Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:09:23.1350589Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:09:23.1350800Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:09:23.1351110Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:09:23.1351353Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:09:23.1351540Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1351738Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:09:23.1351974Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1352162Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:09:23.1352333Z         }
2026-02-21T08:09:23.1352498Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:09:23.1352686Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:09:23.1352896Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:09:23.1353104Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:09:23.1353293Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:23.1353468Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:23.1353647Z         tt.reduce.return %50 : f32
2026-02-21T08:09:23.1353833Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:09:23.1354052Z       %38 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1354308Z       %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:23.1354535Z       tt.store %39, %37 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1354730Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:09:23.1354903Z       %40 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:09:23.1355087Z       %41 = arith.addi %arg5, %40 : i32
2026-02-21T08:09:23.1355263Z       %42 = arith.muli %41, %c16_i32 : i32
2026-02-21T08:09:23.1355478Z       %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:23.1355714Z       %44 = tt.splat %42 : i32 -> tensor<16xi32>
2026-02-21T08:09:23.1355895Z       %45 = arith.addi %44, %43 : tensor<16xi32>
2026-02-21T08:09:23.1356195Z       %46 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:09:23.1356583Z         %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1356945Z         %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1357231Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:09:23.1357580Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:09:23.1357941Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1358142Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:09:23.1358349Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1358545Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:09:23.1358708Z         } else {
2026-02-21T08:09:23.1358870Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:09:23.1359134Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:09:23.1359349Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:09:23.1359550Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:09:23.1359785Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:09:23.1360022Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:09:23.1360208Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:09:23.1360408Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:09:23.1360605Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1360800Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:09:23.1360963Z         }
2026-02-21T08:09:23.1361110Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:09:23.1361295Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:09:23.1361577Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:09:23.1361799Z       %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({
2026-02-21T08:09:23.1362009Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:23.1362182Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:23.1362356Z         tt.reduce.return %50 : f32
2026-02-21T08:09:23.1362540Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:09:23.1362762Z       %48 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1363011Z       %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:23.1363242Z       tt.store %49, %47 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1363414Z     }
2026-02-21T08:09:23.1363581Z     scf.for %arg5 = %10 to %c256_i32 step %c2368_i32  : i32 {
2026-02-21T08:09:23.1363787Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:09:23.1364012Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:23.1364244Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:09:23.1364437Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:09:23.1364738Z       %16 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:09:23.1365121Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1365480Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:09:23.1365757Z         %22 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:09:23.1366113Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:09:23.1366472Z           %25 = arith.subf %21, %20 : tensor<16x32xf32>
2026-02-21T08:09:23.1366671Z           %26 = arith.mulf %24, %25 : tensor<16x32xf32>
2026-02-21T08:09:23.1366880Z           %27 = arith.addf %26, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1367069Z           scf.yield %27 : tensor<16x32xf32>
2026-02-21T08:09:23.1367236Z         } else {
2026-02-21T08:09:23.1367387Z           %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:09:23.1367601Z           %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32>
2026-02-21T08:09:23.1367814Z           %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32>
2026-02-21T08:09:23.1368014Z           %27 = arith.ori %25, %26 : tensor<16x32xi1>
2026-02-21T08:09:23.1368247Z           %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:09:23.1368476Z           %29 = math.log %28 : tensor<16x32xf32>
2026-02-21T08:09:23.1368667Z           %30 = arith.subf %29, %20 : tensor<16x32xf32>
2026-02-21T08:09:23.1368859Z           %31 = arith.mulf %21, %30 : tensor<16x32xf32>
2026-02-21T08:09:23.1369061Z           %32 = arith.addf %31, %cst : tensor<16x32xf32>
2026-02-21T08:09:23.1369256Z           scf.yield %32 : tensor<16x32xf32>
2026-02-21T08:09:23.1369471Z         }
2026-02-21T08:09:23.1369616Z         %23 = arith.addf %arg7, %22 : tensor<16x32xf32>
2026-02-21T08:09:23.1369803Z         scf.yield %23 : tensor<16x32xf32>
2026-02-21T08:09:23.1370012Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:09:23.1370220Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:09:23.1370406Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:23.1370572Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:23.1370754Z         tt.reduce.return %20 : f32
2026-02-21T08:09:23.1370936Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:09:23.1371157Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1371420Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:23.1371651Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:23.1371888Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:09:23.1372102Z     tt.return
2026-02-21T08:09:23.1372243Z   }
2026-02-21T08:09:23.1372367Z }
2026-02-21T08:09:23.1372445Z 
2026-02-21T08:09:23.1372495Z {-#
2026-02-21T08:09:23.1372634Z   external_resources: {
2026-02-21T08:09:23.1372792Z     mlir_reproducer: {
2026-02-21T08:09:23.1377162Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:23.1381536Z       disable_threading: false,
2026-02-21T08:09:23.1381698Z       verify_each: true
2026-02-21T08:09:23.1381844Z     }
2026-02-21T08:09:23.1381988Z   }
2026-02-21T08:09:23.1382104Z #-}
2026-02-21T08:09:23.1382542Z /tmp/torchinductor_root/tt/cttkhh7z2rjgcswtho32r5mlfurqew4gn2qvsirs7o74lkakfnym.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:23.1383838Z /tmp/torchinductor_root/tt/cttkhh7z2rjgcswtho32r5mlfurqew4gn2qvsirs7o74lkakfnym.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:23.1384917Z [64s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:23.1386089Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:09:23.1387207Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:23.1387463Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:25.6557917Z module {
2026-02-21T08:09:25.6560252Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:25.6560883Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:09:25.6561181Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:25.6561639Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:25.6562982Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32>
2026-02-21T08:09:25.6563223Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:09:25.6563404Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:25.6563589Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:25.6563760Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:25.6564073Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:09:25.6564505Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:09:25.6564802Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:25.6564979Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:09:25.6565157Z     %4 = arith.minsi %3, %c256_i32 : i32
2026-02-21T08:09:25.6569132Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:09:25.6570158Z       %5 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:09:25.6570412Z       %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:09:25.6570674Z       %7 = tt.splat %5 : i32 -> tensor<16xi32>
2026-02-21T08:09:25.6570863Z       %8 = arith.addi %7, %6 : tensor<16xi32>
2026-02-21T08:09:25.6571174Z       %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>)  : i32 {
2026-02-21T08:09:25.6571582Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:09:25.6572094Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:09:25.6572392Z         %15 = scf.if %arg3 -> (tensor<16x256xf32>) {
2026-02-21T08:09:25.6572761Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:09:25.6573148Z           %18 = arith.subf %14, %13 : tensor<16x256xf32>
2026-02-21T08:09:25.6573360Z           %19 = arith.mulf %17, %18 : tensor<16x256xf32>
2026-02-21T08:09:25.6573576Z           %20 = arith.addf %19, %cst : tensor<16x256xf32>
2026-02-21T08:09:25.6573778Z           scf.yield %20 : tensor<16x256xf32>
2026-02-21T08:09:25.6573948Z         } else {
2026-02-21T08:09:25.6574119Z           %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32>
2026-02-21T08:09:25.6574335Z           %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32>
2026-02-21T08:09:25.6574559Z           %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32>
2026-02-21T08:09:25.6574764Z           %20 = arith.ori %18, %19 : tensor<16x256xi1>
2026-02-21T08:09:25.6575029Z           %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:09:25.6575279Z           %22 = math.log %21 : tensor<16x256xf32>
2026-02-21T08:09:25.6575473Z           %23 = arith.subf %22, %13 : tensor<16x256xf32>
2026-02-21T08:09:25.6575687Z           %24 = arith.mulf %14, %23 : tensor<16x256xf32>
2026-02-21T08:09:25.6576166Z           %25 = arith.addf %24, %cst : tensor<16x256xf32>
2026-02-21T08:09:25.6576374Z           scf.yield %25 : tensor<16x256xf32>
2026-02-21T08:09:25.6576556Z         }
2026-02-21T08:09:25.6576711Z         %16 = arith.addf %arg7, %15 : tensor<16x256xf32>
2026-02-21T08:09:25.6576918Z         scf.yield %16 : tensor<16x256xf32>
2026-02-21T08:09:25.6577141Z       } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:25.6577372Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:09:25.6577557Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:25.6577737Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:25.6577918Z         tt.reduce.return %13 : f32
2026-02-21T08:09:25.6578109Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:09:25.6578339Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:25.6578677Z       %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:09:25.6578916Z       tt.store %12, %10 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:09:25.6579092Z     }
2026-02-21T08:09:25.6579217Z     tt.return
2026-02-21T08:09:25.6579336Z   }
2026-02-21T08:09:25.6579458Z }
2026-02-21T08:09:25.6579523Z 
2026-02-21T08:09:25.6579571Z {-#
2026-02-21T08:09:25.6579701Z   external_resources: {
2026-02-21T08:09:25.6579861Z     mlir_reproducer: {
2026-02-21T08:09:25.6584179Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:25.6588587Z       disable_threading: false,
2026-02-21T08:09:25.6588754Z       verify_each: true
2026-02-21T08:09:25.6588890Z     }
2026-02-21T08:09:25.6589010Z   }
2026-02-21T08:09:25.6589116Z #-}
2026-02-21T08:09:25.6589525Z /tmp/torchinductor_root/um/cum4laz43eppv7xfzfxng5552jl2lswbog6nywjoe3ypha2sth6j.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:25.6590715Z /tmp/torchinductor_root/um/cum4laz43eppv7xfzfxng5552jl2lswbog6nywjoe3ypha2sth6j.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:25.6591684Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:25.6592833Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:09:25.6593769Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:25.6594014Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:26.0076310Z module {
2026-02-21T08:09:26.0077039Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:26.0077975Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:09:26.0078199Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:26.0078439Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32>
2026-02-21T08:09:26.0078673Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:09:26.0078867Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:26.0079051Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:26.0079238Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:26.0079554Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<32x16xf32>>
2026-02-21T08:09:26.0080002Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<32x16xf32>>
2026-02-21T08:09:26.0080322Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:26.0080518Z     %3 = arith.muli %2, %c32_i32 : i32
2026-02-21T08:09:26.0080751Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:09:26.0081006Z     %5 = tt.splat %3 : i32 -> tensor<32xi32>
2026-02-21T08:09:26.0081214Z     %6 = arith.addi %5, %4 : tensor<32xi32>
2026-02-21T08:09:26.0081529Z     %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c16_i32 iter_args(%arg6 = %cst) -> (tensor<32x16xf32>)  : i32 {
2026-02-21T08:09:26.0082130Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<32x16xf32>> -> tensor<32x16xf32>
2026-02-21T08:09:26.0082527Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<32x16xf32>> -> tensor<32x16xf32>
2026-02-21T08:09:26.0082825Z       %13 = scf.if %arg3 -> (tensor<32x16xf32>) {
2026-02-21T08:09:26.0083216Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x16xf32>) -> tensor<32x16xf32>
2026-02-21T08:09:26.0083607Z         %16 = arith.subf %12, %11 : tensor<32x16xf32>
2026-02-21T08:09:26.0083820Z         %17 = arith.mulf %15, %16 : tensor<32x16xf32>
2026-02-21T08:09:26.0084032Z         %18 = arith.addf %17, %cst : tensor<32x16xf32>
2026-02-21T08:09:26.0084228Z         scf.yield %18 : tensor<32x16xf32>
2026-02-21T08:09:26.0084402Z       } else {
2026-02-21T08:09:26.0084556Z         %15 = tt.splat %arg4 : f32 -> tensor<32x16xf32>
2026-02-21T08:09:26.0084775Z         %16 = arith.cmpf ogt, %12, %15 : tensor<32x16xf32>
2026-02-21T08:09:26.0084985Z         %17 = arith.cmpf une, %12, %12 : tensor<32x16xf32>
2026-02-21T08:09:26.0085189Z         %18 = arith.ori %16, %17 : tensor<32x16xi1>
2026-02-21T08:09:26.0085427Z         %19 = arith.select %18, %12, %15 : tensor<32x16xi1>, tensor<32x16xf32>
2026-02-21T08:09:26.0085659Z         %20 = math.log %19 : tensor<32x16xf32>
2026-02-21T08:09:26.0085853Z         %21 = arith.subf %20, %11 : tensor<32x16xf32>
2026-02-21T08:09:26.0086046Z         %22 = arith.mulf %12, %21 : tensor<32x16xf32>
2026-02-21T08:09:26.0086249Z         %23 = arith.addf %22, %cst : tensor<32x16xf32>
2026-02-21T08:09:26.0086435Z         scf.yield %23 : tensor<32x16xf32>
2026-02-21T08:09:26.0086603Z       }
2026-02-21T08:09:26.0086751Z       %14 = arith.addf %arg6, %13 : tensor<32x16xf32>
2026-02-21T08:09:26.0087060Z       scf.yield %14 : tensor<32x16xf32>
2026-02-21T08:09:26.0087245Z     } {tt.warp_specialize}
2026-02-21T08:09:26.0087407Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:09:26.0087594Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:09:26.0087760Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:09:26.0087945Z       tt.reduce.return %11 : f32
2026-02-21T08:09:26.0088120Z     }) : (tensor<32x16xf32>) -> tensor<32xf32>
2026-02-21T08:09:26.0088349Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<32x!tt.ptr<f32>>
2026-02-21T08:09:26.0088606Z     %10 = tt.addptr %9, %6 : tensor<32x!tt.ptr<f32>>, tensor<32xi32>
2026-02-21T08:09:26.0088829Z     tt.store %10, %8 : tensor<32x!tt.ptr<f32>>
2026-02-21T08:09:26.0089008Z     tt.return
2026-02-21T08:09:26.0089126Z   }
2026-02-21T08:09:26.0089275Z }
2026-02-21T08:09:26.0089339Z 
2026-02-21T08:09:26.0089395Z {-#
2026-02-21T08:09:26.0089580Z   external_resources: {
2026-02-21T08:09:26.0089750Z     mlir_reproducer: {
2026-02-21T08:09:26.0093988Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:26.0098325Z       disable_threading: false,
2026-02-21T08:09:26.0098502Z       verify_each: true
2026-02-21T08:09:26.0098656Z     }
2026-02-21T08:09:26.0098792Z   }
2026-02-21T08:09:26.0098927Z #-}
2026-02-21T08:09:26.0099443Z /tmp/torchinductor_root/pj/cpjr5oe4aculykfgs75dfjbrgt34pqrvlqqderpr3fiksvzv4gvf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:26.0100804Z /tmp/torchinductor_root/pj/cpjr5oe4aculykfgs75dfjbrgt34pqrvlqqderpr3fiksvzv4gvf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:26.0101908Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:26.0103010Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:09:26.0104033Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:26.0104318Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:26.0578281Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:09:26.0581033Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:26.0581934Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:09:26.0586912Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:09:26.0588769Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:26.0589020Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:09:26.0589259Z     %cst = arith.constant dense<4096> : tensor<512x1xi32>
2026-02-21T08:09:26.0589732Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<512x4xf32>
2026-02-21T08:09:26.0590006Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:09:26.0590196Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:26.0590384Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:26.0590568Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:26.0590882Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<512x4xf32>>
2026-02-21T08:09:26.0591205Z     %1 = tt.get_program_id x : i32
2026-02-21T08:09:26.0591409Z     scf.for %arg5 = %1 to %c8_i32 step %c9472_i32  : i32 {
2026-02-21T08:09:26.0591629Z       %2 = arith.muli %arg5, %c512_i32 : i32
2026-02-21T08:09:26.0591923Z       %3 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:09:26.0592187Z       %4 = tt.splat %2 : i32 -> tensor<512xi32>
2026-02-21T08:09:26.0592388Z       %5 = arith.addi %4, %3 : tensor<512xi32>
2026-02-21T08:09:26.0592582Z       %c4092_i32 = arith.constant 4092 : i32
2026-02-21T08:09:26.0592783Z       %c12_i32 = arith.constant 12 : i32
2026-02-21T08:09:26.0593091Z       %6 = scf.for %arg6 = %c0_i32 to %c4092_i32 step %c12_i32 iter_args(%arg7 = %cst_0) -> (tensor<512x4xf32>)  : i32 {
2026-02-21T08:09:26.0593462Z         %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:26.0593709Z         %26 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:09:26.0593922Z         %27 = arith.addi %26, %25 : tensor<4xi32>
2026-02-21T08:09:26.0594216Z         %28 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<512x4xf32>> -> tensor<512x4xf32>
2026-02-21T08:09:26.0594557Z         %29 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:26.0594832Z         %30 = arith.muli %29, %cst : tensor<512x1xi32>
2026-02-21T08:09:26.0595079Z         %31 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:09:26.0595368Z         %32 = tt.broadcast %30 : tensor<512x1xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0595623Z         %33 = tt.broadcast %31 : tensor<1x4xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0595857Z         %34 = arith.addi %32, %33 : tensor<512x4xi32>
2026-02-21T08:09:26.0596098Z         %35 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0596367Z         %36 = tt.addptr %35, %34 : tensor<512x4x!tt.ptr<f32>>, tensor<512x4xi32>
2026-02-21T08:09:26.0596616Z         %37 = tt.load %36 : tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0596817Z         %38 = scf.if %arg3 -> (tensor<512x4xf32>) {
2026-02-21T08:09:26.0597189Z           %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32>
2026-02-21T08:09:26.0597562Z           %75 = arith.subf %37, %28 : tensor<512x4xf32>
2026-02-21T08:09:26.0597805Z           %76 = arith.mulf %74, %75 : tensor<512x4xf32>
2026-02-21T08:09:26.0598012Z           %77 = arith.addf %76, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0598308Z           scf.yield %77 : tensor<512x4xf32>
2026-02-21T08:09:26.0598479Z         } else {
2026-02-21T08:09:26.0598649Z           %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32>
2026-02-21T08:09:26.0598866Z           %75 = arith.cmpf ogt, %37, %74 : tensor<512x4xf32>
2026-02-21T08:09:26.0599091Z           %76 = arith.cmpf une, %37, %37 : tensor<512x4xf32>
2026-02-21T08:09:26.0599305Z           %77 = arith.ori %75, %76 : tensor<512x4xi1>
2026-02-21T08:09:26.0599538Z           %78 = arith.select %77, %37, %74 : tensor<512x4xi1>, tensor<512x4xf32>
2026-02-21T08:09:26.0599785Z           %79 = math.log %78 : tensor<512x4xf32>
2026-02-21T08:09:26.0599981Z           %80 = arith.subf %79, %28 : tensor<512x4xf32>
2026-02-21T08:09:26.0600187Z           %81 = arith.mulf %37, %80 : tensor<512x4xf32>
2026-02-21T08:09:26.0600391Z           %82 = arith.addf %81, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0600595Z           scf.yield %82 : tensor<512x4xf32>
2026-02-21T08:09:26.0600841Z         }
2026-02-21T08:09:26.0600992Z         %39 = arith.addf %arg7, %38 : tensor<512x4xf32>
2026-02-21T08:09:26.0601194Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:26.0601385Z         %40 = arith.muli %c4_i32, %c1_i32 : i32
2026-02-21T08:09:26.0601579Z         %41 = arith.addi %arg6, %40 : i32
2026-02-21T08:09:26.0601816Z         %42 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:26.0602098Z         %43 = tt.splat %41 : i32 -> tensor<4xi32>
2026-02-21T08:09:26.0602286Z         %44 = arith.addi %43, %42 : tensor<4xi32>
2026-02-21T08:09:26.0602560Z         %45 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc<tensor<512x4xf32>> -> tensor<512x4xf32>
2026-02-21T08:09:26.0602891Z         %46 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:26.0603144Z         %47 = arith.muli %46, %cst : tensor<512x1xi32>
2026-02-21T08:09:26.0603397Z         %48 = tt.expand_dims %44 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:09:26.0603674Z         %49 = tt.broadcast %47 : tensor<512x1xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0603930Z         %50 = tt.broadcast %48 : tensor<1x4xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0604160Z         %51 = arith.addi %49, %50 : tensor<512x4xi32>
2026-02-21T08:09:26.0604384Z         %52 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0604655Z         %53 = tt.addptr %52, %51 : tensor<512x4x!tt.ptr<f32>>, tensor<512x4xi32>
2026-02-21T08:09:26.0604893Z         %54 = tt.load %53 : tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0605100Z         %55 = scf.if %arg3 -> (tensor<512x4xf32>) {
2026-02-21T08:09:26.0605447Z           %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32>
2026-02-21T08:09:26.0605813Z           %75 = arith.subf %54, %45 : tensor<512x4xf32>
2026-02-21T08:09:26.0606018Z           %76 = arith.mulf %74, %75 : tensor<512x4xf32>
2026-02-21T08:09:26.0606225Z           %77 = arith.addf %76, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0606424Z           scf.yield %77 : tensor<512x4xf32>
2026-02-21T08:09:26.0606589Z         } else {
2026-02-21T08:09:26.0606752Z           %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32>
2026-02-21T08:09:26.0606964Z           %75 = arith.cmpf ogt, %54, %74 : tensor<512x4xf32>
2026-02-21T08:09:26.0607181Z           %76 = arith.cmpf une, %54, %54 : tensor<512x4xf32>
2026-02-21T08:09:26.0607389Z           %77 = arith.ori %75, %76 : tensor<512x4xi1>
2026-02-21T08:09:26.0607617Z           %78 = arith.select %77, %54, %74 : tensor<512x4xi1>, tensor<512x4xf32>
2026-02-21T08:09:26.0607854Z           %79 = math.log %78 : tensor<512x4xf32>
2026-02-21T08:09:26.0608041Z           %80 = arith.subf %79, %45 : tensor<512x4xf32>
2026-02-21T08:09:26.0608240Z           %81 = arith.mulf %54, %80 : tensor<512x4xf32>
2026-02-21T08:09:26.0608437Z           %82 = arith.addf %81, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0608635Z           scf.yield %82 : tensor<512x4xf32>
2026-02-21T08:09:26.0608863Z         }
2026-02-21T08:09:26.0609000Z         %56 = arith.addf %39, %55 : tensor<512x4xf32>
2026-02-21T08:09:26.0609190Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:09:26.0609371Z         %57 = arith.muli %c4_i32, %c2_i32 : i32
2026-02-21T08:09:26.0609554Z         %58 = arith.addi %arg6, %57 : i32
2026-02-21T08:09:26.0609764Z         %59 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:26.0609999Z         %60 = tt.splat %58 : i32 -> tensor<4xi32>
2026-02-21T08:09:26.0610184Z         %61 = arith.addi %60, %59 : tensor<4xi32>
2026-02-21T08:09:26.0610455Z         %62 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc<tensor<512x4xf32>> -> tensor<512x4xf32>
2026-02-21T08:09:26.0610783Z         %63 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:26.0611033Z         %64 = arith.muli %63, %cst : tensor<512x1xi32>
2026-02-21T08:09:26.0611321Z         %65 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:09:26.0611596Z         %66 = tt.broadcast %64 : tensor<512x1xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0611871Z         %67 = tt.broadcast %65 : tensor<1x4xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0612101Z         %68 = arith.addi %66, %67 : tensor<512x4xi32>
2026-02-21T08:09:26.0612323Z         %69 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0612592Z         %70 = tt.addptr %69, %68 : tensor<512x4x!tt.ptr<f32>>, tensor<512x4xi32>
2026-02-21T08:09:26.0612832Z         %71 = tt.load %70 : tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0613038Z         %72 = scf.if %arg3 -> (tensor<512x4xf32>) {
2026-02-21T08:09:26.0613389Z           %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32>
2026-02-21T08:09:26.0613752Z           %75 = arith.subf %71, %62 : tensor<512x4xf32>
2026-02-21T08:09:26.0613955Z           %76 = arith.mulf %74, %75 : tensor<512x4xf32>
2026-02-21T08:09:26.0614156Z           %77 = arith.addf %76, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0614353Z           scf.yield %77 : tensor<512x4xf32>
2026-02-21T08:09:26.0614514Z         } else {
2026-02-21T08:09:26.0614677Z           %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32>
2026-02-21T08:09:26.0614887Z           %75 = arith.cmpf ogt, %71, %74 : tensor<512x4xf32>
2026-02-21T08:09:26.0615107Z           %76 = arith.cmpf une, %71, %71 : tensor<512x4xf32>
2026-02-21T08:09:26.0615313Z           %77 = arith.ori %75, %76 : tensor<512x4xi1>
2026-02-21T08:09:26.0615535Z           %78 = arith.select %77, %71, %74 : tensor<512x4xi1>, tensor<512x4xf32>
2026-02-21T08:09:26.0615769Z           %79 = math.log %78 : tensor<512x4xf32>
2026-02-21T08:09:26.0615956Z           %80 = arith.subf %79, %62 : tensor<512x4xf32>
2026-02-21T08:09:26.0616152Z           %81 = arith.mulf %71, %80 : tensor<512x4xf32>
2026-02-21T08:09:26.0616392Z           %82 = arith.addf %81, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0616596Z           scf.yield %82 : tensor<512x4xf32>
2026-02-21T08:09:26.0616763Z         }
2026-02-21T08:09:26.0616900Z         %73 = arith.addf %56, %72 : tensor<512x4xf32>
2026-02-21T08:09:26.0617088Z         scf.yield %73 : tensor<512x4xf32>
2026-02-21T08:09:26.0617269Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:09:26.0617497Z       %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:26.0617730Z       %8 = tt.splat %c4092_i32 : i32 -> tensor<4xi32>
2026-02-21T08:09:26.0617928Z       %9 = arith.addi %8, %7 : tensor<4xi32>
2026-02-21T08:09:26.0618203Z       %10 = tt.descriptor_load %0[%2, %c4092_i32] : !tt.tensordesc<tensor<512x4xf32>> -> tensor<512x4xf32>
2026-02-21T08:09:26.0618540Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32>
2026-02-21T08:09:26.0618799Z       %12 = arith.muli %11, %cst : tensor<512x1xi32>
2026-02-21T08:09:26.0619033Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:09:26.0619365Z       %14 = tt.broadcast %12 : tensor<512x1xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0619606Z       %15 = tt.broadcast %13 : tensor<1x4xi32> -> tensor<512x4xi32>
2026-02-21T08:09:26.0619828Z       %16 = arith.addi %14, %15 : tensor<512x4xi32>
2026-02-21T08:09:26.0620054Z       %17 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0620306Z       %18 = tt.addptr %17, %16 : tensor<512x4x!tt.ptr<f32>>, tensor<512x4xi32>
2026-02-21T08:09:26.0620546Z       %19 = tt.load %18 : tensor<512x4x!tt.ptr<f32>>
2026-02-21T08:09:26.0620739Z       %20 = scf.if %arg3 -> (tensor<512x4xf32>) {
2026-02-21T08:09:26.0621086Z         %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32>
2026-02-21T08:09:26.0621428Z         %26 = arith.subf %19, %10 : tensor<512x4xf32>
2026-02-21T08:09:26.0621678Z         %27 = arith.mulf %25, %26 : tensor<512x4xf32>
2026-02-21T08:09:26.0621924Z         %28 = arith.addf %27, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0622116Z         scf.yield %28 : tensor<512x4xf32>
2026-02-21T08:09:26.0622286Z       } else {
2026-02-21T08:09:26.0622435Z         %25 = tt.splat %arg4 : f32 -> tensor<512x4xf32>
2026-02-21T08:09:26.0622650Z         %26 = arith.cmpf ogt, %19, %25 : tensor<512x4xf32>
2026-02-21T08:09:26.0622855Z         %27 = arith.cmpf une, %19, %19 : tensor<512x4xf32>
2026-02-21T08:09:26.0623060Z         %28 = arith.ori %26, %27 : tensor<512x4xi1>
2026-02-21T08:09:26.0623282Z         %29 = arith.select %28, %19, %25 : tensor<512x4xi1>, tensor<512x4xf32>
2026-02-21T08:09:26.0623520Z         %30 = math.log %29 : tensor<512x4xf32>
2026-02-21T08:09:26.0623714Z         %31 = arith.subf %30, %10 : tensor<512x4xf32>
2026-02-21T08:09:26.0623907Z         %32 = arith.mulf %19, %31 : tensor<512x4xf32>
2026-02-21T08:09:26.0624116Z         %33 = arith.addf %32, %cst_0 : tensor<512x4xf32>
2026-02-21T08:09:26.0624306Z         scf.yield %33 : tensor<512x4xf32>
2026-02-21T08:09:26.0624488Z       }
2026-02-21T08:09:26.0624629Z       %21 = arith.addf %6, %20 : tensor<512x4xf32>
2026-02-21T08:09:26.0624827Z       %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({
2026-02-21T08:09:26.0625016Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:26.0625186Z         %25 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:26.0625369Z         tt.reduce.return %25 : f32
2026-02-21T08:09:26.0625546Z       }) : (tensor<512x4xf32>) -> tensor<512xf32>
2026-02-21T08:09:26.0625771Z       %23 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<512x!tt.ptr<f32>>
2026-02-21T08:09:26.0626021Z       %24 = tt.addptr %23, %5 : tensor<512x!tt.ptr<f32>>, tensor<512xi32>
2026-02-21T08:09:26.0626253Z       tt.store %24, %22 : tensor<512x!tt.ptr<f32>>
2026-02-21T08:09:26.0626511Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:26.0626749Z     tt.return
2026-02-21T08:09:26.0626877Z   }
2026-02-21T08:09:26.0626992Z }
2026-02-21T08:09:26.0627061Z 
2026-02-21T08:09:26.0627119Z {-#
2026-02-21T08:09:26.0627241Z   external_resources: {
2026-02-21T08:09:26.0627398Z     mlir_reproducer: {
2026-02-21T08:09:26.0631656Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:26.0636218Z       disable_threading: false,
2026-02-21T08:09:26.0636456Z       verify_each: true
2026-02-21T08:09:26.0636601Z     }
2026-02-21T08:09:26.0636750Z   }
2026-02-21T08:09:26.0636875Z #-}
2026-02-21T08:09:26.0637356Z /tmp/torchinductor_root/yu/cyudquc76sft6j3sqezvgqcbsdkdqd5kew3iaciebsyyk2vta6eh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:26.0638594Z /tmp/torchinductor_root/yu/cyudquc76sft6j3sqezvgqcbsdkdqd5kew3iaciebsyyk2vta6eh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:26.0639616Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:26.0640702Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[3, 2], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:09:26.0641675Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:26.0641957Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:27.0517642Z module {
2026-02-21T08:09:27.0518367Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:27.0518970Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:09:27.0519163Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:27.0519335Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:27.0519719Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x128xf32>
2026-02-21T08:09:27.0520001Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:09:27.0520188Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:27.0520373Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:27.0520547Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:27.0520856Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<64x128xf32>>
2026-02-21T08:09:27.0521356Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<64x128xf32>>
2026-02-21T08:09:27.0521734Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:27.0522133Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:09:27.0522317Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:09:27.0522517Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:09:27.0522714Z       %5 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:09:27.0522952Z       %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:09:27.0523516Z       %7 = tt.splat %5 : i32 -> tensor<64xi32>
2026-02-21T08:09:27.0523712Z       %8 = arith.addi %7, %6 : tensor<64xi32>
2026-02-21T08:09:27.0524017Z       %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg7 = %cst) -> (tensor<64x128xf32>)  : i32 {
2026-02-21T08:09:27.0524429Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:09:27.0524800Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:09:27.0525089Z         %15 = scf.if %arg3 -> (tensor<64x128xf32>) {
2026-02-21T08:09:27.0525466Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T08:09:27.0525832Z           %18 = arith.subf %14, %13 : tensor<64x128xf32>
2026-02-21T08:09:27.0526146Z           %19 = arith.mulf %17, %18 : tensor<64x128xf32>
2026-02-21T08:09:27.0526364Z           %20 = arith.addf %19, %cst : tensor<64x128xf32>
2026-02-21T08:09:27.0526557Z           scf.yield %20 : tensor<64x128xf32>
2026-02-21T08:09:27.0526732Z         } else {
2026-02-21T08:09:27.0526893Z           %17 = tt.splat %arg4 : f32 -> tensor<64x128xf32>
2026-02-21T08:09:27.0527119Z           %18 = arith.cmpf ogt, %14, %17 : tensor<64x128xf32>
2026-02-21T08:09:27.0527339Z           %19 = arith.cmpf une, %14, %14 : tensor<64x128xf32>
2026-02-21T08:09:27.0527554Z           %20 = arith.ori %18, %19 : tensor<64x128xi1>
2026-02-21T08:09:27.0527798Z           %21 = arith.select %20, %14, %17 : tensor<64x128xi1>, tensor<64x128xf32>
2026-02-21T08:09:27.0528037Z           %22 = math.log %21 : tensor<64x128xf32>
2026-02-21T08:09:27.0528239Z           %23 = arith.subf %22, %13 : tensor<64x128xf32>
2026-02-21T08:09:27.0528435Z           %24 = arith.mulf %14, %23 : tensor<64x128xf32>
2026-02-21T08:09:27.0528646Z           %25 = arith.addf %24, %cst : tensor<64x128xf32>
2026-02-21T08:09:27.0528840Z           scf.yield %25 : tensor<64x128xf32>
2026-02-21T08:09:27.0529010Z         }
2026-02-21T08:09:27.0529152Z         %16 = arith.addf %arg7, %15 : tensor<64x128xf32>
2026-02-21T08:09:27.0529346Z         scf.yield %16 : tensor<64x128xf32>
2026-02-21T08:09:27.0529552Z       } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:09:27.0529755Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.0529945Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.0530120Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.0530308Z         tt.reduce.return %13 : f32
2026-02-21T08:09:27.0530488Z       }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T08:09:27.0530720Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:09:27.0530978Z       %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:09:27.0531201Z       tt.store %12, %10 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:09:27.0531399Z     } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T08:09:27.0531568Z     tt.return
2026-02-21T08:09:27.0531693Z   }
2026-02-21T08:09:27.0531807Z }
2026-02-21T08:09:27.0531937Z 
2026-02-21T08:09:27.0531987Z {-#
2026-02-21T08:09:27.0532110Z   external_resources: {
2026-02-21T08:09:27.0532269Z     mlir_reproducer: {
2026-02-21T08:09:27.0536660Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:27.0541051Z       disable_threading: false,
2026-02-21T08:09:27.0541249Z       verify_each: true
2026-02-21T08:09:27.0541402Z     }
2026-02-21T08:09:27.0541539Z   }
2026-02-21T08:09:27.0541661Z #-}
2026-02-21T08:09:27.0542185Z /tmp/torchinductor_root/q6/cq6q5pv2cbszkz2i6st67lcbiwqdbcytfsidsup2i3mkuyuhy5nm.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:27.0543498Z /tmp/torchinductor_root/q6/cq6q5pv2cbszkz2i6st67lcbiwqdbcytfsidsup2i3mkuyuhy5nm.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:27.0544542Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:27.0545725Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[0, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:09:27.0546717Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:27.0547007Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:27.1087234Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:09:27.1087944Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:09:27.1088574Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:09:27.1088783Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:09:27.1088961Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:09:27.1089182Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32>
2026-02-21T08:09:27.1089405Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:09:27.1089597Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:09:27.1089800Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:09:27.1089989Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:09:27.1090313Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:09:27.1090756Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:09:27.1091076Z     %2 = tt.get_program_id x : i32
2026-02-21T08:09:27.1091261Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:09:27.1091441Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:09:27.1091931Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:09:27.1092112Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:09:27.1092310Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:09:27.1092492Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:09:27.1092671Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:09:27.1092863Z     %c4_i32_1 = arith.constant 4 : i32
2026-02-21T08:09:27.1093067Z     %9 = arith.remsi %8, %c4_i32_1 : i32
2026-02-21T08:09:27.1093260Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:09:27.1093438Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:09:27.1093633Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:09:27.1093822Z     %13 = arith.muli %c1_i32, %c4_i32_1 : i32
2026-02-21T08:09:27.1094042Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:09:27.1094248Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:09:27.1094488Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:27.1094807Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:09:27.1095014Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:09:27.1095345Z       %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:09:27.1095779Z         %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1096173Z         %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1096464Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:09:27.1096842Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:09:27.1097221Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1097420Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:09:27.1097632Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1097830Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:09:27.1098003Z         } else {
2026-02-21T08:09:27.1098161Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:09:27.1098380Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:09:27.1098597Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:09:27.1098801Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:09:27.1099042Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:09:27.1099274Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:09:27.1099469Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1099662Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:09:27.1099863Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1100052Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:09:27.1100223Z         }
2026-02-21T08:09:27.1100372Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:09:27.1100561Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:09:27.1100875Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:27.1101200Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.1101392Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.1101566Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.1101751Z         tt.reduce.return %52 : f32
2026-02-21T08:09:27.1101982Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:09:27.1102203Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1102468Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:09:27.1102697Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1102892Z       %c1_i32_2 = arith.constant 1 : i32
2026-02-21T08:09:27.1103077Z       %22 = arith.muli %c1_i32, %c1_i32_2 : i32
2026-02-21T08:09:27.1103329Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:09:27.1103545Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T08:09:27.1103762Z       %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:27.1104003Z       %26 = tt.splat %24 : i32 -> tensor<4xi32>
2026-02-21T08:09:27.1104191Z       %27 = arith.addi %26, %25 : tensor<4xi32>
2026-02-21T08:09:27.1104498Z       %28 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:09:27.1104893Z         %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1105250Z         %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1105539Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:09:27.1105946Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:09:27.1106313Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1106516Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:09:27.1106715Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1106912Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:09:27.1107075Z         } else {
2026-02-21T08:09:27.1107240Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:09:27.1107451Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:09:27.1107667Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:09:27.1107874Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:09:27.1108104Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:09:27.1108346Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:09:27.1108539Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1108740Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:09:27.1108933Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1109127Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:09:27.1109296Z         }
2026-02-21T08:09:27.1109434Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:09:27.1109626Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:09:27.1109935Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:27.1110265Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.1110448Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.1110626Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.1110867Z         tt.reduce.return %52 : f32
2026-02-21T08:09:27.1111055Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:09:27.1111281Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1111541Z       %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:09:27.1111804Z       tt.store %31, %29 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1112038Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:09:27.1112226Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:09:27.1112406Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:09:27.1112583Z       %34 = arith.muli %33, %c4_i32 : i32
2026-02-21T08:09:27.1112796Z       %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:27.1113034Z       %36 = tt.splat %34 : i32 -> tensor<4xi32>
2026-02-21T08:09:27.1113220Z       %37 = arith.addi %36, %35 : tensor<4xi32>
2026-02-21T08:09:27.1113517Z       %38 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:09:27.1113906Z         %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1114321Z         %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1114600Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:09:27.1114948Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:09:27.1115305Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1115505Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:09:27.1115697Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1115890Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:09:27.1116050Z         } else {
2026-02-21T08:09:27.1116211Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:09:27.1116416Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:09:27.1116627Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:09:27.1116884Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:09:27.1117118Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:09:27.1117353Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:09:27.1117538Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1117733Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:09:27.1117922Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1118112Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:09:27.1118270Z         }
2026-02-21T08:09:27.1118412Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:09:27.1118599Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:09:27.1118895Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:27.1119220Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.1119405Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.1119584Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.1119758Z         tt.reduce.return %52 : f32
2026-02-21T08:09:27.1119940Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:09:27.1120159Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1120403Z       %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:09:27.1120635Z       tt.store %41, %39 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1120822Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:09:27.1121004Z       %42 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:09:27.1121175Z       %43 = arith.addi %arg5, %42 : i32
2026-02-21T08:09:27.1121348Z       %44 = arith.muli %43, %c4_i32 : i32
2026-02-21T08:09:27.1121555Z       %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:27.1121790Z       %46 = tt.splat %44 : i32 -> tensor<4xi32>
2026-02-21T08:09:27.1122031Z       %47 = arith.addi %46, %45 : tensor<4xi32>
2026-02-21T08:09:27.1122328Z       %48 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:09:27.1122725Z         %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1123084Z         %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1123375Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:09:27.1123735Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:09:27.1124090Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1124292Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:09:27.1124490Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1124691Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:09:27.1124862Z         } else {
2026-02-21T08:09:27.1125101Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:09:27.1125318Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:09:27.1125530Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:09:27.1125758Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:09:27.1125995Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:09:27.1126230Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:09:27.1126426Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:09:27.1126619Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:09:27.1126831Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1127013Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:09:27.1127181Z         }
2026-02-21T08:09:27.1127315Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:09:27.1127556Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:09:27.1127862Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:27.1128177Z       %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.1128365Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.1128533Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.1128715Z         tt.reduce.return %52 : f32
2026-02-21T08:09:27.1128890Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:09:27.1129108Z       %50 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1129358Z       %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:09:27.1129577Z       tt.store %51, %49 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1129768Z     } {tt.disallow_acc_multi_buffer}
2026-02-21T08:09:27.1129954Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:09:27.1130155Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:09:27.1130370Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:09:27.1130602Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:09:27.1130791Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:09:27.1131075Z       %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:09:27.1131449Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1131797Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:09:27.1132110Z         %24 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:09:27.1132453Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:09:27.1132805Z           %27 = arith.subf %23, %22 : tensor<4x4xf32>
2026-02-21T08:09:27.1133009Z           %28 = arith.mulf %26, %27 : tensor<4x4xf32>
2026-02-21T08:09:27.1133206Z           %29 = arith.addf %28, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1133398Z           scf.yield %29 : tensor<4x4xf32>
2026-02-21T08:09:27.1133558Z         } else {
2026-02-21T08:09:27.1133718Z           %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:09:27.1133925Z           %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32>
2026-02-21T08:09:27.1134146Z           %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32>
2026-02-21T08:09:27.1134356Z           %29 = arith.ori %27, %28 : tensor<4x4xi1>
2026-02-21T08:09:27.1134584Z           %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:09:27.1134819Z           %31 = math.log %30 : tensor<4x4xf32>
2026-02-21T08:09:27.1135002Z           %32 = arith.subf %31, %22 : tensor<4x4xf32>
2026-02-21T08:09:27.1135196Z           %33 = arith.mulf %23, %32 : tensor<4x4xf32>
2026-02-21T08:09:27.1135388Z           %34 = arith.addf %33, %cst : tensor<4x4xf32>
2026-02-21T08:09:27.1135635Z           scf.yield %34 : tensor<4x4xf32>
2026-02-21T08:09:27.1135800Z         }
2026-02-21T08:09:27.1135934Z         %25 = arith.addf %arg7, %24 : tensor<4x4xf32>
2026-02-21T08:09:27.1136122Z         scf.yield %25 : tensor<4x4xf32>
2026-02-21T08:09:27.1136422Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:09:27.1136742Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:09:27.1136925Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:09:27.1137103Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:09:27.1137279Z         tt.reduce.return %22 : f32
2026-02-21T08:09:27.1137459Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:09:27.1137676Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1137921Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:09:27.1138198Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:09:27.1138421Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:09:27.1138623Z     tt.return
2026-02-21T08:09:27.1138742Z   }
2026-02-21T08:09:27.1138860Z }
2026-02-21T08:09:27.1138926Z 
2026-02-21T08:09:27.1138983Z {-#
2026-02-21T08:09:27.1139105Z   external_resources: {
2026-02-21T08:09:27.1139261Z     mlir_reproducer: {
2026-02-21T08:09:27.1143521Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:09:27.1147868Z       disable_threading: false,
2026-02-21T08:09:27.1148076Z       verify_each: true
2026-02-21T08:09:27.1148233Z     }
2026-02-21T08:09:27.1148358Z   }
2026-02-21T08:09:27.1148495Z #-}
2026-02-21T08:09:27.1148988Z /tmp/torchinductor_root/dc/cdco3tfp4dqsv4m67ihzdj7b22uvz6pscxnp3nmyybieeqfozo77.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:09:27.1150228Z /tmp/torchinductor_root/dc/cdco3tfp4dqsv4m67ihzdj7b22uvz6pscxnp3nmyybieeqfozo77.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:09:27.1151239Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:09:27.1152498Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:09:27.1153557Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:09:27.1153812Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:09:28.5226629Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.1 configs/s
2026-02-21T08:09:28.5238123Z [69s] Adaptive compile timeout: 30s (90% percentile=2.7s, bounds=[30.0s, 60s])
2026-02-21T08:09:29.5539385Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 985.5 configs/s
2026-02-21T08:09:29.6109262Z [70s] Initial random population of 100, 5 starting points: 
2026-02-21T08:09:29.6111008Z error=9
2026-02-21T08:09:29.6111160Z timeout=3
2026-02-21T08:09:29.6111290Z ok=88
2026-02-21T08:09:29.6111410Z min=0.0482
2026-02-21T08:09:29.6111537Z mid=0.4474
2026-02-21T08:09:29.6111652Z max=23.3493
2026-02-21T08:09:29.6111799Z best={'block_sizes': [256, 1],
2026-02-21T08:09:29.6112124Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:09:29.6112382Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:09:29.6112580Z  'num_sm_multiplier': 64,
2026-02-21T08:09:29.6112734Z  'num_stages': 7,
2026-02-21T08:09:29.6112879Z  'num_warps': 2,
2026-02-21T08:09:29.6113032Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:09:29.6113228Z  'range_flattens': [False, False],
2026-02-21T08:09:29.6113413Z  'range_multi_buffers': [True, True],
2026-02-21T08:09:29.6113600Z  'range_num_stages': [1, 3],
2026-02-21T08:09:29.6113781Z  'range_unroll_factors': [4, 3],
2026-02-21T08:09:29.6113970Z  'range_warp_specializes': [False, None]}
2026-02-21T08:09:29.6130142Z [70s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:09:30.7586740Z [71s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T08:09:39.2258333Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 3.1 configs/s
2026-02-21T08:09:44.3928397Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 17.2 configs/s
2026-02-21T08:09:50.5439347Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 167.6         
2026-02-21T08:09:50.5443234Z                                                                   configs/s     
2026-02-21T08:09:50.9958506Z [91s] Generation 1 complete: 
2026-02-21T08:09:50.9963243Z error=1
2026-02-21T08:09:50.9964984Z ok=89
2026-02-21T08:09:50.9965185Z min=0.0420
2026-02-21T08:09:50.9969786Z mid=0.0564
2026-02-21T08:09:50.9971160Z max=0.5583
2026-02-21T08:09:50.9971330Z best={'block_sizes': [1024, 1],
2026-02-21T08:09:50.9971597Z  'indexing': ['tensor_descriptor', 'pointer', 'pointer'],
2026-02-21T08:09:50.9972196Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:09:50.9976863Z  'num_stages': 6,
2026-02-21T08:09:50.9978527Z  'num_warps': 16,
2026-02-21T08:09:50.9978759Z  'pid_type': 'flat',
2026-02-21T08:09:50.9984036Z  'range_flattens': [None, False],
2026-02-21T08:09:50.9988036Z  'range_multi_buffers': [None, None],
2026-02-21T08:09:50.9991708Z  'range_num_stages': [0, 1],
2026-02-21T08:09:50.9995649Z  'range_unroll_factors': [0, 0],
2026-02-21T08:09:50.9997600Z  'range_warp_specializes': [None, None]}
2026-02-21T08:09:50.9997889Z [91s] Fitting surrogate: 190 points, 190 targets
2026-02-21T08:09:52.3318551Z [93s] Generation 2 starting: 83 neighbors, 5 active search path(s)
2026-02-21T08:09:56.0529604Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 16.4 configs/s
2026-02-21T08:10:01.1281008Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.1 configs/s
2026-02-21T08:10:08.1211338Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 148.9         
2026-02-21T08:10:08.1212728Z                                                                   configs/s     
2026-02-21T08:10:08.5232435Z [109s] Generation 2 complete: 
2026-02-21T08:10:08.5232785Z ok=89
2026-02-21T08:10:08.5232986Z min=0.0420
2026-02-21T08:10:08.5233176Z mid=0.0481
2026-02-21T08:10:08.5233475Z max=0.2386
2026-02-21T08:10:08.5233743Z best={'block_sizes': [256, 1],
2026-02-21T08:10:08.5234114Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:10:08.5234494Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:10:08.5234806Z  'num_stages': 7,
2026-02-21T08:10:08.5235018Z  'num_warps': 1,
2026-02-21T08:10:08.5235241Z  'pid_type': 'flat',
2026-02-21T08:10:08.5235491Z  'range_flattens': [None, False],
2026-02-21T08:10:08.5235795Z  'range_multi_buffers': [None, None],
2026-02-21T08:10:08.5236100Z  'range_num_stages': [0, 4],
2026-02-21T08:10:08.5236722Z  'range_unroll_factors': [0, 2],
2026-02-21T08:10:08.5237049Z  'range_warp_specializes': [None, None]}
2026-02-21T08:10:08.5247759Z [109s] Fitting surrogate: 279 points, 279 targets
2026-02-21T08:10:09.4625057Z [110s] Generation 3 starting: 72 neighbors, 5 active search path(s)
2026-02-21T08:10:14.4937969Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 6.1 configs/s
2026-02-21T08:10:18.9886983Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.8 configs/s
2026-02-21T08:10:25.5608795Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 159.8         
2026-02-21T08:10:25.5612730Z                                                                   configs/s     
2026-02-21T08:10:25.9245601Z [126s] Generation 3 complete: 
2026-02-21T08:10:25.9249682Z ok=77
2026-02-21T08:10:25.9253589Z min=0.0420
2026-02-21T08:10:25.9258057Z mid=0.0441
2026-02-21T08:10:25.9262581Z max=0.1893
2026-02-21T08:10:25.9267099Z best={'block_sizes': [1024, 1],
2026-02-21T08:10:25.9271239Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:10:25.9271625Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:10:25.9271928Z  'num_sm_multiplier': 64,
2026-02-21T08:10:25.9276701Z  'num_stages': 7,
2026-02-21T08:10:25.9280514Z  'num_warps': 1,
2026-02-21T08:10:25.9285132Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:10:25.9289428Z  'range_flattens': [False, False],
2026-02-21T08:10:25.9290965Z  'range_multi_buffers': [True, True],
2026-02-21T08:10:25.9291193Z  'range_num_stages': [0, 3],
2026-02-21T08:10:25.9291361Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:25.9291546Z  'range_warp_specializes': [True, None]}
2026-02-21T08:10:25.9291836Z [126s] Fitting surrogate: 356 points, 356 targets
2026-02-21T08:10:26.7770502Z [127s] Generation 4 starting: 48 neighbors, 4 active search path(s)
2026-02-21T08:10:29.3484173Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 11.1 configs/s
2026-02-21T08:10:32.2508841Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.5 configs/s
2026-02-21T08:10:36.5705667Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 236.5         
2026-02-21T08:10:36.5707360Z                                                                   configs/s     
2026-02-21T08:10:36.8224670Z [137s] Generation 4 complete: 
2026-02-21T08:10:36.8227023Z error=1
2026-02-21T08:10:36.8227154Z ok=52
2026-02-21T08:10:36.8227280Z min=0.0419
2026-02-21T08:10:36.8227403Z mid=0.0440
2026-02-21T08:10:36.8227516Z max=0.0707
2026-02-21T08:10:36.8227651Z best={'block_sizes': [1024, 1],
2026-02-21T08:10:36.8227881Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:10:36.8228147Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:10:36.8228337Z  'num_sm_multiplier': 64,
2026-02-21T08:10:36.8228495Z  'num_stages': 7,
2026-02-21T08:10:36.8228628Z  'num_warps': 1,
2026-02-21T08:10:36.8228784Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:10:36.8228975Z  'range_flattens': [False, False],
2026-02-21T08:10:36.8229182Z  'range_multi_buffers': [True, True],
2026-02-21T08:10:36.8229786Z  'range_num_stages': [0, 3],
2026-02-21T08:10:36.8229952Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:36.8230134Z  'range_warp_specializes': [True, None]}
2026-02-21T08:10:36.8237450Z [137s] Fitting surrogate: 409 points, 409 targets
2026-02-21T08:10:37.3216462Z [138s] Generation 5 starting: 24 neighbors, 2 active search path(s)
2026-02-21T08:10:40.7628589Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 3.9 configs/s
2026-02-21T08:10:42.2705342Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.1 configs/s
2026-02-21T08:10:44.7157139Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.6         
2026-02-21T08:10:44.7160868Z                                                                   configs/s     
2026-02-21T08:10:44.8720037Z [145s] Generation 5 complete: 
2026-02-21T08:10:44.8720348Z ok=27
2026-02-21T08:10:44.8720540Z min=0.0420
2026-02-21T08:10:44.8721172Z mid=0.0441
2026-02-21T08:10:44.8721386Z max=0.1976
2026-02-21T08:10:44.8721587Z best={'block_sizes': [1024, 1],
2026-02-21T08:10:44.8722258Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:10:44.8722687Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:10:44.8723008Z  'num_sm_multiplier': 64,
2026-02-21T08:10:44.8723247Z  'num_stages': 7,
2026-02-21T08:10:44.8723466Z  'num_warps': 1,
2026-02-21T08:10:44.8723710Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:10:44.8724004Z  'range_flattens': [False, False],
2026-02-21T08:10:44.8724301Z  'range_multi_buffers': [True, True],
2026-02-21T08:10:44.8724589Z  'range_num_stages': [0, 3],
2026-02-21T08:10:44.8724852Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:44.8725131Z  'range_warp_specializes': [True, None]}
2026-02-21T08:10:44.8742106Z [145s] Fitting surrogate: 436 points, 436 targets
2026-02-21T08:10:45.2098794Z [146s] Generation 6 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:10:46.6141522Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 10.5 configs/s
2026-02-21T08:10:47.2602137Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s
2026-02-21T08:10:48.2960374Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 972.6         
2026-02-21T08:10:48.2964316Z                                                                   configs/s     
2026-02-21T08:10:48.3748171Z [149s] Generation 6 complete: 
2026-02-21T08:10:48.3752556Z ok=13
2026-02-21T08:10:48.3756931Z min=0.0420
2026-02-21T08:10:48.3760956Z mid=0.0420
2026-02-21T08:10:48.3763052Z max=0.0727
2026-02-21T08:10:48.3763262Z best={'block_sizes': [1024, 1],
2026-02-21T08:10:48.3768209Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:10:48.3769704Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:10:48.3769975Z  'num_sm_multiplier': 64,
2026-02-21T08:10:48.3772919Z  'num_stages': 7,
2026-02-21T08:10:48.3773145Z  'num_warps': 1,
2026-02-21T08:10:48.3777372Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:10:48.3782492Z  'range_flattens': [False, False],
2026-02-21T08:10:48.3784024Z  'range_multi_buffers': [True, True],
2026-02-21T08:10:48.3784293Z  'range_num_stages': [0, 3],
2026-02-21T08:10:48.3789543Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:48.3791126Z  'range_warp_specializes': [True, None]}
2026-02-21T08:10:48.3791418Z [149s] Fitting surrogate: 449 points, 449 targets
2026-02-21T08:10:48.5521459Z [149s] Autotuning complete in 149.5s after searching 432 configs.
2026-02-21T08:10:48.5521773Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:10:48.5522978Z     @helion.kernel(config=helion.Config(block_sizes=[1024, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=64, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:10:48.5524195Z 
2026-02-21T08:10:48.5524438Z [149s] Code of selected kernel: /tmp/torchinductor_root/6p/c6pg5h2vgc4upe4bcu6zqjxnou4tx4etxnen3szzszsx4spcfpfc.py
2026-02-21T08:10:49.4870238Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T08:10:49.4870521Z (B, T, V)
2026-02-21T08:10:49.4870658Z --------------
2026-02-21T08:10:49.4870805Z (8, 512, 4096)
2026-02-21T08:10:49.4870950Z 
2026-02-21T08:10:49.4889042Z  17%|█▋        | 1/6 [02:37<13:06, 157.33s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1:
2026-02-21T08:10:49.4893116Z (B, T, V)
2026-02-21T08:10:49.4897566Z --------------
2026-02-21T08:10:49.4899026Z (8, 512, 8192)
2026-02-21T08:10:49.4899386Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:10:50.6456954Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:10:51.9928881Z INFO:tritonbench.utils.triton_op:Took 2.71ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:10:54.8907807Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:10:54.8911910Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:10:54.8913251Z               'dtype': 'torch.float32',
2026-02-21T08:10:54.8913469Z               'shape': (4096, 8192),
2026-02-21T08:10:54.8913652Z               'stride': (8192, 1)},
2026-02-21T08:10:54.8913819Z             { 'device': 'cuda:0',
2026-02-21T08:10:54.8913999Z               'dtype': 'torch.float32',
2026-02-21T08:10:54.8914174Z               'shape': (4096, 8192),
2026-02-21T08:10:54.8914348Z               'stride': (8192, 1)}),
2026-02-21T08:10:54.8914503Z   'kwargs': {}}
2026-02-21T08:10:54.8927652Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:10:55.1297852Z [0s] Autotune random seed: 2134765727
2026-02-21T08:10:55.2897126Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:11:27.4924418Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=128, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[0, 1], range_warp_specializes=[False, None])
2026-02-21T08:11:27.8236028Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None])
2026-02-21T08:11:28.2982647Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=64, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[None, None])
2026-02-21T08:11:28.4597333Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 4], range_warp_specializes=[False, None])
2026-02-21T08:11:28.5632963Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, False])
2026-02-21T08:11:28.8492042Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[4, 1], range_unroll_factors=[0, 1], range_warp_specializes=[True, None])
2026-02-21T08:11:28.8509287Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:11:29.0294922Z module {
2026-02-21T08:11:29.0299529Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:11:29.0300506Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:11:29.0300708Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:29.0300949Z     %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32>
2026-02-21T08:11:29.0301185Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:11:29.0301380Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:29.0301577Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:11:29.0301758Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:11:29.0302057Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:29.0305539Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:11:29.0309560Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:11:29.0313413Z     %2 = tt.get_program_id x : i32
2026-02-21T08:11:29.0316681Z     %3 = arith.muli %2, %c512_i32 : i32
2026-02-21T08:11:29.0320756Z     %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:11:29.0322543Z     %5 = tt.splat %3 : i32 -> tensor<512xi32>
2026-02-21T08:11:29.0322760Z     %6 = arith.addi %5, %4 : tensor<512xi32>
2026-02-21T08:11:29.0323080Z     %7 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>)  : i32 {
2026-02-21T08:11:29.0323504Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:11:29.0323871Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:11:29.0324154Z       %13 = scf.if %arg3 -> (tensor<512x32xf32>) {
2026-02-21T08:11:29.0324526Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32>
2026-02-21T08:11:29.0324894Z         %16 = arith.subf %12, %11 : tensor<512x32xf32>
2026-02-21T08:11:29.0325099Z         %17 = arith.mulf %15, %16 : tensor<512x32xf32>
2026-02-21T08:11:29.0325316Z         %18 = arith.addf %17, %cst : tensor<512x32xf32>
2026-02-21T08:11:29.0325514Z         scf.yield %18 : tensor<512x32xf32>
2026-02-21T08:11:29.0325690Z       } else {
2026-02-21T08:11:29.0325847Z         %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32>
2026-02-21T08:11:29.0326072Z         %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32>
2026-02-21T08:11:29.0326295Z         %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32>
2026-02-21T08:11:29.0326501Z         %18 = arith.ori %16, %17 : tensor<512x32xi1>
2026-02-21T08:11:29.0326743Z         %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32>
2026-02-21T08:11:29.0326982Z         %20 = math.log %19 : tensor<512x32xf32>
2026-02-21T08:11:29.0327183Z         %21 = arith.subf %20, %11 : tensor<512x32xf32>
2026-02-21T08:11:29.0327381Z         %22 = arith.mulf %12, %21 : tensor<512x32xf32>
2026-02-21T08:11:29.0327596Z         %23 = arith.addf %22, %cst : tensor<512x32xf32>
2026-02-21T08:11:29.0328046Z         scf.yield %23 : tensor<512x32xf32>
2026-02-21T08:11:29.0328218Z       }
2026-02-21T08:11:29.0328375Z       %14 = arith.addf %arg6, %13 : tensor<512x32xf32>
2026-02-21T08:11:29.0328569Z       scf.yield %14 : tensor<512x32xf32>
2026-02-21T08:11:29.0328895Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:11:29.0329222Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.0329421Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:11:29.0329598Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:11:29.0329777Z       tt.reduce.return %11 : f32
2026-02-21T08:11:29.0329961Z     }) : (tensor<512x32xf32>) -> tensor<512xf32>
2026-02-21T08:11:29.0330184Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<512x!tt.ptr<f32>>
2026-02-21T08:11:29.0330450Z     %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr<f32>>, tensor<512xi32>
2026-02-21T08:11:29.0330805Z     tt.store %10, %8 : tensor<512x!tt.ptr<f32>>
2026-02-21T08:11:29.0330999Z     tt.return
2026-02-21T08:11:29.0331129Z   }
2026-02-21T08:11:29.0331260Z }
2026-02-21T08:11:29.0331329Z 
2026-02-21T08:11:29.0331390Z {-#
2026-02-21T08:11:29.0331522Z   external_resources: {
2026-02-21T08:11:29.0331684Z     mlir_reproducer: {
2026-02-21T08:11:29.0335992Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:29.0340312Z       disable_threading: false,
2026-02-21T08:11:29.0340472Z       verify_each: true
2026-02-21T08:11:29.0340621Z     }
2026-02-21T08:11:29.0340736Z   }
2026-02-21T08:11:29.0340853Z #-}
2026-02-21T08:11:29.0341264Z /tmp/torchinductor_root/6v/c6vcceyrhox3eack2pqh6el7mbxcvppgwliuhit5zyi62nwxsnqp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:29.0342539Z /tmp/torchinductor_root/6v/c6vcceyrhox3eack2pqh6el7mbxcvppgwliuhit5zyi62nwxsnqp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:29.0343519Z [33s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:29.0344499Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:11:29.0345469Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:29.0345726Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:29.6553584Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:11:29.6558997Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:11:29.6560058Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:11:29.6560638Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:11:29.6560855Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:29.6561042Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:11:29.6561266Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32>
2026-02-21T08:11:29.6561495Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:11:29.6561667Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:29.6562023Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:11:29.6562217Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:11:29.6562394Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:29.6562711Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:11:29.6563133Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:11:29.6563443Z     %2 = tt.get_program_id x : i32
2026-02-21T08:11:29.6563622Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T08:11:29.6563809Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:29.6563996Z     %4 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:11:29.6564177Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:11:29.6564360Z     %6 = arith.divui %5, %c2368_i32 : i32
2026-02-21T08:11:29.6564543Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:11:29.6564721Z     %7 = arith.remsi %6, %c4_i32 : i32
2026-02-21T08:11:29.6564890Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:11:29.6565059Z     %9 = arith.muli %8, %c2368_i32 : i32
2026-02-21T08:11:29.6565231Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:11:29.6565414Z     %11 = arith.muli %c2368_i32, %c4_i32 : i32
2026-02-21T08:11:29.6565618Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:11:29.6565813Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:11:29.6566048Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:29.6566297Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:11:29.6566500Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:11:29.6566816Z       %16 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:11:29.6567226Z         %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6567597Z         %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6567885Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:11:29.6568259Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:11:29.6568622Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6568836Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:11:29.6569051Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6569250Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:11:29.6569548Z         } else {
2026-02-21T08:11:29.6569705Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:11:29.6569929Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:11:29.6570144Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:11:29.6570358Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:11:29.6570589Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:11:29.6570833Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:11:29.6571042Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6571247Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:11:29.6571460Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6571657Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:11:29.6571835Z         }
2026-02-21T08:11:29.6572085Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:11:29.6572296Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:11:29.6572521Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:11:29.6572744Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.6572939Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:29.6573114Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:29.6573302Z         tt.reduce.return %50 : f32
2026-02-21T08:11:29.6573486Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:11:29.6573724Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6573987Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:29.6574234Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6574439Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:11:29.6574628Z       %20 = arith.muli %c2368_i32, %c1_i32_0 : i32
2026-02-21T08:11:29.6574828Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:11:29.6575013Z       %22 = arith.muli %21, %c16_i32 : i32
2026-02-21T08:11:29.6575249Z       %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:29.6575495Z       %24 = tt.splat %22 : i32 -> tensor<16xi32>
2026-02-21T08:11:29.6575702Z       %25 = arith.addi %24, %23 : tensor<16xi32>
2026-02-21T08:11:29.6576025Z       %26 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:11:29.6576476Z         %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6576854Z         %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6577152Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:11:29.6577525Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:11:29.6577901Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6578116Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:11:29.6578322Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6578527Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:11:29.6578699Z         } else {
2026-02-21T08:11:29.6578871Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:11:29.6579092Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:11:29.6579321Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:11:29.6579539Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:11:29.6579782Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:11:29.6580017Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:11:29.6580205Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6580476Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:11:29.6580668Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6580863Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:11:29.6581032Z         }
2026-02-21T08:11:29.6581170Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:11:29.6581361Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:11:29.6581565Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:11:29.6581779Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.6581992Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:29.6582173Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:29.6582353Z         tt.reduce.return %50 : f32
2026-02-21T08:11:29.6582537Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:11:29.6582766Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6583073Z       %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:29.6583316Z       tt.store %29, %27 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6583511Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:29.6583697Z       %30 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:11:29.6583881Z       %31 = arith.addi %arg5, %30 : i32
2026-02-21T08:11:29.6584061Z       %32 = arith.muli %31, %c16_i32 : i32
2026-02-21T08:11:29.6584288Z       %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:29.6584530Z       %34 = tt.splat %32 : i32 -> tensor<16xi32>
2026-02-21T08:11:29.6584723Z       %35 = arith.addi %34, %33 : tensor<16xi32>
2026-02-21T08:11:29.6585024Z       %36 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:11:29.6585420Z         %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6585826Z         %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6586122Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:11:29.6586483Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:11:29.6586852Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6587056Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:11:29.6587269Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6587461Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:11:29.6587633Z         } else {
2026-02-21T08:11:29.6587792Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:11:29.6588013Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:11:29.6588239Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:11:29.6588450Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:11:29.6588697Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:11:29.6588934Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:11:29.6589136Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6589332Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:11:29.6589540Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6589743Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:11:29.6589911Z         }
2026-02-21T08:11:29.6590063Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:11:29.6590255Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:11:29.6590467Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:11:29.6590684Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.6590879Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:29.6591060Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:29.6591289Z         tt.reduce.return %50 : f32
2026-02-21T08:11:29.6591470Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:11:29.6591762Z       %38 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6592051Z       %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:29.6592276Z       tt.store %39, %37 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6592473Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:11:29.6592648Z       %40 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:11:29.6592835Z       %41 = arith.addi %arg5, %40 : i32
2026-02-21T08:11:29.6593015Z       %42 = arith.muli %41, %c16_i32 : i32
2026-02-21T08:11:29.6593233Z       %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:29.6593470Z       %44 = tt.splat %42 : i32 -> tensor<16xi32>
2026-02-21T08:11:29.6593653Z       %45 = arith.addi %44, %43 : tensor<16xi32>
2026-02-21T08:11:29.6594013Z       %46 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:11:29.6594403Z         %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6594764Z         %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6595049Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:11:29.6595399Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:11:29.6595760Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6595957Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:11:29.6596163Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6596359Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:11:29.6596525Z         } else {
2026-02-21T08:11:29.6596692Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:11:29.6596906Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:11:29.6597133Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:11:29.6597337Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:11:29.6597571Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:11:29.6597810Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:11:29.6597997Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:11:29.6598194Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:11:29.6598387Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6598582Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:11:29.6598744Z         }
2026-02-21T08:11:29.6598889Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:11:29.6599079Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:11:29.6599290Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:11:29.6599503Z       %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.6599683Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:29.6599863Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:29.6600039Z         tt.reduce.return %50 : f32
2026-02-21T08:11:29.6600221Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:11:29.6600437Z       %48 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6600696Z       %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:29.6600923Z       tt.store %49, %47 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6601107Z     }
2026-02-21T08:11:29.6601272Z     scf.for %arg5 = %10 to %c256_i32 step %c2368_i32  : i32 {
2026-02-21T08:11:29.6601480Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:11:29.6601707Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:29.6602023Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:11:29.6602212Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:11:29.6602507Z       %16 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:11:29.6602899Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6603259Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:11:29.6603537Z         %22 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:11:29.6603896Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:11:29.6604253Z           %25 = arith.subf %21, %20 : tensor<16x32xf32>
2026-02-21T08:11:29.6604510Z           %26 = arith.mulf %24, %25 : tensor<16x32xf32>
2026-02-21T08:11:29.6604721Z           %27 = arith.addf %26, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6604908Z           scf.yield %27 : tensor<16x32xf32>
2026-02-21T08:11:29.6605076Z         } else {
2026-02-21T08:11:29.6605230Z           %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:11:29.6605448Z           %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32>
2026-02-21T08:11:29.6605661Z           %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32>
2026-02-21T08:11:29.6605875Z           %27 = arith.ori %25, %26 : tensor<16x32xi1>
2026-02-21T08:11:29.6606108Z           %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:11:29.6606338Z           %29 = math.log %28 : tensor<16x32xf32>
2026-02-21T08:11:29.6606532Z           %30 = arith.subf %29, %20 : tensor<16x32xf32>
2026-02-21T08:11:29.6606725Z           %31 = arith.mulf %21, %30 : tensor<16x32xf32>
2026-02-21T08:11:29.6606934Z           %32 = arith.addf %31, %cst : tensor<16x32xf32>
2026-02-21T08:11:29.6607124Z           scf.yield %32 : tensor<16x32xf32>
2026-02-21T08:11:29.6607293Z         }
2026-02-21T08:11:29.6607432Z         %23 = arith.addf %arg7, %22 : tensor<16x32xf32>
2026-02-21T08:11:29.6607630Z         scf.yield %23 : tensor<16x32xf32>
2026-02-21T08:11:29.6607844Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:11:29.6608053Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:11:29.6608239Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:29.6608405Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:29.6608583Z         tt.reduce.return %20 : f32
2026-02-21T08:11:29.6608757Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:11:29.6608980Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6609237Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:29.6609463Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:29.6609662Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:11:29.6609816Z     tt.return
2026-02-21T08:11:29.6609945Z   }
2026-02-21T08:11:29.6610058Z }
2026-02-21T08:11:29.6610131Z 
2026-02-21T08:11:29.6610180Z {-#
2026-02-21T08:11:29.6610303Z   external_resources: {
2026-02-21T08:11:29.6610457Z     mlir_reproducer: {
2026-02-21T08:11:29.6614788Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:29.6619480Z       disable_threading: false,
2026-02-21T08:11:29.6619689Z       verify_each: true
2026-02-21T08:11:29.6619866Z     }
2026-02-21T08:11:29.6620014Z   }
2026-02-21T08:11:29.6620145Z #-}
2026-02-21T08:11:29.6620637Z /tmp/torchinductor_root/np/cnp3647mdqcwjvzyuqilvqr5f6dahdv7ylecezdqdpc2zi24si6c.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:29.6622000Z /tmp/torchinductor_root/np/cnp3647mdqcwjvzyuqilvqr5f6dahdv7ylecezdqdpc2zi24si6c.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:29.6623057Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:29.6624197Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:11:29.6625246Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:29.6625502Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:32.4247265Z module {
2026-02-21T08:11:32.4248125Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:11:32.4248867Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:11:32.4249527Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:32.4253262Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:32.4257426Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32>
2026-02-21T08:11:32.4261554Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:11:32.4265460Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:32.4269342Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:11:32.4272834Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:11:32.4275474Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:32.4275835Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:11:32.4276300Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:11:32.4276624Z     %2 = tt.get_program_id x : i32
2026-02-21T08:11:32.4276828Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:11:32.4277076Z     %4 = arith.minsi %3, %c256_i32 : i32
2026-02-21T08:11:32.4277738Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:11:32.4278043Z       %5 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:11:32.4278324Z       %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:11:32.4278617Z       %7 = tt.splat %5 : i32 -> tensor<16xi32>
2026-02-21T08:11:32.4278839Z       %8 = arith.addi %7, %6 : tensor<16xi32>
2026-02-21T08:11:32.4279281Z       %9 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>)  : i32 {
2026-02-21T08:11:32.4279937Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:11:32.4280519Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:11:32.4280920Z         %15 = scf.if %arg3 -> (tensor<16x256xf32>) {
2026-02-21T08:11:32.4281629Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:11:32.4282093Z           %18 = arith.subf %14, %13 : tensor<16x256xf32>
2026-02-21T08:11:32.4282300Z           %19 = arith.mulf %17, %18 : tensor<16x256xf32>
2026-02-21T08:11:32.4282538Z           %20 = arith.addf %19, %cst : tensor<16x256xf32>
2026-02-21T08:11:32.4282745Z           scf.yield %20 : tensor<16x256xf32>
2026-02-21T08:11:32.4282912Z         } else {
2026-02-21T08:11:32.4283081Z           %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32>
2026-02-21T08:11:32.4283299Z           %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32>
2026-02-21T08:11:32.4283527Z           %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32>
2026-02-21T08:11:32.4283745Z           %20 = arith.ori %18, %19 : tensor<16x256xi1>
2026-02-21T08:11:32.4283984Z           %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:11:32.4284234Z           %22 = math.log %21 : tensor<16x256xf32>
2026-02-21T08:11:32.4284433Z           %23 = arith.subf %22, %13 : tensor<16x256xf32>
2026-02-21T08:11:32.4284638Z           %24 = arith.mulf %14, %23 : tensor<16x256xf32>
2026-02-21T08:11:32.4284837Z           %25 = arith.addf %24, %cst : tensor<16x256xf32>
2026-02-21T08:11:32.4285039Z           scf.yield %25 : tensor<16x256xf32>
2026-02-21T08:11:32.4285213Z         }
2026-02-21T08:11:32.4285355Z         %16 = arith.addf %arg7, %15 : tensor<16x256xf32>
2026-02-21T08:11:32.4285552Z         scf.yield %16 : tensor<16x256xf32>
2026-02-21T08:11:32.4285768Z       } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:32.4286000Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:11:32.4286183Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:32.4286361Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:32.4286538Z         tt.reduce.return %13 : f32
2026-02-21T08:11:32.4286723Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:11:32.4286955Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:32.4287210Z       %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:11:32.4287445Z       tt.store %12, %10 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:11:32.4287619Z     }
2026-02-21T08:11:32.4287740Z     tt.return
2026-02-21T08:11:32.4287858Z   }
2026-02-21T08:11:32.4287976Z }
2026-02-21T08:11:32.4288039Z 
2026-02-21T08:11:32.4288085Z {-#
2026-02-21T08:11:32.4288211Z   external_resources: {
2026-02-21T08:11:32.4288365Z     mlir_reproducer: {
2026-02-21T08:11:32.4292704Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:32.4297234Z       disable_threading: false,
2026-02-21T08:11:32.4297410Z       verify_each: true
2026-02-21T08:11:32.4297553Z     }
2026-02-21T08:11:32.4297662Z   }
2026-02-21T08:11:32.4297781Z #-}
2026-02-21T08:11:32.4298184Z /tmp/torchinductor_root/dc/cdcak6bwi27fdqg3bzvvkvgo6selj43m4xajbp5b6gu3kl7zznvk.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:32.4299360Z /tmp/torchinductor_root/dc/cdcak6bwi27fdqg3bzvvkvgo6selj43m4xajbp5b6gu3kl7zznvk.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:32.4300322Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:32.4301334Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:11:32.4302310Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:32.4302567Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:33.8815557Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:11:33.8816279Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:11:33.8816892Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:11:33.8817091Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:33.8817283Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:33.8817503Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32>
2026-02-21T08:11:33.8817736Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:11:33.8817928Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:33.8818115Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:11:33.8818304Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:11:33.8818485Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:33.8818811Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:11:33.8819259Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:11:33.8819895Z     %2 = tt.get_program_id x : i32
2026-02-21T08:11:33.8820088Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:11:33.8820279Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:11:33.8820473Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:11:33.8820648Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:11:33.8820845Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:11:33.8821028Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:11:33.8821203Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:11:33.8821380Z     %c4_i32_1 = arith.constant 4 : i32
2026-02-21T08:11:33.8821569Z     %9 = arith.remsi %8, %c4_i32_1 : i32
2026-02-21T08:11:33.8821753Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:11:33.8825274Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:11:33.8825469Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:11:33.8825662Z     %13 = arith.muli %c1_i32, %c4_i32_1 : i32
2026-02-21T08:11:33.8826001Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:11:33.8826219Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:11:33.8826469Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:33.8826729Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:11:33.8826943Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:11:33.8827269Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:11:33.8827691Z         %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8828043Z         %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8828320Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:11:33.8828688Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:11:33.8829048Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8829245Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:11:33.8829450Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8829639Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:11:33.8829808Z         } else {
2026-02-21T08:11:33.8829960Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:11:33.8830171Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:11:33.8830383Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:11:33.8830581Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:11:33.8830815Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:11:33.8831046Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:11:33.8831236Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8831426Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:11:33.8831625Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8831829Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:11:33.8832053Z         }
2026-02-21T08:11:33.8832210Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:11:33.8832407Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:11:33.8832743Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:33.8833089Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:11:33.8833296Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:33.8833481Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:33.8833693Z         tt.reduce.return %52 : f32
2026-02-21T08:11:33.8833876Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:11:33.8834093Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8834350Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:33.8834684Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8834910Z       %c1_i32_2 = arith.constant 1 : i32
2026-02-21T08:11:33.8835095Z       %22 = arith.muli %c1_i32, %c1_i32_2 : i32
2026-02-21T08:11:33.8835274Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:11:33.8835452Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T08:11:33.8835669Z       %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:33.8835895Z       %26 = tt.splat %24 : i32 -> tensor<4xi32>
2026-02-21T08:11:33.8836084Z       %27 = arith.addi %26, %25 : tensor<4xi32>
2026-02-21T08:11:33.8836374Z       %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:11:33.8836754Z         %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8837164Z         %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8837446Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:11:33.8837799Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:11:33.8838147Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8838346Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:11:33.8838539Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8838732Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:11:33.8838898Z         } else {
2026-02-21T08:11:33.8839052Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:11:33.8839268Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:11:33.8839526Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:11:33.8839735Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:11:33.8839964Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:11:33.8840196Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:11:33.8840386Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8840573Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:11:33.8840775Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8840960Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:11:33.8841127Z         }
2026-02-21T08:11:33.8841262Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:11:33.8841449Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:11:33.8841750Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:33.8842112Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:11:33.8842300Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:33.8842475Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:33.8842673Z         tt.reduce.return %52 : f32
2026-02-21T08:11:33.8842863Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:11:33.8843096Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8843360Z       %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:33.8843603Z       tt.store %31, %29 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8843808Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:11:33.8843999Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:11:33.8844181Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:11:33.8844351Z       %34 = arith.muli %33, %c4_i32 : i32
2026-02-21T08:11:33.8844571Z       %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:33.8844826Z       %36 = tt.splat %34 : i32 -> tensor<4xi32>
2026-02-21T08:11:33.8845038Z       %37 = arith.addi %36, %35 : tensor<4xi32>
2026-02-21T08:11:33.8845363Z       %38 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:11:33.8845849Z         %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8846244Z         %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8846553Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:11:33.8846949Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:11:33.8847344Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8847569Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:11:33.8847796Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8848007Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:11:33.8848197Z         } else {
2026-02-21T08:11:33.8848425Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:11:33.8848669Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:11:33.8848899Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:11:33.8849129Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:11:33.8849388Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:11:33.8849644Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:11:33.8849860Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8850071Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:11:33.8850294Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8850501Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:11:33.8850687Z         }
2026-02-21T08:11:33.8850839Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:11:33.8851077Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:11:33.8851383Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:33.8851694Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:11:33.8851934Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:33.8852121Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:33.8852319Z         tt.reduce.return %52 : f32
2026-02-21T08:11:33.8852509Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:11:33.8852744Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8853020Z       %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:33.8853262Z       tt.store %41, %39 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8853456Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:11:33.8853631Z       %42 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:11:33.8853812Z       %43 = arith.addi %arg5, %42 : i32
2026-02-21T08:11:33.8853981Z       %44 = arith.muli %43, %c4_i32 : i32
2026-02-21T08:11:33.8854201Z       %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:33.8854430Z       %46 = tt.splat %44 : i32 -> tensor<4xi32>
2026-02-21T08:11:33.8854612Z       %47 = arith.addi %46, %45 : tensor<4xi32>
2026-02-21T08:11:33.8854902Z       %48 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:11:33.8855268Z         %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8855614Z         %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8855885Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:11:33.8856235Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:11:33.8856586Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8856844Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:11:33.8857052Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8857243Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:11:33.8857415Z         } else {
2026-02-21T08:11:33.8857574Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:11:33.8857791Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:11:33.8858009Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:11:33.8858212Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:11:33.8858449Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:11:33.8858684Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:11:33.8858881Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:11:33.8859078Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:11:33.8859348Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8859544Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:11:33.8859703Z         }
2026-02-21T08:11:33.8859846Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:11:33.8860029Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:11:33.8860333Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:33.8860646Z       %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:11:33.8860838Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:33.8861008Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:33.8861199Z         tt.reduce.return %52 : f32
2026-02-21T08:11:33.8861387Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:11:33.8861599Z       %50 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8861909Z       %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:33.8862150Z       tt.store %51, %49 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8862387Z     } {tt.disallow_acc_multi_buffer}
2026-02-21T08:11:33.8862593Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:11:33.8862802Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:11:33.8863036Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:33.8863288Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:11:33.8863520Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:11:33.8863813Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:11:33.8864177Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8864529Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:11:33.8864812Z         %24 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:11:33.8865157Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:11:33.8865504Z           %27 = arith.subf %23, %22 : tensor<4x4xf32>
2026-02-21T08:11:33.8865695Z           %28 = arith.mulf %26, %27 : tensor<4x4xf32>
2026-02-21T08:11:33.8865894Z           %29 = arith.addf %28, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8866086Z           scf.yield %29 : tensor<4x4xf32>
2026-02-21T08:11:33.8866245Z         } else {
2026-02-21T08:11:33.8866404Z           %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:11:33.8866605Z           %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32>
2026-02-21T08:11:33.8866817Z           %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32>
2026-02-21T08:11:33.8867010Z           %29 = arith.ori %27, %28 : tensor<4x4xi1>
2026-02-21T08:11:33.8867239Z           %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:11:33.8867464Z           %31 = math.log %30 : tensor<4x4xf32>
2026-02-21T08:11:33.8867710Z           %32 = arith.subf %31, %22 : tensor<4x4xf32>
2026-02-21T08:11:33.8867901Z           %33 = arith.mulf %23, %32 : tensor<4x4xf32>
2026-02-21T08:11:33.8868087Z           %34 = arith.addf %33, %cst : tensor<4x4xf32>
2026-02-21T08:11:33.8868274Z           scf.yield %34 : tensor<4x4xf32>
2026-02-21T08:11:33.8868433Z         }
2026-02-21T08:11:33.8868575Z         %25 = arith.addf %arg7, %24 : tensor<4x4xf32>
2026-02-21T08:11:33.8868757Z         scf.yield %25 : tensor<4x4xf32>
2026-02-21T08:11:33.8869056Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:33.8869377Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:11:33.8869555Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:33.8869730Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:33.8869902Z         tt.reduce.return %22 : f32
2026-02-21T08:11:33.8870160Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:11:33.8870379Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8870629Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:33.8870848Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:33.8871071Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:11:33.8871271Z     tt.return
2026-02-21T08:11:33.8871391Z   }
2026-02-21T08:11:33.8871516Z }
2026-02-21T08:11:33.8871583Z 
2026-02-21T08:11:33.8871634Z {-#
2026-02-21T08:11:33.8871764Z   external_resources: {
2026-02-21T08:11:33.8871947Z     mlir_reproducer: {
2026-02-21T08:11:33.8876559Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:33.8881415Z       disable_threading: false,
2026-02-21T08:11:33.8881596Z       verify_each: true
2026-02-21T08:11:33.8881762Z     }
2026-02-21T08:11:33.8881960Z   }
2026-02-21T08:11:33.8882143Z #-}
2026-02-21T08:11:33.8882718Z /tmp/torchinductor_root/vo/cvoeviq4uu2ccyyvereddu2zu4p225xiu5p2fh3dy2k7ekhtaajo.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:33.8884068Z /tmp/torchinductor_root/vo/cvoeviq4uu2ccyyvereddu2zu4p225xiu5p2fh3dy2k7ekhtaajo.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:33.8885235Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:33.8886492Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:11:33.8887590Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:33.8887869Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:35.3642032Z module {
2026-02-21T08:11:35.3643012Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:11:35.3643639Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:11:35.3643852Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:11:35.3644045Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:11:35.3644278Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x8192xf32>
2026-02-21T08:11:35.3644521Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:11:35.3644715Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:11:35.3644905Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:11:35.3645096Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:11:35.3645275Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:11:35.3645609Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x8192xf32>>
2026-02-21T08:11:35.3646071Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x8192xf32>>
2026-02-21T08:11:35.3646399Z     %2 = tt.get_program_id x : i32
2026-02-21T08:11:35.3646588Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:11:35.3646774Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:11:35.3646985Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:11:35.3647193Z       %5 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:11:35.3647432Z       %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:11:35.3647689Z       %7 = tt.splat %5 : i32 -> tensor<4xi32>
2026-02-21T08:11:35.3647891Z       %8 = arith.addi %7, %6 : tensor<4xi32>
2026-02-21T08:11:35.3648180Z       %9 = tt.descriptor_load %0[%5, %c0_i32] : !tt.tensordesc<tensor<4x8192xf32>> -> tensor<4x8192xf32>
2026-02-21T08:11:35.3648587Z       %10 = tt.descriptor_load %1[%5, %c0_i32] : !tt.tensordesc<tensor<4x8192xf32>> -> tensor<4x8192xf32>
2026-02-21T08:11:35.3648902Z       %11 = scf.if %arg3 -> (tensor<4x8192xf32>) {
2026-02-21T08:11:35.3649290Z         %16 = tt.extern_elementwise %10 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x8192xf32>) -> tensor<4x8192xf32>
2026-02-21T08:11:35.3649689Z         %17 = arith.subf %10, %9 : tensor<4x8192xf32>
2026-02-21T08:11:35.3649905Z         %18 = arith.mulf %16, %17 : tensor<4x8192xf32>
2026-02-21T08:11:35.3650130Z         %19 = arith.addf %18, %cst : tensor<4x8192xf32>
2026-02-21T08:11:35.3650356Z         scf.yield %19 : tensor<4x8192xf32>
2026-02-21T08:11:35.3650531Z       } else {
2026-02-21T08:11:35.3650706Z         %16 = tt.splat %arg4 : f32 -> tensor<4x8192xf32>
2026-02-21T08:11:35.3650935Z         %17 = arith.cmpf ogt, %10, %16 : tensor<4x8192xf32>
2026-02-21T08:11:35.3651175Z         %18 = arith.cmpf une, %10, %10 : tensor<4x8192xf32>
2026-02-21T08:11:35.3651395Z         %19 = arith.ori %17, %18 : tensor<4x8192xi1>
2026-02-21T08:11:35.3651656Z         %20 = arith.select %19, %10, %16 : tensor<4x8192xi1>, tensor<4x8192xf32>
2026-02-21T08:11:35.3652112Z         %21 = math.log %20 : tensor<4x8192xf32>
2026-02-21T08:11:35.3652323Z         %22 = arith.subf %21, %9 : tensor<4x8192xf32>
2026-02-21T08:11:35.3652527Z         %23 = arith.mulf %10, %22 : tensor<4x8192xf32>
2026-02-21T08:11:35.3652734Z         %24 = arith.addf %23, %cst : tensor<4x8192xf32>
2026-02-21T08:11:35.3652943Z         scf.yield %24 : tensor<4x8192xf32>
2026-02-21T08:11:35.3653121Z       }
2026-02-21T08:11:35.3653279Z       %12 = arith.addf %11, %cst : tensor<4x8192xf32>
2026-02-21T08:11:35.3653493Z       %13 = "tt.reduce"(%12) <{axis = 1 : i32}> ({
2026-02-21T08:11:35.3653675Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:11:35.3653851Z         %16 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:11:35.3654026Z         tt.reduce.return %16 : f32
2026-02-21T08:11:35.3654217Z       }) : (tensor<4x8192xf32>) -> tensor<4xf32>
2026-02-21T08:11:35.3654526Z       %14 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:35.3654807Z       %15 = tt.addptr %14, %8 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:11:35.3655046Z       tt.store %15, %13 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:11:35.3655272Z     } {tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:11:35.3655473Z     tt.return
2026-02-21T08:11:35.3655601Z   }
2026-02-21T08:11:35.3655733Z }
2026-02-21T08:11:35.3655798Z 
2026-02-21T08:11:35.3655845Z {-#
2026-02-21T08:11:35.3655975Z   external_resources: {
2026-02-21T08:11:35.3656122Z     mlir_reproducer: {
2026-02-21T08:11:35.3660359Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:11:35.3664899Z       disable_threading: false,
2026-02-21T08:11:35.3665069Z       verify_each: true
2026-02-21T08:11:35.3665215Z     }
2026-02-21T08:11:35.3665329Z   }
2026-02-21T08:11:35.3665445Z #-}
2026-02-21T08:11:35.3665857Z /tmp/torchinductor_root/ax/caxj6wc7yve7diibicypm2atgzdubmmgoizzagqbuz2cpheuc57o.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:11:35.3667071Z /tmp/torchinductor_root/ax/caxj6wc7yve7diibicypm2atgzdubmmgoizzagqbuz2cpheuc57o.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:11:35.3668130Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:11:35.3669220Z Config: @helion.kernel(config=helion.Config(block_sizes=[8192, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_sm_multiplier=16, num_stages=1, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:11:35.3670242Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:11:35.3670551Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:11:35.4301140Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.3 configs/s
2026-02-21T08:11:35.4310297Z [40s] Adaptive compile timeout: 30s (90% percentile=4.0s, bounds=[30.0s, 30s])
2026-02-21T08:11:36.1785666Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1326.9 configs/s
2026-02-21T08:11:36.2241445Z [40s] Initial random population of 100, 5 starting points: 
2026-02-21T08:11:36.2243279Z error=7
2026-02-21T08:11:36.2243430Z timeout=6
2026-02-21T08:11:36.2243560Z ok=87
2026-02-21T08:11:36.2243678Z min=0.0747
2026-02-21T08:11:36.2243805Z mid=0.8366
2026-02-21T08:11:36.2243920Z max=40.4429
2026-02-21T08:11:36.2244062Z best={'block_sizes': [1024, 1],
2026-02-21T08:11:36.2244281Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:11:36.2244520Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:11:36.2244702Z  'num_sm_multiplier': 16,
2026-02-21T08:11:36.2244860Z  'num_stages': 1,
2026-02-21T08:11:36.2245001Z  'num_warps': 1,
2026-02-21T08:11:36.2245153Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:11:36.2245348Z  'range_flattens': [None, None],
2026-02-21T08:11:36.2245544Z  'range_multi_buffers': [False, True],
2026-02-21T08:11:36.2245740Z  'range_num_stages': [0, 4],
2026-02-21T08:11:36.2245901Z  'range_unroll_factors': [2, 0],
2026-02-21T08:11:36.2246086Z  'range_warp_specializes': [None, True]}
2026-02-21T08:11:36.2258366Z [40s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:11:37.4728448Z [42s] Generation 1 starting: 89 neighbors, 5 active search path(s)
2026-02-21T08:11:46.8057786Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 4.2 configs/s
2026-02-21T08:11:52.4587462Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.4 configs/s
2026-02-21T08:11:59.5651347Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.2         
2026-02-21T08:11:59.5655436Z                                                                   configs/s     
2026-02-21T08:11:59.8595130Z [64s] Generation 1 complete: 
2026-02-21T08:11:59.8597589Z ok=95
2026-02-21T08:11:59.8597790Z min=0.0645
2026-02-21T08:11:59.8597979Z mid=0.0851
2026-02-21T08:11:59.8598203Z max=0.6297
2026-02-21T08:11:59.8598959Z best={'block_sizes': [256, 1],
2026-02-21T08:11:59.8599294Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:11:59.8599655Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:11:59.8599881Z  'num_stages': 7,
2026-02-21T08:11:59.8600045Z  'num_warps': 1,
2026-02-21T08:11:59.8600372Z  'pid_type': 'flat',
2026-02-21T08:11:59.8600573Z  'range_flattens': [None, False],
2026-02-21T08:11:59.8600848Z  'range_multi_buffers': [None, True],
2026-02-21T08:11:59.8606276Z  'range_num_stages': [0, 3],
2026-02-21T08:11:59.8610708Z  'range_unroll_factors': [0, 3],
2026-02-21T08:11:59.8615319Z  'range_warp_specializes': [None, None]}
2026-02-21T08:11:59.8620250Z [64s] Fitting surrogate: 195 points, 195 targets
2026-02-21T08:12:00.9236668Z [65s] Generation 2 starting: 75 neighbors, 5 active search path(s)
2026-02-21T08:12:04.3550523Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 39.9 configs/s
2026-02-21T08:12:09.3159184Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 15.8 configs/s
2026-02-21T08:12:16.9608069Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 132.2         
2026-02-21T08:12:16.9611313Z                                                                   configs/s     
2026-02-21T08:12:17.3219381Z [82s] Generation 2 complete: 
2026-02-21T08:12:17.3219690Z ok=81
2026-02-21T08:12:17.3219848Z min=0.0624
2026-02-21T08:12:17.3219975Z mid=0.0746
2026-02-21T08:12:17.3220135Z max=0.1874
2026-02-21T08:12:17.3220675Z best={'block_sizes': [1024, 1],
2026-02-21T08:12:17.3220976Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:12:17.3221245Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:12:17.3221438Z  'num_stages': 7,
2026-02-21T08:12:17.3221577Z  'num_warps': 4,
2026-02-21T08:12:17.3221728Z  'pid_type': 'flat',
2026-02-21T08:12:17.3222083Z  'range_flattens': [None, False],
2026-02-21T08:12:17.3222309Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:17.3222500Z  'range_num_stages': [0, 3],
2026-02-21T08:12:17.3222669Z  'range_unroll_factors': [0, 3],
2026-02-21T08:12:17.3222840Z  'range_warp_specializes': [None, None]}
2026-02-21T08:12:17.3234351Z [82s] Fitting surrogate: 276 points, 276 targets
2026-02-21T08:12:18.5042873Z [83s] Generation 3 starting: 75 neighbors, 5 active search path(s)
2026-02-21T08:12:23.4683753Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 7.3 configs/s
2026-02-21T08:12:28.1895755Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.5 configs/s
2026-02-21T08:12:36.2874726Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 124.9         
2026-02-21T08:12:36.2875365Z                                                                   configs/s     
2026-02-21T08:12:36.6560792Z [101s] Generation 3 complete: 
2026-02-21T08:12:36.6564523Z ok=81
2026-02-21T08:12:36.6568916Z min=0.0624
2026-02-21T08:12:36.6572717Z mid=0.0708
2026-02-21T08:12:36.6577447Z max=0.1916
2026-02-21T08:12:36.6581831Z best={'block_sizes': [1024, 1],
2026-02-21T08:12:36.6585676Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:12:36.6586959Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:12:36.6587171Z  'num_stages': 1,
2026-02-21T08:12:36.6587329Z  'num_warps': 4,
2026-02-21T08:12:36.6587477Z  'pid_type': 'flat',
2026-02-21T08:12:36.6587631Z  'range_flattens': [None, None],
2026-02-21T08:12:36.6587813Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:36.6587989Z  'range_num_stages': [0, 3],
2026-02-21T08:12:36.6588154Z  'range_unroll_factors': [0, 0],
2026-02-21T08:12:36.6588325Z  'range_warp_specializes': [None, True]}
2026-02-21T08:12:36.6588536Z [101s] Fitting surrogate: 357 points, 357 targets
2026-02-21T08:12:37.7026476Z [102s] Generation 4 starting: 74 neighbors, 5 active search path(s)
2026-02-21T08:12:40.9885624Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 30.8 configs/s
2026-02-21T08:12:45.5758147Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.7 configs/s
2026-02-21T08:12:53.6633084Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 125.0         
2026-02-21T08:12:53.6633493Z                                                                   configs/s     
2026-02-21T08:12:54.0263904Z [118s] Generation 4 complete: 
2026-02-21T08:12:54.0268124Z error=1
2026-02-21T08:12:54.0272572Z ok=79
2026-02-21T08:12:54.0276478Z min=0.0604
2026-02-21T08:12:54.0281015Z mid=0.0696
2026-02-21T08:12:54.0284917Z max=0.1322
2026-02-21T08:12:54.0286412Z best={'block_sizes': [1024, 1],
2026-02-21T08:12:54.0286717Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:12:54.0289724Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:12:54.0289959Z  'num_stages': 1,
2026-02-21T08:12:54.0290105Z  'num_warps': 4,
2026-02-21T08:12:54.0290253Z  'pid_type': 'flat',
2026-02-21T08:12:54.0290411Z  'range_flattens': [None, None],
2026-02-21T08:12:54.0291017Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:54.0291231Z  'range_num_stages': [0, 3],
2026-02-21T08:12:54.0291404Z  'range_unroll_factors': [0, 1],
2026-02-21T08:12:54.0291586Z  'range_warp_specializes': [None, True]}
2026-02-21T08:12:54.0291806Z [118s] Fitting surrogate: 437 points, 437 targets
2026-02-21T08:12:55.0173603Z [119s] Generation 5 starting: 70 neighbors, 5 active search path(s)
2026-02-21T08:13:00.7318807Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 6.2 configs/s
2026-02-21T08:13:04.9773114Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 17.4 configs/s
2026-02-21T08:13:12.4134181Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.0         
2026-02-21T08:13:12.4134709Z                                                                   configs/s     
2026-02-21T08:13:12.7733467Z [137s] Generation 5 complete: 
2026-02-21T08:13:12.7737870Z ok=75
2026-02-21T08:13:12.7741723Z min=0.0624
2026-02-21T08:13:12.7744850Z mid=0.0726
2026-02-21T08:13:12.7748821Z max=0.2263
2026-02-21T08:13:12.7750406Z best={'block_sizes': [1024, 1],
2026-02-21T08:13:12.7750688Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:13:12.7750944Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:13:12.7751154Z  'num_stages': 1,
2026-02-21T08:13:12.7751313Z  'num_warps': 4,
2026-02-21T08:13:12.7751462Z  'pid_type': 'flat',
2026-02-21T08:13:12.7751636Z  'range_flattens': [None, None],
2026-02-21T08:13:12.7751822Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:12.7752099Z  'range_num_stages': [0, 3],
2026-02-21T08:13:12.7752275Z  'range_unroll_factors': [0, 1],
2026-02-21T08:13:12.7752471Z  'range_warp_specializes': [None, True]}
2026-02-21T08:13:12.7752701Z [137s] Fitting surrogate: 512 points, 512 targets
2026-02-21T08:13:13.6527774Z [138s] Generation 6 starting: 51 neighbors, 4 active search path(s)
2026-02-21T08:13:18.1085863Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 5.4 configs/s
2026-02-21T08:13:21.1280833Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 17.5 configs/s
2026-02-21T08:13:27.0178495Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 179.0         
2026-02-21T08:13:27.0178903Z                                                                   configs/s     
2026-02-21T08:13:27.2826117Z [151s] Generation 6 complete: 
2026-02-21T08:13:27.2828017Z ok=55
2026-02-21T08:13:27.2828258Z min=0.0624
2026-02-21T08:13:27.2832414Z mid=0.0646
2026-02-21T08:13:27.2836322Z max=0.2303
2026-02-21T08:13:27.2841050Z best={'block_sizes': [2048, 1],
2026-02-21T08:13:27.2845579Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:13:27.2847148Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:13:27.2847428Z  'num_stages': 1,
2026-02-21T08:13:27.2852485Z  'num_warps': 8,
2026-02-21T08:13:27.2854620Z  'pid_type': 'flat',
2026-02-21T08:13:27.2854831Z  'range_flattens': [None, False],
2026-02-21T08:13:27.2855028Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:27.2855263Z  'range_num_stages': [0, 3],
2026-02-21T08:13:27.2855854Z  'range_unroll_factors': [0, 1],
2026-02-21T08:13:27.2856047Z  'range_warp_specializes': [None, True]}
2026-02-21T08:13:27.2856330Z [151s] Fitting surrogate: 567 points, 567 targets
2026-02-21T08:13:27.8487395Z [152s] Generation 7 starting: 24 neighbors, 2 active search path(s)
2026-02-21T08:13:29.6097197Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 20.5 configs/s
2026-02-21T08:13:31.0620069Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.8 configs/s
2026-02-21T08:13:33.8545070Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 361.1         
2026-02-21T08:13:33.8546378Z                                                                   configs/s     
2026-02-21T08:13:33.9919309Z [158s] Generation 7 complete: 
2026-02-21T08:13:33.9924182Z ok=27
2026-02-21T08:13:33.9926124Z min=0.0624
2026-02-21T08:13:33.9930463Z mid=0.0688
2026-02-21T08:13:33.9934219Z max=0.1669
2026-02-21T08:13:33.9938616Z best={'block_sizes': [2048, 1],
2026-02-21T08:13:33.9941608Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:13:33.9941967Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:13:33.9942172Z  'num_stages': 1,
2026-02-21T08:13:33.9942313Z  'num_warps': 8,
2026-02-21T08:13:33.9942459Z  'pid_type': 'flat',
2026-02-21T08:13:33.9942614Z  'range_flattens': [None, False],
2026-02-21T08:13:33.9942795Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:33.9942969Z  'range_num_stages': [0, 3],
2026-02-21T08:13:33.9943133Z  'range_unroll_factors': [0, 1],
2026-02-21T08:13:33.9943311Z  'range_warp_specializes': [None, True]}
2026-02-21T08:13:33.9943507Z [158s] Fitting surrogate: 594 points, 594 targets
2026-02-21T08:13:34.4725561Z [159s] Generation 8 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:14:05.2519152Z [189s] Timeout after 30s compiling Config(block_sizes=[2048, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, None])
2026-02-21T08:14:05.2534664Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.3 configs/s
2026-02-21T08:14:06.4147079Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 18.9 configs/s
2026-02-21T08:14:08.8177140Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 419.4         
2026-02-21T08:14:08.8182162Z                                                                   configs/s     
2026-02-21T08:14:08.9470828Z [193s] Generation 8 complete: 
2026-02-21T08:14:08.9474705Z timeout=1
2026-02-21T08:14:08.9476634Z ok=22
2026-02-21T08:14:08.9476846Z min=0.0624
2026-02-21T08:14:08.9482358Z mid=0.0644
2026-02-21T08:14:08.9486720Z max=0.0829
2026-02-21T08:14:08.9488299Z best={'block_sizes': [2048, 1],
2026-02-21T08:14:08.9488636Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:14:08.9493353Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:14:08.9493649Z  'num_stages': 1,
2026-02-21T08:14:08.9493832Z  'num_warps': 8,
2026-02-21T08:14:08.9493993Z  'pid_type': 'flat',
2026-02-21T08:14:08.9494180Z  'range_flattens': [None, False],
2026-02-21T08:14:08.9499135Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:08.9504334Z  'range_num_stages': [0, 3],
2026-02-21T08:14:08.9508769Z  'range_unroll_factors': [0, 1],
2026-02-21T08:14:08.9512650Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:08.9517690Z [193s] Fitting surrogate: 617 points, 617 targets
2026-02-21T08:14:09.2981525Z [194s] Generation 9 starting: 13 neighbors, 1 active search path(s)
2026-02-21T08:14:15.3992819Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.3 configs/s
2026-02-21T08:14:16.1632726Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.1 configs/s
2026-02-21T08:14:17.5385322Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 730.0         
2026-02-21T08:14:17.5386005Z                                                                   configs/s     
2026-02-21T08:14:17.6117190Z [202s] Generation 9 complete: 
2026-02-21T08:14:17.6120624Z ok=15
2026-02-21T08:14:17.6123388Z min=0.0624
2026-02-21T08:14:17.6126160Z mid=0.0625
2026-02-21T08:14:17.6129582Z max=0.3105
2026-02-21T08:14:17.6129777Z best={'block_sizes': [2048, 1],
2026-02-21T08:14:17.6130002Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:14:17.6130245Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:14:17.6130435Z  'num_stages': 1,
2026-02-21T08:14:17.6130572Z  'num_warps': 8,
2026-02-21T08:14:17.6130713Z  'pid_type': 'flat',
2026-02-21T08:14:17.6130866Z  'range_flattens': [None, False],
2026-02-21T08:14:17.6131050Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:17.6131226Z  'range_num_stages': [0, 3],
2026-02-21T08:14:17.6131393Z  'range_unroll_factors': [0, 1],
2026-02-21T08:14:17.6131592Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:17.6133438Z [202s] Fitting surrogate: 632 points, 632 targets
2026-02-21T08:14:18.0095214Z [202s] Generation 10 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:14:18.5545670Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 25.3 configs/s
2026-02-21T08:14:19.1826644Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 19.0 configs/s
2026-02-21T08:14:20.5471640Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 736.2         
2026-02-21T08:14:20.5472605Z                                                                   configs/s     
2026-02-21T08:14:20.6467157Z [205s] Generation 10 complete: 
2026-02-21T08:14:20.6467455Z ok=13
2026-02-21T08:14:20.6467646Z min=0.0624
2026-02-21T08:14:20.6467832Z mid=0.0625
2026-02-21T08:14:20.6468011Z max=0.0687
2026-02-21T08:14:20.6468212Z best={'block_sizes': [2048, 1],
2026-02-21T08:14:20.6468652Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:14:20.6469075Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:14:20.6469409Z  'num_stages': 1,
2026-02-21T08:14:20.6469626Z  'num_warps': 8,
2026-02-21T08:14:20.6469835Z  'pid_type': 'flat',
2026-02-21T08:14:20.6470076Z  'range_flattens': [None, False],
2026-02-21T08:14:20.6470349Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:20.6470639Z  'range_num_stages': [0, 3],
2026-02-21T08:14:20.6470898Z  'range_unroll_factors': [0, 1],
2026-02-21T08:14:20.6471182Z  'range_warp_specializes': [None, True]}
2026-02-21T08:14:20.6497534Z [205s] Fitting surrogate: 645 points, 645 targets
2026-02-21T08:14:20.9183245Z [205s] Autotuning complete in 205.6s after searching 611 configs.
2026-02-21T08:14:20.9187728Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:14:20.9192569Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:14:20.9193847Z 
2026-02-21T08:14:20.9199984Z [205s] Code of selected kernel: /tmp/torchinductor_root/kn/ckn2rv4kwlogyvhbtasra32a2ixoldjoc2cgkod7mj3op4wnp5w3.py
2026-02-21T08:14:21.8348638Z WARNING:tritonbench.utils.triton_op:Completed input ID 1:
2026-02-21T08:14:21.8352796Z (B, T, V)
2026-02-21T08:14:21.8357250Z --------------
2026-02-21T08:14:21.8360621Z (8, 512, 8192)
2026-02-21T08:14:21.8364606Z 
2026-02-21T08:14:21.8366986Z  33%|███▎      | 2/6 [06:09<12:38, 189.69s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2:
2026-02-21T08:14:21.8367468Z (B, T, V)
2026-02-21T08:14:21.8367656Z ---------------
2026-02-21T08:14:21.8372233Z (8, 512, 16384)
2026-02-21T08:14:21.8372556Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for torch_kl_div
2026-02-21T08:14:22.9761309Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:14:24.3426268Z INFO:tritonbench.utils.triton_op:Took 2.56ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:14:28.0402546Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:14:28.0403888Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:14:28.0404123Z               'dtype': 'torch.float32',
2026-02-21T08:14:28.0404313Z               'shape': (4096, 16384),
2026-02-21T08:14:28.0404501Z               'stride': (16384, 1)},
2026-02-21T08:14:28.0404670Z             { 'device': 'cuda:0',
2026-02-21T08:14:28.0404841Z               'dtype': 'torch.float32',
2026-02-21T08:14:28.0405023Z               'shape': (4096, 16384),
2026-02-21T08:14:28.0405186Z               'stride': (16384, 1)}),
2026-02-21T08:14:28.0405346Z   'kwargs': {}}
2026-02-21T08:14:28.0429364Z INFO:tritonbench.utils.triton_op:Took 3.22ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:14:28.3593507Z [0s] Autotune random seed: 2134765727
2026-02-21T08:14:28.5074679Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:15:00.7860780Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=128, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[0, 1], range_warp_specializes=[False, None])
2026-02-21T08:15:01.2981405Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None])
2026-02-21T08:15:02.0949821Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=64, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[None, None])
2026-02-21T08:15:02.2762955Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 4], range_warp_specializes=[False, None])
2026-02-21T08:15:02.3881234Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, False])
2026-02-21T08:15:02.4974407Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True])
2026-02-21T08:15:02.4990852Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T08:15:02.7415672Z module {
2026-02-21T08:15:02.7416695Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:15:02.7422209Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:15:02.7427924Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:15:02.7431959Z     %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32>
2026-02-21T08:15:02.7433890Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:15:02.7434121Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:15:02.7434319Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:15:02.7434508Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:15:02.7434698Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:15:02.7435009Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:15:02.7435437Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<512x32xf32>>
2026-02-21T08:15:02.7435750Z     %2 = tt.get_program_id x : i32
2026-02-21T08:15:02.7435940Z     %3 = arith.muli %2, %c512_i32 : i32
2026-02-21T08:15:02.7436168Z     %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:15:02.7436406Z     %5 = tt.splat %3 : i32 -> tensor<512xi32>
2026-02-21T08:15:02.7436599Z     %6 = arith.addi %5, %4 : tensor<512xi32>
2026-02-21T08:15:02.7436898Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>)  : i32 {
2026-02-21T08:15:02.7437306Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:15:02.7437677Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<512x32xf32>> -> tensor<512x32xf32>
2026-02-21T08:15:02.7437958Z       %13 = scf.if %arg3 -> (tensor<512x32xf32>) {
2026-02-21T08:15:02.7438324Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32>
2026-02-21T08:15:02.7438709Z         %16 = arith.subf %12, %11 : tensor<512x32xf32>
2026-02-21T08:15:02.7438920Z         %17 = arith.mulf %15, %16 : tensor<512x32xf32>
2026-02-21T08:15:02.7439130Z         %18 = arith.addf %17, %cst : tensor<512x32xf32>
2026-02-21T08:15:02.7439323Z         scf.yield %18 : tensor<512x32xf32>
2026-02-21T08:15:02.7439494Z       } else {
2026-02-21T08:15:02.7439651Z         %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32>
2026-02-21T08:15:02.7439872Z         %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32>
2026-02-21T08:15:02.7440090Z         %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32>
2026-02-21T08:15:02.7440301Z         %18 = arith.ori %16, %17 : tensor<512x32xi1>
2026-02-21T08:15:02.7440545Z         %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32>
2026-02-21T08:15:02.7440783Z         %20 = math.log %19 : tensor<512x32xf32>
2026-02-21T08:15:02.7440981Z         %21 = arith.subf %20, %11 : tensor<512x32xf32>
2026-02-21T08:15:02.7441175Z         %22 = arith.mulf %12, %21 : tensor<512x32xf32>
2026-02-21T08:15:02.7441709Z         %23 = arith.addf %22, %cst : tensor<512x32xf32>
2026-02-21T08:15:02.7441974Z         scf.yield %23 : tensor<512x32xf32>
2026-02-21T08:15:02.7442148Z       }
2026-02-21T08:15:02.7442286Z       %14 = arith.addf %arg6, %13 : tensor<512x32xf32>
2026-02-21T08:15:02.7442484Z       scf.yield %14 : tensor<512x32xf32>
2026-02-21T08:15:02.7442805Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:15:02.7443131Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:15:02.7443327Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:15:02.7443503Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:15:02.7443694Z       tt.reduce.return %11 : f32
2026-02-21T08:15:02.7443874Z     }) : (tensor<512x32xf32>) -> tensor<512xf32>
2026-02-21T08:15:02.7444110Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<512x!tt.ptr<f32>>
2026-02-21T08:15:02.7444438Z     %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr<f32>>, tensor<512xi32>
2026-02-21T08:15:02.7444680Z     tt.store %10, %8 : tensor<512x!tt.ptr<f32>>
2026-02-21T08:15:02.7444872Z     tt.return
2026-02-21T08:15:02.7445003Z   }
2026-02-21T08:15:02.7445134Z }
2026-02-21T08:15:02.7445204Z 
2026-02-21T08:15:02.7445257Z {-#
2026-02-21T08:15:02.7445398Z   external_resources: {
2026-02-21T08:15:02.7445557Z     mlir_reproducer: {
2026-02-21T08:15:02.7449886Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:15:02.7454270Z       disable_threading: false,
2026-02-21T08:15:02.7454438Z       verify_each: true
2026-02-21T08:15:02.7454576Z     }
2026-02-21T08:15:02.7454698Z   }
2026-02-21T08:15:02.7454806Z #-}
2026-02-21T08:15:02.7455220Z /tmp/torchinductor_root/gz/cgztxd3tqx7ki7scwmrlgdnafqpeglbx7qjnld2ruexcgpkellr2.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:15:02.7456402Z /tmp/torchinductor_root/gz/cgztxd3tqx7ki7scwmrlgdnafqpeglbx7qjnld2ruexcgpkellr2.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:15:02.7457386Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:15:02.7458432Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:15:02.7459300Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:15:02.7459584Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:15:03.5012878Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:15:03.5015411Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:15:03.5016892Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:15:03.5017256Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:15:03.5018773Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:15:03.5019100Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:15:03.5019404Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32>
2026-02-21T08:15:03.5019689Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:15:03.5019913Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:15:03.5020146Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:15:03.5020389Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:15:03.5020612Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:15:03.5020969Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:15:03.5021474Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<16x32xf32>>
2026-02-21T08:15:03.5021965Z     %2 = tt.get_program_id x : i32
2026-02-21T08:15:03.5022176Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T08:15:03.5022378Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:15:03.5022571Z     %4 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:15:03.5022774Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:15:03.5022957Z     %6 = arith.divui %5, %c2368_i32 : i32
2026-02-21T08:15:03.5023166Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:15:03.5023372Z     %7 = arith.remsi %6, %c4_i32 : i32
2026-02-21T08:15:03.5023593Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:15:03.5023800Z     %9 = arith.muli %8, %c2368_i32 : i32
2026-02-21T08:15:03.5024005Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:15:03.5024217Z     %11 = arith.muli %c2368_i32, %c4_i32 : i32
2026-02-21T08:15:03.5024438Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:15:03.5024652Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:15:03.5024895Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:03.5025207Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:15:03.5025420Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:15:03.5025777Z       %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:15:03.5026243Z         %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5026760Z         %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5027103Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:15:03.5027517Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:15:03.5027977Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5028239Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:15:03.5028495Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5028990Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:15:03.5029178Z         } else {
2026-02-21T08:15:03.5029362Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:15:03.5029605Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:15:03.5029843Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:15:03.5030071Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:15:03.5030317Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:15:03.5030572Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:15:03.5030770Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5031019Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:15:03.5031235Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5031526Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:15:03.5031705Z         }
2026-02-21T08:15:03.5031902Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:15:03.5032101Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:15:03.5032323Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:15:03.5032544Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:15:03.5032744Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:03.5032921Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:03.5033116Z         tt.reduce.return %50 : f32
2026-02-21T08:15:03.5033310Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:15:03.5033542Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5033817Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:03.5034056Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5034262Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:15:03.5034456Z       %20 = arith.muli %c2368_i32, %c1_i32_0 : i32
2026-02-21T08:15:03.5034659Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:15:03.5034847Z       %22 = arith.muli %21, %c16_i32 : i32
2026-02-21T08:15:03.5035077Z       %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:03.5035333Z       %24 = tt.splat %22 : i32 -> tensor<16xi32>
2026-02-21T08:15:03.5035528Z       %25 = arith.addi %24, %23 : tensor<16xi32>
2026-02-21T08:15:03.5035854Z       %26 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:15:03.5036261Z         %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5036643Z         %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5036941Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:15:03.5037307Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:15:03.5037690Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5037898Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:15:03.5038115Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5038311Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:15:03.5038513Z         } else {
2026-02-21T08:15:03.5038685Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:15:03.5038905Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:15:03.5039152Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:15:03.5039364Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:15:03.5039610Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:15:03.5039857Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:15:03.5040164Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5040375Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:15:03.5040580Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5040785Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:15:03.5040954Z         }
2026-02-21T08:15:03.5041106Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:15:03.5041302Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:15:03.5041525Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:15:03.5041745Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:15:03.5041987Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:03.5042238Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:03.5042425Z         tt.reduce.return %50 : f32
2026-02-21T08:15:03.5042615Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:15:03.5042903Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5043182Z       %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:03.5043427Z       tt.store %29, %27 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5043639Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:15:03.5043835Z       %30 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:15:03.5044026Z       %31 = arith.addi %arg5, %30 : i32
2026-02-21T08:15:03.5044218Z       %32 = arith.muli %31, %c16_i32 : i32
2026-02-21T08:15:03.5044447Z       %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:03.5044703Z       %34 = tt.splat %32 : i32 -> tensor<16xi32>
2026-02-21T08:15:03.5044901Z       %35 = arith.addi %34, %33 : tensor<16xi32>
2026-02-21T08:15:03.5045222Z       %36 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:15:03.5045642Z         %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5046025Z         %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5046334Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:15:03.5046707Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:15:03.5047094Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5047317Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:15:03.5047533Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5047748Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:15:03.5047927Z         } else {
2026-02-21T08:15:03.5048111Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:15:03.5048339Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:15:03.5048595Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:15:03.5048811Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:15:03.5049056Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:15:03.5049319Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:15:03.5049515Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5049725Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:15:03.5049931Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5050133Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:15:03.5050301Z         }
2026-02-21T08:15:03.5050452Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:15:03.5050651Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:15:03.5050863Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:15:03.5051093Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:15:03.5051349Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:03.5051535Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:03.5051717Z         tt.reduce.return %50 : f32
2026-02-21T08:15:03.5051940Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:15:03.5052170Z       %38 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5052439Z       %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:03.5052681Z       tt.store %39, %37 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5052878Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:15:03.5053069Z       %40 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:15:03.5053255Z       %41 = arith.addi %arg5, %40 : i32
2026-02-21T08:15:03.5053435Z       %42 = arith.muli %41, %c16_i32 : i32
2026-02-21T08:15:03.5053657Z       %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:03.5053955Z       %44 = tt.splat %42 : i32 -> tensor<16xi32>
2026-02-21T08:15:03.5054159Z       %45 = arith.addi %44, %43 : tensor<16xi32>
2026-02-21T08:15:03.5054463Z       %46 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:15:03.5054865Z         %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5055232Z         %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5055529Z         %52 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:15:03.5055894Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:15:03.5056274Z           %55 = arith.subf %51, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5056485Z           %56 = arith.mulf %54, %55 : tensor<16x32xf32>
2026-02-21T08:15:03.5056692Z           %57 = arith.addf %56, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5056900Z           scf.yield %57 : tensor<16x32xf32>
2026-02-21T08:15:03.5057068Z         } else {
2026-02-21T08:15:03.5057235Z           %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:15:03.5057451Z           %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32>
2026-02-21T08:15:03.5057678Z           %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32>
2026-02-21T08:15:03.5057895Z           %57 = arith.ori %55, %56 : tensor<16x32xi1>
2026-02-21T08:15:03.5058131Z           %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:15:03.5058379Z           %59 = math.log %58 : tensor<16x32xf32>
2026-02-21T08:15:03.5058573Z           %60 = arith.subf %59, %50 : tensor<16x32xf32>
2026-02-21T08:15:03.5058781Z           %61 = arith.mulf %51, %60 : tensor<16x32xf32>
2026-02-21T08:15:03.5058982Z           %62 = arith.addf %61, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5059186Z           scf.yield %62 : tensor<16x32xf32>
2026-02-21T08:15:03.5059365Z         }
2026-02-21T08:15:03.5059518Z         %53 = arith.addf %arg7, %52 : tensor<16x32xf32>
2026-02-21T08:15:03.5059720Z         scf.yield %53 : tensor<16x32xf32>
2026-02-21T08:15:03.5059929Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:15:03.5060154Z       %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({
2026-02-21T08:15:03.5060341Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:03.5060522Z         %50 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:03.5060704Z         tt.reduce.return %50 : f32
2026-02-21T08:15:03.5060895Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:15:03.5061131Z       %48 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5061396Z       %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:03.5061645Z       tt.store %49, %47 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5061829Z     }
2026-02-21T08:15:03.5062042Z     scf.for %arg5 = %10 to %c256_i32 step %c2368_i32  : i32 {
2026-02-21T08:15:03.5062264Z       %12 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:15:03.5062603Z       %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:03.5062869Z       %14 = tt.splat %12 : i32 -> tensor<16xi32>
2026-02-21T08:15:03.5063087Z       %15 = arith.addi %14, %13 : tensor<16xi32>
2026-02-21T08:15:03.5063448Z       %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>)  : i32 {
2026-02-21T08:15:03.5063889Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5064302Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<16x32xf32>> -> tensor<16x32xf32>
2026-02-21T08:15:03.5064602Z         %22 = scf.if %arg3 -> (tensor<16x32xf32>) {
2026-02-21T08:15:03.5065060Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32>
2026-02-21T08:15:03.5065515Z           %25 = arith.subf %21, %20 : tensor<16x32xf32>
2026-02-21T08:15:03.5065734Z           %26 = arith.mulf %24, %25 : tensor<16x32xf32>
2026-02-21T08:15:03.5065949Z           %27 = arith.addf %26, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5066159Z           scf.yield %27 : tensor<16x32xf32>
2026-02-21T08:15:03.5066350Z         } else {
2026-02-21T08:15:03.5066510Z           %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32>
2026-02-21T08:15:03.5066733Z           %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32>
2026-02-21T08:15:03.5066997Z           %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32>
2026-02-21T08:15:03.5067227Z           %27 = arith.ori %25, %26 : tensor<16x32xi1>
2026-02-21T08:15:03.5067500Z           %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32>
2026-02-21T08:15:03.5067753Z           %29 = math.log %28 : tensor<16x32xf32>
2026-02-21T08:15:03.5067954Z           %30 = arith.subf %29, %20 : tensor<16x32xf32>
2026-02-21T08:15:03.5068169Z           %31 = arith.mulf %21, %30 : tensor<16x32xf32>
2026-02-21T08:15:03.5068382Z           %32 = arith.addf %31, %cst : tensor<16x32xf32>
2026-02-21T08:15:03.5068578Z           scf.yield %32 : tensor<16x32xf32>
2026-02-21T08:15:03.5068755Z         }
2026-02-21T08:15:03.5068909Z         %23 = arith.addf %arg7, %22 : tensor<16x32xf32>
2026-02-21T08:15:03.5069112Z         scf.yield %23 : tensor<16x32xf32>
2026-02-21T08:15:03.5069334Z       } {tt.disallow_acc_multi_buffer, tt.warp_specialize}
2026-02-21T08:15:03.5069571Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:15:03.5069786Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:03.5069962Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:03.5070152Z         tt.reduce.return %20 : f32
2026-02-21T08:15:03.5070334Z       }) : (tensor<16x32xf32>) -> tensor<16xf32>
2026-02-21T08:15:03.5070567Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5070838Z       %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:03.5071077Z       tt.store %19, %17 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:03.5071276Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:15:03.5071435Z     tt.return
2026-02-21T08:15:03.5071569Z   }
2026-02-21T08:15:03.5071686Z }
2026-02-21T08:15:03.5071761Z 
2026-02-21T08:15:03.5071812Z {-#
2026-02-21T08:15:03.5071984Z   external_resources: {
2026-02-21T08:15:03.5072143Z     mlir_reproducer: {
2026-02-21T08:15:03.5076650Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:15:03.5081112Z       disable_threading: false,
2026-02-21T08:15:03.5081286Z       verify_each: true
2026-02-21T08:15:03.5081426Z     }
2026-02-21T08:15:03.5081548Z   }
2026-02-21T08:15:03.5081657Z #-}
2026-02-21T08:15:03.5082123Z /tmp/torchinductor_root/fx/cfxhv7mynb2iqmyluksc2n3g4deg6txguqnoxbc4ybakkfpdvbpq.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:15:03.5083331Z /tmp/torchinductor_root/fx/cfxhv7mynb2iqmyluksc2n3g4deg6txguqnoxbc4ybakkfpdvbpq.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:15:03.5084339Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:15:03.5085450Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:15:03.5086462Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:15:03.5086708Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:15:07.1647991Z module {
2026-02-21T08:15:07.1648895Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:15:07.1649600Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:15:07.1649823Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:15:07.1650025Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:15:07.1650257Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32>
2026-02-21T08:15:07.1650525Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:15:07.1650710Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:15:07.1650919Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:15:07.1651111Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:15:07.1651288Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:15:07.1651609Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:15:07.1652293Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:15:07.1652620Z     %2 = tt.get_program_id x : i32
2026-02-21T08:15:07.1653167Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:15:07.1653347Z     %4 = arith.minsi %3, %c256_i32 : i32
2026-02-21T08:15:07.1653551Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:15:07.1653766Z       %5 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:15:07.1654003Z       %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:15:07.1654270Z       %7 = tt.splat %5 : i32 -> tensor<16xi32>
2026-02-21T08:15:07.1654468Z       %8 = arith.addi %7, %6 : tensor<16xi32>
2026-02-21T08:15:07.1654786Z       %9 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>)  : i32 {
2026-02-21T08:15:07.1655229Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:15:07.1655621Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:15:07.1656013Z         %15 = scf.if %arg3 -> (tensor<16x256xf32>) {
2026-02-21T08:15:07.1656420Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:15:07.1656807Z           %18 = arith.subf %14, %13 : tensor<16x256xf32>
2026-02-21T08:15:07.1657016Z           %19 = arith.mulf %17, %18 : tensor<16x256xf32>
2026-02-21T08:15:07.1657233Z           %20 = arith.addf %19, %cst : tensor<16x256xf32>
2026-02-21T08:15:07.1657434Z           scf.yield %20 : tensor<16x256xf32>
2026-02-21T08:15:07.1657612Z         } else {
2026-02-21T08:15:07.1657777Z           %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32>
2026-02-21T08:15:07.1658105Z           %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32>
2026-02-21T08:15:07.1658350Z           %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32>
2026-02-21T08:15:07.1658575Z           %20 = arith.ori %18, %19 : tensor<16x256xi1>
2026-02-21T08:15:07.1658827Z           %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:15:07.1659074Z           %22 = math.log %21 : tensor<16x256xf32>
2026-02-21T08:15:07.1659277Z           %23 = arith.subf %22, %13 : tensor<16x256xf32>
2026-02-21T08:15:07.1659497Z           %24 = arith.mulf %14, %23 : tensor<16x256xf32>
2026-02-21T08:15:07.1659728Z           %25 = arith.addf %24, %cst : tensor<16x256xf32>
2026-02-21T08:15:07.1659943Z           scf.yield %25 : tensor<16x256xf32>
2026-02-21T08:15:07.1660129Z         }
2026-02-21T08:15:07.1660285Z         %16 = arith.addf %arg7, %15 : tensor<16x256xf32>
2026-02-21T08:15:07.1660479Z         scf.yield %16 : tensor<16x256xf32>
2026-02-21T08:15:07.1660710Z       } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:07.1660939Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:15:07.1661134Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:07.1661316Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:07.1661512Z         tt.reduce.return %13 : f32
2026-02-21T08:15:07.1661710Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:15:07.1661989Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:07.1662259Z       %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:15:07.1662488Z       tt.store %12, %10 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:15:07.1662671Z     }
2026-02-21T08:15:07.1662791Z     tt.return
2026-02-21T08:15:07.1662921Z   }
2026-02-21T08:15:07.1663040Z }
2026-02-21T08:15:07.1663119Z 
2026-02-21T08:15:07.1663170Z {-#
2026-02-21T08:15:07.1663302Z   external_resources: {
2026-02-21T08:15:07.1663454Z     mlir_reproducer: {
2026-02-21T08:15:07.1667761Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:15:07.1672083Z       disable_threading: false,
2026-02-21T08:15:07.1672250Z       verify_each: true
2026-02-21T08:15:07.1672386Z     }
2026-02-21T08:15:07.1672505Z   }
2026-02-21T08:15:07.1672612Z #-}
2026-02-21T08:15:07.1673026Z /tmp/torchinductor_root/zv/czvcur7zmglxpsruw55sec2h27wqfefv4zsbjzrwcbfysarfkw6k.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:15:07.1678791Z /tmp/torchinductor_root/zv/czvcur7zmglxpsruw55sec2h27wqfefv4zsbjzrwcbfysarfkw6k.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:15:07.1679826Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:15:07.1680862Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:15:07.1681776Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:15:07.1682096Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:15:08.9224387Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:15:08.9225080Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:15:08.9225646Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:15:08.9225837Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:15:08.9226029Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:15:08.9226242Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32>
2026-02-21T08:15:08.9226463Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:15:08.9226637Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:15:08.9226831Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:15:08.9227020Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:15:08.9227194Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:15:08.9227513Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:15:08.9228265Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:15:08.9228579Z     %2 = tt.get_program_id x : i32
2026-02-21T08:15:08.9228750Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:15:08.9228934Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:15:08.9229119Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:15:08.9229287Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:15:08.9229469Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:15:08.9229640Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:15:08.9229808Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:15:08.9229976Z     %c4_i32_1 = arith.constant 4 : i32
2026-02-21T08:15:08.9230150Z     %9 = arith.remsi %8, %c4_i32_1 : i32
2026-02-21T08:15:08.9230316Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:15:08.9230487Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:15:08.9230749Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:15:08.9230929Z     %13 = arith.muli %c1_i32, %c4_i32_1 : i32
2026-02-21T08:15:08.9231132Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:15:08.9231380Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:15:08.9231611Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:15:08.9231949Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:15:08.9232145Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:15:08.9232457Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:15:08.9232850Z         %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9233216Z         %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9233506Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:15:08.9233871Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:15:08.9234247Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9234447Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:15:08.9234659Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9234854Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:15:08.9235034Z         } else {
2026-02-21T08:15:08.9235198Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:15:08.9235405Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:15:08.9235622Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:15:08.9235821Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:15:08.9236055Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:15:08.9236288Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:15:08.9236487Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9236684Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:15:08.9236879Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9237073Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:15:08.9237236Z         }
2026-02-21T08:15:08.9237384Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:15:08.9237577Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:15:08.9237911Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:08.9238252Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:15:08.9238453Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:08.9238640Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:08.9238825Z         tt.reduce.return %52 : f32
2026-02-21T08:15:08.9239017Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:15:08.9239250Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9239716Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:15:08.9239959Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9240181Z       %c1_i32_2 = arith.constant 1 : i32
2026-02-21T08:15:08.9240380Z       %22 = arith.muli %c1_i32, %c1_i32_2 : i32
2026-02-21T08:15:08.9240587Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:15:08.9240769Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T08:15:08.9241007Z       %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:15:08.9241257Z       %26 = tt.splat %24 : i32 -> tensor<4xi32>
2026-02-21T08:15:08.9241449Z       %27 = arith.addi %26, %25 : tensor<4xi32>
2026-02-21T08:15:08.9241767Z       %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:15:08.9242261Z         %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9242638Z         %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9242927Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:15:08.9243299Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:15:08.9243671Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9243876Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:15:08.9244090Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9244288Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:15:08.9244468Z         } else {
2026-02-21T08:15:08.9244626Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:15:08.9244849Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:15:08.9245075Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:15:08.9245285Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:15:08.9245527Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:15:08.9245762Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:15:08.9245959Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9246152Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:15:08.9246357Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9246553Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:15:08.9246723Z         }
2026-02-21T08:15:08.9246863Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:15:08.9247045Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:15:08.9247353Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:08.9247673Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:15:08.9247867Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:08.9248036Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:08.9248219Z         tt.reduce.return %52 : f32
2026-02-21T08:15:08.9248402Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:15:08.9248614Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9248869Z       %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:15:08.9249094Z       tt.store %31, %29 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9249287Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:15:08.9249460Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:15:08.9249640Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:15:08.9249816Z       %34 = arith.muli %33, %c4_i32 : i32
2026-02-21T08:15:08.9250026Z       %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:15:08.9250262Z       %36 = tt.splat %34 : i32 -> tensor<4xi32>
2026-02-21T08:15:08.9250507Z       %37 = arith.addi %36, %35 : tensor<4xi32>
2026-02-21T08:15:08.9250847Z       %38 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:15:08.9251255Z         %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9251639Z         %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9251966Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:15:08.9252324Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:15:08.9252715Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9252930Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:15:08.9253167Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9253463Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:15:08.9253632Z         } else {
2026-02-21T08:15:08.9253795Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:15:08.9254005Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:15:08.9254238Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:15:08.9254451Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:15:08.9254702Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:15:08.9254948Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:15:08.9255145Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9255363Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:15:08.9255577Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9255790Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:15:08.9255971Z         }
2026-02-21T08:15:08.9256122Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:15:08.9256311Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:15:08.9256732Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:08.9257076Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:15:08.9257261Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:08.9257448Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:08.9257621Z         tt.reduce.return %52 : f32
2026-02-21T08:15:08.9257799Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:15:08.9258009Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9258260Z       %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:15:08.9258479Z       tt.store %41, %39 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9258672Z       %c3_i32 = arith.constant 3 : i32
2026-02-21T08:15:08.9258856Z       %42 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:15:08.9259032Z       %43 = arith.addi %arg5, %42 : i32
2026-02-21T08:15:08.9259204Z       %44 = arith.muli %43, %c4_i32 : i32
2026-02-21T08:15:08.9259409Z       %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:15:08.9259641Z       %46 = tt.splat %44 : i32 -> tensor<4xi32>
2026-02-21T08:15:08.9259823Z       %47 = arith.addi %46, %45 : tensor<4xi32>
2026-02-21T08:15:08.9260120Z       %48 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:15:08.9260499Z         %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9260840Z         %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9261118Z         %54 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:15:08.9261462Z           %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:15:08.9261906Z           %57 = arith.subf %53, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9262130Z           %58 = arith.mulf %56, %57 : tensor<4x4xf32>
2026-02-21T08:15:08.9262323Z           %59 = arith.addf %58, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9262515Z           scf.yield %59 : tensor<4x4xf32>
2026-02-21T08:15:08.9262717Z         } else {
2026-02-21T08:15:08.9262918Z           %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:15:08.9263213Z           %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32>
2026-02-21T08:15:08.9263468Z           %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32>
2026-02-21T08:15:08.9263670Z           %59 = arith.ori %57, %58 : tensor<4x4xi1>
2026-02-21T08:15:08.9263889Z           %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:15:08.9264123Z           %61 = math.log %60 : tensor<4x4xf32>
2026-02-21T08:15:08.9264307Z           %62 = arith.subf %61, %52 : tensor<4x4xf32>
2026-02-21T08:15:08.9264570Z           %63 = arith.mulf %53, %62 : tensor<4x4xf32>
2026-02-21T08:15:08.9264782Z           %64 = arith.addf %63, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9265038Z           scf.yield %64 : tensor<4x4xf32>
2026-02-21T08:15:08.9265269Z         }
2026-02-21T08:15:08.9265452Z         %55 = arith.addf %arg7, %54 : tensor<4x4xf32>
2026-02-21T08:15:08.9265668Z         scf.yield %55 : tensor<4x4xf32>
2026-02-21T08:15:08.9265973Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:08.9266302Z       %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({
2026-02-21T08:15:08.9266491Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:08.9266675Z         %52 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:08.9266861Z         tt.reduce.return %52 : f32
2026-02-21T08:15:08.9267033Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:15:08.9267246Z       %50 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9267495Z       %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:15:08.9267722Z       tt.store %51, %49 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9267906Z     } {tt.disallow_acc_multi_buffer}
2026-02-21T08:15:08.9268101Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:15:08.9268297Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:15:08.9268510Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:15:08.9268741Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:15:08.9268921Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:15:08.9269215Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:15:08.9269597Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9269949Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:15:08.9270230Z         %24 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:15:08.9270570Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:15:08.9270922Z           %27 = arith.subf %23, %22 : tensor<4x4xf32>
2026-02-21T08:15:08.9271113Z           %28 = arith.mulf %26, %27 : tensor<4x4xf32>
2026-02-21T08:15:08.9271313Z           %29 = arith.addf %28, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9271506Z           scf.yield %29 : tensor<4x4xf32>
2026-02-21T08:15:08.9271665Z         } else {
2026-02-21T08:15:08.9271825Z           %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:15:08.9272059Z           %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32>
2026-02-21T08:15:08.9272268Z           %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32>
2026-02-21T08:15:08.9272462Z           %29 = arith.ori %27, %28 : tensor<4x4xi1>
2026-02-21T08:15:08.9272696Z           %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:15:08.9272995Z           %31 = math.log %30 : tensor<4x4xf32>
2026-02-21T08:15:08.9273186Z           %32 = arith.subf %31, %22 : tensor<4x4xf32>
2026-02-21T08:15:08.9273387Z           %33 = arith.mulf %23, %32 : tensor<4x4xf32>
2026-02-21T08:15:08.9273583Z           %34 = arith.addf %33, %cst : tensor<4x4xf32>
2026-02-21T08:15:08.9273779Z           scf.yield %34 : tensor<4x4xf32>
2026-02-21T08:15:08.9273940Z         }
2026-02-21T08:15:08.9274087Z         %25 = arith.addf %arg7, %24 : tensor<4x4xf32>
2026-02-21T08:15:08.9274278Z         scf.yield %25 : tensor<4x4xf32>
2026-02-21T08:15:08.9274596Z       } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:15:08.9274926Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:15:08.9275117Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:08.9275370Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:08.9275553Z         tt.reduce.return %22 : f32
2026-02-21T08:15:08.9275732Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:15:08.9275939Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9276193Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:15:08.9276422Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:15:08.9276641Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:15:08.9276846Z     tt.return
2026-02-21T08:15:08.9276964Z   }
2026-02-21T08:15:08.9277086Z }
2026-02-21T08:15:08.9277152Z 
2026-02-21T08:15:08.9277199Z {-#
2026-02-21T08:15:08.9277329Z   external_resources: {
2026-02-21T08:15:08.9277477Z     mlir_reproducer: {
2026-02-21T08:15:08.9281751Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:15:08.9286101Z       disable_threading: false,
2026-02-21T08:15:08.9286265Z       verify_each: true
2026-02-21T08:15:08.9286405Z     }
2026-02-21T08:15:08.9286517Z   }
2026-02-21T08:15:08.9286629Z #-}
2026-02-21T08:15:08.9287038Z /tmp/torchinductor_root/yn/cynvmjngx6ng2kx3ixz3yqs4355ndpgxqui52zyrc6jmswkt3tv3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:15:08.9288223Z /tmp/torchinductor_root/yn/cynvmjngx6ng2kx3ixz3yqs4355ndpgxqui52zyrc6jmswkt3tv3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:15:08.9289236Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:15:08.9290310Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:15:08.9291286Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:15:08.9291585Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:15:09.8608250Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:15:09.8609277Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:15:09.8610213Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:15:09.8610501Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:15:09.8610810Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:15:09.8611092Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:15:09.8611442Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32>
2026-02-21T08:15:09.8611815Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:15:09.8612520Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:15:09.8612831Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:15:09.8613153Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:15:09.8613453Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:15:09.8613994Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x1024xf32>>
2026-02-21T08:15:09.8614750Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x1024xf32>>
2026-02-21T08:15:09.8615263Z     %2 = tt.get_program_id x : i32
2026-02-21T08:15:09.8615542Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:15:09.8615815Z     %4 = arith.minsi %3, %c512_i32 : i32
2026-02-21T08:15:09.8616094Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:15:09.8616354Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:15:09.8616643Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:15:09.8616917Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:15:09.8617180Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:15:09.8617451Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:15:09.8617723Z     %9 = arith.remsi %8, %c2_i32 : i32
2026-02-21T08:15:09.8617997Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:15:09.8618260Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:15:09.8618538Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:15:09.8618803Z     %13 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:15:09.8619110Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:15:09.8619406Z       %14 = arith.muli %arg5, %c8_i32 : i32
2026-02-21T08:15:09.8619767Z       %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:15:09.8620162Z       %16 = tt.splat %14 : i32 -> tensor<8xi32>
2026-02-21T08:15:09.8620458Z       %17 = arith.addi %16, %15 : tensor<8xi32>
2026-02-21T08:15:09.8620983Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>)  : i32 {
2026-02-21T08:15:09.8621672Z         %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8622776Z         %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8623369Z         %34 = scf.if %arg3 -> (tensor<8x1024xf32>) {
2026-02-21T08:15:09.8624224Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:15:09.8624964Z           %37 = arith.subf %33, %32 : tensor<8x1024xf32>
2026-02-21T08:15:09.8625369Z           %38 = arith.mulf %36, %37 : tensor<8x1024xf32>
2026-02-21T08:15:09.8625863Z           %39 = arith.addf %38, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8626387Z           scf.yield %39 : tensor<8x1024xf32>
2026-02-21T08:15:09.8626836Z         } else {
2026-02-21T08:15:09.8627140Z           %36 = tt.splat %arg4 : f32 -> tensor<8x1024xf32>
2026-02-21T08:15:09.8627731Z           %37 = arith.cmpf ogt, %33, %36 : tensor<8x1024xf32>
2026-02-21T08:15:09.8628332Z           %38 = arith.cmpf une, %33, %33 : tensor<8x1024xf32>
2026-02-21T08:15:09.8628736Z           %39 = arith.ori %37, %38 : tensor<8x1024xi1>
2026-02-21T08:15:09.8629280Z           %40 = arith.select %39, %33, %36 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:15:09.8629754Z           %41 = math.log %40 : tensor<8x1024xf32>
2026-02-21T08:15:09.8630164Z           %42 = arith.subf %41, %32 : tensor<8x1024xf32>
2026-02-21T08:15:09.8630598Z           %43 = arith.mulf %33, %42 : tensor<8x1024xf32>
2026-02-21T08:15:09.8631045Z           %44 = arith.addf %43, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8631450Z           scf.yield %44 : tensor<8x1024xf32>
2026-02-21T08:15:09.8631817Z         }
2026-02-21T08:15:09.8632195Z         %35 = arith.addf %arg7, %34 : tensor<8x1024xf32>
2026-02-21T08:15:09.8632574Z         scf.yield %35 : tensor<8x1024xf32>
2026-02-21T08:15:09.8633164Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:15:09.8633754Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:15:09.8634145Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:09.8634577Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:09.8634933Z         tt.reduce.return %32 : f32
2026-02-21T08:15:09.8635307Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:15:09.8635792Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8636309Z       %21 = tt.addptr %20, %17 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:15:09.8636787Z       tt.store %21, %19 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8637218Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:15:09.8637621Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:15:09.8638100Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:15:09.8638520Z       %24 = arith.muli %23, %c8_i32 : i32
2026-02-21T08:15:09.8638964Z       %25 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:15:09.8639412Z       %26 = tt.splat %24 : i32 -> tensor<8xi32>
2026-02-21T08:15:09.8639868Z       %27 = arith.addi %26, %25 : tensor<8xi32>
2026-02-21T08:15:09.8640453Z       %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>)  : i32 {
2026-02-21T08:15:09.8641230Z         %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8642045Z         %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8642593Z         %34 = scf.if %arg3 -> (tensor<8x1024xf32>) {
2026-02-21T08:15:09.8643307Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:15:09.8643989Z           %37 = arith.subf %33, %32 : tensor<8x1024xf32>
2026-02-21T08:15:09.8644437Z           %38 = arith.mulf %36, %37 : tensor<8x1024xf32>
2026-02-21T08:15:09.8644896Z           %39 = arith.addf %38, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8645380Z           scf.yield %39 : tensor<8x1024xf32>
2026-02-21T08:15:09.8645764Z         } else {
2026-02-21T08:15:09.8646089Z           %36 = tt.splat %arg4 : f32 -> tensor<8x1024xf32>
2026-02-21T08:15:09.8646555Z           %37 = arith.cmpf ogt, %33, %36 : tensor<8x1024xf32>
2026-02-21T08:15:09.8646999Z           %38 = arith.cmpf une, %33, %33 : tensor<8x1024xf32>
2026-02-21T08:15:09.8647460Z           %39 = arith.ori %37, %38 : tensor<8x1024xi1>
2026-02-21T08:15:09.8647981Z           %40 = arith.select %39, %33, %36 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:15:09.8648450Z           %41 = math.log %40 : tensor<8x1024xf32>
2026-02-21T08:15:09.8648890Z           %42 = arith.subf %41, %32 : tensor<8x1024xf32>
2026-02-21T08:15:09.8649265Z           %43 = arith.mulf %33, %42 : tensor<8x1024xf32>
2026-02-21T08:15:09.8649739Z           %44 = arith.addf %43, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8650131Z           scf.yield %44 : tensor<8x1024xf32>
2026-02-21T08:15:09.8650565Z         }
2026-02-21T08:15:09.8650945Z         %35 = arith.addf %arg7, %34 : tensor<8x1024xf32>
2026-02-21T08:15:09.8651328Z         scf.yield %35 : tensor<8x1024xf32>
2026-02-21T08:15:09.8651901Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:15:09.8652486Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:15:09.8652885Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:09.8653228Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:09.8653665Z         tt.reduce.return %32 : f32
2026-02-21T08:15:09.8654059Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:15:09.8654484Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8655040Z       %31 = tt.addptr %30, %27 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:15:09.8655476Z       tt.store %31, %29 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8655863Z     }
2026-02-21T08:15:09.8656177Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:15:09.8656594Z       %14 = arith.muli %arg5, %c8_i32 : i32
2026-02-21T08:15:09.8657058Z       %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:15:09.8657530Z       %16 = tt.splat %14 : i32 -> tensor<8xi32>
2026-02-21T08:15:09.8657957Z       %17 = arith.addi %16, %15 : tensor<8xi32>
2026-02-21T08:15:09.8658538Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>)  : i32 {
2026-02-21T08:15:09.8659357Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8660101Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<8x1024xf32>> -> tensor<8x1024xf32>
2026-02-21T08:15:09.8660645Z         %24 = scf.if %arg3 -> (tensor<8x1024xf32>) {
2026-02-21T08:15:09.8661383Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32>
2026-02-21T08:15:09.8662382Z           %27 = arith.subf %23, %22 : tensor<8x1024xf32>
2026-02-21T08:15:09.8662823Z           %28 = arith.mulf %26, %27 : tensor<8x1024xf32>
2026-02-21T08:15:09.8663280Z           %29 = arith.addf %28, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8663675Z           scf.yield %29 : tensor<8x1024xf32>
2026-02-21T08:15:09.8664061Z         } else {
2026-02-21T08:15:09.8664354Z           %26 = tt.splat %arg4 : f32 -> tensor<8x1024xf32>
2026-02-21T08:15:09.8664866Z           %27 = arith.cmpf ogt, %23, %26 : tensor<8x1024xf32>
2026-02-21T08:15:09.8665285Z           %28 = arith.cmpf une, %23, %23 : tensor<8x1024xf32>
2026-02-21T08:15:09.8665734Z           %29 = arith.ori %27, %28 : tensor<8x1024xi1>
2026-02-21T08:15:09.8666267Z           %30 = arith.select %29, %23, %26 : tensor<8x1024xi1>, tensor<8x1024xf32>
2026-02-21T08:15:09.8666741Z           %31 = math.log %30 : tensor<8x1024xf32>
2026-02-21T08:15:09.8667142Z           %32 = arith.subf %31, %22 : tensor<8x1024xf32>
2026-02-21T08:15:09.8667657Z           %33 = arith.mulf %23, %32 : tensor<8x1024xf32>
2026-02-21T08:15:09.8668105Z           %34 = arith.addf %33, %cst : tensor<8x1024xf32>
2026-02-21T08:15:09.8668486Z           scf.yield %34 : tensor<8x1024xf32>
2026-02-21T08:15:09.8668883Z         }
2026-02-21T08:15:09.8669223Z         %25 = arith.addf %arg7, %24 : tensor<8x1024xf32>
2026-02-21T08:15:09.8669606Z         scf.yield %25 : tensor<8x1024xf32>
2026-02-21T08:15:09.8670180Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:15:09.8670734Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:15:09.8671135Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:15:09.8671545Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:15:09.8671940Z         tt.reduce.return %22 : f32
2026-02-21T08:15:09.8672331Z       }) : (tensor<8x1024xf32>) -> tensor<8xf32>
2026-02-21T08:15:09.8672851Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8673386Z       %21 = tt.addptr %20, %17 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:15:09.8673816Z       tt.store %21, %19 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:15:09.8674237Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:15:09.8674595Z     tt.return
2026-02-21T08:15:09.8674851Z   }
2026-02-21T08:15:09.8675147Z }
2026-02-21T08:15:09.8675295Z 
2026-02-21T08:15:09.8675394Z {-#
2026-02-21T08:15:09.8675687Z   external_resources: {
2026-02-21T08:15:09.8675977Z     mlir_reproducer: {
2026-02-21T08:15:09.8683726Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:15:09.8691773Z       disable_threading: false,
2026-02-21T08:15:09.8692209Z       verify_each: true
2026-02-21T08:15:09.8692531Z     }
2026-02-21T08:15:09.8692752Z   }
2026-02-21T08:15:09.8693058Z #-}
2026-02-21T08:15:09.8693815Z /tmp/torchinductor_root/q2/cq2juvjyoieznyrwo64yoc5zbzf6hjv6zdhby6kzf5phwmki44fv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:15:09.8696045Z /tmp/torchinductor_root/q2/cq2juvjyoieznyrwo64yoc5zbzf6hjv6zdhby6kzf5phwmki44fv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:15:09.8697974Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:15:09.8699907Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=7, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:15:09.8701716Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:15:09.8702302Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:15:10.8636009Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.0 configs/s
2026-02-21T08:15:10.8646522Z [42s] Adaptive compile timeout: 30s (90% percentile=4.2s, bounds=[30.0s, 30s])
2026-02-21T08:15:12.2538563Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 711.2 configs/s
2026-02-21T08:15:12.3129089Z [43s] Initial random population of 100, 5 starting points: 
2026-02-21T08:15:12.3132442Z error=8
2026-02-21T08:15:12.3137420Z timeout=6
2026-02-21T08:15:12.3141124Z ok=86
2026-02-21T08:15:12.3146800Z min=0.1260
2026-02-21T08:15:12.3148234Z mid=1.5774
2026-02-21T08:15:12.3148490Z max=95.2207
2026-02-21T08:15:12.3148676Z best={'block_sizes': [1024, 1],
2026-02-21T08:15:12.3149141Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:15:12.3149464Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:15:12.3149756Z  'num_sm_multiplier': 16,
2026-02-21T08:15:12.3149978Z  'num_stages': 1,
2026-02-21T08:15:12.3150201Z  'num_warps': 1,
2026-02-21T08:15:12.3150436Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:15:12.3150741Z  'range_flattens': [None, None],
2026-02-21T08:15:12.3151037Z  'range_multi_buffers': [False, True],
2026-02-21T08:15:12.3151302Z  'range_num_stages': [0, 4],
2026-02-21T08:15:12.3151582Z  'range_unroll_factors': [2, 0],
2026-02-21T08:15:12.3151844Z  'range_warp_specializes': [None, True]}
2026-02-21T08:15:12.3152167Z [43s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:15:13.7099841Z [45s] Generation 1 starting: 94 neighbors, 5 active search path(s)
2026-02-21T08:15:23.2513234Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 3.9 configs/s
2026-02-21T08:15:28.8528543Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 17.6 configs/s
2026-02-21T08:15:42.1906187Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 75.5 configs/s
2026-02-21T08:15:42.5876067Z [74s] Generation 1 complete: 
2026-02-21T08:15:42.5880212Z error=2
2026-02-21T08:15:42.5884067Z ok=98
2026-02-21T08:15:42.5886162Z min=0.1116
2026-02-21T08:15:42.5886725Z mid=0.1424
2026-02-21T08:15:42.5887003Z max=0.6441
2026-02-21T08:15:42.5887237Z best={'block_sizes': [1024, 1],
2026-02-21T08:15:42.5887575Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:15:42.5887963Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:15:42.5888246Z  'num_sm_multiplier': 64,
2026-02-21T08:15:42.5888453Z  'num_stages': 7,
2026-02-21T08:15:42.5888684Z  'num_warps': 8,
2026-02-21T08:15:42.5888883Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:15:42.5889148Z  'range_flattens': [False, False],
2026-02-21T08:15:42.5889377Z  'range_multi_buffers': [True, True],
2026-02-21T08:15:42.5889650Z  'range_num_stages': [1, 3],
2026-02-21T08:15:42.5889951Z  'range_unroll_factors': [0, 3],
2026-02-21T08:15:42.5890201Z  'range_warp_specializes': [True, None]}
2026-02-21T08:15:42.5891074Z [74s] Fitting surrogate: 200 points, 200 targets
2026-02-21T08:15:44.0695451Z [75s] Generation 2 starting: 97 neighbors, 5 active search path(s)
2026-02-21T08:16:02.3798557Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 1.3 configs/s
2026-02-21T08:16:08.1785404Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 17.6 configs/s
2026-02-21T08:16:21.8105636Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 73.8 configs/s
2026-02-21T08:16:22.2793538Z [113s] Generation 2 complete: 
2026-02-21T08:16:22.2796755Z error=1
2026-02-21T08:16:22.2797344Z ok=102
2026-02-21T08:16:22.2797605Z min=0.1136
2026-02-21T08:16:22.2797811Z mid=0.1382
2026-02-21T08:16:22.2798009Z max=0.7056
2026-02-21T08:16:22.2798183Z best={'block_sizes': [1024, 1],
2026-02-21T08:16:22.2798521Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:16:22.2798826Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:16:22.2799083Z  'num_sm_multiplier': 64,
2026-02-21T08:16:22.2799284Z  'num_stages': 7,
2026-02-21T08:16:22.2799495Z  'num_warps': 8,
2026-02-21T08:16:22.2799705Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:16:22.2799937Z  'range_flattens': [False, False],
2026-02-21T08:16:22.2800335Z  'range_multi_buffers': [True, True],
2026-02-21T08:16:22.2800878Z  'range_num_stages': [1, 3],
2026-02-21T08:16:22.2801163Z  'range_unroll_factors': [0, 3],
2026-02-21T08:16:22.2801387Z  'range_warp_specializes': [True, None]}
2026-02-21T08:16:22.2820474Z [113s] Fitting surrogate: 303 points, 303 targets
2026-02-21T08:16:23.8236134Z [115s] Generation 3 starting: 82 neighbors, 5 active search path(s)
2026-02-21T08:16:44.6867847Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.6 configs/s
2026-02-21T08:16:46.7051756Z module {
2026-02-21T08:16:46.7055856Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:16:46.7060129Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:16:46.7062312Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:16:46.7062997Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:16:46.7063572Z     %cst = arith.constant dense<16384> : tensor<4x1xi32>
2026-02-21T08:16:46.7063922Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:16:46.7064246Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:16:46.7064497Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:16:46.7064728Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:16:46.7065022Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:16:46.7065252Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:16:46.7065645Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:16:46.7066105Z     %1 = tt.get_program_id x : i32
2026-02-21T08:16:46.7066423Z     scf.for %arg5 = %1 to %c1024_i32 step %c9472_i32  : i32 {
2026-02-21T08:16:46.7066673Z       %2 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:16:46.7067277Z       %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:16:46.7067608Z       %4 = tt.splat %2 : i32 -> tensor<4xi32>
2026-02-21T08:16:46.7067880Z       %5 = arith.addi %4, %3 : tensor<4xi32>
2026-02-21T08:16:46.7068160Z       %c15360_i32 = arith.constant 15360 : i32
2026-02-21T08:16:46.7068390Z       %c3072_i32 = arith.constant 3072 : i32
2026-02-21T08:16:46.7068795Z       %6 = scf.for %arg6 = %c0_i32 to %c15360_i32 step %c3072_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:16:46.7069226Z         %25 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:16:46.7069559Z         %26 = tt.splat %arg6 : i32 -> tensor<1024xi32>
2026-02-21T08:16:46.7069855Z         %27 = arith.addi %26, %25 : tensor<1024xi32>
2026-02-21T08:16:46.7070202Z         %28 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:16:46.7070605Z         %29 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:16:46.7070906Z         %30 = arith.muli %29, %cst : tensor<4x1xi32>
2026-02-21T08:16:46.7071284Z         %31 = tt.expand_dims %27 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T08:16:46.7072035Z         %32 = tt.broadcast %30 : tensor<4x1xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7072339Z         %33 = tt.broadcast %31 : tensor<1x1024xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7072685Z         %34 = arith.addi %32, %33 : tensor<4x1024xi32>
2026-02-21T08:16:46.7072971Z         %35 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7073327Z         %36 = tt.addptr %35, %34 : tensor<4x1024x!tt.ptr<f32>>, tensor<4x1024xi32>
2026-02-21T08:16:46.7073725Z         %37 = tt.load %36 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7074032Z         %38 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:16:46.7074467Z           %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:16:46.7075019Z           %75 = arith.subf %37, %28 : tensor<4x1024xf32>
2026-02-21T08:16:46.7075319Z           %76 = arith.mulf %74, %75 : tensor<4x1024xf32>
2026-02-21T08:16:46.7075614Z           %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7076253Z           scf.yield %77 : tensor<4x1024xf32>
2026-02-21T08:16:46.7076503Z         } else {
2026-02-21T08:16:46.7076698Z           %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:16:46.7077017Z           %75 = arith.cmpf ogt, %37, %74 : tensor<4x1024xf32>
2026-02-21T08:16:46.7077290Z           %76 = arith.cmpf une, %37, %37 : tensor<4x1024xf32>
2026-02-21T08:16:46.7077562Z           %77 = arith.ori %75, %76 : tensor<4x1024xi1>
2026-02-21T08:16:46.7077890Z           %78 = arith.select %77, %37, %74 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:16:46.7078185Z           %79 = math.log %78 : tensor<4x1024xf32>
2026-02-21T08:16:46.7078445Z           %80 = arith.subf %79, %28 : tensor<4x1024xf32>
2026-02-21T08:16:46.7078716Z           %81 = arith.mulf %37, %80 : tensor<4x1024xf32>
2026-02-21T08:16:46.7079005Z           %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7079246Z           scf.yield %82 : tensor<4x1024xf32>
2026-02-21T08:16:46.7079500Z         }
2026-02-21T08:16:46.7079728Z         %39 = arith.addf %arg7, %38 : tensor<4x1024xf32>
2026-02-21T08:16:46.7079966Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:16:46.7080245Z         %40 = arith.muli %c1024_i32, %c1_i32 : i32
2026-02-21T08:16:46.7080481Z         %41 = arith.addi %arg6, %40 : i32
2026-02-21T08:16:46.7080782Z         %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:16:46.7081072Z         %43 = tt.splat %41 : i32 -> tensor<1024xi32>
2026-02-21T08:16:46.7081362Z         %44 = arith.addi %43, %42 : tensor<1024xi32>
2026-02-21T08:16:46.7081716Z         %45 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:16:46.7082122Z         %46 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:16:46.7082463Z         %47 = arith.muli %46, %cst : tensor<4x1xi32>
2026-02-21T08:16:46.7082757Z         %48 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T08:16:46.7083120Z         %49 = tt.broadcast %47 : tensor<4x1xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7083466Z         %50 = tt.broadcast %48 : tensor<1x1024xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7083745Z         %51 = arith.addi %49, %50 : tensor<4x1024xi32>
2026-02-21T08:16:46.7084057Z         %52 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7084384Z         %53 = tt.addptr %52, %51 : tensor<4x1024x!tt.ptr<f32>>, tensor<4x1024xi32>
2026-02-21T08:16:46.7084746Z         %54 = tt.load %53 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7085030Z         %55 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:16:46.7085492Z           %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:16:46.7086017Z           %75 = arith.subf %54, %45 : tensor<4x1024xf32>
2026-02-21T08:16:46.7086261Z           %76 = arith.mulf %74, %75 : tensor<4x1024xf32>
2026-02-21T08:16:46.7086568Z           %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7086811Z           scf.yield %77 : tensor<4x1024xf32>
2026-02-21T08:16:46.7087055Z         } else {
2026-02-21T08:16:46.7087311Z           %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:16:46.7087571Z           %75 = arith.cmpf ogt, %54, %74 : tensor<4x1024xf32>
2026-02-21T08:16:46.7087859Z           %76 = arith.cmpf une, %54, %54 : tensor<4x1024xf32>
2026-02-21T08:16:46.7088137Z           %77 = arith.ori %75, %76 : tensor<4x1024xi1>
2026-02-21T08:16:46.7088450Z           %78 = arith.select %77, %54, %74 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:16:46.7088727Z           %79 = math.log %78 : tensor<4x1024xf32>
2026-02-21T08:16:46.7089080Z           %80 = arith.subf %79, %45 : tensor<4x1024xf32>
2026-02-21T08:16:46.7089365Z           %81 = arith.mulf %54, %80 : tensor<4x1024xf32>
2026-02-21T08:16:46.7089618Z           %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7089900Z           scf.yield %82 : tensor<4x1024xf32>
2026-02-21T08:16:46.7090121Z         }
2026-02-21T08:16:46.7090991Z         %56 = arith.addf %39, %55 : tensor<4x1024xf32>
2026-02-21T08:16:46.7091227Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:16:46.7091506Z         %57 = arith.muli %c1024_i32, %c2_i32 : i32
2026-02-21T08:16:46.7091764Z         %58 = arith.addi %arg6, %57 : i32
2026-02-21T08:16:46.7092077Z         %59 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:16:46.7092419Z         %60 = tt.splat %58 : i32 -> tensor<1024xi32>
2026-02-21T08:16:46.7092657Z         %61 = arith.addi %60, %59 : tensor<1024xi32>
2026-02-21T08:16:46.7093005Z         %62 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:16:46.7093434Z         %63 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:16:46.7093723Z         %64 = arith.muli %63, %cst : tensor<4x1xi32>
2026-02-21T08:16:46.7094046Z         %65 = tt.expand_dims %61 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T08:16:46.7094389Z         %66 = tt.broadcast %64 : tensor<4x1xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7094717Z         %67 = tt.broadcast %65 : tensor<1x1024xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7094976Z         %68 = arith.addi %66, %67 : tensor<4x1024xi32>
2026-02-21T08:16:46.7095320Z         %69 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7095661Z         %70 = tt.addptr %69, %68 : tensor<4x1024x!tt.ptr<f32>>, tensor<4x1024xi32>
2026-02-21T08:16:46.7095979Z         %71 = tt.load %70 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7096342Z         %72 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:16:46.7096743Z           %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:16:46.7097170Z           %75 = arith.subf %71, %62 : tensor<4x1024xf32>
2026-02-21T08:16:46.7097484Z           %76 = arith.mulf %74, %75 : tensor<4x1024xf32>
2026-02-21T08:16:46.7097737Z           %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7098002Z           scf.yield %77 : tensor<4x1024xf32>
2026-02-21T08:16:46.7098249Z         } else {
2026-02-21T08:16:46.7098490Z           %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:16:46.7098768Z           %75 = arith.cmpf ogt, %71, %74 : tensor<4x1024xf32>
2026-02-21T08:16:46.7099087Z           %76 = arith.cmpf une, %71, %71 : tensor<4x1024xf32>
2026-02-21T08:16:46.7099378Z           %77 = arith.ori %75, %76 : tensor<4x1024xi1>
2026-02-21T08:16:46.7099679Z           %78 = arith.select %77, %71, %74 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:16:46.7100079Z           %79 = math.log %78 : tensor<4x1024xf32>
2026-02-21T08:16:46.7100332Z           %80 = arith.subf %79, %62 : tensor<4x1024xf32>
2026-02-21T08:16:46.7100628Z           %81 = arith.mulf %71, %80 : tensor<4x1024xf32>
2026-02-21T08:16:46.7100946Z           %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7101210Z           scf.yield %82 : tensor<4x1024xf32>
2026-02-21T08:16:46.7101463Z         }
2026-02-21T08:16:46.7101655Z         %73 = arith.addf %56, %72 : tensor<4x1024xf32>
2026-02-21T08:16:46.7101985Z         scf.yield %73 : tensor<4x1024xf32>
2026-02-21T08:16:46.7102222Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:16:46.7102660Z       %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:16:46.7103023Z       %8 = tt.splat %c15360_i32 : i32 -> tensor<1024xi32>
2026-02-21T08:16:46.7103286Z       %9 = arith.addi %8, %7 : tensor<1024xi32>
2026-02-21T08:16:46.7103739Z       %10 = tt.descriptor_load %0[%2, %c15360_i32] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:16:46.7104169Z       %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:16:46.7104509Z       %12 = arith.muli %11, %cst : tensor<4x1xi32>
2026-02-21T08:16:46.7104847Z       %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32>
2026-02-21T08:16:46.7105209Z       %14 = tt.broadcast %12 : tensor<4x1xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7105557Z       %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<4x1024xi32>
2026-02-21T08:16:46.7105830Z       %16 = arith.addi %14, %15 : tensor<4x1024xi32>
2026-02-21T08:16:46.7106185Z       %17 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7106499Z       %18 = tt.addptr %17, %16 : tensor<4x1024x!tt.ptr<f32>>, tensor<4x1024xi32>
2026-02-21T08:16:46.7106851Z       %19 = tt.load %18 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr<f32>>
2026-02-21T08:16:46.7107224Z       %20 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:16:46.7107626Z         %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:16:46.7108055Z         %26 = arith.subf %19, %10 : tensor<4x1024xf32>
2026-02-21T08:16:46.7108331Z         %27 = arith.mulf %25, %26 : tensor<4x1024xf32>
2026-02-21T08:16:46.7108610Z         %28 = arith.addf %27, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7108876Z         scf.yield %28 : tensor<4x1024xf32>
2026-02-21T08:16:46.7109119Z       } else {
2026-02-21T08:16:46.7109350Z         %25 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:16:46.7109617Z         %26 = arith.cmpf ogt, %19, %25 : tensor<4x1024xf32>
2026-02-21T08:16:46.7109922Z         %27 = arith.cmpf une, %19, %19 : tensor<4x1024xf32>
2026-02-21T08:16:46.7110178Z         %28 = arith.ori %26, %27 : tensor<4x1024xi1>
2026-02-21T08:16:46.7110502Z         %29 = arith.select %28, %19, %25 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:16:46.7110834Z         %30 = math.log %29 : tensor<4x1024xf32>
2026-02-21T08:16:46.7111070Z         %31 = arith.subf %30, %10 : tensor<4x1024xf32>
2026-02-21T08:16:46.7111354Z         %32 = arith.mulf %19, %31 : tensor<4x1024xf32>
2026-02-21T08:16:46.7111608Z         %33 = arith.addf %32, %cst_0 : tensor<4x1024xf32>
2026-02-21T08:16:46.7111924Z         scf.yield %33 : tensor<4x1024xf32>
2026-02-21T08:16:46.7112146Z       }
2026-02-21T08:16:46.7112366Z       %21 = arith.addf %6, %20 : tensor<4x1024xf32>
2026-02-21T08:16:46.7112650Z       %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({
2026-02-21T08:16:46.7112890Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:16:46.7113141Z         %25 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:16:46.7113380Z         tt.reduce.return %25 : f32
2026-02-21T08:16:46.7113644Z       }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:16:46.7113901Z       %23 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:16:46.7114319Z       %24 = tt.addptr %23, %5 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:16:46.7114621Z       tt.store %24, %22 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:16:46.7114866Z     } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:16:46.7115154Z     tt.return
2026-02-21T08:16:46.7115328Z   }
2026-02-21T08:16:46.7115508Z }
2026-02-21T08:16:46.7115616Z 
2026-02-21T08:16:46.7115706Z {-#
2026-02-21T08:16:46.7115912Z   external_resources: {
2026-02-21T08:16:46.7116108Z     mlir_reproducer: {
2026-02-21T08:16:46.7120584Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:16:46.7125073Z       disable_threading: false,
2026-02-21T08:16:46.7125344Z       verify_each: true
2026-02-21T08:16:46.7125530Z     }
2026-02-21T08:16:46.7125715Z   }
2026-02-21T08:16:46.7125907Z #-}
2026-02-21T08:16:46.7126401Z /tmp/torchinductor_root/d7/cd7j2kp2pczr6gv2rsagvocxjuaejcorrxzh2minax3zj62t4r32.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:16:46.7127653Z /tmp/torchinductor_root/d7/cd7j2kp2pczr6gv2rsagvocxjuaejcorrxzh2minax3zj62t4r32.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:16:46.7128723Z [138s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:16:46.7130984Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=64, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:16:46.7132049Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:16:46.7132415Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:16:49.4638130Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 18.0 configs/s
2026-02-21T08:17:00.8528662Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 88.3 configs/s
2026-02-21T08:17:01.2177135Z [152s] Generation 3 complete: 
2026-02-21T08:17:01.2177558Z error=3
2026-02-21T08:17:01.2177964Z ok=85
2026-02-21T08:17:01.2178288Z min=0.1116
2026-02-21T08:17:01.2178539Z mid=0.1340
2026-02-21T08:17:01.2178857Z max=1.3301
2026-02-21T08:17:01.2179139Z best={'block_sizes': [1024, 1],
2026-02-21T08:17:01.2179637Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:17:01.2180191Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:17:01.2180576Z  'num_sm_multiplier': 64,
2026-02-21T08:17:01.2180921Z  'num_stages': 7,
2026-02-21T08:17:01.2181209Z  'num_warps': 8,
2026-02-21T08:17:01.2181578Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:17:01.2182293Z  'range_flattens': [False, False],
2026-02-21T08:17:01.2182707Z  'range_multi_buffers': [True, True],
2026-02-21T08:17:01.2183084Z  'range_num_stages': [1, 2],
2026-02-21T08:17:01.2183446Z  'range_unroll_factors': [0, 3],
2026-02-21T08:17:01.2184268Z  'range_warp_specializes': [True, None]}
2026-02-21T08:17:01.2206706Z [152s] Fitting surrogate: 391 points, 391 targets
2026-02-21T08:17:02.4524032Z [153s] Generation 4 starting: 86 neighbors, 5 active search path(s)
2026-02-21T08:17:07.1291633Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 16.5 configs/s
2026-02-21T08:17:12.2350736Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 17.6 configs/s
2026-02-21T08:17:26.3318287Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 71.4 configs/s
2026-02-21T08:17:26.9034885Z [178s] Generation 4 complete: 
2026-02-21T08:17:26.9035268Z ok=91
2026-02-21T08:17:26.9035708Z min=0.1116
2026-02-21T08:17:26.9035990Z mid=0.1300
2026-02-21T08:17:26.9036320Z max=0.5458
2026-02-21T08:17:26.9036600Z best={'block_sizes': [1024, 1],
2026-02-21T08:17:26.9037100Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:17:26.9037710Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:17:26.9038203Z  'num_sm_multiplier': 64,
2026-02-21T08:17:26.9038580Z  'num_stages': 7,
2026-02-21T08:17:26.9038893Z  'num_warps': 8,
2026-02-21T08:17:26.9039244Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:17:26.9039604Z  'range_flattens': [False, False],
2026-02-21T08:17:26.9040015Z  'range_multi_buffers': [True, True],
2026-02-21T08:17:26.9040373Z  'range_num_stages': [1, 2],
2026-02-21T08:17:26.9040738Z  'range_unroll_factors': [0, 3],
2026-02-21T08:17:26.9041138Z  'range_warp_specializes': [True, None]}
2026-02-21T08:17:26.9070160Z [178s] Fitting surrogate: 482 points, 482 targets
2026-02-21T08:17:28.3099343Z [179s] Generation 5 starting: 82 neighbors, 5 active search path(s)
2026-02-21T08:17:33.5673036Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 5.7 configs/s
2026-02-21T08:17:38.4081119Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.5 configs/s
2026-02-21T08:17:50.6193169Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.4 configs/s
2026-02-21T08:17:50.9954071Z [202s] Generation 5 complete: 
2026-02-21T08:17:50.9955774Z ok=87
2026-02-21T08:17:50.9956064Z min=0.1116
2026-02-21T08:17:50.9956391Z mid=0.1281
2026-02-21T08:17:50.9956595Z max=0.9944
2026-02-21T08:17:50.9956793Z best={'block_sizes': [1024, 1],
2026-02-21T08:17:50.9957167Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:17:50.9957520Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:17:50.9957775Z  'num_sm_multiplier': 64,
2026-02-21T08:17:50.9958037Z  'num_stages': 7,
2026-02-21T08:17:50.9958225Z  'num_warps': 8,
2026-02-21T08:17:50.9958527Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:17:50.9958803Z  'range_flattens': [False, False],
2026-02-21T08:17:50.9959072Z  'range_multi_buffers': [True, True],
2026-02-21T08:17:50.9959998Z  'range_num_stages': [1, 2],
2026-02-21T08:17:50.9960262Z  'range_unroll_factors': [0, 3],
2026-02-21T08:17:50.9960525Z  'range_warp_specializes': [True, None]}
2026-02-21T08:17:50.9975769Z [202s] Fitting surrogate: 569 points, 569 targets
2026-02-21T08:17:52.0050557Z [203s] Generation 6 starting: 55 neighbors, 3 active search path(s)
2026-02-21T08:17:55.2340577Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 53.9 configs/s
2026-02-21T08:17:58.5042924Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 17.7 configs/s
2026-02-21T08:18:07.5940343Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 111.2         
2026-02-21T08:18:07.5944329Z                                                                   configs/s     
2026-02-21T08:18:07.9003695Z [219s] Generation 6 complete: 
2026-02-21T08:18:07.9004144Z ok=59
2026-02-21T08:18:07.9004626Z min=0.1134
2026-02-21T08:18:07.9004928Z mid=0.1342
2026-02-21T08:18:07.9005259Z max=0.3552
2026-02-21T08:18:07.9005598Z best={'block_sizes': [2048, 1],
2026-02-21T08:18:07.9006112Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:18:07.9006585Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:18:07.9007477Z  'num_sm_multiplier': 16,
2026-02-21T08:18:07.9007905Z  'num_stages': 1,
2026-02-21T08:18:07.9008227Z  'num_warps': 8,
2026-02-21T08:18:07.9008664Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:18:07.9009075Z  'range_flattens': [False, False],
2026-02-21T08:18:07.9009479Z  'range_multi_buffers': [True, True],
2026-02-21T08:18:07.9009807Z  'range_num_stages': [0, 4],
2026-02-21T08:18:07.9010224Z  'range_unroll_factors': [0, 3],
2026-02-21T08:18:07.9010559Z  'range_warp_specializes': [True, None]}
2026-02-21T08:18:07.9026376Z [219s] Fitting surrogate: 628 points, 628 targets
2026-02-21T08:18:08.7498914Z [220s] Generation 7 starting: 53 neighbors, 3 active search path(s)
2026-02-21T08:18:11.4477787Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 24.9 configs/s
2026-02-21T08:18:14.5033400Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.6 configs/s
2026-02-21T08:18:23.8971007Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 111.4         
2026-02-21T08:18:23.8971401Z                                                                   configs/s     
2026-02-21T08:18:24.1710654Z [235s] Generation 7 complete: 
2026-02-21T08:18:24.1714945Z ok=56
2026-02-21T08:18:24.1716770Z min=0.1096
2026-02-21T08:18:24.1716938Z mid=0.1240
2026-02-21T08:18:24.1717064Z max=0.1946
2026-02-21T08:18:24.1717214Z best={'block_sizes': [2048, 1],
2026-02-21T08:18:24.1717438Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:18:24.1717690Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:18:24.1717890Z  'num_stages': 1,
2026-02-21T08:18:24.1718030Z  'num_warps': 8,
2026-02-21T08:18:24.1718178Z  'pid_type': 'flat',
2026-02-21T08:18:24.1718336Z  'range_flattens': [None, False],
2026-02-21T08:18:24.1718521Z  'range_multi_buffers': [None, True],
2026-02-21T08:18:24.1718698Z  'range_num_stages': [0, 4],
2026-02-21T08:18:24.1718868Z  'range_unroll_factors': [0, 0],
2026-02-21T08:18:24.1719040Z  'range_warp_specializes': [None, True]}
2026-02-21T08:18:24.1731004Z [235s] Fitting surrogate: 684 points, 684 targets
2026-02-21T08:18:24.7789195Z [236s] Generation 8 starting: 30 neighbors, 2 active search path(s)
2026-02-21T08:18:26.6547864Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 14.9 configs/s
2026-02-21T08:18:28.3843129Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 17.8 configs/s
2026-02-21T08:18:33.3608566Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 203.1         
2026-02-21T08:18:33.3609886Z                                                                   configs/s     
2026-02-21T08:18:33.5289236Z [245s] Generation 8 complete: 
2026-02-21T08:18:33.5292395Z ok=32
2026-02-21T08:18:33.5293874Z min=0.1096
2026-02-21T08:18:33.5294029Z mid=0.1261
2026-02-21T08:18:33.5294156Z max=0.1976
2026-02-21T08:18:33.5294293Z best={'block_sizes': [2048, 1],
2026-02-21T08:18:33.5294536Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:18:33.5294775Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:18:33.5295350Z  'num_stages': 1,
2026-02-21T08:18:33.5295521Z  'num_warps': 8,
2026-02-21T08:18:33.5295671Z  'pid_type': 'flat',
2026-02-21T08:18:33.5295849Z  'range_flattens': [None, False],
2026-02-21T08:18:33.5296037Z  'range_multi_buffers': [None, True],
2026-02-21T08:18:33.5296234Z  'range_num_stages': [0, 4],
2026-02-21T08:18:33.5296402Z  'range_unroll_factors': [0, 0],
2026-02-21T08:18:33.5296592Z  'range_warp_specializes': [None, True]}
2026-02-21T08:18:33.5313072Z [245s] Fitting surrogate: 716 points, 716 targets
2026-02-21T08:18:33.8191643Z [245s] Autotuning complete in 245.3s after searching 691 configs.
2026-02-21T08:18:33.8192080Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:18:33.8193028Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:18:33.8193872Z 
2026-02-21T08:18:33.8194129Z [245s] Code of selected kernel: /tmp/torchinductor_root/pg/cpgzaakzb2yxo736ajfixkoxq5uniyijriimmreocuh7kbuo7e5a.py
2026-02-21T08:18:34.9426624Z WARNING:tritonbench.utils.triton_op:Completed input ID 2:
2026-02-21T08:18:34.9430756Z (B, T, V)
2026-02-21T08:18:34.9435447Z ---------------
2026-02-21T08:18:34.9439957Z (8, 512, 16384)
2026-02-21T08:18:34.9445254Z 
2026-02-21T08:18:34.9462375Z  50%|█████     | 3/6 [10:22<10:55, 218.65s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3:
2026-02-21T08:18:34.9463514Z (B, T, V)
2026-02-21T08:18:34.9463688Z ---------------
2026-02-21T08:18:34.9463840Z (8, 512, 32768)
2026-02-21T08:18:34.9464192Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:18:36.0372789Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:18:37.1399751Z INFO:tritonbench.utils.triton_op:Took 2.45ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:18:41.0186952Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:18:41.0187321Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:18:41.0187603Z               'dtype': 'torch.float32',
2026-02-21T08:18:41.0187897Z               'shape': (4096, 32768),
2026-02-21T08:18:41.0188172Z               'stride': (32768, 1)},
2026-02-21T08:18:41.0188436Z             { 'device': 'cuda:0',
2026-02-21T08:18:41.0188711Z               'dtype': 'torch.float32',
2026-02-21T08:18:41.0188989Z               'shape': (4096, 32768),
2026-02-21T08:18:41.0189261Z               'stride': (32768, 1)}),
2026-02-21T08:18:41.0189516Z   'kwargs': {}}
2026-02-21T08:18:41.0218205Z INFO:tritonbench.utils.triton_op:Took 3.65ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:18:41.2257327Z [0s] Autotune random seed: 2134765727
2026-02-21T08:18:41.3791055Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:19:13.5560953Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, None])
2026-02-21T08:19:13.8538850Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[3, 2], range_warp_specializes=[False, None])
2026-02-21T08:19:13.9047268Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=256, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[3, 0], range_warp_specializes=[False, True])
2026-02-21T08:19:14.5742865Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None])
2026-02-21T08:19:15.9552905Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None])
2026-02-21T08:19:16.2317633Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[False, False])
2026-02-21T08:19:16.3061367Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[3, 2], range_warp_specializes=[False, False])
2026-02-21T08:19:16.6930642Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=64, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[4, 2], range_warp_specializes=[False, False])
2026-02-21T08:19:16.6945203Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:19:17.3394435Z module {
2026-02-21T08:19:17.3399186Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:19:17.3403909Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:19:17.3407380Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:19:17.3411308Z     %c296_i32 = arith.constant 296 : i32
2026-02-21T08:19:17.3412642Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x1024xf32>
2026-02-21T08:19:17.3412900Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:19:17.3413117Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:19:17.3413320Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:19:17.3413528Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:19:17.3413717Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:19:17.3414062Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x1024xf32>>
2026-02-21T08:19:17.3414534Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x1024xf32>>
2026-02-21T08:19:17.3414857Z     %2 = tt.get_program_id x : i32
2026-02-21T08:19:17.3415085Z     scf.for %arg5 = %2 to %c64_i32 step %c296_i32  : i32 {
2026-02-21T08:19:17.3415294Z       %3 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:19:17.3415639Z       %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:19:17.3415915Z       %5 = tt.splat %3 : i32 -> tensor<64xi32>
2026-02-21T08:19:17.3416113Z       %6 = arith.addi %5, %4 : tensor<64xi32>
2026-02-21T08:19:17.3416443Z       %7 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<64x1024xf32>)  : i32 {
2026-02-21T08:19:17.3416866Z         %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<64x1024xf32>> -> tensor<64x1024xf32>
2026-02-21T08:19:17.3417258Z         %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<64x1024xf32>> -> tensor<64x1024xf32>
2026-02-21T08:19:17.3417561Z         %13 = scf.if %arg3 -> (tensor<64x1024xf32>) {
2026-02-21T08:19:17.3417950Z           %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x1024xf32>) -> tensor<64x1024xf32>
2026-02-21T08:19:17.3418374Z           %16 = arith.subf %12, %11 : tensor<64x1024xf32>
2026-02-21T08:19:17.3418596Z           %17 = arith.mulf %15, %16 : tensor<64x1024xf32>
2026-02-21T08:19:17.3418824Z           %18 = arith.addf %17, %cst : tensor<64x1024xf32>
2026-02-21T08:19:17.3419036Z           scf.yield %18 : tensor<64x1024xf32>
2026-02-21T08:19:17.3419213Z         } else {
2026-02-21T08:19:17.3419389Z           %15 = tt.splat %arg4 : f32 -> tensor<64x1024xf32>
2026-02-21T08:19:17.3419614Z           %16 = arith.cmpf ogt, %12, %15 : tensor<64x1024xf32>
2026-02-21T08:19:17.3419847Z           %17 = arith.cmpf une, %12, %12 : tensor<64x1024xf32>
2026-02-21T08:19:17.3420067Z           %18 = arith.ori %16, %17 : tensor<64x1024xi1>
2026-02-21T08:19:17.3420323Z           %19 = arith.select %18, %12, %15 : tensor<64x1024xi1>, tensor<64x1024xf32>
2026-02-21T08:19:17.3420580Z           %20 = math.log %19 : tensor<64x1024xf32>
2026-02-21T08:19:17.3420786Z           %21 = arith.subf %20, %11 : tensor<64x1024xf32>
2026-02-21T08:19:17.3421002Z           %22 = arith.mulf %12, %21 : tensor<64x1024xf32>
2026-02-21T08:19:17.3421220Z           %23 = arith.addf %22, %cst : tensor<64x1024xf32>
2026-02-21T08:19:17.3421429Z           scf.yield %23 : tensor<64x1024xf32>
2026-02-21T08:19:17.3421604Z         }
2026-02-21T08:19:17.3421768Z         %14 = arith.addf %arg7, %13 : tensor<64x1024xf32>
2026-02-21T08:19:17.3421998Z         scf.yield %14 : tensor<64x1024xf32>
2026-02-21T08:19:17.3422193Z       } {tt.warp_specialize}
2026-02-21T08:19:17.3422374Z       %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:19:17.3422558Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:19:17.3422741Z         %11 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:19:17.3422922Z         tt.reduce.return %11 : f32
2026-02-21T08:19:17.3423116Z       }) : (tensor<64x1024xf32>) -> tensor<64xf32>
2026-02-21T08:19:17.3423343Z       %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:19:17.3423610Z       %10 = tt.addptr %9, %6 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:19:17.3423842Z       tt.store %10, %8 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:19:17.3424172Z     } {tt.disallow_acc_multi_buffer}
2026-02-21T08:19:17.3424353Z     tt.return
2026-02-21T08:19:17.3424484Z   }
2026-02-21T08:19:17.3424620Z }
2026-02-21T08:19:17.3424692Z 
2026-02-21T08:19:17.3424746Z {-#
2026-02-21T08:19:17.3424892Z   external_resources: {
2026-02-21T08:19:17.3425053Z     mlir_reproducer: {
2026-02-21T08:19:17.3429394Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:19:17.3433709Z       disable_threading: false,
2026-02-21T08:19:17.3433880Z       verify_each: true
2026-02-21T08:19:17.3434023Z     }
2026-02-21T08:19:17.3434135Z   }
2026-02-21T08:19:17.3434250Z #-}
2026-02-21T08:19:17.3434652Z /tmp/torchinductor_root/cl/cclsdp73pnkd5ihcd7ywluctzniw5lm2m64ipuaqnio43c7av6wv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:19:17.3435827Z /tmp/torchinductor_root/cl/cclsdp73pnkd5ihcd7ywluctzniw5lm2m64ipuaqnio43c7av6wv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:19:17.3436769Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:19:17.3437831Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=2, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:19:17.3438799Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:19:17.3439049Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:19:18.0472416Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:19:18.0477575Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:19:18.0481735Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:19:18.0483230Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:19:18.0483446Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:19:18.0483639Z     %c4736_i32 = arith.constant 4736 : i32
2026-02-21T08:19:18.0483848Z     %cst = arith.constant dense<32768> : tensor<4x1xi32>
2026-02-21T08:19:18.0484101Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x64xf32>
2026-02-21T08:19:18.0484322Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:19:18.0484503Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:19:18.0484689Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:19:18.0484894Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:19:18.0485066Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:19:18.0485378Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<4x64xf32>>
2026-02-21T08:19:18.0485693Z     %1 = tt.get_program_id x : i32
2026-02-21T08:19:18.0486168Z     scf.for %arg5 = %1 to %c1024_i32 step %c4736_i32  : i32 {
2026-02-21T08:19:18.0486422Z       %2 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:19:18.0486649Z       %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:19:18.0486898Z       %4 = tt.splat %2 : i32 -> tensor<4xi32>
2026-02-21T08:19:18.0487078Z       %5 = arith.addi %4, %3 : tensor<4xi32>
2026-02-21T08:19:18.0487261Z       %c128_i32 = arith.constant 128 : i32
2026-02-21T08:19:18.0487569Z       %6 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c128_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x64xf32>)  : i32 {
2026-02-21T08:19:18.0487948Z         %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:19:18.0488199Z         %11 = tt.splat %arg6 : i32 -> tensor<64xi32>
2026-02-21T08:19:18.0488402Z         %12 = arith.addi %11, %10 : tensor<64xi32>
2026-02-21T08:19:18.0488641Z         %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:19:18.0488896Z         %14 = arith.muli %13, %cst : tensor<4x1xi32>
2026-02-21T08:19:18.0489141Z         %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:19:18.0489423Z         %16 = tt.broadcast %14 : tensor<4x1xi32> -> tensor<4x64xi32>
2026-02-21T08:19:18.0489669Z         %17 = tt.broadcast %15 : tensor<1x64xi32> -> tensor<4x64xi32>
2026-02-21T08:19:18.0489895Z         %18 = arith.addi %16, %17 : tensor<4x64xi32>
2026-02-21T08:19:18.0490127Z         %19 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x64x!tt.ptr<f32>>
2026-02-21T08:19:18.0490389Z         %20 = tt.addptr %19, %18 : tensor<4x64x!tt.ptr<f32>>, tensor<4x64xi32>
2026-02-21T08:19:18.0490638Z         %21 = tt.load %20 : tensor<4x64x!tt.ptr<f32>>
2026-02-21T08:19:18.0490916Z         %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<4x64xf32>> -> tensor<4x64xf32>
2026-02-21T08:19:18.0491199Z         %23 = scf.if %arg3 -> (tensor<4x64xf32>) {
2026-02-21T08:19:18.0491572Z           %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x64xf32>) -> tensor<4x64xf32>
2026-02-21T08:19:18.0491991Z           %43 = arith.subf %22, %21 : tensor<4x64xf32>
2026-02-21T08:19:18.0492198Z           %44 = arith.mulf %42, %43 : tensor<4x64xf32>
2026-02-21T08:19:18.0492397Z           %45 = arith.addf %44, %cst_0 : tensor<4x64xf32>
2026-02-21T08:19:18.0492597Z           scf.yield %45 : tensor<4x64xf32>
2026-02-21T08:19:18.0492762Z         } else {
2026-02-21T08:19:18.0492928Z           %42 = tt.splat %arg4 : f32 -> tensor<4x64xf32>
2026-02-21T08:19:18.0493147Z           %43 = arith.cmpf ogt, %22, %42 : tensor<4x64xf32>
2026-02-21T08:19:18.0493361Z           %44 = arith.cmpf une, %22, %22 : tensor<4x64xf32>
2026-02-21T08:19:18.0493569Z           %45 = arith.ori %43, %44 : tensor<4x64xi1>
2026-02-21T08:19:18.0493797Z           %46 = arith.select %45, %22, %42 : tensor<4x64xi1>, tensor<4x64xf32>
2026-02-21T08:19:18.0494036Z           %47 = math.log %46 : tensor<4x64xf32>
2026-02-21T08:19:18.0494322Z           %48 = arith.subf %47, %21 : tensor<4x64xf32>
2026-02-21T08:19:18.0494521Z           %49 = arith.mulf %22, %48 : tensor<4x64xf32>
2026-02-21T08:19:18.0494725Z           %50 = arith.addf %49, %cst_0 : tensor<4x64xf32>
2026-02-21T08:19:18.0494915Z           scf.yield %50 : tensor<4x64xf32>
2026-02-21T08:19:18.0495084Z         }
2026-02-21T08:19:18.0495224Z         %24 = arith.addf %arg7, %23 : tensor<4x64xf32>
2026-02-21T08:19:18.0495428Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:19:18.0495609Z         %25 = arith.muli %c64_i32, %c1_i32 : i32
2026-02-21T08:19:18.0495796Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:19:18.0496014Z         %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:19:18.0496261Z         %28 = tt.splat %26 : i32 -> tensor<64xi32>
2026-02-21T08:19:18.0496461Z         %29 = arith.addi %28, %27 : tensor<64xi32>
2026-02-21T08:19:18.0496764Z         %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:19:18.0497033Z         %31 = arith.muli %30, %cst : tensor<4x1xi32>
2026-02-21T08:19:18.0497267Z         %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:19:18.0497546Z         %33 = tt.broadcast %31 : tensor<4x1xi32> -> tensor<4x64xi32>
2026-02-21T08:19:18.0497787Z         %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<4x64xi32>
2026-02-21T08:19:18.0498013Z         %35 = arith.addi %33, %34 : tensor<4x64xi32>
2026-02-21T08:19:18.0498239Z         %36 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x64x!tt.ptr<f32>>
2026-02-21T08:19:18.0498492Z         %37 = tt.addptr %36, %35 : tensor<4x64x!tt.ptr<f32>>, tensor<4x64xi32>
2026-02-21T08:19:18.0498735Z         %38 = tt.load %37 : tensor<4x64x!tt.ptr<f32>>
2026-02-21T08:19:18.0499002Z         %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc<tensor<4x64xf32>> -> tensor<4x64xf32>
2026-02-21T08:19:18.0499282Z         %40 = scf.if %arg3 -> (tensor<4x64xf32>) {
2026-02-21T08:19:18.0499631Z           %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x64xf32>) -> tensor<4x64xf32>
2026-02-21T08:19:18.0499986Z           %43 = arith.subf %39, %38 : tensor<4x64xf32>
2026-02-21T08:19:18.0500185Z           %44 = arith.mulf %42, %43 : tensor<4x64xf32>
2026-02-21T08:19:18.0500384Z           %45 = arith.addf %44, %cst_0 : tensor<4x64xf32>
2026-02-21T08:19:18.0500581Z           scf.yield %45 : tensor<4x64xf32>
2026-02-21T08:19:18.0500745Z         } else {
2026-02-21T08:19:18.0500906Z           %42 = tt.splat %arg4 : f32 -> tensor<4x64xf32>
2026-02-21T08:19:18.0501111Z           %43 = arith.cmpf ogt, %39, %42 : tensor<4x64xf32>
2026-02-21T08:19:18.0501323Z           %44 = arith.cmpf une, %39, %39 : tensor<4x64xf32>
2026-02-21T08:19:18.0501527Z           %45 = arith.ori %43, %44 : tensor<4x64xi1>
2026-02-21T08:19:18.0501747Z           %46 = arith.select %45, %39, %42 : tensor<4x64xi1>, tensor<4x64xf32>
2026-02-21T08:19:18.0502028Z           %47 = math.log %46 : tensor<4x64xf32>
2026-02-21T08:19:18.0502217Z           %48 = arith.subf %47, %38 : tensor<4x64xf32>
2026-02-21T08:19:18.0502410Z           %49 = arith.mulf %39, %48 : tensor<4x64xf32>
2026-02-21T08:19:18.0502604Z           %50 = arith.addf %49, %cst_0 : tensor<4x64xf32>
2026-02-21T08:19:18.0502799Z           scf.yield %50 : tensor<4x64xf32>
2026-02-21T08:19:18.0502964Z         }
2026-02-21T08:19:18.0503099Z         %41 = arith.addf %24, %40 : tensor<4x64xf32>
2026-02-21T08:19:18.0503285Z         scf.yield %41 : tensor<4x64xf32>
2026-02-21T08:19:18.0503492Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:19:18.0503715Z       %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:19:18.0503894Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:19:18.0504074Z         %10 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:19:18.0504250Z         tt.reduce.return %10 : f32
2026-02-21T08:19:18.0504436Z       }) : (tensor<4x64xf32>) -> tensor<4xf32>
2026-02-21T08:19:18.0504656Z       %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:19:18.0504967Z       %9 = tt.addptr %8, %5 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:19:18.0505192Z       tt.store %9, %7 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:19:18.0505469Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:19:18.0505740Z     tt.return
2026-02-21T08:19:18.0505861Z   }
2026-02-21T08:19:18.0505982Z }
2026-02-21T08:19:18.0506048Z 
2026-02-21T08:19:18.0506103Z {-#
2026-02-21T08:19:18.0506223Z   external_resources: {
2026-02-21T08:19:18.0506380Z     mlir_reproducer: {
2026-02-21T08:19:18.0510681Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:19:18.0515064Z       disable_threading: false,
2026-02-21T08:19:18.0515233Z       verify_each: true
2026-02-21T08:19:18.0515383Z     }
2026-02-21T08:19:18.0515500Z   }
2026-02-21T08:19:18.0515624Z #-}
2026-02-21T08:19:18.0516036Z /tmp/torchinductor_root/2p/c2p632ghn6dygk7gjcvqdffs6626zjbyiy4i6kevgfmfqap3huip.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:19:18.0517207Z /tmp/torchinductor_root/2p/c2p632ghn6dygk7gjcvqdffs6626zjbyiy4i6kevgfmfqap3huip.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:19:18.0518160Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:19:18.0519211Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:19:18.0520158Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:19:18.0520411Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:19:19.2010095Z Initial population exploring neighbors  28% ━━━╸           28/100 12.2 configs/s
2026-02-21T08:19:19.2014258Z 
2026-02-21T08:19:19.2016798Z  50%|█████     | 3/6 [11:07<11:07, 222.35s/it]
2026-02-21T08:19:19.2021911Z WARNING:tritonbench.utils.triton_op:Caught exception on backend helion_kl_div_tritonbench, terminating early with partial results
2026-02-21T08:19:19.2025871Z Traceback (most recent call last):
2026-02-21T08:19:19.2041104Z   File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1199, in run
2026-02-21T08:19:19.2041571Z     y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
2026-02-21T08:19:19.2041833Z                                                   ^^^^^^^^^^^^^^^^^
2026-02-21T08:19:19.2042321Z   File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1188, in _reduce_benchmarks
2026-02-21T08:19:19.2042683Z     torch.accelerator.synchronize()
2026-02-21T08:19:19.2043266Z   File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/accelerator/__init__.py", line 235, in synchronize
2026-02-21T08:19:19.2043658Z     torch._C._accelerator_synchronizeDevice(device_index)
2026-02-21T08:19:19.2043910Z torch.AcceleratorError: CUDA error: misaligned address
2026-02-21T08:19:19.2044349Z Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
2026-02-21T08:19:19.2044892Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2026-02-21T08:19:19.2045272Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2026-02-21T08:19:19.2045539Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2026-02-21T08:19:19.2045709Z 
2026-02-21T08:19:19.2045919Z WARNING:tritonbench.utils.triton_op:Failing input: --input-id 3 --num-inputs 1 --input-sample-mode first-k
2026-02-21T08:19:19.2046338Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmprfy49r42.csv
2026-02-21T08:19:19.8589813Z       (B, T, V)    liger_kl_div-speedup    liger_kl_div-accuracy    torch_compile_kl_div-speedup    torch_compile_kl_div-accuracy    helion_kl_div_tritonbench-speedup    helion_kl_div_tritonbench-accuracy
2026-02-21T08:19:19.8591496Z ---------------  ----------------------  -----------------------  ------------------------------  -------------------------------  -----------------------------------  ------------------------------------
2026-02-21T08:19:19.8592288Z  (8, 512, 4096)                 3.10454                        1                         3.03078                                1                              3.58413                                     1
2026-02-21T08:19:19.8596901Z  (8, 512, 8192)                 3.52643                        1                         3.20763                                1                              4.03754                                     1
2026-02-21T08:19:19.8598427Z (8, 512, 16384)                 4.00574                        1                         3.27739                                1                              3.96176                                     1
2026-02-21T08:19:19.8598986Z         average                 3.54557                        1                         3.17193                                1                              3.86114                                     1
2026-02-21T08:21:26.9323454Z Using num_inputs=20 for kl_div
2026-02-21T08:21:27.2957645Z Running kl_div benchmark with Helion implementation...
2026-02-21T08:21:27.2958684Z 
2026-02-21T08:21:27.6420488Z Warning: Requested 20 inputs but only 6 available. Using all available inputs.
2026-02-21T08:21:27.6425505Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 6)
2026-02-21T08:21:27.6430080Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 3, 4, 5]
2026-02-21T08:21:27.6434649Z 
2026-02-21T08:21:27.6446782Z   0%|          | 0/6 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T08:21:27.6447220Z (B, T, V)
2026-02-21T08:21:27.6451820Z --------------
2026-02-21T08:21:27.6456275Z (8, 512, 4096)
2026-02-21T08:21:27.6461946Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for torch_kl_div
2026-02-21T08:21:28.9742855Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:21:30.4280369Z INFO:tritonbench.utils.triton_op:Took 41.71ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:21:31.4112312Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:21:31.4116196Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:21:31.4118318Z               'dtype': 'torch.float32',
2026-02-21T08:21:31.4118596Z               'shape': (4096, 4096),
2026-02-21T08:21:31.4118821Z               'stride': (4096, 1)},
2026-02-21T08:21:31.4119056Z             { 'device': 'cuda:0',
2026-02-21T08:21:31.4119280Z               'dtype': 'torch.float32',
2026-02-21T08:21:31.4119506Z               'shape': (4096, 4096),
2026-02-21T08:21:31.4119719Z               'stride': (4096, 1)}),
2026-02-21T08:21:31.4120367Z   'kwargs': {}}
2026-02-21T08:21:31.4120762Z INFO:tritonbench.utils.triton_op:Took 0.57ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:21:31.7088664Z [0s] Autotune random seed: 2135561342
2026-02-21T08:21:31.7622869Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:22:34.3220541Z [62s] Timeout after 60s compiling Config(block_sizes=[64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T08:22:34.5529200Z [62s] Timeout after 60s compiling Config(block_sizes=[4096, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], maxnreg=32, num_sm_multiplier=128, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[2, 2], range_warp_specializes=[False, None])
2026-02-21T08:22:35.2557561Z [63s] Timeout after 60s compiling Config(block_sizes=[256, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[None, False])
2026-02-21T08:22:35.4252739Z [63s] Timeout after 60s compiling Config(block_sizes=[16, 4096], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 1], range_unroll_factors=[1, 1], range_warp_specializes=[None, True])
2026-02-21T08:22:35.4270026Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:22:37.2733096Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:22:37.2738375Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:37.2745529Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:22:37.2745756Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:22:37.2745935Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:37.2746116Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:37.2746329Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32>
2026-02-21T08:22:37.2746569Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:22:37.2746765Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:37.2746984Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:37.2747470Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:37.2747784Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:22:37.2748211Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:22:37.2748510Z     %2 = tt.get_program_id x : i32
2026-02-21T08:22:37.2748697Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:22:37.2748882Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T08:22:37.2749056Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:22:37.2749232Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:22:37.2749409Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:22:37.2749587Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:22:37.2749745Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:22:37.2749922Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:22:37.2750197Z     %9 = arith.remsi %8, %c2_i32 : i32
2026-02-21T08:22:37.2750378Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:22:37.2750542Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:22:37.2750723Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:22:37.2750900Z     %13 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:22:37.2751090Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:22:37.2751290Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:22:37.2751516Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:22:37.2751774Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:22:37.2752074Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:22:37.2752395Z       %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:22:37.2752809Z         %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2753182Z         %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2753487Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:22:37.2753854Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:22:37.2754234Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:22:37.2754448Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:22:37.2754652Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2754857Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:22:37.2755025Z         } else {
2026-02-21T08:22:37.2755189Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:22:37.2755402Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:22:37.2755628Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:22:37.2755848Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:22:37.2756086Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:22:37.2756327Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:22:37.2756521Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:22:37.2756723Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:22:37.2756923Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2757120Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:22:37.2757284Z         }
2026-02-21T08:22:37.2757435Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:22:37.2757632Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:22:37.2757828Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:22:37.2758039Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:22:37.2758226Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:37.2758494Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:37.2758675Z         tt.reduce.return %32 : f32
2026-02-21T08:22:37.2758859Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:22:37.2759088Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2759344Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:22:37.2759588Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2759781Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:22:37.2759974Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:22:37.2760155Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:22:37.2760334Z       %24 = arith.muli %23, %c256_i32 : i32
2026-02-21T08:22:37.2760551Z       %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:22:37.2760797Z       %26 = tt.splat %24 : i32 -> tensor<256xi32>
2026-02-21T08:22:37.2761047Z       %27 = arith.addi %26, %25 : tensor<256xi32>
2026-02-21T08:22:37.2761350Z       %28 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:22:37.2761744Z         %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2762149Z         %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2762441Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:22:37.2762806Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:22:37.2763170Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:22:37.2763379Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:22:37.2763583Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2763788Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:22:37.2763956Z         } else {
2026-02-21T08:22:37.2764124Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:22:37.2764348Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:22:37.2764565Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:22:37.2764778Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:22:37.2765014Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:22:37.2765259Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:22:37.2765449Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:22:37.2765649Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:22:37.2765856Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2766044Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:22:37.2766212Z         }
2026-02-21T08:22:37.2766356Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:22:37.2766552Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:22:37.2766746Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:22:37.2766953Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:22:37.2767141Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:37.2767328Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:37.2767519Z         tt.reduce.return %32 : f32
2026-02-21T08:22:37.2767704Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:22:37.2767945Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2768215Z       %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:22:37.2768462Z       tt.store %31, %29 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2768665Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:22:37.2768881Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:22:37.2769154Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:22:37.2769390Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:22:37.2769648Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:22:37.2769846Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:22:37.2770165Z       %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:22:37.2770578Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2770964Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:22:37.2771268Z         %24 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:22:37.2771696Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:22:37.2772114Z           %27 = arith.subf %23, %22 : tensor<256x32xf32>
2026-02-21T08:22:37.2772322Z           %28 = arith.mulf %26, %27 : tensor<256x32xf32>
2026-02-21T08:22:37.2772539Z           %29 = arith.addf %28, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2772748Z           scf.yield %29 : tensor<256x32xf32>
2026-02-21T08:22:37.2772921Z         } else {
2026-02-21T08:22:37.2773091Z           %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:22:37.2773316Z           %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32>
2026-02-21T08:22:37.2773552Z           %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32>
2026-02-21T08:22:37.2773766Z           %29 = arith.ori %27, %28 : tensor<256x32xi1>
2026-02-21T08:22:37.2774021Z           %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:22:37.2774278Z           %31 = math.log %30 : tensor<256x32xf32>
2026-02-21T08:22:37.2774481Z           %32 = arith.subf %31, %22 : tensor<256x32xf32>
2026-02-21T08:22:37.2774698Z           %33 = arith.mulf %23, %32 : tensor<256x32xf32>
2026-02-21T08:22:37.2774910Z           %34 = arith.addf %33, %cst : tensor<256x32xf32>
2026-02-21T08:22:37.2775117Z           scf.yield %34 : tensor<256x32xf32>
2026-02-21T08:22:37.2775280Z         }
2026-02-21T08:22:37.2775430Z         %25 = arith.addf %arg7, %24 : tensor<256x32xf32>
2026-02-21T08:22:37.2775617Z         scf.yield %25 : tensor<256x32xf32>
2026-02-21T08:22:37.2775816Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:22:37.2776019Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:22:37.2776199Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:37.2776375Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:37.2776550Z         tt.reduce.return %22 : f32
2026-02-21T08:22:37.2776733Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:22:37.2776958Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2777224Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:22:37.2777465Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:22:37.2777656Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:22:37.2777829Z     tt.return
2026-02-21T08:22:37.2777948Z   }
2026-02-21T08:22:37.2778068Z }
2026-02-21T08:22:37.2778133Z 
2026-02-21T08:22:37.2778182Z {-#
2026-02-21T08:22:37.2778310Z   external_resources: {
2026-02-21T08:22:37.2778463Z     mlir_reproducer: {
2026-02-21T08:22:37.2782928Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:37.2787426Z       disable_threading: false,
2026-02-21T08:22:37.2787591Z       verify_each: true
2026-02-21T08:22:37.2787739Z     }
2026-02-21T08:22:37.2787862Z   }
2026-02-21T08:22:37.2787973Z #-}
2026-02-21T08:22:37.2788411Z /tmp/torchinductor_root/g7/cg7uh4kqlcjratoh3u7e26ozcc5fu6il4yz5rozohovifnytn3wd.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:37.2789639Z /tmp/torchinductor_root/g7/cg7uh4kqlcjratoh3u7e26ozcc5fu6il4yz5rozohovifnytn3wd.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:37.2790653Z [65s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:37.2791801Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:22:37.2792822Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:37.2793069Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:37.7676007Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:22:37.7676858Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:37.7677530Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:22:37.7677750Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:37.7677951Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:22:37.7678193Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x256xf32>
2026-02-21T08:22:37.7678452Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:22:37.7678638Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:37.7678833Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:37.7679036Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:37.7679419Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<64x256xf32>>
2026-02-21T08:22:37.7679929Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<64x256xf32>>
2026-02-21T08:22:37.7680294Z     %2 = tt.get_program_id x : i32
2026-02-21T08:22:37.7680881Z     scf.for %arg5 = %2 to %c64_i32 step %c9472_i32  : i32 {
2026-02-21T08:22:37.7681103Z       %3 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:22:37.7681344Z       %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:22:37.7681592Z       %5 = tt.splat %3 : i32 -> tensor<64xi32>
2026-02-21T08:22:37.7681784Z       %6 = arith.addi %5, %4 : tensor<64xi32>
2026-02-21T08:22:37.7682051Z       %c3840_i32 = arith.constant 3840 : i32
2026-02-21T08:22:37.7682231Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:22:37.7682544Z       %7 = scf.for %arg6 = %c0_i32 to %c3840_i32 step %c768_i32 iter_args(%arg7 = %cst) -> (tensor<64x256xf32>)  : i32 {
2026-02-21T08:22:37.7682944Z         %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7683320Z         %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7683707Z         %17 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:22:37.7684086Z           %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:22:37.7684462Z           %32 = arith.subf %16, %15 : tensor<64x256xf32>
2026-02-21T08:22:37.7684665Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:22:37.7684877Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7685073Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:22:37.7685250Z         } else {
2026-02-21T08:22:37.7685409Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:22:37.7685633Z           %32 = arith.cmpf ogt, %16, %31 : tensor<64x256xf32>
2026-02-21T08:22:37.7685858Z           %33 = arith.cmpf une, %16, %16 : tensor<64x256xf32>
2026-02-21T08:22:37.7686064Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:22:37.7686311Z           %35 = arith.select %34, %16, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:22:37.7686553Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:22:37.7686756Z           %37 = arith.subf %36, %15 : tensor<64x256xf32>
2026-02-21T08:22:37.7686952Z           %38 = arith.mulf %16, %37 : tensor<64x256xf32>
2026-02-21T08:22:37.7687158Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7687356Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:22:37.7687525Z         }
2026-02-21T08:22:37.7687675Z         %18 = arith.addf %arg7, %17 : tensor<64x256xf32>
2026-02-21T08:22:37.7687866Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:37.7688056Z         %19 = arith.muli %c256_i32, %c1_i32 : i32
2026-02-21T08:22:37.7688237Z         %20 = arith.addi %arg6, %19 : i32
2026-02-21T08:22:37.7688507Z         %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7688872Z         %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7689157Z         %23 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:22:37.7689517Z           %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:22:37.7689874Z           %32 = arith.subf %22, %21 : tensor<64x256xf32>
2026-02-21T08:22:37.7690077Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:22:37.7690276Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7690472Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:22:37.7690645Z         } else {
2026-02-21T08:22:37.7690800Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:22:37.7691017Z           %32 = arith.cmpf ogt, %22, %31 : tensor<64x256xf32>
2026-02-21T08:22:37.7691232Z           %33 = arith.cmpf une, %22, %22 : tensor<64x256xf32>
2026-02-21T08:22:37.7691450Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:22:37.7691792Z           %35 = arith.select %34, %22, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:22:37.7692072Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:22:37.7692269Z           %37 = arith.subf %36, %21 : tensor<64x256xf32>
2026-02-21T08:22:37.7692465Z           %38 = arith.mulf %22, %37 : tensor<64x256xf32>
2026-02-21T08:22:37.7692675Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7692866Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:22:37.7693035Z         }
2026-02-21T08:22:37.7693173Z         %24 = arith.addf %18, %23 : tensor<64x256xf32>
2026-02-21T08:22:37.7693371Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:22:37.7693561Z         %25 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:22:37.7693744Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:22:37.7694078Z         %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7694431Z         %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7694711Z         %29 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:22:37.7695058Z           %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:22:37.7695417Z           %32 = arith.subf %28, %27 : tensor<64x256xf32>
2026-02-21T08:22:37.7695619Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:22:37.7695814Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7696009Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:22:37.7696172Z         } else {
2026-02-21T08:22:37.7696334Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:22:37.7696544Z           %32 = arith.cmpf ogt, %28, %31 : tensor<64x256xf32>
2026-02-21T08:22:37.7696763Z           %33 = arith.cmpf une, %28, %28 : tensor<64x256xf32>
2026-02-21T08:22:37.7696972Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:22:37.7697200Z           %35 = arith.select %34, %28, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:22:37.7697440Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:22:37.7697627Z           %37 = arith.subf %36, %27 : tensor<64x256xf32>
2026-02-21T08:22:37.7697835Z           %38 = arith.mulf %28, %37 : tensor<64x256xf32>
2026-02-21T08:22:37.7698040Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7698245Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:22:37.7698421Z         }
2026-02-21T08:22:37.7698565Z         %30 = arith.addf %24, %29 : tensor<64x256xf32>
2026-02-21T08:22:37.7698766Z         scf.yield %30 : tensor<64x256xf32>
2026-02-21T08:22:37.7698951Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:22:37.7699244Z       %8 = tt.descriptor_load %0[%3, %c3840_i32] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7699638Z       %9 = tt.descriptor_load %1[%3, %c3840_i32] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:22:37.7699943Z       %10 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:22:37.7700322Z         %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:22:37.7700694Z         %16 = arith.subf %9, %8 : tensor<64x256xf32>
2026-02-21T08:22:37.7700909Z         %17 = arith.mulf %15, %16 : tensor<64x256xf32>
2026-02-21T08:22:37.7701117Z         %18 = arith.addf %17, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7701322Z         scf.yield %18 : tensor<64x256xf32>
2026-02-21T08:22:37.7701494Z       } else {
2026-02-21T08:22:37.7701658Z         %15 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:22:37.7701899Z         %16 = arith.cmpf ogt, %9, %15 : tensor<64x256xf32>
2026-02-21T08:22:37.7702130Z         %17 = arith.cmpf une, %9, %9 : tensor<64x256xf32>
2026-02-21T08:22:37.7702410Z         %18 = arith.ori %16, %17 : tensor<64x256xi1>
2026-02-21T08:22:37.7702646Z         %19 = arith.select %18, %9, %15 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:22:37.7702897Z         %20 = math.log %19 : tensor<64x256xf32>
2026-02-21T08:22:37.7703099Z         %21 = arith.subf %20, %8 : tensor<64x256xf32>
2026-02-21T08:22:37.7703309Z         %22 = arith.mulf %9, %21 : tensor<64x256xf32>
2026-02-21T08:22:37.7703517Z         %23 = arith.addf %22, %cst : tensor<64x256xf32>
2026-02-21T08:22:37.7703719Z         scf.yield %23 : tensor<64x256xf32>
2026-02-21T08:22:37.7703898Z       }
2026-02-21T08:22:37.7704043Z       %11 = arith.addf %7, %10 : tensor<64x256xf32>
2026-02-21T08:22:37.7704274Z       %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({
2026-02-21T08:22:37.7704465Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:37.7704653Z         %15 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:37.7704839Z         tt.reduce.return %15 : f32
2026-02-21T08:22:37.7705088Z       }) : (tensor<64x256xf32>) -> tensor<64xf32>
2026-02-21T08:22:37.7705333Z       %13 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:22:37.7705600Z       %14 = tt.addptr %13, %6 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:22:37.7705836Z       tt.store %14, %12 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:22:37.7706091Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:22:37.7706339Z     tt.return
2026-02-21T08:22:37.7706458Z   }
2026-02-21T08:22:37.7706578Z }
2026-02-21T08:22:37.7706643Z 
2026-02-21T08:22:37.7706697Z {-#
2026-02-21T08:22:37.7706817Z   external_resources: {
2026-02-21T08:22:37.7706972Z     mlir_reproducer: {
2026-02-21T08:22:37.7711254Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:37.7715605Z       disable_threading: false,
2026-02-21T08:22:37.7715774Z       verify_each: true
2026-02-21T08:22:37.7715910Z     }
2026-02-21T08:22:37.7716029Z   }
2026-02-21T08:22:37.7716133Z #-}
2026-02-21T08:22:37.7716545Z /tmp/torchinductor_root/zw/czwn57nyont3ac4ro5t4qpyubljjxltznojztyn6lmiuy36skcnd.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:37.7717725Z /tmp/torchinductor_root/zw/czwn57nyont3ac4ro5t4qpyubljjxltznojztyn6lmiuy36skcnd.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:37.7718737Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:37.7719805Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:22:37.7720771Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:37.7721014Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:37.7989146Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:22:37.7994543Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:37.7995467Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:22:37.7995671Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:37.7995892Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:37.7996110Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32>
2026-02-21T08:22:37.7996339Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:22:37.7996517Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:37.7996702Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:37.7996868Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:37.7997188Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:22:37.7997622Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:22:37.7997923Z     %2 = tt.get_program_id x : i32
2026-02-21T08:22:37.7998099Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:22:37.7998271Z     %4 = arith.minsi %3, %c32_i32 : i32
2026-02-21T08:22:37.7998468Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:22:37.7998666Z       %5 = arith.muli %arg5, %c128_i32 : i32
2026-02-21T08:22:37.7998896Z       %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:22:37.7999139Z       %7 = tt.splat %5 : i32 -> tensor<128xi32>
2026-02-21T08:22:37.7999332Z       %8 = arith.addi %7, %6 : tensor<128xi32>
2026-02-21T08:22:37.7999637Z       %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>)  : i32 {
2026-02-21T08:22:37.8000033Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:22:37.8000405Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:22:37.8000693Z         %15 = scf.if %arg3 -> (tensor<128x32xf32>) {
2026-02-21T08:22:37.8001063Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32>
2026-02-21T08:22:37.8001432Z           %18 = arith.subf %14, %13 : tensor<128x32xf32>
2026-02-21T08:22:37.8001635Z           %19 = arith.mulf %17, %18 : tensor<128x32xf32>
2026-02-21T08:22:37.8001846Z           %20 = arith.addf %19, %cst : tensor<128x32xf32>
2026-02-21T08:22:37.8002120Z           scf.yield %20 : tensor<128x32xf32>
2026-02-21T08:22:37.8002297Z         } else {
2026-02-21T08:22:37.8002459Z           %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32>
2026-02-21T08:22:37.8002686Z           %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32>
2026-02-21T08:22:37.8003120Z           %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32>
2026-02-21T08:22:37.8003328Z           %20 = arith.ori %18, %19 : tensor<128x32xi1>
2026-02-21T08:22:37.8003569Z           %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32>
2026-02-21T08:22:37.8003807Z           %22 = math.log %21 : tensor<128x32xf32>
2026-02-21T08:22:37.8004011Z           %23 = arith.subf %22, %13 : tensor<128x32xf32>
2026-02-21T08:22:37.8004209Z           %24 = arith.mulf %14, %23 : tensor<128x32xf32>
2026-02-21T08:22:37.8004417Z           %25 = arith.addf %24, %cst : tensor<128x32xf32>
2026-02-21T08:22:37.8004617Z           scf.yield %25 : tensor<128x32xf32>
2026-02-21T08:22:37.8004780Z         }
2026-02-21T08:22:37.8004932Z         %16 = arith.addf %arg7, %15 : tensor<128x32xf32>
2026-02-21T08:22:37.8005122Z         scf.yield %16 : tensor<128x32xf32>
2026-02-21T08:22:37.8005399Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:22:37.8005740Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:22:37.8005955Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:37.8006138Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:37.8006324Z         tt.reduce.return %13 : f32
2026-02-21T08:22:37.8006522Z       }) : (tensor<128x32xf32>) -> tensor<128xf32>
2026-02-21T08:22:37.8006762Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:22:37.8007021Z       %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:22:37.8007266Z       tt.store %12, %10 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:22:37.8007456Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T08:22:37.8007627Z     tt.return
2026-02-21T08:22:37.8007747Z   }
2026-02-21T08:22:37.8007865Z }
2026-02-21T08:22:37.8007930Z 
2026-02-21T08:22:37.8007978Z {-#
2026-02-21T08:22:37.8008108Z   external_resources: {
2026-02-21T08:22:37.8008261Z     mlir_reproducer: {
2026-02-21T08:22:37.8012716Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:37.8017302Z       disable_threading: false,
2026-02-21T08:22:37.8017467Z       verify_each: true
2026-02-21T08:22:37.8017618Z     }
2026-02-21T08:22:37.8017732Z   }
2026-02-21T08:22:37.8017849Z #-}
2026-02-21T08:22:37.8018304Z /tmp/torchinductor_root/eg/cegvfk24qtnxvmxn4bycu6f6zzbkjopadvnuuaxwludmaet2ilma.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:37.8019682Z /tmp/torchinductor_root/eg/cegvfk24qtnxvmxn4bycu6f6zzbkjopadvnuuaxwludmaet2ilma.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:37.8020708Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:37.8021916Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:22:37.8022941Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:37.8023195Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:38.2299361Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:22:38.2301548Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:38.2302398Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:22:38.2302590Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:38.2302782Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:22:38.2302988Z     %cst = arith.constant dense<4096> : tensor<4x1xi32>
2026-02-21T08:22:38.2303240Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x4xf32>
2026-02-21T08:22:38.2303491Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:22:38.2303687Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:38.2303860Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:38.2304038Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:38.2304343Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:22:38.2304645Z     %1 = tt.get_program_id x : i32
2026-02-21T08:22:38.2304820Z     %2 = arith.subi %c1024_i32, %1 : i32
2026-02-21T08:22:38.2304989Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:38.2305166Z     %3 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:22:38.2305353Z     %4 = arith.addi %2, %3 : i32
2026-02-21T08:22:38.2305524Z     %5 = arith.divui %4, %c2368_i32 : i32
2026-02-21T08:22:38.2305979Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:22:38.2306149Z     %6 = arith.remsi %5, %c3_i32 : i32
2026-02-21T08:22:38.2306325Z     %7 = arith.subi %5, %6 : i32
2026-02-21T08:22:38.2306490Z     %8 = arith.muli %7, %c2368_i32 : i32
2026-02-21T08:22:38.2306670Z     %9 = arith.addi %1, %8 : i32
2026-02-21T08:22:38.2306840Z     %10 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:22:38.2307037Z     scf.for %arg5 = %1 to %9 step %10  : i32 {
2026-02-21T08:22:38.2307231Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:22:38.2307451Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:22:38.2307699Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2307885Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:22:38.2308191Z       %15 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:22:38.2308497Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2308702Z         %40 = arith.addi %39, %12 : tensor<4xi32>
2026-02-21T08:22:38.2308994Z         %41 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:22:38.2312740Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:22:38.2317431Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:22:38.2319466Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2319807Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2324561Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:22:38.2324922Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2325231Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:22:38.2330391Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2334419Z         %50 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:22:38.2336447Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:22:38.2337137Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:22:38.2337537Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2341995Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:22:38.2345806Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2350283Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:22:38.2351906Z         } else {
2026-02-21T08:22:38.2352128Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:22:38.2352426Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:22:38.2356903Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:22:38.2357225Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:22:38.2357494Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:22:38.2363296Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:22:38.2365194Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2365432Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:22:38.2365645Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2365842Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:22:38.2366016Z         }
2026-02-21T08:22:38.2366161Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:22:38.2366357Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:22:38.2366607Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:22:38.2366882Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:22:38.2367075Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:38.2367247Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:38.2367431Z         tt.reduce.return %39 : f32
2026-02-21T08:22:38.2367612Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:22:38.2367854Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2368111Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:22:38.2368346Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2368544Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:22:38.2368724Z       %19 = arith.muli %c2368_i32, %c1_i32_1 : i32
2026-02-21T08:22:38.2368917Z       %20 = arith.addi %arg5, %19 : i32
2026-02-21T08:22:38.2369088Z       %21 = arith.muli %20, %c4_i32 : i32
2026-02-21T08:22:38.2369308Z       %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:22:38.2369539Z       %23 = tt.splat %21 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2369767Z       %24 = arith.addi %23, %22 : tensor<4xi32>
2026-02-21T08:22:38.2370076Z       %25 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:22:38.2370389Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2370586Z         %40 = arith.addi %39, %22 : tensor<4xi32>
2026-02-21T08:22:38.2371044Z         %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:22:38.2371300Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:22:38.2371546Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:22:38.2371835Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2372154Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2372393Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:22:38.2372622Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2372895Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:22:38.2373179Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2373592Z         %50 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:22:38.2373888Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:22:38.2374242Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:22:38.2374601Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2374795Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:22:38.2375005Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2375195Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:22:38.2375365Z         } else {
2026-02-21T08:22:38.2375526Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:22:38.2375731Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:22:38.2375947Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:22:38.2376145Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:22:38.2376377Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:22:38.2376603Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:22:38.2376794Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2376985Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:22:38.2377179Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2377371Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:22:38.2377530Z         }
2026-02-21T08:22:38.2377670Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:22:38.2377851Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:22:38.2378093Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:22:38.2378346Z       %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({
2026-02-21T08:22:38.2378532Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:38.2378706Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:38.2378883Z         tt.reduce.return %39 : f32
2026-02-21T08:22:38.2379065Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:22:38.2379275Z       %27 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2379527Z       %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:22:38.2379748Z       tt.store %28, %26 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2379940Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:22:38.2380123Z       %29 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:22:38.2380300Z       %30 = arith.addi %arg5, %29 : i32
2026-02-21T08:22:38.2380476Z       %31 = arith.muli %30, %c4_i32 : i32
2026-02-21T08:22:38.2380684Z       %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:22:38.2380918Z       %33 = tt.splat %31 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2381101Z       %34 = arith.addi %33, %32 : tensor<4xi32>
2026-02-21T08:22:38.2381403Z       %35 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:22:38.2381784Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2382021Z         %40 = arith.addi %39, %32 : tensor<4xi32>
2026-02-21T08:22:38.2382265Z         %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:22:38.2382515Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:22:38.2382758Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:22:38.2383024Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2383269Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2383492Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:22:38.2383717Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2384042Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:22:38.2384318Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2384638Z         %50 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:22:38.2384914Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:22:38.2385264Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:22:38.2385619Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2385812Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:22:38.2386017Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2386205Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:22:38.2386373Z         } else {
2026-02-21T08:22:38.2386526Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:22:38.2386742Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:22:38.2386958Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:22:38.2387156Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:22:38.2387383Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:22:38.2387607Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:22:38.2387800Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:22:38.2387985Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:22:38.2388187Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2388377Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:22:38.2388537Z         }
2026-02-21T08:22:38.2388683Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:22:38.2388867Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:22:38.2389115Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:22:38.2389402Z       %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({
2026-02-21T08:22:38.2389597Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:38.2389771Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:38.2389957Z         tt.reduce.return %39 : f32
2026-02-21T08:22:38.2390135Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:22:38.2390356Z       %37 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2390615Z       %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:22:38.2390840Z       tt.store %38, %36 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2391035Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:22:38.2391233Z     scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32  : i32 {
2026-02-21T08:22:38.2391451Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:22:38.2391670Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:22:38.2391947Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2392206Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:22:38.2392507Z       %15 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:22:38.2392820Z         %19 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:22:38.2393021Z         %20 = arith.addi %19, %12 : tensor<4xi32>
2026-02-21T08:22:38.2393262Z         %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:22:38.2393508Z         %22 = arith.muli %21, %cst : tensor<4x1xi32>
2026-02-21T08:22:38.2393747Z         %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:22:38.2394014Z         %24 = tt.broadcast %22 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2394249Z         %25 = tt.broadcast %23 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:22:38.2394467Z         %26 = arith.addi %24, %25 : tensor<4x4xi32>
2026-02-21T08:22:38.2394754Z         %27 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2395016Z         %28 = tt.addptr %27, %26 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:22:38.2395285Z         %29 = tt.load %28 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:22:38.2395605Z         %30 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:22:38.2395885Z         %31 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:22:38.2396226Z           %33 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:22:38.2396576Z           %34 = arith.subf %30, %29 : tensor<4x4xf32>
2026-02-21T08:22:38.2396767Z           %35 = arith.mulf %33, %34 : tensor<4x4xf32>
2026-02-21T08:22:38.2396974Z           %36 = arith.addf %35, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2397165Z           scf.yield %36 : tensor<4x4xf32>
2026-02-21T08:22:38.2397337Z         } else {
2026-02-21T08:22:38.2397499Z           %33 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:22:38.2397703Z           %34 = arith.cmpf ogt, %30, %33 : tensor<4x4xf32>
2026-02-21T08:22:38.2397917Z           %35 = arith.cmpf une, %30, %30 : tensor<4x4xf32>
2026-02-21T08:22:38.2398112Z           %36 = arith.ori %34, %35 : tensor<4x4xi1>
2026-02-21T08:22:38.2398341Z           %37 = arith.select %36, %30, %33 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:22:38.2398564Z           %38 = math.log %37 : tensor<4x4xf32>
2026-02-21T08:22:38.2398756Z           %39 = arith.subf %38, %29 : tensor<4x4xf32>
2026-02-21T08:22:38.2398951Z           %40 = arith.mulf %30, %39 : tensor<4x4xf32>
2026-02-21T08:22:38.2399144Z           %41 = arith.addf %40, %cst_0 : tensor<4x4xf32>
2026-02-21T08:22:38.2399338Z           scf.yield %41 : tensor<4x4xf32>
2026-02-21T08:22:38.2399500Z         }
2026-02-21T08:22:38.2399645Z         %32 = arith.addf %arg7, %31 : tensor<4x4xf32>
2026-02-21T08:22:38.2399831Z         scf.yield %32 : tensor<4x4xf32>
2026-02-21T08:22:38.2400084Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:22:38.2400340Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:22:38.2400537Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:38.2400713Z         %19 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:38.2400888Z         tt.reduce.return %19 : f32
2026-02-21T08:22:38.2401069Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:22:38.2401279Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2401531Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:22:38.2401752Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:22:38.2401977Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:22:38.2402140Z     tt.return
2026-02-21T08:22:38.2402260Z   }
2026-02-21T08:22:38.2402378Z }
2026-02-21T08:22:38.2402445Z 
2026-02-21T08:22:38.2402495Z {-#
2026-02-21T08:22:38.2402629Z   external_resources: {
2026-02-21T08:22:38.2402834Z     mlir_reproducer: {
2026-02-21T08:22:38.2407137Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:38.2411450Z       disable_threading: false,
2026-02-21T08:22:38.2411616Z       verify_each: true
2026-02-21T08:22:38.2411753Z     }
2026-02-21T08:22:38.2411911Z   }
2026-02-21T08:22:38.2412026Z #-}
2026-02-21T08:22:38.2412440Z /tmp/torchinductor_root/5a/c5aid2qwpcuu4qg3gno4tcuqrrl3ww6r6k5sx3ichpbpqpwiqtw4.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:38.2413624Z /tmp/torchinductor_root/5a/c5aid2qwpcuu4qg3gno4tcuqrrl3ww6r6k5sx3ichpbpqpwiqtw4.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:38.2414589Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:38.2415651Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:22:38.2416620Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:38.2416864Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:38.5164816Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:22:38.5169054Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:38.5173236Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:22:38.5176714Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:22:38.5181408Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:38.5183413Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:22:38.5183754Z     %cst = arith.constant dense<4096> : tensor<32x1xi32>
2026-02-21T08:22:38.5184326Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x8xf32>
2026-02-21T08:22:38.5189217Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:22:38.5189491Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:38.5189727Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:38.5189947Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:38.5190290Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<32x8xf32>>
2026-02-21T08:22:38.5190635Z     %1 = tt.get_program_id x : i32
2026-02-21T08:22:38.5190854Z     scf.for %arg5 = %1 to %c128_i32 step %c592_i32  : i32 {
2026-02-21T08:22:38.5191070Z       %2 = arith.muli %arg5, %c32_i32 : i32
2026-02-21T08:22:38.5191302Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:22:38.5191540Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:22:38.5192110Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:22:38.5192318Z       %c16_i32 = arith.constant 16 : i32
2026-02-21T08:22:38.5192631Z       %6 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<32x8xf32>)  : i32 {
2026-02-21T08:22:38.5192986Z         %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:22:38.5193231Z         %11 = tt.splat %arg6 : i32 -> tensor<8xi32>
2026-02-21T08:22:38.5193436Z         %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:22:38.5193677Z         %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:22:38.5193938Z         %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:22:38.5194177Z         %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:22:38.5194456Z         %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:22:38.5194707Z         %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:22:38.5194930Z         %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:22:38.5195161Z         %19 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:22:38.5195417Z         %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f32>>, tensor<32x8xi32>
2026-02-21T08:22:38.5195704Z         %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:22:38.5196035Z         %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<32x8xf32>> -> tensor<32x8xf32>
2026-02-21T08:22:38.5196334Z         %23 = scf.if %arg3 -> (tensor<32x8xf32>) {
2026-02-21T08:22:38.5196693Z           %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:22:38.5197053Z           %43 = arith.subf %22, %21 : tensor<32x8xf32>
2026-02-21T08:22:38.5197250Z           %44 = arith.mulf %42, %43 : tensor<32x8xf32>
2026-02-21T08:22:38.5197461Z           %45 = arith.addf %44, %cst_0 : tensor<32x8xf32>
2026-02-21T08:22:38.5197656Z           scf.yield %45 : tensor<32x8xf32>
2026-02-21T08:22:38.5197827Z         } else {
2026-02-21T08:22:38.5197985Z           %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32>
2026-02-21T08:22:38.5198204Z           %43 = arith.cmpf ogt, %22, %42 : tensor<32x8xf32>
2026-02-21T08:22:38.5198420Z           %44 = arith.cmpf une, %22, %22 : tensor<32x8xf32>
2026-02-21T08:22:38.5198617Z           %45 = arith.ori %43, %44 : tensor<32x8xi1>
2026-02-21T08:22:38.5198849Z           %46 = arith.select %45, %22, %42 : tensor<32x8xi1>, tensor<32x8xf32>
2026-02-21T08:22:38.5199076Z           %47 = math.log %46 : tensor<32x8xf32>
2026-02-21T08:22:38.5199269Z           %48 = arith.subf %47, %21 : tensor<32x8xf32>
2026-02-21T08:22:38.5199456Z           %49 = arith.mulf %22, %48 : tensor<32x8xf32>
2026-02-21T08:22:38.5199659Z           %50 = arith.addf %49, %cst_0 : tensor<32x8xf32>
2026-02-21T08:22:38.5199848Z           scf.yield %50 : tensor<32x8xf32>
2026-02-21T08:22:38.5200016Z         }
2026-02-21T08:22:38.5200262Z         %24 = arith.addf %arg7, %23 : tensor<32x8xf32>
2026-02-21T08:22:38.5200450Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:38.5200639Z         %25 = arith.muli %c8_i32, %c1_i32 : i32
2026-02-21T08:22:38.5200817Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:22:38.5201039Z         %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:22:38.5201272Z         %28 = tt.splat %26 : i32 -> tensor<8xi32>
2026-02-21T08:22:38.5201469Z         %29 = arith.addi %28, %27 : tensor<8xi32>
2026-02-21T08:22:38.5201710Z         %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:22:38.5202003Z         %31 = arith.muli %30, %cst : tensor<32x1xi32>
2026-02-21T08:22:38.5202247Z         %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:22:38.5202513Z         %33 = tt.broadcast %31 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:22:38.5202840Z         %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:22:38.5203063Z         %35 = arith.addi %33, %34 : tensor<32x8xi32>
2026-02-21T08:22:38.5203291Z         %36 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:22:38.5203552Z         %37 = tt.addptr %36, %35 : tensor<32x8x!tt.ptr<f32>>, tensor<32x8xi32>
2026-02-21T08:22:38.5203831Z         %38 = tt.load %37 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:22:38.5204156Z         %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc<tensor<32x8xf32>> -> tensor<32x8xf32>
2026-02-21T08:22:38.5204428Z         %40 = scf.if %arg3 -> (tensor<32x8xf32>) {
2026-02-21T08:22:38.5204784Z           %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:22:38.5205145Z           %43 = arith.subf %39, %38 : tensor<32x8xf32>
2026-02-21T08:22:38.5205346Z           %44 = arith.mulf %42, %43 : tensor<32x8xf32>
2026-02-21T08:22:38.5205566Z           %45 = arith.addf %44, %cst_0 : tensor<32x8xf32>
2026-02-21T08:22:38.5205758Z           scf.yield %45 : tensor<32x8xf32>
2026-02-21T08:22:38.5205929Z         } else {
2026-02-21T08:22:38.5206080Z           %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32>
2026-02-21T08:22:38.5206292Z           %43 = arith.cmpf ogt, %39, %42 : tensor<32x8xf32>
2026-02-21T08:22:38.5206495Z           %44 = arith.cmpf une, %39, %39 : tensor<32x8xf32>
2026-02-21T08:22:38.5206699Z           %45 = arith.ori %43, %44 : tensor<32x8xi1>
2026-02-21T08:22:38.5206928Z           %46 = arith.select %45, %39, %42 : tensor<32x8xi1>, tensor<32x8xf32>
2026-02-21T08:22:38.5207152Z           %47 = math.log %46 : tensor<32x8xf32>
2026-02-21T08:22:38.5207342Z           %48 = arith.subf %47, %38 : tensor<32x8xf32>
2026-02-21T08:22:38.5207528Z           %49 = arith.mulf %39, %48 : tensor<32x8xf32>
2026-02-21T08:22:38.5207730Z           %50 = arith.addf %49, %cst_0 : tensor<32x8xf32>
2026-02-21T08:22:38.5207918Z           scf.yield %50 : tensor<32x8xf32>
2026-02-21T08:22:38.5208090Z         }
2026-02-21T08:22:38.5208230Z         %41 = arith.addf %24, %40 : tensor<32x8xf32>
2026-02-21T08:22:38.5208412Z         scf.yield %41 : tensor<32x8xf32>
2026-02-21T08:22:38.5208626Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:22:38.5208842Z       %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:22:38.5209031Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:38.5209200Z         %10 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:38.5209382Z         tt.reduce.return %10 : f32
2026-02-21T08:22:38.5209558Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:22:38.5209780Z       %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<32x!tt.ptr<f32>>
2026-02-21T08:22:38.5210030Z       %9 = tt.addptr %8, %5 : tensor<32x!tt.ptr<f32>>, tensor<32xi32>
2026-02-21T08:22:38.5210250Z       tt.store %9, %7 : tensor<32x!tt.ptr<f32>>
2026-02-21T08:22:38.5210448Z     } {tt.flatten, tt.warp_specialize}
2026-02-21T08:22:38.5210671Z     tt.return
2026-02-21T08:22:38.5210798Z   }
2026-02-21T08:22:38.5210911Z }
2026-02-21T08:22:38.5210985Z 
2026-02-21T08:22:38.5211032Z {-#
2026-02-21T08:22:38.5211154Z   external_resources: {
2026-02-21T08:22:38.5211309Z     mlir_reproducer: {
2026-02-21T08:22:38.5215654Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:38.5220307Z       disable_threading: false,
2026-02-21T08:22:38.5220480Z       verify_each: true
2026-02-21T08:22:38.5220623Z     }
2026-02-21T08:22:38.5220748Z   }
2026-02-21T08:22:38.5220860Z #-}
2026-02-21T08:22:38.5221293Z /tmp/torchinductor_root/5y/c5yk235kd52a3fiyoi3qqhw5kqfyvszntwziwlt5uvp3ycrgotdh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:38.5222556Z /tmp/torchinductor_root/5y/c5yk235kd52a3fiyoi3qqhw5kqfyvszntwziwlt5uvp3ycrgotdh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:38.5223558Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:38.5224661Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:22:38.5225620Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:38.5225864Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:39.2821589Z module {
2026-02-21T08:22:39.2823024Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:39.2823617Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:22:39.2823844Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:39.2824123Z     %cst = arith.constant dense<0.000000e+00> : tensor<1024x8xf32>
2026-02-21T08:22:39.2824757Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:22:39.2824981Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:39.2825184Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:39.2825397Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:39.2825772Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<1024x8xf32>>
2026-02-21T08:22:39.2826282Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<1024x8xf32>>
2026-02-21T08:22:39.2826624Z     %2 = tt.get_program_id x : i32
2026-02-21T08:22:39.2826822Z     %3 = arith.muli %2, %c1024_i32 : i32
2026-02-21T08:22:39.2827069Z     %4 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:22:39.2827357Z     %5 = tt.splat %3 : i32 -> tensor<1024xi32>
2026-02-21T08:22:39.2827675Z     %6 = arith.addi %5, %4 : tensor<1024xi32>
2026-02-21T08:22:39.2827997Z     %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c8_i32 iter_args(%arg6 = %cst) -> (tensor<1024x8xf32>)  : i32 {
2026-02-21T08:22:39.2828390Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<1024x8xf32>> -> tensor<1024x8xf32>
2026-02-21T08:22:39.2828757Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<1024x8xf32>> -> tensor<1024x8xf32>
2026-02-21T08:22:39.2829037Z       %13 = scf.if %arg3 -> (tensor<1024x8xf32>) {
2026-02-21T08:22:39.2829409Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x8xf32>) -> tensor<1024x8xf32>
2026-02-21T08:22:39.2829784Z         %16 = arith.subf %12, %11 : tensor<1024x8xf32>
2026-02-21T08:22:39.2829983Z         %17 = arith.mulf %15, %16 : tensor<1024x8xf32>
2026-02-21T08:22:39.2830192Z         %18 = arith.addf %17, %cst : tensor<1024x8xf32>
2026-02-21T08:22:39.2830383Z         scf.yield %18 : tensor<1024x8xf32>
2026-02-21T08:22:39.2830557Z       } else {
2026-02-21T08:22:39.2830710Z         %15 = tt.splat %arg4 : f32 -> tensor<1024x8xf32>
2026-02-21T08:22:39.2830930Z         %16 = arith.cmpf ogt, %12, %15 : tensor<1024x8xf32>
2026-02-21T08:22:39.2831141Z         %17 = arith.cmpf une, %12, %12 : tensor<1024x8xf32>
2026-02-21T08:22:39.2831357Z         %18 = arith.ori %16, %17 : tensor<1024x8xi1>
2026-02-21T08:22:39.2831596Z         %19 = arith.select %18, %12, %15 : tensor<1024x8xi1>, tensor<1024x8xf32>
2026-02-21T08:22:39.2831833Z         %20 = math.log %19 : tensor<1024x8xf32>
2026-02-21T08:22:39.2832076Z         %21 = arith.subf %20, %11 : tensor<1024x8xf32>
2026-02-21T08:22:39.2832269Z         %22 = arith.mulf %12, %21 : tensor<1024x8xf32>
2026-02-21T08:22:39.2832474Z         %23 = arith.addf %22, %cst : tensor<1024x8xf32>
2026-02-21T08:22:39.2832664Z         scf.yield %23 : tensor<1024x8xf32>
2026-02-21T08:22:39.2832835Z       }
2026-02-21T08:22:39.2832983Z       %14 = arith.addf %arg6, %13 : tensor<1024x8xf32>
2026-02-21T08:22:39.2833174Z       scf.yield %14 : tensor<1024x8xf32>
2026-02-21T08:22:39.2833430Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:22:39.2833689Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:22:39.2833878Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:22:39.2834046Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:22:39.2834233Z       tt.reduce.return %11 : f32
2026-02-21T08:22:39.2834412Z     }) : (tensor<1024x8xf32>) -> tensor<1024xf32>
2026-02-21T08:22:39.2834646Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>>
2026-02-21T08:22:39.2834917Z     %10 = tt.addptr %9, %6 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32>
2026-02-21T08:22:39.2835153Z     tt.store %10, %8 : tensor<1024x!tt.ptr<f32>>
2026-02-21T08:22:39.2835345Z     tt.return
2026-02-21T08:22:39.2835467Z   }
2026-02-21T08:22:39.2835586Z }
2026-02-21T08:22:39.2835652Z 
2026-02-21T08:22:39.2835699Z {-#
2026-02-21T08:22:39.2835830Z   external_resources: {
2026-02-21T08:22:39.2836062Z     mlir_reproducer: {
2026-02-21T08:22:39.2840399Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:39.2844760Z       disable_threading: false,
2026-02-21T08:22:39.2844925Z       verify_each: true
2026-02-21T08:22:39.2845073Z     }
2026-02-21T08:22:39.2845187Z   }
2026-02-21T08:22:39.2845311Z #-}
2026-02-21T08:22:39.2845743Z /tmp/torchinductor_root/xg/cxgm3kagergn5wzsdsmbbpk6fkefzss35oywbsrjzcxa47fbxywa.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:39.2846984Z /tmp/torchinductor_root/xg/cxgm3kagergn5wzsdsmbbpk6fkefzss35oywbsrjzcxa47fbxywa.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:39.2847981Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:39.2849003Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:22:39.2849902Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:39.2850156Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:42.0238511Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:22:42.0242849Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:22:42.0244380Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:22:42.0244611Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:22:42.0244797Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:22:42.0244987Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:22:42.0245212Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x64xf32>
2026-02-21T08:22:42.0245477Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:22:42.0245985Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:22:42.0246170Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:22:42.0246362Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:22:42.0246674Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<8x64xf32>>
2026-02-21T08:22:42.0247104Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<8x64xf32>>
2026-02-21T08:22:42.0247406Z     %2 = tt.get_program_id x : i32
2026-02-21T08:22:42.0247616Z     scf.for %arg5 = %2 to %c512_i32 step %c592_i32  : i32 {
2026-02-21T08:22:42.0247832Z       %3 = arith.muli %arg5, %c8_i32 : i32
2026-02-21T08:22:42.0248050Z       %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:22:42.0248301Z       %5 = tt.splat %3 : i32 -> tensor<8xi32>
2026-02-21T08:22:42.0248579Z       %6 = arith.addi %5, %4 : tensor<8xi32>
2026-02-21T08:22:42.0248774Z       %c4032_i32 = arith.constant 4032 : i32
2026-02-21T08:22:42.0248952Z       %c192_i32 = arith.constant 192 : i32
2026-02-21T08:22:42.0249257Z       %7 = scf.for %arg6 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg7 = %cst) -> (tensor<8x64xf32>)  : i32 {
2026-02-21T08:22:42.0249658Z         %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0250011Z         %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0250296Z         %17 = scf.if %arg3 -> (tensor<8x64xf32>) {
2026-02-21T08:22:42.0250656Z           %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32>
2026-02-21T08:22:42.0251020Z           %32 = arith.subf %16, %15 : tensor<8x64xf32>
2026-02-21T08:22:42.0251219Z           %33 = arith.mulf %31, %32 : tensor<8x64xf32>
2026-02-21T08:22:42.0251463Z           %34 = arith.addf %33, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0251666Z           scf.yield %34 : tensor<8x64xf32>
2026-02-21T08:22:42.0251830Z         } else {
2026-02-21T08:22:42.0252090Z           %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32>
2026-02-21T08:22:42.0252313Z           %32 = arith.cmpf ogt, %16, %31 : tensor<8x64xf32>
2026-02-21T08:22:42.0252526Z           %33 = arith.cmpf une, %16, %16 : tensor<8x64xf32>
2026-02-21T08:22:42.0252738Z           %34 = arith.ori %32, %33 : tensor<8x64xi1>
2026-02-21T08:22:42.0252972Z           %35 = arith.select %34, %16, %31 : tensor<8x64xi1>, tensor<8x64xf32>
2026-02-21T08:22:42.0253216Z           %36 = math.log %35 : tensor<8x64xf32>
2026-02-21T08:22:42.0253406Z           %37 = arith.subf %36, %15 : tensor<8x64xf32>
2026-02-21T08:22:42.0253606Z           %38 = arith.mulf %16, %37 : tensor<8x64xf32>
2026-02-21T08:22:42.0253811Z           %39 = arith.addf %38, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0253999Z           scf.yield %39 : tensor<8x64xf32>
2026-02-21T08:22:42.0254174Z         }
2026-02-21T08:22:42.0254313Z         %18 = arith.addf %arg7, %17 : tensor<8x64xf32>
2026-02-21T08:22:42.0254506Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:22:42.0254685Z         %19 = arith.muli %c64_i32, %c1_i32 : i32
2026-02-21T08:22:42.0254870Z         %20 = arith.addi %arg6, %19 : i32
2026-02-21T08:22:42.0255126Z         %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0255470Z         %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0255745Z         %23 = scf.if %arg3 -> (tensor<8x64xf32>) {
2026-02-21T08:22:42.0256094Z           %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32>
2026-02-21T08:22:42.0256446Z           %32 = arith.subf %22, %21 : tensor<8x64xf32>
2026-02-21T08:22:42.0256643Z           %33 = arith.mulf %31, %32 : tensor<8x64xf32>
2026-02-21T08:22:42.0256964Z           %34 = arith.addf %33, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0257160Z           scf.yield %34 : tensor<8x64xf32>
2026-02-21T08:22:42.0257328Z         } else {
2026-02-21T08:22:42.0257495Z           %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32>
2026-02-21T08:22:42.0257712Z           %32 = arith.cmpf ogt, %22, %31 : tensor<8x64xf32>
2026-02-21T08:22:42.0257936Z           %33 = arith.cmpf une, %22, %22 : tensor<8x64xf32>
2026-02-21T08:22:42.0258140Z           %34 = arith.ori %32, %33 : tensor<8x64xi1>
2026-02-21T08:22:42.0258386Z           %35 = arith.select %34, %22, %31 : tensor<8x64xi1>, tensor<8x64xf32>
2026-02-21T08:22:42.0258634Z           %36 = math.log %35 : tensor<8x64xf32>
2026-02-21T08:22:42.0258825Z           %37 = arith.subf %36, %21 : tensor<8x64xf32>
2026-02-21T08:22:42.0259023Z           %38 = arith.mulf %22, %37 : tensor<8x64xf32>
2026-02-21T08:22:42.0259218Z           %39 = arith.addf %38, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0259502Z           scf.yield %39 : tensor<8x64xf32>
2026-02-21T08:22:42.0259671Z         }
2026-02-21T08:22:42.0259815Z         %24 = arith.addf %18, %23 : tensor<8x64xf32>
2026-02-21T08:22:42.0259998Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:22:42.0260185Z         %25 = arith.muli %c64_i32, %c2_i32 : i32
2026-02-21T08:22:42.0260371Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:22:42.0260625Z         %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0260969Z         %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0261235Z         %29 = scf.if %arg3 -> (tensor<8x64xf32>) {
2026-02-21T08:22:42.0261589Z           %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32>
2026-02-21T08:22:42.0261972Z           %32 = arith.subf %28, %27 : tensor<8x64xf32>
2026-02-21T08:22:42.0262176Z           %33 = arith.mulf %31, %32 : tensor<8x64xf32>
2026-02-21T08:22:42.0262382Z           %34 = arith.addf %33, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0262570Z           scf.yield %34 : tensor<8x64xf32>
2026-02-21T08:22:42.0262739Z         } else {
2026-02-21T08:22:42.0262894Z           %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32>
2026-02-21T08:22:42.0263109Z           %32 = arith.cmpf ogt, %28, %31 : tensor<8x64xf32>
2026-02-21T08:22:42.0263319Z           %33 = arith.cmpf une, %28, %28 : tensor<8x64xf32>
2026-02-21T08:22:42.0263525Z           %34 = arith.ori %32, %33 : tensor<8x64xi1>
2026-02-21T08:22:42.0263758Z           %35 = arith.select %34, %28, %31 : tensor<8x64xi1>, tensor<8x64xf32>
2026-02-21T08:22:42.0263992Z           %36 = math.log %35 : tensor<8x64xf32>
2026-02-21T08:22:42.0264187Z           %37 = arith.subf %36, %27 : tensor<8x64xf32>
2026-02-21T08:22:42.0264379Z           %38 = arith.mulf %28, %37 : tensor<8x64xf32>
2026-02-21T08:22:42.0264584Z           %39 = arith.addf %38, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0264771Z           scf.yield %39 : tensor<8x64xf32>
2026-02-21T08:22:42.0264938Z         }
2026-02-21T08:22:42.0265073Z         %30 = arith.addf %24, %29 : tensor<8x64xf32>
2026-02-21T08:22:42.0265261Z         scf.yield %30 : tensor<8x64xf32>
2026-02-21T08:22:42.0265443Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:22:42.0265705Z       %8 = tt.descriptor_load %0[%3, %c4032_i32] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0266070Z       %9 = tt.descriptor_load %1[%3, %c4032_i32] : !tt.tensordesc<tensor<8x64xf32>> -> tensor<8x64xf32>
2026-02-21T08:22:42.0266350Z       %10 = scf.if %arg3 -> (tensor<8x64xf32>) {
2026-02-21T08:22:42.0266720Z         %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32>
2026-02-21T08:22:42.0267089Z         %16 = arith.subf %9, %8 : tensor<8x64xf32>
2026-02-21T08:22:42.0267293Z         %17 = arith.mulf %15, %16 : tensor<8x64xf32>
2026-02-21T08:22:42.0267513Z         %18 = arith.addf %17, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0267782Z         scf.yield %18 : tensor<8x64xf32>
2026-02-21T08:22:42.0267959Z       } else {
2026-02-21T08:22:42.0268112Z         %15 = tt.splat %arg4 : f32 -> tensor<8x64xf32>
2026-02-21T08:22:42.0268337Z         %16 = arith.cmpf ogt, %9, %15 : tensor<8x64xf32>
2026-02-21T08:22:42.0268551Z         %17 = arith.cmpf une, %9, %9 : tensor<8x64xf32>
2026-02-21T08:22:42.0268766Z         %18 = arith.ori %16, %17 : tensor<8x64xi1>
2026-02-21T08:22:42.0269014Z         %19 = arith.select %18, %9, %15 : tensor<8x64xi1>, tensor<8x64xf32>
2026-02-21T08:22:42.0269253Z         %20 = math.log %19 : tensor<8x64xf32>
2026-02-21T08:22:42.0269466Z         %21 = arith.subf %20, %8 : tensor<8x64xf32>
2026-02-21T08:22:42.0269673Z         %22 = arith.mulf %9, %21 : tensor<8x64xf32>
2026-02-21T08:22:42.0269897Z         %23 = arith.addf %22, %cst : tensor<8x64xf32>
2026-02-21T08:22:42.0270093Z         scf.yield %23 : tensor<8x64xf32>
2026-02-21T08:22:42.0270339Z       }
2026-02-21T08:22:42.0270491Z       %11 = arith.addf %7, %10 : tensor<8x64xf32>
2026-02-21T08:22:42.0270688Z       %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({
2026-02-21T08:22:42.0270886Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:22:42.0271065Z         %15 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:22:42.0271281Z         tt.reduce.return %15 : f32
2026-02-21T08:22:42.0271474Z       }) : (tensor<8x64xf32>) -> tensor<8xf32>
2026-02-21T08:22:42.0271706Z       %13 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:22:42.0272001Z       %14 = tt.addptr %13, %6 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:22:42.0272239Z       tt.store %14, %12 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:22:42.0272441Z     } {tt.flatten, tt.warp_specialize}
2026-02-21T08:22:42.0272621Z     tt.return
2026-02-21T08:22:42.0272747Z   }
2026-02-21T08:22:42.0272872Z }
2026-02-21T08:22:42.0272941Z 
2026-02-21T08:22:42.0272992Z {-#
2026-02-21T08:22:42.0273131Z   external_resources: {
2026-02-21T08:22:42.0273290Z     mlir_reproducer: {
2026-02-21T08:22:42.0277570Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:22:42.0281958Z       disable_threading: false,
2026-02-21T08:22:42.0282122Z       verify_each: true
2026-02-21T08:22:42.0282289Z     }
2026-02-21T08:22:42.0282422Z   }
2026-02-21T08:22:42.0282648Z #-}
2026-02-21T08:22:42.0283153Z /tmp/torchinductor_root/2f/c2ftjbl75eyrgfb27icab6zcwzvurrffx5xrfg7hnekeoijqcnby.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:22:42.0284496Z /tmp/torchinductor_root/2f/c2ftjbl75eyrgfb27icab6zcwzvurrffx5xrfg7hnekeoijqcnby.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:22:42.0285528Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:22:42.0286799Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=64, num_sm_multiplier=4, num_stages=7, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:22:42.0287858Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:22:42.0288126Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:22:42.0288725Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.2 configs/s
2026-02-21T08:22:42.0289095Z [70s] Adaptive compile timeout: 30s (90% percentile=24.9s, bounds=[30.0s, 60s])
2026-02-21T08:22:42.6896593Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1510.5 configs/s
2026-02-21T08:22:42.7343628Z [70s] Initial random population of 100, 5 starting points: 
2026-02-21T08:22:42.7345340Z error=15
2026-02-21T08:22:42.7345520Z timeout=4
2026-02-21T08:22:42.7345677Z ok=81
2026-02-21T08:22:42.7345821Z min=0.0471
2026-02-21T08:22:42.7345976Z mid=0.6176
2026-02-21T08:22:42.7346119Z max=37.2368
2026-02-21T08:22:42.7346341Z best={'block_sizes': [2048, 2],
2026-02-21T08:22:42.7346598Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:22:42.7346840Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:22:42.7347060Z  'num_stages': 5,
2026-02-21T08:22:42.7347215Z  'num_warps': 4,
2026-02-21T08:22:42.7347388Z  'pid_type': 'flat',
2026-02-21T08:22:42.7347562Z  'range_flattens': [None, None],
2026-02-21T08:22:42.7347747Z  'range_multi_buffers': [None, None],
2026-02-21T08:22:42.7347946Z  'range_num_stages': [0, 3],
2026-02-21T08:22:42.7348123Z  'range_unroll_factors': [0, 1],
2026-02-21T08:22:42.7348322Z  'range_warp_specializes': [None, True]}
2026-02-21T08:22:42.7359051Z [70s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:22:43.8964203Z [72s] Generation 1 starting: 87 neighbors, 5 active search path(s)
2026-02-21T08:22:48.8908961Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 11.7 configs/s
2026-02-21T08:22:54.5204447Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s
2026-02-21T08:23:00.9891489Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 159.0         
2026-02-21T08:23:00.9893336Z                                                                   configs/s     
2026-02-21T08:23:01.3797600Z [89s] Generation 1 complete: 
2026-02-21T08:23:01.3799154Z error=1
2026-02-21T08:23:01.3799309Z ok=92
2026-02-21T08:23:01.3799434Z min=0.0441
2026-02-21T08:23:01.3799563Z mid=0.0562
2026-02-21T08:23:01.3799677Z max=0.2325
2026-02-21T08:23:01.3799820Z best={'block_sizes': [2048, 2],
2026-02-21T08:23:01.3800059Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:01.3800323Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:23:01.3800506Z  'num_stages': 7,
2026-02-21T08:23:01.3800650Z  'num_warps': 32,
2026-02-21T08:23:01.3800788Z  'pid_type': 'flat',
2026-02-21T08:23:01.3800938Z  'range_flattens': [None, True],
2026-02-21T08:23:01.3801115Z  'range_multi_buffers': [None, None],
2026-02-21T08:23:01.3801328Z  'range_num_stages': [0, 1],
2026-02-21T08:23:01.3802103Z  'range_unroll_factors': [0, 3],
2026-02-21T08:23:01.3802284Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:01.3810431Z [89s] Fitting surrogate: 193 points, 193 targets
2026-02-21T08:23:02.4552061Z [90s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T08:23:05.4714568Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 27.0 configs/s
2026-02-21T08:23:06.3911684Z module {
2026-02-21T08:23:06.3916573Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:23:06.3921086Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:23:06.3926038Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:23:06.3931066Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32>
2026-02-21T08:23:06.3935814Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:23:06.3937424Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:23:06.3937647Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:23:06.3937834Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:23:06.3938145Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:23:06.3938579Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:23:06.3938892Z     %2 = tt.get_program_id x : i32
2026-02-21T08:23:06.3939066Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:23:06.3939287Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:23:06.3939517Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:23:06.3939707Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:23:06.3940014Z     %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>)  : i32 {
2026-02-21T08:23:06.3940420Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:23:06.3940779Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:23:06.3941057Z       %13 = scf.if %arg3 -> (tensor<4x256xf32>) {
2026-02-21T08:23:06.3941420Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32>
2026-02-21T08:23:06.3941780Z         %16 = arith.subf %12, %11 : tensor<4x256xf32>
2026-02-21T08:23:06.3942077Z         %17 = arith.mulf %15, %16 : tensor<4x256xf32>
2026-02-21T08:23:06.3942284Z         %18 = arith.addf %17, %cst : tensor<4x256xf32>
2026-02-21T08:23:06.3942473Z         scf.yield %18 : tensor<4x256xf32>
2026-02-21T08:23:06.3942647Z       } else {
2026-02-21T08:23:06.3942804Z         %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32>
2026-02-21T08:23:06.3943034Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32>
2026-02-21T08:23:06.3943246Z         %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32>
2026-02-21T08:23:06.3943472Z         %18 = arith.ori %16, %17 : tensor<4x256xi1>
2026-02-21T08:23:06.3943710Z         %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32>
2026-02-21T08:23:06.3943943Z         %20 = math.log %19 : tensor<4x256xf32>
2026-02-21T08:23:06.3944140Z         %21 = arith.subf %20, %11 : tensor<4x256xf32>
2026-02-21T08:23:06.3944329Z         %22 = arith.mulf %12, %21 : tensor<4x256xf32>
2026-02-21T08:23:06.3944530Z         %23 = arith.addf %22, %cst : tensor<4x256xf32>
2026-02-21T08:23:06.3944725Z         scf.yield %23 : tensor<4x256xf32>
2026-02-21T08:23:06.3944899Z       }
2026-02-21T08:23:06.3945050Z       %14 = arith.addf %arg6, %13 : tensor<4x256xf32>
2026-02-21T08:23:06.3945236Z       scf.yield %14 : tensor<4x256xf32>
2026-02-21T08:23:06.3945489Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:23:06.3946044Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:23:06.3946230Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:23:06.3946398Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:23:06.3946585Z       tt.reduce.return %11 : f32
2026-02-21T08:23:06.3946759Z     }) : (tensor<4x256xf32>) -> tensor<4xf32>
2026-02-21T08:23:06.3946986Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:23:06.3947246Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:23:06.3947467Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:23:06.3947659Z     tt.return
2026-02-21T08:23:06.3947777Z   }
2026-02-21T08:23:06.3947899Z }
2026-02-21T08:23:06.3947963Z 
2026-02-21T08:23:06.3948011Z {-#
2026-02-21T08:23:06.3948137Z   external_resources: {
2026-02-21T08:23:06.3948291Z     mlir_reproducer: {
2026-02-21T08:23:06.3952556Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:23:06.3956898Z       disable_threading: false,
2026-02-21T08:23:06.3957067Z       verify_each: true
2026-02-21T08:23:06.3957207Z     }
2026-02-21T08:23:06.3957327Z   }
2026-02-21T08:23:06.3957435Z #-}
2026-02-21T08:23:06.3957874Z /tmp/torchinductor_root/55/c55vipmor3vwbduj4eavmyyhpqhttt56s25htj5r6oae4vfksynf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:23:06.3959197Z /tmp/torchinductor_root/55/c55vipmor3vwbduj4eavmyyhpqhttt56s25htj5r6oae4vfksynf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:23:06.3960249Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:23:06.3961337Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:23:06.3962320Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:23:06.3962645Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:23:06.5902045Z module {
2026-02-21T08:23:06.5906370Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:23:06.5907643Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:23:06.5907874Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:23:06.5910950Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:23:06.5911262Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:23:06.5916040Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:23:06.5919276Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:23:06.5921337Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:23:06.5922003Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:23:06.5922481Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:23:06.5922795Z     %2 = tt.get_program_id x : i32
2026-02-21T08:23:06.5922979Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:23:06.5923208Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:23:06.5923440Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:23:06.5923631Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:23:06.5923934Z     %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:23:06.5924338Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:23:06.5924710Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:23:06.5924991Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:23:06.5925364Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:23:06.5925726Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:23:06.5925931Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:23:06.5926131Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:23:06.5926331Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:23:06.5926501Z       } else {
2026-02-21T08:23:06.5926658Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:23:06.5926878Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:23:06.5927090Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:23:06.5927302Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:23:06.5927540Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:23:06.5927786Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:23:06.5927981Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:23:06.5928171Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:23:06.5928395Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:23:06.5928591Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:23:06.5928760Z       }
2026-02-21T08:23:06.5928922Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:23:06.5929114Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:23:06.5929360Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:23:06.5929631Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:23:06.5929809Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:23:06.5929985Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:23:06.5930172Z       tt.reduce.return %11 : f32
2026-02-21T08:23:06.5930450Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:23:06.5930673Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:23:06.5930916Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:23:06.5931146Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:23:06.5931317Z     tt.return
2026-02-21T08:23:06.5931444Z   }
2026-02-21T08:23:06.5931558Z }
2026-02-21T08:23:06.5931632Z 
2026-02-21T08:23:06.5931680Z {-#
2026-02-21T08:23:06.5931807Z   external_resources: {
2026-02-21T08:23:06.5932001Z     mlir_reproducer: {
2026-02-21T08:23:06.5936352Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:23:06.5940644Z       disable_threading: false,
2026-02-21T08:23:06.5940807Z       verify_each: true
2026-02-21T08:23:06.5940942Z     }
2026-02-21T08:23:06.5941062Z   }
2026-02-21T08:23:06.5941168Z #-}
2026-02-21T08:23:06.5941569Z /tmp/torchinductor_root/h2/ch2ffeze3ynkvw7chal4c5bahlgouvx7edgegll52ryqk4xw6kh7.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:23:06.5942793Z /tmp/torchinductor_root/h2/ch2ffeze3ynkvw7chal4c5bahlgouvx7edgegll52ryqk4xw6kh7.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:23:06.5943756Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:23:06.5944707Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:23:06.5945561Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:23:06.5945804Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:23:09.6154394Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 17.3 configs/s
2026-02-21T08:23:15.7767856Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 165.5         
2026-02-21T08:23:15.7768642Z                                                                   configs/s     
2026-02-21T08:23:16.1488227Z [104s] Generation 2 complete: 
2026-02-21T08:23:16.1493182Z error=2
2026-02-21T08:23:16.1494525Z ok=74
2026-02-21T08:23:16.1494688Z min=0.0439
2026-02-21T08:23:16.1494814Z mid=0.0480
2026-02-21T08:23:16.1494939Z max=0.1137
2026-02-21T08:23:16.1495070Z best={'block_sizes': [256, 1],
2026-02-21T08:23:16.1495335Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:16.1495598Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:23:16.1495779Z  'num_stages': 5,
2026-02-21T08:23:16.1495916Z  'num_warps': 1,
2026-02-21T08:23:16.1496064Z  'pid_type': 'flat',
2026-02-21T08:23:16.1496224Z  'range_flattens': [None, False],
2026-02-21T08:23:16.1496398Z  'range_multi_buffers': [None, False],
2026-02-21T08:23:16.1496581Z  'range_num_stages': [0, 1],
2026-02-21T08:23:16.1497046Z  'range_unroll_factors': [0, 1],
2026-02-21T08:23:16.1497267Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:16.1504214Z [104s] Fitting surrogate: 269 points, 269 targets
2026-02-21T08:23:17.3596444Z [105s] Generation 3 starting: 68 neighbors, 5 active search path(s)
2026-02-21T08:23:20.0443414Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 48.9 configs/s
2026-02-21T08:23:24.0639764Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.4 configs/s
2026-02-21T08:23:30.3983306Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 165.0         
2026-02-21T08:23:30.3983827Z                                                                   configs/s     
2026-02-21T08:23:30.8063304Z [119s] Generation 3 complete: 
2026-02-21T08:23:30.8067523Z error=1
2026-02-21T08:23:30.8067755Z ok=72
2026-02-21T08:23:30.8067899Z min=0.0419
2026-02-21T08:23:30.8068024Z mid=0.0461
2026-02-21T08:23:30.8068197Z max=0.0891
2026-02-21T08:23:30.8068383Z best={'block_sizes': [256, 1],
2026-02-21T08:23:30.8069062Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:30.8069380Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:23:30.8069574Z  'num_stages': 5,
2026-02-21T08:23:30.8072673Z  'num_warps': 1,
2026-02-21T08:23:30.8072913Z  'pid_type': 'flat',
2026-02-21T08:23:30.8073120Z  'range_flattens': [None, False],
2026-02-21T08:23:30.8073337Z  'range_multi_buffers': [None, False],
2026-02-21T08:23:30.8073544Z  'range_num_stages': [0, 1],
2026-02-21T08:23:30.8073740Z  'range_unroll_factors': [0, 1],
2026-02-21T08:23:30.8073924Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:30.8075338Z [119s] Fitting surrogate: 342 points, 342 targets
2026-02-21T08:23:31.9201173Z [120s] Generation 4 starting: 64 neighbors, 5 active search path(s)
2026-02-21T08:23:34.4654667Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 48.4 configs/s
2026-02-21T08:23:38.2996783Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.1 configs/s
2026-02-21T08:23:44.0829286Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 175.4         
2026-02-21T08:23:44.0829620Z                                                                   configs/s     
2026-02-21T08:23:44.4997514Z [132s] Generation 4 complete: 
2026-02-21T08:23:44.4999211Z ok=70
2026-02-21T08:23:44.4999386Z min=0.0419
2026-02-21T08:23:44.4999517Z mid=0.0460
2026-02-21T08:23:44.4999649Z max=0.2673
2026-02-21T08:23:44.4999787Z best={'block_sizes': [256, 1],
2026-02-21T08:23:44.5000057Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:44.5000334Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:23:44.5000524Z  'num_stages': 5,
2026-02-21T08:23:44.5000666Z  'num_warps': 1,
2026-02-21T08:23:44.5000803Z  'pid_type': 'flat',
2026-02-21T08:23:44.5000962Z  'range_flattens': [None, False],
2026-02-21T08:23:44.5001134Z  'range_multi_buffers': [None, False],
2026-02-21T08:23:44.5001356Z  'range_num_stages': [0, 1],
2026-02-21T08:23:44.5002263Z  'range_unroll_factors': [0, 1],
2026-02-21T08:23:44.5002446Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:44.5025592Z [132s] Fitting surrogate: 412 points, 412 targets
2026-02-21T08:23:45.5938998Z [133s] Generation 5 starting: 39 neighbors, 3 active search path(s)
2026-02-21T08:23:47.3373944Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 61.4 configs/s
2026-02-21T08:23:49.6600338Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 17.1 configs/s
2026-02-21T08:23:53.3166462Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.8         
2026-02-21T08:23:53.3170302Z                                                                   configs/s     
2026-02-21T08:23:53.5229371Z [141s] Generation 5 complete: 
2026-02-21T08:23:53.5229754Z ok=43
2026-02-21T08:23:53.5230002Z min=0.0419
2026-02-21T08:23:53.5230159Z mid=0.0460
2026-02-21T08:23:53.5230300Z max=0.0624
2026-02-21T08:23:53.5230824Z best={'block_sizes': [256, 2],
2026-02-21T08:23:53.5231116Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:53.5231380Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:23:53.5231587Z  'num_stages': 5,
2026-02-21T08:23:53.5231730Z  'num_warps': 4,
2026-02-21T08:23:53.5232081Z  'pid_type': 'flat',
2026-02-21T08:23:53.5232254Z  'range_flattens': [None, False],
2026-02-21T08:23:53.5232430Z  'range_multi_buffers': [None, True],
2026-02-21T08:23:53.5232615Z  'range_num_stages': [0, 1],
2026-02-21T08:23:53.5232773Z  'range_unroll_factors': [0, 0],
2026-02-21T08:23:53.5232954Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:53.5247198Z [141s] Fitting surrogate: 455 points, 455 targets
2026-02-21T08:23:54.0003923Z [142s] Generation 6 starting: 27 neighbors, 2 active search path(s)
2026-02-21T08:23:55.2545270Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 37.5 configs/s
2026-02-21T08:23:56.8989081Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 17.5 configs/s
2026-02-21T08:23:59.5726979Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 413.6         
2026-02-21T08:23:59.5727650Z                                                                   configs/s     
2026-02-21T08:23:59.7140133Z [147s] Generation 6 complete: 
2026-02-21T08:23:59.7144224Z ok=29
2026-02-21T08:23:59.7145657Z min=0.0420
2026-02-21T08:23:59.7145819Z mid=0.0440
2026-02-21T08:23:59.7145944Z max=0.0603
2026-02-21T08:23:59.7146081Z best={'block_sizes': [256, 1],
2026-02-21T08:23:59.7146336Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:23:59.7146607Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:23:59.7146801Z  'num_stages': 5,
2026-02-21T08:23:59.7146939Z  'num_warps': 8,
2026-02-21T08:23:59.7147085Z  'pid_type': 'flat',
2026-02-21T08:23:59.7147240Z  'range_flattens': [None, False],
2026-02-21T08:23:59.7147421Z  'range_multi_buffers': [None, False],
2026-02-21T08:23:59.7147637Z  'range_num_stages': [0, 1],
2026-02-21T08:23:59.7148237Z  'range_unroll_factors': [0, 0],
2026-02-21T08:23:59.7148425Z  'range_warp_specializes': [None, False]}
2026-02-21T08:23:59.7154519Z [147s] Fitting surrogate: 484 points, 484 targets
2026-02-21T08:24:00.1964705Z [148s] Generation 7 starting: 27 neighbors, 2 active search path(s)
2026-02-21T08:24:01.4861311Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 60.2 configs/s
2026-02-21T08:24:03.0847372Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.4 configs/s
2026-02-21T08:24:05.2310964Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 471.2         
2026-02-21T08:24:05.2311296Z                                                                   configs/s     
2026-02-21T08:24:05.3567109Z [153s] Generation 7 complete: 
2026-02-21T08:24:05.3568822Z ok=29
2026-02-21T08:24:05.3568986Z min=0.0419
2026-02-21T08:24:05.3569110Z mid=0.0420
2026-02-21T08:24:05.3569237Z max=0.2732
2026-02-21T08:24:05.3569688Z best={'block_sizes': [256, 1],
2026-02-21T08:24:05.3569969Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:24:05.3570227Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:24:05.3570415Z  'num_stages': 5,
2026-02-21T08:24:05.3570550Z  'num_warps': 8,
2026-02-21T08:24:05.3570691Z  'pid_type': 'flat',
2026-02-21T08:24:05.3570844Z  'range_flattens': [None, False],
2026-02-21T08:24:05.3571025Z  'range_multi_buffers': [None, False],
2026-02-21T08:24:05.3571228Z  'range_num_stages': [0, 1],
2026-02-21T08:24:05.3571399Z  'range_unroll_factors': [0, 0],
2026-02-21T08:24:05.3571570Z  'range_warp_specializes': [None, False]}
2026-02-21T08:24:05.3583956Z [153s] Fitting surrogate: 513 points, 513 targets
2026-02-21T08:24:05.6214372Z [153s] Autotuning complete in 153.9s after searching 487 configs.
2026-02-21T08:24:05.6216256Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:24:05.6217244Z     @helion.kernel(config=helion.Config(block_sizes=[256, 1], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T08:24:05.6218213Z 
2026-02-21T08:24:05.6218481Z [153s] Code of selected kernel: /tmp/torchinductor_root/2e/c2ef76cyfenisfioqqaqdn2hszipb6ekljuztdrkll7knehpfj34.py
2026-02-21T08:24:05.6400780Z from __future__ import annotations
2026-02-21T08:24:05.6401001Z 
2026-02-21T08:24:05.6405609Z import torch
2026-02-21T08:24:05.6409558Z import triton
2026-02-21T08:24:05.6411075Z import triton.language as tl
2026-02-21T08:24:05.6411363Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:24:05.6411641Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:24:05.6416897Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:24:05.6418446Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:24:05.6418650Z 
2026-02-21T08:24:05.6418724Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T08:24:05.6418910Z _BLOCK_SIZE_0 = tl.constexpr(256)
2026-02-21T08:24:05.6419064Z 
2026-02-21T08:24:05.6419127Z @triton.jit
2026-02-21T08:24:05.6419327Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:24:05.6424300Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:24:05.6426341Z     pid_0 = tl.program_id(0)
2026-02-21T08:24:05.6426590Z     offset_1 = pid_0
2026-02-21T08:24:05.6426775Z     indices_1 = offset_1 + tl.zeros([1], tl.int32)
2026-02-21T08:24:05.6431687Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:24:05.6433378Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:24:05.6433703Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:24:05.6434097Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:24:05.6434682Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:24:05.6439260Z     for offset_0 in tl.range(0, 4096, _BLOCK_SIZE_0, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T08:24:05.6440748Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:24:05.6441021Z         loss_sum_copy = loss_sum
2026-02-21T08:24:05.6441213Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:24:05.6441509Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:24:05.6441846Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:24:05.6442207Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:24:05.6442800Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 4096 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:24:05.6443192Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:24:05.6443549Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 4096 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:24:05.6443885Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:24:05.6444173Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:24:05.6444483Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:24:05.6444713Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:24:05.6444888Z         if log_target:
2026-02-21T08:24:05.6445057Z             y_true_val_copy = y_true_val
2026-02-21T08:24:05.6445252Z             y_pred_val_copy = y_pred_val
2026-02-21T08:24:05.6445432Z             kl_loss_copy = kl_loss
2026-02-21T08:24:05.6445627Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:24:05.6445828Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:24:05.6446024Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:24:05.6446238Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:24:05.6446476Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:24:05.6446725Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:24:05.6446990Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:24:05.6447191Z             v_2 = v_0 * v_1
2026-02-21T08:24:05.6447357Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:24:05.6447554Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:24:05.6447811Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:24:05.6448113Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:24:05.6448326Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:24:05.6448510Z         _not = not log_target
2026-02-21T08:24:05.6448677Z         if _not:
2026-02-21T08:24:05.6448823Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:24:05.6449011Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:24:05.6449188Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:24:05.6449387Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:24:05.6449580Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:24:05.6449776Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:24:05.6450014Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:24:05.6450302Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:24:05.6450513Z             v_5 = tl_math.log(v_4)
2026-02-21T08:24:05.6450725Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:24:05.6450962Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:24:05.6451138Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:24:05.6451455Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:24:05.6451639Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:24:05.6451834Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:24:05.6452092Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:24:05.6452317Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:24:05.6452525Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:24:05.6452652Z 
2026-02-21T08:24:05.6452954Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:24:05.6453339Z     """
2026-02-21T08:24:05.6453476Z     Compute KL Divergence loss.
2026-02-21T08:24:05.6453585Z 
2026-02-21T08:24:05.6453637Z     Args:
2026-02-21T08:24:05.6453807Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:24:05.6454156Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:24:05.6454492Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:24:05.6454803Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:24:05.6455040Z         eps: Small value to avoid numerical issues
2026-02-21T08:24:05.6455171Z 
2026-02-21T08:24:05.6455231Z     Returns:
2026-02-21T08:24:05.6455364Z         loss: KL divergence loss
2026-02-21T08:24:05.6455522Z     """
2026-02-21T08:24:05.6455657Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:24:05.6455842Z     BT, V = y_pred.shape
2026-02-21T08:24:05.6456031Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:24:05.6456305Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:24:05.6456548Z     # src[kl_div.py:77]: )
2026-02-21T08:24:05.6456793Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:24:05.6457087Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:24:05.6457303Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:24:05.6457508Z     # src[kl_div.py:82]: else:
2026-02-21T08:24:05.6457661Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:24:05.6457823Z     if reduction == 'none':
2026-02-21T08:24:05.6458004Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:24:05.6458210Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:24:05.6458378Z     else:
2026-02-21T08:24:05.6458592Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:24:05.6458920Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:24:05.6459200Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:24:05.6459515Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:24:05.6459773Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:24:05.6460075Z     _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=5)
2026-02-21T08:24:05.6460412Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:24:05.6460641Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:24:05.6460872Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:24:05.6461057Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:24:05.6461226Z     if reduction == 'batchmean':
2026-02-21T08:24:05.6461417Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:24:05.6461630Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:24:05.6461811Z     elif reduction == 'sum':
2026-02-21T08:24:05.6462030Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:24:05.6462246Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:24:05.6462422Z     elif reduction == 'mean':
2026-02-21T08:24:05.6462626Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:24:05.6462902Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:24:05.6463075Z     else:
2026-02-21T08:24:05.6463210Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:24:05.6463389Z         final_loss = loss
2026-02-21T08:24:05.6463551Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:24:05.6463717Z     return final_loss
2026-02-21T08:24:06.3863360Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T08:24:06.3866358Z (B, T, V)
2026-02-21T08:24:06.3870909Z --------------
2026-02-21T08:24:06.3875641Z (8, 512, 4096)
2026-02-21T08:24:06.3880090Z 
2026-02-21T08:24:06.3885679Z  17%|█▋        | 1/6 [02:38<13:13, 158.74s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1:
2026-02-21T08:24:06.3889671Z (B, T, V)
2026-02-21T08:24:06.3891075Z --------------
2026-02-21T08:24:06.3891266Z (8, 512, 8192)
2026-02-21T08:24:06.3891834Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:24:07.5717779Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:24:08.6712898Z INFO:tritonbench.utils.triton_op:Took 2.53ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:24:10.0853200Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:24:10.0854717Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:24:10.0854951Z               'dtype': 'torch.float32',
2026-02-21T08:24:10.0855149Z               'shape': (4096, 8192),
2026-02-21T08:24:10.0855324Z               'stride': (8192, 1)},
2026-02-21T08:24:10.0855510Z             { 'device': 'cuda:0',
2026-02-21T08:24:10.0855686Z               'dtype': 'torch.float32',
2026-02-21T08:24:10.0855872Z               'shape': (4096, 8192),
2026-02-21T08:24:10.0856051Z               'stride': (8192, 1)}),
2026-02-21T08:24:10.0856215Z   'kwargs': {}}
2026-02-21T08:24:10.0870307Z INFO:tritonbench.utils.triton_op:Took 1.99ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:24:10.2818123Z [0s] Autotune random seed: 2135561342
2026-02-21T08:24:10.3158497Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:24:42.7911097Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[3, 3], range_warp_specializes=[None, None])
2026-02-21T08:24:42.9785705Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=2, num_stages=5, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[False, None])
2026-02-21T08:24:43.2263283Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T08:24:43.5086750Z [33s] Timeout after 30s compiling Config(block_sizes=[4096, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], maxnreg=32, num_sm_multiplier=128, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[2, 2], range_warp_specializes=[False, None])
2026-02-21T08:24:43.8508309Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 3], range_unroll_factors=[3, 3], range_warp_specializes=[None, None])
2026-02-21T08:24:43.9744928Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 128], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=32, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[None, False])
2026-02-21T08:24:44.2789379Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[None, False])
2026-02-21T08:24:44.3474285Z [34s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 1], range_warp_specializes=[None, None])
2026-02-21T08:24:44.3829933Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False])
2026-02-21T08:24:44.3848072Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.1 configs/s
2026-02-21T08:24:46.7521025Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:24:46.7526336Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:46.7531805Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:24:46.7537460Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:24:46.7539100Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:46.7539299Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:46.7542678Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32>
2026-02-21T08:24:46.7542952Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:24:46.7543138Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:46.7543351Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:46.7543588Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:46.7545972Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:46.7546396Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:24:46.7546853Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:24:46.7547175Z     %2 = tt.get_program_id x : i32
2026-02-21T08:24:46.7547369Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:24:46.7547557Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T08:24:46.7547748Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:24:46.7547927Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:24:46.7548122Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:24:46.7548305Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:24:46.7548532Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:24:46.7549108Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:46.7549286Z     %9 = arith.remsi %8, %c2_i32 : i32
2026-02-21T08:24:46.7549468Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:24:46.7549642Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:24:46.7549828Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:24:46.7550007Z     %13 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:24:46.7550221Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:24:46.7550422Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:24:46.7550667Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:46.7550942Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:24:46.7551133Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:24:46.7551449Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:24:46.7552026Z         %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7552447Z         %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7552743Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:24:46.7553111Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:24:46.7553491Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:24:46.7553702Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:24:46.7553910Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7554118Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:24:46.7554291Z         } else {
2026-02-21T08:24:46.7554461Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:24:46.7554688Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:24:46.7554924Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:24:46.7555141Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:24:46.7555380Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:24:46.7555626Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:24:46.7555827Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:24:46.7556037Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:24:46.7556241Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7556447Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:24:46.7556615Z         }
2026-02-21T08:24:46.7556770Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:24:46.7556969Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:24:46.7557171Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:24:46.7557382Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:24:46.7557572Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:46.7557753Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:46.7557935Z         tt.reduce.return %32 : f32
2026-02-21T08:24:46.7558124Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:24:46.7558358Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7558614Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:24:46.7558854Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7559046Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:24:46.7559239Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:24:46.7559422Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:24:46.7559603Z       %24 = arith.muli %23, %c256_i32 : i32
2026-02-21T08:24:46.7559826Z       %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:46.7560074Z       %26 = tt.splat %24 : i32 -> tensor<256xi32>
2026-02-21T08:24:46.7560357Z       %27 = arith.addi %26, %25 : tensor<256xi32>
2026-02-21T08:24:46.7560653Z       %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:24:46.7561053Z         %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7561415Z         %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7561708Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:24:46.7562100Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:24:46.7562458Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:24:46.7562661Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:24:46.7562922Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7563128Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:24:46.7563291Z         } else {
2026-02-21T08:24:46.7563454Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:24:46.7563672Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:24:46.7563883Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:24:46.7564092Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:24:46.7564322Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:24:46.7564562Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:24:46.7564779Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:24:46.7564980Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:24:46.7565182Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7565376Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:24:46.7565550Z         }
2026-02-21T08:24:46.7565691Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:24:46.7565888Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:24:46.7566083Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:24:46.7566293Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:24:46.7566476Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:46.7566654Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:46.7566841Z         tt.reduce.return %32 : f32
2026-02-21T08:24:46.7567024Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:24:46.7567253Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7567513Z       %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:24:46.7567748Z       tt.store %31, %29 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7567942Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:46.7568147Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:24:46.7568348Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:24:46.7568574Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:24:46.7568814Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:24:46.7569000Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:24:46.7569304Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:24:46.7569693Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7570064Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:24:46.7570353Z         %24 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:24:46.7570709Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:24:46.7571130Z           %27 = arith.subf %23, %22 : tensor<256x32xf32>
2026-02-21T08:24:46.7571331Z           %28 = arith.mulf %26, %27 : tensor<256x32xf32>
2026-02-21T08:24:46.7571542Z           %29 = arith.addf %28, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7571756Z           scf.yield %29 : tensor<256x32xf32>
2026-02-21T08:24:46.7571951Z         } else {
2026-02-21T08:24:46.7572120Z           %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:24:46.7572337Z           %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32>
2026-02-21T08:24:46.7572562Z           %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32>
2026-02-21T08:24:46.7572770Z           %29 = arith.ori %27, %28 : tensor<256x32xi1>
2026-02-21T08:24:46.7573012Z           %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:24:46.7573263Z           %31 = math.log %30 : tensor<256x32xf32>
2026-02-21T08:24:46.7573519Z           %32 = arith.subf %31, %22 : tensor<256x32xf32>
2026-02-21T08:24:46.7573729Z           %33 = arith.mulf %23, %32 : tensor<256x32xf32>
2026-02-21T08:24:46.7573932Z           %34 = arith.addf %33, %cst : tensor<256x32xf32>
2026-02-21T08:24:46.7574130Z           scf.yield %34 : tensor<256x32xf32>
2026-02-21T08:24:46.7574294Z         }
2026-02-21T08:24:46.7574445Z         %25 = arith.addf %arg7, %24 : tensor<256x32xf32>
2026-02-21T08:24:46.7574635Z         scf.yield %25 : tensor<256x32xf32>
2026-02-21T08:24:46.7574838Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:24:46.7575046Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:24:46.7575229Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:46.7575411Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:46.7575591Z         tt.reduce.return %22 : f32
2026-02-21T08:24:46.7575780Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:24:46.7576009Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7576285Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:24:46.7576520Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:24:46.7576711Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:46.7576884Z     tt.return
2026-02-21T08:24:46.7577004Z   }
2026-02-21T08:24:46.7577126Z }
2026-02-21T08:24:46.7577192Z 
2026-02-21T08:24:46.7577242Z {-#
2026-02-21T08:24:46.7577372Z   external_resources: {
2026-02-21T08:24:46.7577522Z     mlir_reproducer: {
2026-02-21T08:24:46.7581738Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:46.7586125Z       disable_threading: false,
2026-02-21T08:24:46.7586289Z       verify_each: true
2026-02-21T08:24:46.7586433Z     }
2026-02-21T08:24:46.7586560Z   }
2026-02-21T08:24:46.7586673Z #-}
2026-02-21T08:24:46.7587094Z /tmp/torchinductor_root/mr/cmro5mf7dlfwrt33obhsbtlfyttej6je6h7p2i3nwbklsmwo5qaf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:46.7588352Z /tmp/torchinductor_root/mr/cmro5mf7dlfwrt33obhsbtlfyttej6je6h7p2i3nwbklsmwo5qaf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:46.7589314Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:46.7590380Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:24:46.7596244Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:46.7596535Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:47.5541600Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:24:47.5542493Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:47.5543152Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:24:47.5543354Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:47.5547427Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:47.5549530Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32>
2026-02-21T08:24:47.5549799Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:24:47.5549984Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:47.5550172Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:47.5550344Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:47.5550529Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:47.5550841Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:24:47.5551290Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:24:47.5551609Z     %2 = tt.get_program_id x : i32
2026-02-21T08:24:47.5551779Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:24:47.5552043Z     %4 = arith.minsi %3, %c32_i32 : i32
2026-02-21T08:24:47.5552242Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:24:47.5552451Z       %5 = arith.muli %arg5, %c128_i32 : i32
2026-02-21T08:24:47.5552679Z       %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:24:47.5552949Z       %7 = tt.splat %5 : i32 -> tensor<128xi32>
2026-02-21T08:24:47.5553152Z       %8 = arith.addi %7, %6 : tensor<128xi32>
2026-02-21T08:24:47.5553460Z       %9 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>)  : i32 {
2026-02-21T08:24:47.5553877Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:24:47.5554255Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:24:47.5554902Z         %15 = scf.if %arg3 -> (tensor<128x32xf32>) {
2026-02-21T08:24:47.5555281Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32>
2026-02-21T08:24:47.5555649Z           %18 = arith.subf %14, %13 : tensor<128x32xf32>
2026-02-21T08:24:47.5555867Z           %19 = arith.mulf %17, %18 : tensor<128x32xf32>
2026-02-21T08:24:47.5556073Z           %20 = arith.addf %19, %cst : tensor<128x32xf32>
2026-02-21T08:24:47.5556277Z           scf.yield %20 : tensor<128x32xf32>
2026-02-21T08:24:47.5556445Z         } else {
2026-02-21T08:24:47.5556612Z           %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32>
2026-02-21T08:24:47.5556833Z           %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32>
2026-02-21T08:24:47.5557175Z           %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32>
2026-02-21T08:24:47.5557397Z           %20 = arith.ori %18, %19 : tensor<128x32xi1>
2026-02-21T08:24:47.5557630Z           %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32>
2026-02-21T08:24:47.5557882Z           %22 = math.log %21 : tensor<128x32xf32>
2026-02-21T08:24:47.5558094Z           %23 = arith.subf %22, %13 : tensor<128x32xf32>
2026-02-21T08:24:47.5558296Z           %24 = arith.mulf %14, %23 : tensor<128x32xf32>
2026-02-21T08:24:47.5558532Z           %25 = arith.addf %24, %cst : tensor<128x32xf32>
2026-02-21T08:24:47.5558722Z           scf.yield %25 : tensor<128x32xf32>
2026-02-21T08:24:47.5558893Z         }
2026-02-21T08:24:47.5559036Z         %16 = arith.addf %arg7, %15 : tensor<128x32xf32>
2026-02-21T08:24:47.5559230Z         scf.yield %16 : tensor<128x32xf32>
2026-02-21T08:24:47.5559508Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:24:47.5559801Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:24:47.5559999Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:47.5560172Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:47.5560360Z         tt.reduce.return %13 : f32
2026-02-21T08:24:47.5560542Z       }) : (tensor<128x32xf32>) -> tensor<128xf32>
2026-02-21T08:24:47.5560778Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:24:47.5561041Z       %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:24:47.5561273Z       tt.store %12, %10 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:24:47.5561473Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T08:24:47.5561633Z     tt.return
2026-02-21T08:24:47.5561761Z   }
2026-02-21T08:24:47.5561910Z }
2026-02-21T08:24:47.5561984Z 
2026-02-21T08:24:47.5562031Z {-#
2026-02-21T08:24:47.5562152Z   external_resources: {
2026-02-21T08:24:47.5562305Z     mlir_reproducer: {
2026-02-21T08:24:47.5566590Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:47.5571103Z       disable_threading: false,
2026-02-21T08:24:47.5571273Z       verify_each: true
2026-02-21T08:24:47.5571413Z     }
2026-02-21T08:24:47.5571532Z   }
2026-02-21T08:24:47.5571640Z #-}
2026-02-21T08:24:47.5572195Z /tmp/torchinductor_root/jh/cjh3kqlworq67kiavguonnx5vbnedwyzfa3s7tsnyxsk2v2fxuls.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:47.5573390Z /tmp/torchinductor_root/jh/cjh3kqlworq67kiavguonnx5vbnedwyzfa3s7tsnyxsk2v2fxuls.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:47.5574347Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:47.5575431Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:24:47.5576426Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:47.5576672Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:48.0930839Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:24:48.0933199Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:48.0933809Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:24:48.0934021Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:48.0934212Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:24:48.0934440Z     %cst = arith.constant dense<8192> : tensor<4x1xi32>
2026-02-21T08:24:48.0934690Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x4xf32>
2026-02-21T08:24:48.0934916Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:24:48.0935120Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:48.0935315Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:48.0935486Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:48.0935665Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:48.0935972Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x4xf32>>
2026-02-21T08:24:48.0936280Z     %1 = tt.get_program_id x : i32
2026-02-21T08:24:48.0936460Z     %2 = arith.subi %c1024_i32, %1 : i32
2026-02-21T08:24:48.0936632Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:48.0936814Z     %3 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:24:48.0936997Z     %4 = arith.addi %2, %3 : i32
2026-02-21T08:24:48.0937173Z     %5 = arith.divui %4, %c2368_i32 : i32
2026-02-21T08:24:48.0937346Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:24:48.0937521Z     %6 = arith.remsi %5, %c3_i32 : i32
2026-02-21T08:24:48.0937693Z     %7 = arith.subi %5, %6 : i32
2026-02-21T08:24:48.0937872Z     %8 = arith.muli %7, %c2368_i32 : i32
2026-02-21T08:24:48.0938341Z     %9 = arith.addi %1, %8 : i32
2026-02-21T08:24:48.0938510Z     %10 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:24:48.0938748Z     scf.for %arg5 = %1 to %9 step %10  : i32 {
2026-02-21T08:24:48.0938940Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:24:48.0939155Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:48.0939402Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0939596Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:24:48.0939894Z       %15 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:24:48.0940207Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0940403Z         %40 = arith.addi %39, %12 : tensor<4xi32>
2026-02-21T08:24:48.0940649Z         %41 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:48.0940995Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:24:48.0941235Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:24:48.0941512Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0941751Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0942042Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:24:48.0942266Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0942528Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:24:48.0942814Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0943139Z         %50 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:24:48.0943424Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:24:48.0943776Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:24:48.0944141Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0944340Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:24:48.0944555Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0944760Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:24:48.0944930Z         } else {
2026-02-21T08:24:48.0945094Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:24:48.0945300Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:24:48.0945511Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:24:48.0945707Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:24:48.0945939Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:24:48.0946178Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:24:48.0946365Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0946561Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:24:48.0946756Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0946949Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:24:48.0947107Z         }
2026-02-21T08:24:48.0947251Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:24:48.0947434Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:24:48.0947683Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:24:48.0947950Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:24:48.0948134Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:48.0948312Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:48.0948489Z         tt.reduce.return %39 : f32
2026-02-21T08:24:48.0948671Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:24:48.0948886Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0949253Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:48.0949486Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0949675Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:24:48.0949865Z       %19 = arith.muli %c2368_i32, %c1_i32_1 : i32
2026-02-21T08:24:48.0950049Z       %20 = arith.addi %arg5, %19 : i32
2026-02-21T08:24:48.0950225Z       %21 = arith.muli %20, %c4_i32 : i32
2026-02-21T08:24:48.0950436Z       %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:48.0950708Z       %23 = tt.splat %21 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0950892Z       %24 = arith.addi %23, %22 : tensor<4xi32>
2026-02-21T08:24:48.0951191Z       %25 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:24:48.0951495Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0951774Z         %40 = arith.addi %39, %22 : tensor<4xi32>
2026-02-21T08:24:48.0952079Z         %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:48.0952342Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:24:48.0952597Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:24:48.0952882Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0953132Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0953362Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:24:48.0953593Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0953866Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:24:48.0954151Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0954497Z         %50 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:24:48.0954799Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:24:48.0955161Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:24:48.0955532Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0955736Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:24:48.0955953Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0956153Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:24:48.0956334Z         } else {
2026-02-21T08:24:48.0956503Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:24:48.0956717Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:24:48.0956940Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:24:48.0957146Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:24:48.0957386Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:24:48.0957619Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:24:48.0957821Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0958023Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:24:48.0958226Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0958427Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:24:48.0958593Z         }
2026-02-21T08:24:48.0958745Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:24:48.0958938Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:24:48.0959197Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:24:48.0959475Z       %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({
2026-02-21T08:24:48.0959667Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:48.0959852Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:48.0960110Z         tt.reduce.return %39 : f32
2026-02-21T08:24:48.0960312Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:24:48.0960529Z       %27 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0960785Z       %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:48.0961010Z       tt.store %28, %26 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0961208Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:48.0961396Z       %29 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:24:48.0961578Z       %30 = arith.addi %arg5, %29 : i32
2026-02-21T08:24:48.0961757Z       %31 = arith.muli %30, %c4_i32 : i32
2026-02-21T08:24:48.0962004Z       %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:48.0962235Z       %33 = tt.splat %31 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0962416Z       %34 = arith.addi %33, %32 : tensor<4xi32>
2026-02-21T08:24:48.0962772Z       %35 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:24:48.0963086Z         %39 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0963279Z         %40 = arith.addi %39, %32 : tensor<4xi32>
2026-02-21T08:24:48.0963519Z         %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:48.0963765Z         %42 = arith.muli %41, %cst : tensor<4x1xi32>
2026-02-21T08:24:48.0964007Z         %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:24:48.0964269Z         %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0964513Z         %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0964730Z         %46 = arith.addi %44, %45 : tensor<4x4xi32>
2026-02-21T08:24:48.0964951Z         %47 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0965212Z         %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:24:48.0965486Z         %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0965813Z         %50 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:24:48.0966090Z         %51 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:24:48.0966438Z           %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:24:48.0966791Z           %54 = arith.subf %50, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0966981Z           %55 = arith.mulf %53, %54 : tensor<4x4xf32>
2026-02-21T08:24:48.0967189Z           %56 = arith.addf %55, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0967378Z           scf.yield %56 : tensor<4x4xf32>
2026-02-21T08:24:48.0967552Z         } else {
2026-02-21T08:24:48.0967710Z           %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:24:48.0967925Z           %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32>
2026-02-21T08:24:48.0968140Z           %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32>
2026-02-21T08:24:48.0968334Z           %56 = arith.ori %54, %55 : tensor<4x4xi1>
2026-02-21T08:24:48.0968587Z           %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:24:48.0968819Z           %58 = math.log %57 : tensor<4x4xf32>
2026-02-21T08:24:48.0969005Z           %59 = arith.subf %58, %49 : tensor<4x4xf32>
2026-02-21T08:24:48.0969199Z           %60 = arith.mulf %50, %59 : tensor<4x4xf32>
2026-02-21T08:24:48.0969399Z           %61 = arith.addf %60, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0969586Z           scf.yield %61 : tensor<4x4xf32>
2026-02-21T08:24:48.0969751Z         }
2026-02-21T08:24:48.0969888Z         %52 = arith.addf %arg7, %51 : tensor<4x4xf32>
2026-02-21T08:24:48.0970076Z         scf.yield %52 : tensor<4x4xf32>
2026-02-21T08:24:48.0970316Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:24:48.0970636Z       %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({
2026-02-21T08:24:48.0970817Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:48.0970993Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:48.0971174Z         tt.reduce.return %39 : f32
2026-02-21T08:24:48.0971370Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:24:48.0971586Z       %37 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0971838Z       %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:48.0972091Z       tt.store %38, %36 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0972282Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:24:48.0972479Z     scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32  : i32 {
2026-02-21T08:24:48.0972697Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:24:48.0972915Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:48.0973209Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0973409Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:24:48.0973697Z       %15 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>)  : i32 {
2026-02-21T08:24:48.0974005Z         %19 = tt.splat %arg6 : i32 -> tensor<4xi32>
2026-02-21T08:24:48.0974194Z         %20 = arith.addi %19, %12 : tensor<4xi32>
2026-02-21T08:24:48.0974433Z         %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:48.0974677Z         %22 = arith.muli %21, %cst : tensor<4x1xi32>
2026-02-21T08:24:48.0974917Z         %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32>
2026-02-21T08:24:48.0975196Z         %24 = tt.broadcast %22 : tensor<4x1xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0975437Z         %25 = tt.broadcast %23 : tensor<1x4xi32> -> tensor<4x4xi32>
2026-02-21T08:24:48.0975660Z         %26 = arith.addi %24, %25 : tensor<4x4xi32>
2026-02-21T08:24:48.0975887Z         %27 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0976151Z         %28 = tt.addptr %27, %26 : tensor<4x4x!tt.ptr<f32>>, tensor<4x4xi32>
2026-02-21T08:24:48.0976419Z         %29 = tt.load %28 evictionPolicy = evict_last : tensor<4x4x!tt.ptr<f32>>
2026-02-21T08:24:48.0976739Z         %30 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x4xf32>> -> tensor<4x4xf32>
2026-02-21T08:24:48.0977021Z         %31 = scf.if %arg3 -> (tensor<4x4xf32>) {
2026-02-21T08:24:48.0977358Z           %33 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32>
2026-02-21T08:24:48.0977724Z           %34 = arith.subf %30, %29 : tensor<4x4xf32>
2026-02-21T08:24:48.0977919Z           %35 = arith.mulf %33, %34 : tensor<4x4xf32>
2026-02-21T08:24:48.0978122Z           %36 = arith.addf %35, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0978317Z           scf.yield %36 : tensor<4x4xf32>
2026-02-21T08:24:48.0978529Z         } else {
2026-02-21T08:24:48.0978724Z           %33 = tt.splat %arg4 : f32 -> tensor<4x4xf32>
2026-02-21T08:24:48.0978939Z           %34 = arith.cmpf ogt, %30, %33 : tensor<4x4xf32>
2026-02-21T08:24:48.0979145Z           %35 = arith.cmpf une, %30, %30 : tensor<4x4xf32>
2026-02-21T08:24:48.0979347Z           %36 = arith.ori %34, %35 : tensor<4x4xi1>
2026-02-21T08:24:48.0979568Z           %37 = arith.select %36, %30, %33 : tensor<4x4xi1>, tensor<4x4xf32>
2026-02-21T08:24:48.0979798Z           %38 = math.log %37 : tensor<4x4xf32>
2026-02-21T08:24:48.0979982Z           %39 = arith.subf %38, %29 : tensor<4x4xf32>
2026-02-21T08:24:48.0980175Z           %40 = arith.mulf %30, %39 : tensor<4x4xf32>
2026-02-21T08:24:48.0980374Z           %41 = arith.addf %40, %cst_0 : tensor<4x4xf32>
2026-02-21T08:24:48.0980562Z           scf.yield %41 : tensor<4x4xf32>
2026-02-21T08:24:48.0980729Z         }
2026-02-21T08:24:48.0980864Z         %32 = arith.addf %arg7, %31 : tensor<4x4xf32>
2026-02-21T08:24:48.0981057Z         scf.yield %32 : tensor<4x4xf32>
2026-02-21T08:24:48.0981354Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:24:48.0981618Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:24:48.0981799Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:48.0982007Z         %19 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:48.0982193Z         tt.reduce.return %19 : f32
2026-02-21T08:24:48.0982372Z       }) : (tensor<4x4xf32>) -> tensor<4xf32>
2026-02-21T08:24:48.0982595Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0982845Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:48.0983078Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:48.0983262Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:24:48.0983426Z     tt.return
2026-02-21T08:24:48.0983552Z   }
2026-02-21T08:24:48.0983677Z }
2026-02-21T08:24:48.0983746Z 
2026-02-21T08:24:48.0983808Z {-#
2026-02-21T08:24:48.0984013Z   external_resources: {
2026-02-21T08:24:48.0984177Z     mlir_reproducer: {
2026-02-21T08:24:48.0988438Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:48.0992813Z       disable_threading: false,
2026-02-21T08:24:48.0992972Z       verify_each: true
2026-02-21T08:24:48.0993116Z     }
2026-02-21T08:24:48.0993230Z   }
2026-02-21T08:24:48.0993349Z #-}
2026-02-21T08:24:48.0993754Z /tmp/torchinductor_root/7g/c7gjsy6mlluox5unwj6ock57hszytyp5zlzgitn3szsuyjxwnppz.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:48.0994963Z /tmp/torchinductor_root/7g/c7gjsy6mlluox5unwj6ock57hszytyp5zlzgitn3szsuyjxwnppz.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:48.0995945Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:48.0997040Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:24:48.0998085Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:48.0998351Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:48.3769190Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:24:48.3772801Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:48.3773753Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:24:48.3773950Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:24:48.3774146Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:48.3774370Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:24:48.3774929Z     %cst = arith.constant dense<8192> : tensor<32x1xi32>
2026-02-21T08:24:48.3775214Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x8xf32>
2026-02-21T08:24:48.3775432Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:24:48.3775618Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:48.3775797Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:48.3775975Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:48.3776155Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:48.3776456Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<32x8xf32>>
2026-02-21T08:24:48.3776772Z     %1 = tt.get_program_id x : i32
2026-02-21T08:24:48.3776975Z     scf.for %arg5 = %1 to %c128_i32 step %c592_i32  : i32 {
2026-02-21T08:24:48.3777192Z       %2 = arith.muli %arg5, %c32_i32 : i32
2026-02-21T08:24:48.3777416Z       %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:24:48.3777671Z       %4 = tt.splat %2 : i32 -> tensor<32xi32>
2026-02-21T08:24:48.3777869Z       %5 = arith.addi %4, %3 : tensor<32xi32>
2026-02-21T08:24:48.3778047Z       %c16_i32 = arith.constant 16 : i32
2026-02-21T08:24:48.3778352Z       %6 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<32x8xf32>)  : i32 {
2026-02-21T08:24:48.3778696Z         %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:48.3778942Z         %11 = tt.splat %arg6 : i32 -> tensor<8xi32>
2026-02-21T08:24:48.3779139Z         %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:24:48.3779390Z         %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:24:48.3779653Z         %14 = arith.muli %13, %cst : tensor<32x1xi32>
2026-02-21T08:24:48.3779897Z         %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:48.3780180Z         %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:24:48.3780431Z         %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:24:48.3780675Z         %18 = arith.addi %16, %17 : tensor<32x8xi32>
2026-02-21T08:24:48.3780902Z         %19 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:24:48.3781178Z         %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr<f32>>, tensor<32x8xi32>
2026-02-21T08:24:48.3781466Z         %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:24:48.3781792Z         %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<32x8xf32>> -> tensor<32x8xf32>
2026-02-21T08:24:48.3782298Z         %23 = scf.if %arg3 -> (tensor<32x8xf32>) {
2026-02-21T08:24:48.3782653Z           %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:24:48.3783025Z           %43 = arith.subf %22, %21 : tensor<32x8xf32>
2026-02-21T08:24:48.3783235Z           %44 = arith.mulf %42, %43 : tensor<32x8xf32>
2026-02-21T08:24:48.3783574Z           %45 = arith.addf %44, %cst_0 : tensor<32x8xf32>
2026-02-21T08:24:48.3783775Z           scf.yield %45 : tensor<32x8xf32>
2026-02-21T08:24:48.3783941Z         } else {
2026-02-21T08:24:48.3784129Z           %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32>
2026-02-21T08:24:48.3784345Z           %43 = arith.cmpf ogt, %22, %42 : tensor<32x8xf32>
2026-02-21T08:24:48.3784571Z           %44 = arith.cmpf une, %22, %22 : tensor<32x8xf32>
2026-02-21T08:24:48.3784785Z           %45 = arith.ori %43, %44 : tensor<32x8xi1>
2026-02-21T08:24:48.3785021Z           %46 = arith.select %45, %22, %42 : tensor<32x8xi1>, tensor<32x8xf32>
2026-02-21T08:24:48.3785266Z           %47 = math.log %46 : tensor<32x8xf32>
2026-02-21T08:24:48.3785458Z           %48 = arith.subf %47, %21 : tensor<32x8xf32>
2026-02-21T08:24:48.3785662Z           %49 = arith.mulf %22, %48 : tensor<32x8xf32>
2026-02-21T08:24:48.3785863Z           %50 = arith.addf %49, %cst_0 : tensor<32x8xf32>
2026-02-21T08:24:48.3786128Z           scf.yield %50 : tensor<32x8xf32>
2026-02-21T08:24:48.3786300Z         }
2026-02-21T08:24:48.3786453Z         %24 = arith.addf %arg7, %23 : tensor<32x8xf32>
2026-02-21T08:24:48.3786655Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:48.3786845Z         %25 = arith.muli %c8_i32, %c1_i32 : i32
2026-02-21T08:24:48.3787035Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:24:48.3787256Z         %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:48.3787502Z         %28 = tt.splat %26 : i32 -> tensor<8xi32>
2026-02-21T08:24:48.3787699Z         %29 = arith.addi %28, %27 : tensor<8xi32>
2026-02-21T08:24:48.3787947Z         %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32>
2026-02-21T08:24:48.3788213Z         %31 = arith.muli %30, %cst : tensor<32x1xi32>
2026-02-21T08:24:48.3788460Z         %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:48.3788747Z         %33 = tt.broadcast %31 : tensor<32x1xi32> -> tensor<32x8xi32>
2026-02-21T08:24:48.3789004Z         %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<32x8xi32>
2026-02-21T08:24:48.3789246Z         %35 = arith.addi %33, %34 : tensor<32x8xi32>
2026-02-21T08:24:48.3789468Z         %36 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:24:48.3789734Z         %37 = tt.addptr %36, %35 : tensor<32x8x!tt.ptr<f32>>, tensor<32x8xi32>
2026-02-21T08:24:48.3790021Z         %38 = tt.load %37 evictionPolicy = evict_first : tensor<32x8x!tt.ptr<f32>>
2026-02-21T08:24:48.3790346Z         %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc<tensor<32x8xf32>> -> tensor<32x8xf32>
2026-02-21T08:24:48.3790634Z         %40 = scf.if %arg3 -> (tensor<32x8xf32>) {
2026-02-21T08:24:48.3790979Z           %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32>
2026-02-21T08:24:48.3791343Z           %43 = arith.subf %39, %38 : tensor<32x8xf32>
2026-02-21T08:24:48.3791549Z           %44 = arith.mulf %42, %43 : tensor<32x8xf32>
2026-02-21T08:24:48.3791755Z           %45 = arith.addf %44, %cst_0 : tensor<32x8xf32>
2026-02-21T08:24:48.3791986Z           scf.yield %45 : tensor<32x8xf32>
2026-02-21T08:24:48.3792150Z         } else {
2026-02-21T08:24:48.3792314Z           %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32>
2026-02-21T08:24:48.3792522Z           %43 = arith.cmpf ogt, %39, %42 : tensor<32x8xf32>
2026-02-21T08:24:48.3792743Z           %44 = arith.cmpf une, %39, %39 : tensor<32x8xf32>
2026-02-21T08:24:48.3792949Z           %45 = arith.ori %43, %44 : tensor<32x8xi1>
2026-02-21T08:24:48.3793195Z           %46 = arith.select %45, %39, %42 : tensor<32x8xi1>, tensor<32x8xf32>
2026-02-21T08:24:48.3793440Z           %47 = math.log %46 : tensor<32x8xf32>
2026-02-21T08:24:48.3793634Z           %48 = arith.subf %47, %38 : tensor<32x8xf32>
2026-02-21T08:24:48.3793845Z           %49 = arith.mulf %39, %48 : tensor<32x8xf32>
2026-02-21T08:24:48.3794048Z           %50 = arith.addf %49, %cst_0 : tensor<32x8xf32>
2026-02-21T08:24:48.3794307Z           scf.yield %50 : tensor<32x8xf32>
2026-02-21T08:24:48.3794466Z         }
2026-02-21T08:24:48.3794609Z         %41 = arith.addf %24, %40 : tensor<32x8xf32>
2026-02-21T08:24:48.3794796Z         scf.yield %41 : tensor<32x8xf32>
2026-02-21T08:24:48.3795003Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:24:48.3795227Z       %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:24:48.3795405Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:48.3795580Z         %10 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:48.3795756Z         tt.reduce.return %10 : f32
2026-02-21T08:24:48.3795941Z       }) : (tensor<32x8xf32>) -> tensor<32xf32>
2026-02-21T08:24:48.3796156Z       %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<32x!tt.ptr<f32>>
2026-02-21T08:24:48.3796413Z       %9 = tt.addptr %8, %5 : tensor<32x!tt.ptr<f32>>, tensor<32xi32>
2026-02-21T08:24:48.3796694Z       tt.store %9, %7 : tensor<32x!tt.ptr<f32>>
2026-02-21T08:24:48.3796888Z     } {tt.flatten, tt.warp_specialize}
2026-02-21T08:24:48.3797060Z     tt.return
2026-02-21T08:24:48.3797179Z   }
2026-02-21T08:24:48.3797298Z }
2026-02-21T08:24:48.3797363Z 
2026-02-21T08:24:48.3797411Z {-#
2026-02-21T08:24:48.3797572Z   external_resources: {
2026-02-21T08:24:48.3797721Z     mlir_reproducer: {
2026-02-21T08:24:48.3801932Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:48.3806268Z       disable_threading: false,
2026-02-21T08:24:48.3806425Z       verify_each: true
2026-02-21T08:24:48.3806573Z     }
2026-02-21T08:24:48.3806686Z   }
2026-02-21T08:24:48.3806800Z #-}
2026-02-21T08:24:48.3807212Z /tmp/torchinductor_root/zc/czcgaxxqjni443xocez7ve6cni2qufwsdli2zj3hwqhryjh4lgxx.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:48.3808384Z /tmp/torchinductor_root/zc/czcgaxxqjni443xocez7ve6cni2qufwsdli2zj3hwqhryjh4lgxx.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:48.3809336Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:48.3810456Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:24:48.3811409Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:48.3811657Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:49.2414568Z module {
2026-02-21T08:24:49.2416155Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:49.2416732Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:24:49.2417244Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:49.2417505Z     %cst = arith.constant dense<0.000000e+00> : tensor<1024x8xf32>
2026-02-21T08:24:49.2417732Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:24:49.2417923Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:49.2418102Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:49.2418275Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:49.2418452Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:49.2418757Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<1024x8xf32>>
2026-02-21T08:24:49.2419185Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<1024x8xf32>>
2026-02-21T08:24:49.2419485Z     %2 = tt.get_program_id x : i32
2026-02-21T08:24:49.2419674Z     %3 = arith.muli %2, %c1024_i32 : i32
2026-02-21T08:24:49.2419911Z     %4 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:24:49.2420167Z     %5 = tt.splat %3 : i32 -> tensor<1024xi32>
2026-02-21T08:24:49.2420359Z     %6 = arith.addi %5, %4 : tensor<1024xi32>
2026-02-21T08:24:49.2420661Z     %7 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c8_i32 iter_args(%arg6 = %cst) -> (tensor<1024x8xf32>)  : i32 {
2026-02-21T08:24:49.2421058Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<1024x8xf32>> -> tensor<1024x8xf32>
2026-02-21T08:24:49.2421424Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<1024x8xf32>> -> tensor<1024x8xf32>
2026-02-21T08:24:49.2421703Z       %13 = scf.if %arg3 -> (tensor<1024x8xf32>) {
2026-02-21T08:24:49.2422159Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x8xf32>) -> tensor<1024x8xf32>
2026-02-21T08:24:49.2422518Z         %16 = arith.subf %12, %11 : tensor<1024x8xf32>
2026-02-21T08:24:49.2422727Z         %17 = arith.mulf %15, %16 : tensor<1024x8xf32>
2026-02-21T08:24:49.2422935Z         %18 = arith.addf %17, %cst : tensor<1024x8xf32>
2026-02-21T08:24:49.2423143Z         scf.yield %18 : tensor<1024x8xf32>
2026-02-21T08:24:49.2423315Z       } else {
2026-02-21T08:24:49.2423472Z         %15 = tt.splat %arg4 : f32 -> tensor<1024x8xf32>
2026-02-21T08:24:49.2423696Z         %16 = arith.cmpf ogt, %12, %15 : tensor<1024x8xf32>
2026-02-21T08:24:49.2423913Z         %17 = arith.cmpf une, %12, %12 : tensor<1024x8xf32>
2026-02-21T08:24:49.2424124Z         %18 = arith.ori %16, %17 : tensor<1024x8xi1>
2026-02-21T08:24:49.2424360Z         %19 = arith.select %18, %12, %15 : tensor<1024x8xi1>, tensor<1024x8xf32>
2026-02-21T08:24:49.2424601Z         %20 = math.log %19 : tensor<1024x8xf32>
2026-02-21T08:24:49.2424797Z         %21 = arith.subf %20, %11 : tensor<1024x8xf32>
2026-02-21T08:24:49.2424991Z         %22 = arith.mulf %12, %21 : tensor<1024x8xf32>
2026-02-21T08:24:49.2425193Z         %23 = arith.addf %22, %cst : tensor<1024x8xf32>
2026-02-21T08:24:49.2425383Z         scf.yield %23 : tensor<1024x8xf32>
2026-02-21T08:24:49.2425737Z       }
2026-02-21T08:24:49.2425880Z       %14 = arith.addf %arg6, %13 : tensor<1024x8xf32>
2026-02-21T08:24:49.2426078Z       scf.yield %14 : tensor<1024x8xf32>
2026-02-21T08:24:49.2426353Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:24:49.2426614Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:24:49.2426805Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:24:49.2426972Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:24:49.2427157Z       tt.reduce.return %11 : f32
2026-02-21T08:24:49.2427333Z     }) : (tensor<1024x8xf32>) -> tensor<1024xf32>
2026-02-21T08:24:49.2427568Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>>
2026-02-21T08:24:49.2427822Z     %10 = tt.addptr %9, %6 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32>
2026-02-21T08:24:49.2428061Z     tt.store %10, %8 : tensor<1024x!tt.ptr<f32>>
2026-02-21T08:24:49.2428242Z     tt.return
2026-02-21T08:24:49.2428429Z   }
2026-02-21T08:24:49.2428557Z }
2026-02-21T08:24:49.2428621Z 
2026-02-21T08:24:49.2428669Z {-#
2026-02-21T08:24:49.2428796Z   external_resources: {
2026-02-21T08:24:49.2428944Z     mlir_reproducer: {
2026-02-21T08:24:49.2433147Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:49.2437615Z       disable_threading: false,
2026-02-21T08:24:49.2437782Z       verify_each: true
2026-02-21T08:24:49.2437924Z     }
2026-02-21T08:24:49.2438036Z   }
2026-02-21T08:24:49.2438152Z #-}
2026-02-21T08:24:49.2438565Z /tmp/torchinductor_root/nr/cnr7f25gh6e2u5sisdbfibfm75rhxvditarqdcbkxerpiisvqpft.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:49.2439759Z /tmp/torchinductor_root/nr/cnr7f25gh6e2u5sisdbfibfm75rhxvditarqdcbkxerpiisvqpft.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:49.2440755Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:49.2441723Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:24:49.2442671Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:49.2442918Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:49.7262703Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:24:49.7266589Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:49.7270818Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:49.7272693Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:24:49.7272914Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:49.7273412Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:49.7273651Z     %cst = arith.constant dense<8192> : tensor<2048x1xi32>
2026-02-21T08:24:49.7273916Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<2048x8xf32>
2026-02-21T08:24:49.7274143Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:24:49.7274337Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:49.7274522Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:49.7274698Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:49.7274878Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:49.7275185Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<2048x8xf32>>
2026-02-21T08:24:49.7275509Z     %1 = tt.get_program_id x : i32
2026-02-21T08:24:49.7275680Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:24:49.7275859Z     %3 = arith.minsi %2, %c2_i32 : i32
2026-02-21T08:24:49.7276054Z     scf.for %arg5 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:24:49.7276258Z       %4 = arith.muli %arg5, %c2048_i32 : i32
2026-02-21T08:24:49.7276509Z       %5 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32>
2026-02-21T08:24:49.7276758Z       %6 = tt.splat %4 : i32 -> tensor<2048xi32>
2026-02-21T08:24:49.7276953Z       %7 = arith.addi %6, %5 : tensor<2048xi32>
2026-02-21T08:24:49.7277140Z       %c8184_i32 = arith.constant 8184 : i32
2026-02-21T08:24:49.7277333Z       %c24_i32 = arith.constant 24 : i32
2026-02-21T08:24:49.7277635Z       %8 = scf.for %arg6 = %c0_i32 to %c8184_i32 step %c24_i32 iter_args(%arg7 = %cst_0) -> (tensor<2048x8xf32>)  : i32 {
2026-02-21T08:24:49.7277988Z         %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:49.7278234Z         %28 = tt.splat %arg6 : i32 -> tensor<8xi32>
2026-02-21T08:24:49.7278430Z         %29 = arith.addi %28, %27 : tensor<8xi32>
2026-02-21T08:24:49.7278720Z         %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32>
2026-02-21T08:24:49.7278992Z         %31 = arith.muli %30, %cst : tensor<2048x1xi32>
2026-02-21T08:24:49.7279242Z         %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:49.7279527Z         %33 = tt.broadcast %31 : tensor<2048x1xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7279786Z         %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7280012Z         %35 = arith.addi %33, %34 : tensor<2048x8xi32>
2026-02-21T08:24:49.7280247Z         %36 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7280518Z         %37 = tt.addptr %36, %35 : tensor<2048x8x!tt.ptr<f32>>, tensor<2048x8xi32>
2026-02-21T08:24:49.7280816Z         %38 = tt.load %37 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7281156Z         %39 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc<tensor<2048x8xf32>> -> tensor<2048x8xf32>
2026-02-21T08:24:49.7281446Z         %40 = scf.if %arg3 -> (tensor<2048x8xf32>) {
2026-02-21T08:24:49.7281820Z           %76 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32>
2026-02-21T08:24:49.7282409Z           %77 = arith.subf %39, %38 : tensor<2048x8xf32>
2026-02-21T08:24:49.7282620Z           %78 = arith.mulf %76, %77 : tensor<2048x8xf32>
2026-02-21T08:24:49.7282827Z           %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7283036Z           scf.yield %79 : tensor<2048x8xf32>
2026-02-21T08:24:49.7283211Z         } else {
2026-02-21T08:24:49.7283372Z           %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32>
2026-02-21T08:24:49.7283595Z           %77 = arith.cmpf ogt, %39, %76 : tensor<2048x8xf32>
2026-02-21T08:24:49.7283810Z           %78 = arith.cmpf une, %39, %39 : tensor<2048x8xf32>
2026-02-21T08:24:49.7284023Z           %79 = arith.ori %77, %78 : tensor<2048x8xi1>
2026-02-21T08:24:49.7284254Z           %80 = arith.select %79, %39, %76 : tensor<2048x8xi1>, tensor<2048x8xf32>
2026-02-21T08:24:49.7284575Z           %81 = math.log %80 : tensor<2048x8xf32>
2026-02-21T08:24:49.7284782Z           %82 = arith.subf %81, %38 : tensor<2048x8xf32>
2026-02-21T08:24:49.7284979Z           %83 = arith.mulf %39, %82 : tensor<2048x8xf32>
2026-02-21T08:24:49.7285187Z           %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7285379Z           scf.yield %84 : tensor<2048x8xf32>
2026-02-21T08:24:49.7285548Z         }
2026-02-21T08:24:49.7285689Z         %41 = arith.addf %arg7, %40 : tensor<2048x8xf32>
2026-02-21T08:24:49.7285891Z         %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:24:49.7286085Z         %42 = arith.muli %c8_i32, %c1_i32_1 : i32
2026-02-21T08:24:49.7286271Z         %43 = arith.addi %arg6, %42 : i32
2026-02-21T08:24:49.7286505Z         %44 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:49.7286739Z         %45 = tt.splat %43 : i32 -> tensor<8xi32>
2026-02-21T08:24:49.7286929Z         %46 = arith.addi %45, %44 : tensor<8xi32>
2026-02-21T08:24:49.7287170Z         %47 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32>
2026-02-21T08:24:49.7287437Z         %48 = arith.muli %47, %cst : tensor<2048x1xi32>
2026-02-21T08:24:49.7287675Z         %49 = tt.expand_dims %46 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:49.7287954Z         %50 = tt.broadcast %48 : tensor<2048x1xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7288213Z         %51 = tt.broadcast %49 : tensor<1x8xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7288434Z         %52 = arith.addi %50, %51 : tensor<2048x8xi32>
2026-02-21T08:24:49.7288662Z         %53 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7288925Z         %54 = tt.addptr %53, %52 : tensor<2048x8x!tt.ptr<f32>>, tensor<2048x8xi32>
2026-02-21T08:24:49.7289221Z         %55 = tt.load %54 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7289557Z         %56 = tt.descriptor_load %0[%4, %43] : !tt.tensordesc<tensor<2048x8xf32>> -> tensor<2048x8xf32>
2026-02-21T08:24:49.7289846Z         %57 = scf.if %arg3 -> (tensor<2048x8xf32>) {
2026-02-21T08:24:49.7290207Z           %76 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32>
2026-02-21T08:24:49.7290568Z           %77 = arith.subf %56, %55 : tensor<2048x8xf32>
2026-02-21T08:24:49.7290775Z           %78 = arith.mulf %76, %77 : tensor<2048x8xf32>
2026-02-21T08:24:49.7290976Z           %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7291182Z           scf.yield %79 : tensor<2048x8xf32>
2026-02-21T08:24:49.7291354Z         } else {
2026-02-21T08:24:49.7291508Z           %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32>
2026-02-21T08:24:49.7291726Z           %77 = arith.cmpf ogt, %56, %76 : tensor<2048x8xf32>
2026-02-21T08:24:49.7291976Z           %78 = arith.cmpf une, %56, %56 : tensor<2048x8xf32>
2026-02-21T08:24:49.7292191Z           %79 = arith.ori %77, %78 : tensor<2048x8xi1>
2026-02-21T08:24:49.7292485Z           %80 = arith.select %79, %56, %76 : tensor<2048x8xi1>, tensor<2048x8xf32>
2026-02-21T08:24:49.7292726Z           %81 = math.log %80 : tensor<2048x8xf32>
2026-02-21T08:24:49.7292923Z           %82 = arith.subf %81, %55 : tensor<2048x8xf32>
2026-02-21T08:24:49.7293119Z           %83 = arith.mulf %56, %82 : tensor<2048x8xf32>
2026-02-21T08:24:49.7293325Z           %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7293520Z           scf.yield %84 : tensor<2048x8xf32>
2026-02-21T08:24:49.7293691Z         }
2026-02-21T08:24:49.7293827Z         %58 = arith.addf %41, %57 : tensor<2048x8xf32>
2026-02-21T08:24:49.7294021Z         %c2_i32_2 = arith.constant 2 : i32
2026-02-21T08:24:49.7294210Z         %59 = arith.muli %c8_i32, %c2_i32_2 : i32
2026-02-21T08:24:49.7294391Z         %60 = arith.addi %arg6, %59 : i32
2026-02-21T08:24:49.7294611Z         %61 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:49.7294922Z         %62 = tt.splat %60 : i32 -> tensor<8xi32>
2026-02-21T08:24:49.7295123Z         %63 = arith.addi %62, %61 : tensor<8xi32>
2026-02-21T08:24:49.7295369Z         %64 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32>
2026-02-21T08:24:49.7295639Z         %65 = arith.muli %64, %cst : tensor<2048x1xi32>
2026-02-21T08:24:49.7295891Z         %66 = tt.expand_dims %63 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:49.7296168Z         %67 = tt.broadcast %65 : tensor<2048x1xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7296437Z         %68 = tt.broadcast %66 : tensor<1x8xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7296696Z         %69 = arith.addi %67, %68 : tensor<2048x8xi32>
2026-02-21T08:24:49.7296936Z         %70 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7297210Z         %71 = tt.addptr %70, %69 : tensor<2048x8x!tt.ptr<f32>>, tensor<2048x8xi32>
2026-02-21T08:24:49.7297514Z         %72 = tt.load %71 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7297856Z         %73 = tt.descriptor_load %0[%4, %60] : !tt.tensordesc<tensor<2048x8xf32>> -> tensor<2048x8xf32>
2026-02-21T08:24:49.7298145Z         %74 = scf.if %arg3 -> (tensor<2048x8xf32>) {
2026-02-21T08:24:49.7298508Z           %76 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32>
2026-02-21T08:24:49.7298868Z           %77 = arith.subf %73, %72 : tensor<2048x8xf32>
2026-02-21T08:24:49.7299079Z           %78 = arith.mulf %76, %77 : tensor<2048x8xf32>
2026-02-21T08:24:49.7299288Z           %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7299497Z           scf.yield %79 : tensor<2048x8xf32>
2026-02-21T08:24:49.7299670Z         } else {
2026-02-21T08:24:49.7299830Z           %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32>
2026-02-21T08:24:49.7300055Z           %77 = arith.cmpf ogt, %73, %76 : tensor<2048x8xf32>
2026-02-21T08:24:49.7300274Z           %78 = arith.cmpf une, %73, %73 : tensor<2048x8xf32>
2026-02-21T08:24:49.7300491Z           %79 = arith.ori %77, %78 : tensor<2048x8xi1>
2026-02-21T08:24:49.7300727Z           %80 = arith.select %79, %73, %76 : tensor<2048x8xi1>, tensor<2048x8xf32>
2026-02-21T08:24:49.7300976Z           %81 = math.log %80 : tensor<2048x8xf32>
2026-02-21T08:24:49.7301179Z           %82 = arith.subf %81, %72 : tensor<2048x8xf32>
2026-02-21T08:24:49.7301377Z           %83 = arith.mulf %73, %82 : tensor<2048x8xf32>
2026-02-21T08:24:49.7301591Z           %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7301789Z           scf.yield %84 : tensor<2048x8xf32>
2026-02-21T08:24:49.7301984Z         }
2026-02-21T08:24:49.7302120Z         %75 = arith.addf %58, %74 : tensor<2048x8xf32>
2026-02-21T08:24:49.7302309Z         scf.yield %75 : tensor<2048x8xf32>
2026-02-21T08:24:49.7302492Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:49.7302719Z       %9 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:24:49.7303023Z       %10 = tt.splat %c8184_i32 : i32 -> tensor<8xi32>
2026-02-21T08:24:49.7303214Z       %11 = arith.addi %10, %9 : tensor<8xi32>
2026-02-21T08:24:49.7303459Z       %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32>
2026-02-21T08:24:49.7303717Z       %13 = arith.muli %12, %cst : tensor<2048x1xi32>
2026-02-21T08:24:49.7303984Z       %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:24:49.7304252Z       %15 = tt.broadcast %13 : tensor<2048x1xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7304510Z       %16 = tt.broadcast %14 : tensor<1x8xi32> -> tensor<2048x8xi32>
2026-02-21T08:24:49.7304759Z       %17 = arith.addi %15, %16 : tensor<2048x8xi32>
2026-02-21T08:24:49.7304992Z       %18 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7305276Z       %19 = tt.addptr %18, %17 : tensor<2048x8x!tt.ptr<f32>>, tensor<2048x8xi32>
2026-02-21T08:24:49.7305635Z       %20 = tt.load %19 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr<f32>>
2026-02-21T08:24:49.7306010Z       %21 = tt.descriptor_load %0[%4, %c8184_i32] : !tt.tensordesc<tensor<2048x8xf32>> -> tensor<2048x8xf32>
2026-02-21T08:24:49.7306330Z       %22 = scf.if %arg3 -> (tensor<2048x8xf32>) {
2026-02-21T08:24:49.7306702Z         %27 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32>
2026-02-21T08:24:49.7307083Z         %28 = arith.subf %21, %20 : tensor<2048x8xf32>
2026-02-21T08:24:49.7307289Z         %29 = arith.mulf %27, %28 : tensor<2048x8xf32>
2026-02-21T08:24:49.7307509Z         %30 = arith.addf %29, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7307712Z         scf.yield %30 : tensor<2048x8xf32>
2026-02-21T08:24:49.7307891Z       } else {
2026-02-21T08:24:49.7308060Z         %27 = tt.splat %arg4 : f32 -> tensor<2048x8xf32>
2026-02-21T08:24:49.7308284Z         %28 = arith.cmpf ogt, %21, %27 : tensor<2048x8xf32>
2026-02-21T08:24:49.7308527Z         %29 = arith.cmpf une, %21, %21 : tensor<2048x8xf32>
2026-02-21T08:24:49.7308752Z         %30 = arith.ori %28, %29 : tensor<2048x8xi1>
2026-02-21T08:24:49.7309005Z         %31 = arith.select %30, %21, %27 : tensor<2048x8xi1>, tensor<2048x8xf32>
2026-02-21T08:24:49.7309251Z         %32 = math.log %31 : tensor<2048x8xf32>
2026-02-21T08:24:49.7309456Z         %33 = arith.subf %32, %20 : tensor<2048x8xf32>
2026-02-21T08:24:49.7309667Z         %34 = arith.mulf %21, %33 : tensor<2048x8xf32>
2026-02-21T08:24:49.7309880Z         %35 = arith.addf %34, %cst_0 : tensor<2048x8xf32>
2026-02-21T08:24:49.7310090Z         scf.yield %35 : tensor<2048x8xf32>
2026-02-21T08:24:49.7310261Z       }
2026-02-21T08:24:49.7310414Z       %23 = arith.addf %8, %22 : tensor<2048x8xf32>
2026-02-21T08:24:49.7310616Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T08:24:49.7310816Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:49.7310995Z         %27 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:49.7311193Z         tt.reduce.return %27 : f32
2026-02-21T08:24:49.7311392Z       }) : (tensor<2048x8xf32>) -> tensor<2048xf32>
2026-02-21T08:24:49.7311631Z       %25 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<2048x!tt.ptr<f32>>
2026-02-21T08:24:49.7311936Z       %26 = tt.addptr %25, %7 : tensor<2048x!tt.ptr<f32>>, tensor<2048xi32>
2026-02-21T08:24:49.7312185Z       tt.store %26, %24 : tensor<2048x!tt.ptr<f32>>
2026-02-21T08:24:49.7312391Z     } {tt.warp_specialize}
2026-02-21T08:24:49.7312545Z     tt.return
2026-02-21T08:24:49.7312681Z   }
2026-02-21T08:24:49.7312800Z }
2026-02-21T08:24:49.7312876Z 
2026-02-21T08:24:49.7312926Z {-#
2026-02-21T08:24:49.7313060Z   external_resources: {
2026-02-21T08:24:49.7313218Z     mlir_reproducer: {
2026-02-21T08:24:49.7317548Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:49.7321923Z       disable_threading: false,
2026-02-21T08:24:49.7322093Z       verify_each: true
2026-02-21T08:24:49.7322232Z     }
2026-02-21T08:24:49.7322351Z   }
2026-02-21T08:24:49.7322459Z #-}
2026-02-21T08:24:49.7322886Z /tmp/torchinductor_root/7q/c7qvgckpip6quizl7wwune2zn2qif7xj2v2ry4kmed7ypp353pjy.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:49.7324052Z /tmp/torchinductor_root/7q/c7qvgckpip6quizl7wwune2zn2qif7xj2v2ry4kmed7ypp353pjy.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:49.7325004Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:49.7326045Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 2048], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:24:49.7326976Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:49.7327225Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:52.2024985Z module {
2026-02-21T08:24:52.2029646Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:24:52.2033725Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:24:52.2035184Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:24:52.2035401Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:24:52.2035584Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:24:52.2035793Z     %cst = arith.constant dense<8192> : tensor<4x1xi32>
2026-02-21T08:24:52.2036051Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x32xf32>
2026-02-21T08:24:52.2036291Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:24:52.2036473Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:24:52.2036656Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:24:52.2036842Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:24:52.2037363Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:24:52.2037664Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : <f32>, <tensor<4x32xf32>>
2026-02-21T08:24:52.2037982Z     %1 = tt.get_program_id x : i32
2026-02-21T08:24:52.2038162Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T08:24:52.2038329Z     %3 = arith.addi %2, %c4_i32 : i32
2026-02-21T08:24:52.2038506Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:24:52.2038678Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:24:52.2038847Z     %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:24:52.2039022Z     %6 = arith.subi %c1_i32, %c1_i32_1 : i32
2026-02-21T08:24:52.2039198Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:24:52.2039355Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:24:52.2039527Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:24:52.2039695Z     %9 = arith.remsi %8, %c3_i32 : i32
2026-02-21T08:24:52.2039855Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:24:52.2040126Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:24:52.2040302Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:24:52.2040478Z     %13 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:24:52.2040667Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:24:52.2040865Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:24:52.2041083Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:52.2041334Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:24:52.2041526Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:24:52.2041831Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>)  : i32 {
2026-02-21T08:24:52.2042366Z         %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:24:52.2042612Z         %43 = tt.splat %arg6 : i32 -> tensor<32xi32>
2026-02-21T08:24:52.2042823Z         %44 = arith.addi %43, %42 : tensor<32xi32>
2026-02-21T08:24:52.2043075Z         %45 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:52.2043324Z         %46 = arith.muli %45, %cst : tensor<4x1xi32>
2026-02-21T08:24:52.2043574Z         %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:24:52.2043850Z         %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2044102Z         %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2044319Z         %50 = arith.addi %48, %49 : tensor<4x32xi32>
2026-02-21T08:24:52.2044550Z         %51 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2044816Z         %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr<f32>>, tensor<4x32xi32>
2026-02-21T08:24:52.2045096Z         %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2045430Z         %54 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x32xf32>> -> tensor<4x32xf32>
2026-02-21T08:24:52.2045712Z         %55 = scf.if %arg3 -> (tensor<4x32xf32>) {
2026-02-21T08:24:52.2046074Z           %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32>
2026-02-21T08:24:52.2046436Z           %58 = arith.subf %54, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2046635Z           %59 = arith.mulf %57, %58 : tensor<4x32xf32>
2026-02-21T08:24:52.2046845Z           %60 = arith.addf %59, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2047039Z           scf.yield %60 : tensor<4x32xf32>
2026-02-21T08:24:52.2047208Z         } else {
2026-02-21T08:24:52.2047412Z           %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32>
2026-02-21T08:24:52.2047623Z           %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32>
2026-02-21T08:24:52.2047841Z           %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32>
2026-02-21T08:24:52.2048047Z           %60 = arith.ori %58, %59 : tensor<4x32xi1>
2026-02-21T08:24:52.2048278Z           %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32>
2026-02-21T08:24:52.2048608Z           %62 = math.log %61 : tensor<4x32xf32>
2026-02-21T08:24:52.2048797Z           %63 = arith.subf %62, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2048995Z           %64 = arith.mulf %54, %63 : tensor<4x32xf32>
2026-02-21T08:24:52.2049195Z           %65 = arith.addf %64, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2049397Z           scf.yield %65 : tensor<4x32xf32>
2026-02-21T08:24:52.2049567Z         }
2026-02-21T08:24:52.2049709Z         %56 = arith.addf %arg7, %55 : tensor<4x32xf32>
2026-02-21T08:24:52.2049903Z         scf.yield %56 : tensor<4x32xf32>
2026-02-21T08:24:52.2050085Z       } {tt.flatten, tt.warp_specialize}
2026-02-21T08:24:52.2050278Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:24:52.2050458Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:52.2050633Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:52.2050870Z         tt.reduce.return %42 : f32
2026-02-21T08:24:52.2051058Z       }) : (tensor<4x32xf32>) -> tensor<4xf32>
2026-02-21T08:24:52.2051279Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2051526Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:52.2051755Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2051973Z       %c1_i32_2 = arith.constant 1 : i32
2026-02-21T08:24:52.2052158Z       %22 = arith.muli %c1_i32, %c1_i32_2 : i32
2026-02-21T08:24:52.2052336Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:24:52.2052512Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T08:24:52.2052721Z       %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:52.2052957Z       %26 = tt.splat %24 : i32 -> tensor<4xi32>
2026-02-21T08:24:52.2053147Z       %27 = arith.addi %26, %25 : tensor<4xi32>
2026-02-21T08:24:52.2053445Z       %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>)  : i32 {
2026-02-21T08:24:52.2053793Z         %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:24:52.2054031Z         %43 = tt.splat %arg6 : i32 -> tensor<32xi32>
2026-02-21T08:24:52.2054231Z         %44 = arith.addi %43, %42 : tensor<32xi32>
2026-02-21T08:24:52.2054475Z         %45 = tt.expand_dims %27 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:52.2054722Z         %46 = arith.muli %45, %cst : tensor<4x1xi32>
2026-02-21T08:24:52.2054963Z         %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:24:52.2055235Z         %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2055485Z         %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2055703Z         %50 = arith.addi %48, %49 : tensor<4x32xi32>
2026-02-21T08:24:52.2055927Z         %51 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2056192Z         %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr<f32>>, tensor<4x32xi32>
2026-02-21T08:24:52.2056471Z         %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2056805Z         %54 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<4x32xf32>> -> tensor<4x32xf32>
2026-02-21T08:24:52.2057085Z         %55 = scf.if %arg3 -> (tensor<4x32xf32>) {
2026-02-21T08:24:52.2057442Z           %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32>
2026-02-21T08:24:52.2057802Z           %58 = arith.subf %54, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2058000Z           %59 = arith.mulf %57, %58 : tensor<4x32xf32>
2026-02-21T08:24:52.2058216Z           %60 = arith.addf %59, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2058442Z           scf.yield %60 : tensor<4x32xf32>
2026-02-21T08:24:52.2058619Z         } else {
2026-02-21T08:24:52.2058786Z           %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32>
2026-02-21T08:24:52.2059080Z           %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32>
2026-02-21T08:24:52.2059299Z           %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32>
2026-02-21T08:24:52.2059516Z           %60 = arith.ori %58, %59 : tensor<4x32xi1>
2026-02-21T08:24:52.2059761Z           %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32>
2026-02-21T08:24:52.2060002Z           %62 = math.log %61 : tensor<4x32xf32>
2026-02-21T08:24:52.2060206Z           %63 = arith.subf %62, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2060404Z           %64 = arith.mulf %54, %63 : tensor<4x32xf32>
2026-02-21T08:24:52.2060619Z           %65 = arith.addf %64, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2060816Z           scf.yield %65 : tensor<4x32xf32>
2026-02-21T08:24:52.2060996Z         }
2026-02-21T08:24:52.2061147Z         %56 = arith.addf %arg7, %55 : tensor<4x32xf32>
2026-02-21T08:24:52.2061338Z         scf.yield %56 : tensor<4x32xf32>
2026-02-21T08:24:52.2061637Z       } {tt.flatten, tt.warp_specialize}
2026-02-21T08:24:52.2061837Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:24:52.2062068Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:52.2062245Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:52.2062438Z         tt.reduce.return %42 : f32
2026-02-21T08:24:52.2062624Z       }) : (tensor<4x32xf32>) -> tensor<4xf32>
2026-02-21T08:24:52.2062857Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2063124Z       %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:52.2063356Z       tt.store %31, %29 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2063558Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:24:52.2063741Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:24:52.2063929Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:24:52.2064104Z       %34 = arith.muli %33, %c4_i32 : i32
2026-02-21T08:24:52.2064331Z       %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:52.2064577Z       %36 = tt.splat %34 : i32 -> tensor<4xi32>
2026-02-21T08:24:52.2064766Z       %37 = arith.addi %36, %35 : tensor<4xi32>
2026-02-21T08:24:52.2065082Z       %38 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>)  : i32 {
2026-02-21T08:24:52.2065435Z         %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:24:52.2065689Z         %43 = tt.splat %arg6 : i32 -> tensor<32xi32>
2026-02-21T08:24:52.2065926Z         %44 = arith.addi %43, %42 : tensor<32xi32>
2026-02-21T08:24:52.2066167Z         %45 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:52.2066420Z         %46 = arith.muli %45, %cst : tensor<4x1xi32>
2026-02-21T08:24:52.2066656Z         %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:24:52.2066936Z         %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2067178Z         %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2067406Z         %50 = arith.addi %48, %49 : tensor<4x32xi32>
2026-02-21T08:24:52.2067621Z         %51 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2067883Z         %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr<f32>>, tensor<4x32xi32>
2026-02-21T08:24:52.2068162Z         %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2068481Z         %54 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<4x32xf32>> -> tensor<4x32xf32>
2026-02-21T08:24:52.2068764Z         %55 = scf.if %arg3 -> (tensor<4x32xf32>) {
2026-02-21T08:24:52.2069108Z           %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32>
2026-02-21T08:24:52.2069465Z           %58 = arith.subf %54, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2069672Z           %59 = arith.mulf %57, %58 : tensor<4x32xf32>
2026-02-21T08:24:52.2069935Z           %60 = arith.addf %59, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2070134Z           scf.yield %60 : tensor<4x32xf32>
2026-02-21T08:24:52.2070296Z         } else {
2026-02-21T08:24:52.2070456Z           %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32>
2026-02-21T08:24:52.2070664Z           %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32>
2026-02-21T08:24:52.2070882Z           %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32>
2026-02-21T08:24:52.2071091Z           %60 = arith.ori %58, %59 : tensor<4x32xi1>
2026-02-21T08:24:52.2071315Z           %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32>
2026-02-21T08:24:52.2071549Z           %62 = math.log %61 : tensor<4x32xf32>
2026-02-21T08:24:52.2071735Z           %63 = arith.subf %62, %53 : tensor<4x32xf32>
2026-02-21T08:24:52.2071955Z           %64 = arith.mulf %54, %63 : tensor<4x32xf32>
2026-02-21T08:24:52.2072151Z           %65 = arith.addf %64, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2072418Z           scf.yield %65 : tensor<4x32xf32>
2026-02-21T08:24:52.2072584Z         }
2026-02-21T08:24:52.2072731Z         %56 = arith.addf %arg7, %55 : tensor<4x32xf32>
2026-02-21T08:24:52.2072922Z         scf.yield %56 : tensor<4x32xf32>
2026-02-21T08:24:52.2073100Z       } {tt.flatten, tt.warp_specialize}
2026-02-21T08:24:52.2073294Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:24:52.2073477Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:52.2073656Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:52.2073832Z         tt.reduce.return %42 : f32
2026-02-21T08:24:52.2074016Z       }) : (tensor<4x32xf32>) -> tensor<4xf32>
2026-02-21T08:24:52.2074231Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2074488Z       %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:52.2074716Z       tt.store %41, %39 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2074959Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:52.2075217Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:24:52.2075408Z       %14 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:24:52.2075630Z       %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:24:52.2075858Z       %16 = tt.splat %14 : i32 -> tensor<4xi32>
2026-02-21T08:24:52.2076049Z       %17 = arith.addi %16, %15 : tensor<4xi32>
2026-02-21T08:24:52.2076351Z       %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>)  : i32 {
2026-02-21T08:24:52.2076692Z         %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32>
2026-02-21T08:24:52.2076934Z         %23 = tt.splat %arg6 : i32 -> tensor<32xi32>
2026-02-21T08:24:52.2077124Z         %24 = arith.addi %23, %22 : tensor<32xi32>
2026-02-21T08:24:52.2077365Z         %25 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:24:52.2077617Z         %26 = arith.muli %25, %cst : tensor<4x1xi32>
2026-02-21T08:24:52.2077851Z         %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32>
2026-02-21T08:24:52.2078128Z         %28 = tt.broadcast %26 : tensor<4x1xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2078372Z         %29 = tt.broadcast %27 : tensor<1x32xi32> -> tensor<4x32xi32>
2026-02-21T08:24:52.2078596Z         %30 = arith.addi %28, %29 : tensor<4x32xi32>
2026-02-21T08:24:52.2078812Z         %31 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2079076Z         %32 = tt.addptr %31, %30 : tensor<4x32x!tt.ptr<f32>>, tensor<4x32xi32>
2026-02-21T08:24:52.2079361Z         %33 = tt.load %32 evictionPolicy = evict_last : tensor<4x32x!tt.ptr<f32>>
2026-02-21T08:24:52.2079683Z         %34 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<4x32xf32>> -> tensor<4x32xf32>
2026-02-21T08:24:52.2079970Z         %35 = scf.if %arg3 -> (tensor<4x32xf32>) {
2026-02-21T08:24:52.2080318Z           %37 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32>
2026-02-21T08:24:52.2080736Z           %38 = arith.subf %34, %33 : tensor<4x32xf32>
2026-02-21T08:24:52.2080931Z           %39 = arith.mulf %37, %38 : tensor<4x32xf32>
2026-02-21T08:24:52.2081144Z           %40 = arith.addf %39, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2081349Z           scf.yield %40 : tensor<4x32xf32>
2026-02-21T08:24:52.2081513Z         } else {
2026-02-21T08:24:52.2081675Z           %37 = tt.splat %arg4 : f32 -> tensor<4x32xf32>
2026-02-21T08:24:52.2081918Z           %38 = arith.cmpf ogt, %34, %37 : tensor<4x32xf32>
2026-02-21T08:24:52.2082151Z           %39 = arith.cmpf une, %34, %34 : tensor<4x32xf32>
2026-02-21T08:24:52.2082357Z           %40 = arith.ori %38, %39 : tensor<4x32xi1>
2026-02-21T08:24:52.2082590Z           %41 = arith.select %40, %34, %37 : tensor<4x32xi1>, tensor<4x32xf32>
2026-02-21T08:24:52.2082823Z           %42 = math.log %41 : tensor<4x32xf32>
2026-02-21T08:24:52.2083068Z           %43 = arith.subf %42, %33 : tensor<4x32xf32>
2026-02-21T08:24:52.2083272Z           %44 = arith.mulf %34, %43 : tensor<4x32xf32>
2026-02-21T08:24:52.2083470Z           %45 = arith.addf %44, %cst_0 : tensor<4x32xf32>
2026-02-21T08:24:52.2083668Z           scf.yield %45 : tensor<4x32xf32>
2026-02-21T08:24:52.2083829Z         }
2026-02-21T08:24:52.2083978Z         %36 = arith.addf %arg7, %35 : tensor<4x32xf32>
2026-02-21T08:24:52.2084163Z         scf.yield %36 : tensor<4x32xf32>
2026-02-21T08:24:52.2084347Z       } {tt.flatten, tt.warp_specialize}
2026-02-21T08:24:52.2084537Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:24:52.2084716Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:24:52.2084893Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:24:52.2085070Z         tt.reduce.return %22 : f32
2026-02-21T08:24:52.2085252Z       }) : (tensor<4x32xf32>) -> tensor<4xf32>
2026-02-21T08:24:52.2085467Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2085726Z       %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:24:52.2085952Z       tt.store %21, %19 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:24:52.2086192Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:24:52.2086420Z     tt.return
2026-02-21T08:24:52.2086540Z   }
2026-02-21T08:24:52.2086662Z }
2026-02-21T08:24:52.2086729Z 
2026-02-21T08:24:52.2086777Z {-#
2026-02-21T08:24:52.2086904Z   external_resources: {
2026-02-21T08:24:52.2087052Z     mlir_reproducer: {
2026-02-21T08:24:52.2091314Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:24:52.2095705Z       disable_threading: false,
2026-02-21T08:24:52.2095866Z       verify_each: true
2026-02-21T08:24:52.2096037Z     }
2026-02-21T08:24:52.2096161Z   }
2026-02-21T08:24:52.2096303Z #-}
2026-02-21T08:24:52.2096827Z /tmp/torchinductor_root/ut/cutkrmuratgdy4w4xrkpvokkovuoxsp5tyx4ojndcaoy767utpnl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:24:52.2098243Z /tmp/torchinductor_root/ut/cutkrmuratgdy4w4xrkpvokkovuoxsp5tyx4ojndcaoy767utpnl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:24:52.2099336Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:24:52.2100436Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 4], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], num_sm_multiplier=2, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:24:52.2101400Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:24:52.2101673Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:24:52.2714478Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.7 configs/s
2026-02-21T08:24:52.2725302Z [41s] Adaptive compile timeout: 30s (90% percentile=25.3s, bounds=[30.0s, 30s])
2026-02-21T08:24:53.2595370Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1009.9 configs/s
2026-02-21T08:24:53.3118534Z [42s] Initial random population of 100, 5 starting points: 
2026-02-21T08:24:53.3124007Z error=14
2026-02-21T08:24:53.3127877Z timeout=9
2026-02-21T08:24:53.3129403Z ok=77
2026-02-21T08:24:53.3129604Z min=0.0707
2026-02-21T08:24:53.3135156Z mid=1.2359
2026-02-21T08:24:53.3138328Z max=78.4087
2026-02-21T08:24:53.3142288Z best={'block_sizes': [256, 2],
2026-02-21T08:24:53.3144174Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:24:53.3144397Z  'load_eviction_policies': ['first', 'last'],
2026-02-21T08:24:53.3144588Z  'num_stages': 6,
2026-02-21T08:24:53.3144725Z  'num_warps': 1,
2026-02-21T08:24:53.3144869Z  'pid_type': 'flat',
2026-02-21T08:24:53.3145026Z  'range_flattens': [None, True],
2026-02-21T08:24:53.3145206Z  'range_multi_buffers': [None, None],
2026-02-21T08:24:53.3145391Z  'range_num_stages': [0, 0],
2026-02-21T08:24:53.3145582Z  'range_unroll_factors': [0, 1],
2026-02-21T08:24:53.3145777Z  'range_warp_specializes': [None, True]}
2026-02-21T08:24:53.3145982Z [42s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:24:54.6128768Z [44s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T08:24:59.7567037Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 8.3 configs/s
2026-02-21T08:25:04.6039369Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.3 configs/s
2026-02-21T08:25:12.8341328Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 127.0         
2026-02-21T08:25:12.8341828Z                                                                   configs/s     
2026-02-21T08:25:13.1924694Z [62s] Generation 1 complete: 
2026-02-21T08:25:13.1928341Z ok=86
2026-02-21T08:25:13.1932445Z min=0.0685
2026-02-21T08:25:13.1936341Z mid=0.0769
2026-02-21T08:25:13.1940318Z max=0.8306
2026-02-21T08:25:13.1944249Z best={'block_sizes': [2048, 1],
2026-02-21T08:25:13.1948713Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:25:13.1950259Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:25:13.1950584Z  'num_stages': 5,
2026-02-21T08:25:13.1950829Z  'num_warps': 8,
2026-02-21T08:25:13.1951034Z  'pid_type': 'flat',
2026-02-21T08:25:13.1951353Z  'range_flattens': [None, None],
2026-02-21T08:25:13.1951620Z  'range_multi_buffers': [None, None],
2026-02-21T08:25:13.1951962Z  'range_num_stages': [0, 3],
2026-02-21T08:25:13.1952194Z  'range_unroll_factors': [0, 1],
2026-02-21T08:25:13.1952456Z  'range_warp_specializes': [None, True]}
2026-02-21T08:25:13.1952745Z [62s] Fitting surrogate: 186 points, 186 targets
2026-02-21T08:25:14.1738084Z [63s] Generation 2 starting: 69 neighbors, 5 active search path(s)
2026-02-21T08:25:26.6790989Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 1.7 configs/s
2026-02-21T08:25:30.8929143Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.3 configs/s
2026-02-21T08:25:37.3164139Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 157.3         
2026-02-21T08:25:37.3164906Z                                                                   configs/s     
2026-02-21T08:25:37.5962284Z [87s] Generation 2 complete: 
2026-02-21T08:25:37.5962595Z ok=75
2026-02-21T08:25:37.5962749Z min=0.0624
2026-02-21T08:25:37.5962876Z mid=0.0728
2026-02-21T08:25:37.5963037Z max=0.3326
2026-02-21T08:25:37.5963191Z best={'block_sizes': [1024, 1],
2026-02-21T08:25:37.5963456Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:25:37.5963729Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:25:37.5963916Z  'num_stages': 7,
2026-02-21T08:25:37.5964068Z  'num_warps': 4,
2026-02-21T08:25:37.5964205Z  'pid_type': 'flat',
2026-02-21T08:25:37.5964365Z  'range_flattens': [None, None],
2026-02-21T08:25:37.5964538Z  'range_multi_buffers': [None, None],
2026-02-21T08:25:37.5964719Z  'range_num_stages': [0, 1],
2026-02-21T08:25:37.5964878Z  'range_unroll_factors': [0, 0],
2026-02-21T08:25:37.5965104Z  'range_warp_specializes': [None, True]}
2026-02-21T08:25:37.5978748Z [87s] Fitting surrogate: 261 points, 261 targets
2026-02-21T08:25:38.6348352Z [88s] Generation 3 starting: 69 neighbors, 5 active search path(s)
2026-02-21T08:25:50.6669015Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.9 configs/s
2026-02-21T08:25:54.7470804Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.3 configs/s
2026-02-21T08:26:01.8268185Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.8         
2026-02-21T08:26:01.8268708Z                                                                   configs/s     
2026-02-21T08:26:02.1449869Z [111s] Generation 3 complete: 
2026-02-21T08:26:02.1454303Z ok=74
2026-02-21T08:26:02.1458869Z min=0.0624
2026-02-21T08:26:02.1462916Z mid=0.0687
2026-02-21T08:26:02.1467476Z max=0.6073
2026-02-21T08:26:02.1472003Z best={'block_sizes': [1024, 1],
2026-02-21T08:26:02.1475804Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:26:02.1476146Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:26:02.1476359Z  'num_stages': 6,
2026-02-21T08:26:02.1481542Z  'num_warps': 1,
2026-02-21T08:26:02.1485400Z  'pid_type': 'flat',
2026-02-21T08:26:02.1489343Z  'range_flattens': [None, True],
2026-02-21T08:26:02.1493311Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:02.1495374Z  'range_num_stages': [0, 0],
2026-02-21T08:26:02.1495585Z  'range_unroll_factors': [0, 1],
2026-02-21T08:26:02.1495768Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:02.1496052Z [111s] Fitting surrogate: 335 points, 335 targets
2026-02-21T08:26:03.1052805Z [112s] Generation 4 starting: 67 neighbors, 5 active search path(s)
2026-02-21T08:26:06.4328393Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 17.2 configs/s
2026-02-21T08:26:10.4187957Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.5 configs/s
2026-02-21T08:26:17.7003457Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 138.8         
2026-02-21T08:26:17.7004609Z                                                                   configs/s     
2026-02-21T08:26:18.1189959Z [127s] Generation 4 complete: 
2026-02-21T08:26:18.1193424Z ok=73
2026-02-21T08:26:18.1194973Z min=0.0624
2026-02-21T08:26:18.1195124Z mid=0.0646
2026-02-21T08:26:18.1195252Z max=0.3154
2026-02-21T08:26:18.1195382Z best={'block_sizes': [1024, 1],
2026-02-21T08:26:18.1195600Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:26:18.1195816Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:26:18.1196003Z  'num_stages': 6,
2026-02-21T08:26:18.1196148Z  'num_warps': 1,
2026-02-21T08:26:18.1196283Z  'pid_type': 'flat',
2026-02-21T08:26:18.1196446Z  'range_flattens': [None, True],
2026-02-21T08:26:18.1196620Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:18.1196805Z  'range_num_stages': [0, 0],
2026-02-21T08:26:18.1196966Z  'range_unroll_factors': [0, 0],
2026-02-21T08:26:18.1197484Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:18.1207024Z [127s] Fitting surrogate: 408 points, 408 targets
2026-02-21T08:26:18.9072947Z [128s] Generation 5 starting: 51 neighbors, 4 active search path(s)
2026-02-21T08:26:30.7792752Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 1.3 configs/s
2026-02-21T08:26:33.8577812Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.5 configs/s
2026-02-21T08:26:39.6299365Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 186.3         
2026-02-21T08:26:39.6300805Z                                                                   configs/s     
2026-02-21T08:26:39.8992321Z [149s] Generation 5 complete: 
2026-02-21T08:26:39.8993695Z ok=55
2026-02-21T08:26:39.8993850Z min=0.0624
2026-02-21T08:26:39.8993984Z mid=0.0644
2026-02-21T08:26:39.8994103Z max=0.3040
2026-02-21T08:26:39.8994244Z best={'block_sizes': [2048, 1],
2026-02-21T08:26:39.8994497Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:26:39.8994794Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:26:39.8995002Z  'num_stages': 5,
2026-02-21T08:26:39.8995138Z  'num_warps': 8,
2026-02-21T08:26:39.8995278Z  'pid_type': 'flat',
2026-02-21T08:26:39.8995426Z  'range_flattens': [None, False],
2026-02-21T08:26:39.8995605Z  'range_multi_buffers': [None, False],
2026-02-21T08:26:39.8995780Z  'range_num_stages': [0, 1],
2026-02-21T08:26:39.8995943Z  'range_unroll_factors': [0, 1],
2026-02-21T08:26:39.8996114Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:39.9007177Z [149s] Fitting surrogate: 463 points, 463 targets
2026-02-21T08:26:40.6200495Z [150s] Generation 6 starting: 36 neighbors, 3 active search path(s)
2026-02-21T08:26:42.2896442Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 35.1 configs/s
2026-02-21T08:26:44.3610002Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 17.8 configs/s
2026-02-21T08:26:48.3941142Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 250.4         
2026-02-21T08:26:48.3942208Z                                                                   configs/s     
2026-02-21T08:26:48.6182201Z [158s] Generation 6 complete: 
2026-02-21T08:26:48.6186212Z ok=39
2026-02-21T08:26:48.6190082Z min=0.0624
2026-02-21T08:26:48.6191490Z mid=0.0626
2026-02-21T08:26:48.6191644Z max=0.1363
2026-02-21T08:26:48.6191795Z best={'block_sizes': [2048, 1],
2026-02-21T08:26:48.6192122Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:26:48.6192401Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:26:48.6192593Z  'num_stages': 5,
2026-02-21T08:26:48.6192744Z  'num_warps': 8,
2026-02-21T08:26:48.6192880Z  'pid_type': 'flat',
2026-02-21T08:26:48.6193041Z  'range_flattens': [None, False],
2026-02-21T08:26:48.6193211Z  'range_multi_buffers': [None, False],
2026-02-21T08:26:48.6193397Z  'range_num_stages': [0, 1],
2026-02-21T08:26:48.6193563Z  'range_unroll_factors': [0, 1],
2026-02-21T08:26:48.6193734Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:48.6201733Z [158s] Fitting surrogate: 502 points, 502 targets
2026-02-21T08:26:49.0287000Z [158s] Generation 7 starting: 11 neighbors, 1 active search path(s)
2026-02-21T08:26:49.6899402Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 44.6 configs/s
2026-02-21T08:26:50.3414894Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.2 configs/s
2026-02-21T08:26:51.9889422Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 699.8         
2026-02-21T08:26:51.9889764Z                                                                   configs/s     
2026-02-21T08:26:52.0612134Z [161s] Generation 7 complete: 
2026-02-21T08:26:52.0614182Z ok=13
2026-02-21T08:26:52.0618840Z min=0.0624
2026-02-21T08:26:52.0620180Z mid=0.0624
2026-02-21T08:26:52.0620338Z max=0.0932
2026-02-21T08:26:52.0620475Z best={'block_sizes': [2048, 1],
2026-02-21T08:26:52.0620726Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:26:52.0621023Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:26:52.0621230Z  'num_stages': 5,
2026-02-21T08:26:52.0621369Z  'num_warps': 8,
2026-02-21T08:26:52.0621502Z  'pid_type': 'flat',
2026-02-21T08:26:52.0621654Z  'range_flattens': [None, False],
2026-02-21T08:26:52.0621824Z  'range_multi_buffers': [None, False],
2026-02-21T08:26:52.0622513Z  'range_num_stages': [0, 1],
2026-02-21T08:26:52.0622675Z  'range_unroll_factors': [0, 1],
2026-02-21T08:26:52.0622855Z  'range_warp_specializes': [None, True]}
2026-02-21T08:26:52.0633097Z [161s] Fitting surrogate: 515 points, 515 targets
2026-02-21T08:26:52.3349124Z [162s] Autotuning complete in 162.0s after searching 485 configs.
2026-02-21T08:26:52.3350614Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:26:52.3351567Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:26:52.3352613Z 
2026-02-21T08:26:52.3352864Z [162s] Code of selected kernel: /tmp/torchinductor_root/im/cimnzdq4zypdmhpvdsxfhsdli2ldhkgwquhjbqrdqltuunfy3gzp.py
2026-02-21T08:26:52.3534268Z from __future__ import annotations
2026-02-21T08:26:52.3534475Z 
2026-02-21T08:26:52.3534545Z import torch
2026-02-21T08:26:52.3534679Z import triton
2026-02-21T08:26:52.3534829Z import triton.language as tl
2026-02-21T08:26:52.3535026Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:26:52.3535329Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:26:52.3535613Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:26:52.3535888Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:26:52.3536057Z 
2026-02-21T08:26:52.3536123Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T08:26:52.3536323Z _BLOCK_SIZE_0 = tl.constexpr(2048)
2026-02-21T08:26:52.3536783Z 
2026-02-21T08:26:52.3536850Z @triton.jit
2026-02-21T08:26:52.3537033Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:26:52.3537322Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:26:52.3537553Z     pid_0 = tl.program_id(0)
2026-02-21T08:26:52.3537712Z     offset_1 = pid_0
2026-02-21T08:26:52.3537873Z     indices_1 = offset_1 + tl.zeros([1], tl.int32)
2026-02-21T08:26:52.3538153Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:26:52.3538465Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:26:52.3538730Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:26:52.3539038Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:26:52.3539404Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:26:52.3539808Z     for offset_0 in tl.range(0, 8192, _BLOCK_SIZE_0, loop_unroll_factor=1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T08:26:52.3540255Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:26:52.3540478Z         loss_sum_copy = loss_sum
2026-02-21T08:26:52.3540653Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:26:52.3540909Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:26:52.3541221Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:26:52.3541475Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:26:52.3541828Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 8192 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:26:52.3542273Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:26:52.3542632Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 8192 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:26:52.3542978Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:26:52.3543266Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:26:52.3543580Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:26:52.3543809Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:26:52.3543980Z         if log_target:
2026-02-21T08:26:52.3544142Z             y_true_val_copy = y_true_val
2026-02-21T08:26:52.3544326Z             y_pred_val_copy = y_pred_val
2026-02-21T08:26:52.3544512Z             kl_loss_copy = kl_loss
2026-02-21T08:26:52.3544689Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:26:52.3544891Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:26:52.3545077Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:26:52.3545305Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:26:52.3545544Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:26:52.3545787Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:26:52.3546054Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:26:52.3546245Z             v_2 = v_0 * v_1
2026-02-21T08:26:52.3546415Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:26:52.3546600Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:26:52.3546861Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:26:52.3547162Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:26:52.3547374Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:26:52.3547553Z         _not = not log_target
2026-02-21T08:26:52.3547705Z         if _not:
2026-02-21T08:26:52.3547857Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:26:52.3548120Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:26:52.3548302Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:26:52.3548493Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:26:52.3548707Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:26:52.3548917Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:26:52.3549169Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:26:52.3549467Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:26:52.3549684Z             v_5 = tl_math.log(v_4)
2026-02-21T08:26:52.3549913Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:26:52.3550156Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:26:52.3550346Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:26:52.3550530Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:26:52.3550799Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:26:52.3551008Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:26:52.3551226Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:26:52.3551473Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:26:52.3551689Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:26:52.3551826Z 
2026-02-21T08:26:52.3552187Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:26:52.3552641Z     """
2026-02-21T08:26:52.3552778Z     Compute KL Divergence loss.
2026-02-21T08:26:52.3552892Z 
2026-02-21T08:26:52.3552956Z     Args:
2026-02-21T08:26:52.3553128Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:26:52.3553459Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:26:52.3553781Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:26:52.3554089Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:26:52.3554334Z         eps: Small value to avoid numerical issues
2026-02-21T08:26:52.3554461Z 
2026-02-21T08:26:52.3554515Z     Returns:
2026-02-21T08:26:52.3554654Z         loss: KL divergence loss
2026-02-21T08:26:52.3554804Z     """
2026-02-21T08:26:52.3554945Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:26:52.3555121Z     BT, V = y_pred.shape
2026-02-21T08:26:52.3555315Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:26:52.3555575Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:26:52.3555811Z     # src[kl_div.py:77]: )
2026-02-21T08:26:52.3556060Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:26:52.3556338Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:26:52.3556555Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:26:52.3556752Z     # src[kl_div.py:82]: else:
2026-02-21T08:26:52.3556912Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:26:52.3557066Z     if reduction == 'none':
2026-02-21T08:26:52.3557254Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:26:52.3557462Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:26:52.3557624Z     else:
2026-02-21T08:26:52.3557845Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:26:52.3558162Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:26:52.3558454Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:26:52.3558761Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:26:52.3559020Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:26:52.3559317Z     _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=5)
2026-02-21T08:26:52.3559708Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:26:52.3559939Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:26:52.3560162Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:26:52.3560353Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:26:52.3560515Z     if reduction == 'batchmean':
2026-02-21T08:26:52.3560713Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:26:52.3560929Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:26:52.3561105Z     elif reduction == 'sum':
2026-02-21T08:26:52.3561301Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:26:52.3561512Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:26:52.3561696Z     elif reduction == 'mean':
2026-02-21T08:26:52.3561943Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:26:52.3562229Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:26:52.3562402Z     else:
2026-02-21T08:26:52.3562546Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:26:52.3562741Z         final_loss = loss
2026-02-21T08:26:52.3562908Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:26:52.3563096Z     return final_loss
2026-02-21T08:26:53.2457624Z WARNING:tritonbench.utils.triton_op:Completed input ID 1:
2026-02-21T08:26:53.2462326Z (B, T, V)
2026-02-21T08:26:53.2466526Z --------------
2026-02-21T08:26:53.2468594Z (8, 512, 8192)
2026-02-21T08:26:53.2468714Z 
2026-02-21T08:26:53.2469278Z  33%|███▎      | 2/6 [05:25<10:54, 163.52s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2:
2026-02-21T08:26:53.2469574Z (B, T, V)
2026-02-21T08:26:53.2469710Z ---------------
2026-02-21T08:26:53.2469845Z (8, 512, 16384)
2026-02-21T08:26:53.2470086Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:26:54.3674728Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:26:55.4523673Z INFO:tritonbench.utils.triton_op:Took 3.20ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:26:56.7705818Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:26:56.7706090Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:26:56.7706286Z               'dtype': 'torch.float32',
2026-02-21T08:26:56.7706466Z               'shape': (4096, 16384),
2026-02-21T08:26:56.7706650Z               'stride': (16384, 1)},
2026-02-21T08:26:56.7706826Z             { 'device': 'cuda:0',
2026-02-21T08:26:56.7706987Z               'dtype': 'torch.float32',
2026-02-21T08:26:56.7707163Z               'shape': (4096, 16384),
2026-02-21T08:26:56.7707321Z               'stride': (16384, 1)}),
2026-02-21T08:26:56.7707484Z   'kwargs': {}}
2026-02-21T08:26:56.7718103Z INFO:tritonbench.utils.triton_op:Took 1.65ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:26:56.9671141Z [0s] Autotune random seed: 2135561342
2026-02-21T08:26:57.0108165Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:27:29.3092854Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[3, 3], range_warp_specializes=[None, None])
2026-02-21T08:27:29.5604744Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=2, num_stages=5, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[False, None])
2026-02-21T08:27:29.9480009Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T08:27:30.1333046Z [33s] Timeout after 30s compiling Config(block_sizes=[16, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=1, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[False, False])
2026-02-21T08:27:30.6582980Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, False])
2026-02-21T08:27:30.8233278Z [33s] Timeout after 30s compiling Config(block_sizes=[32, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True])
2026-02-21T08:27:31.3254793Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=128, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[4, 3], range_unroll_factors=[1, 3], range_warp_specializes=[None, None])
2026-02-21T08:27:31.6744315Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False])
2026-02-21T08:27:31.6770099Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s
2026-02-21T08:27:35.0579206Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:27:35.0580139Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:27:35.0581018Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:27:35.0581295Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:27:35.0581599Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:27:35.0582177Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:35.0582502Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32>
2026-02-21T08:27:35.0582856Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:27:35.0583122Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:27:35.0583405Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:27:35.0583716Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:27:35.0583977Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:27:35.0584458Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:27:35.0585180Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<256x32xf32>>
2026-02-21T08:27:35.0585692Z     %2 = tt.get_program_id x : i32
2026-02-21T08:27:35.0585967Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:27:35.0586247Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T08:27:35.0586914Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:27:35.0587175Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:27:35.0587474Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:27:35.0587755Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:27:35.0588049Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:27:35.0588337Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:27:35.0588612Z     %9 = arith.remsi %8, %c2_i32 : i32
2026-02-21T08:27:35.0588890Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:27:35.0589153Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:27:35.0589431Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:27:35.0589696Z     %13 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:27:35.0590006Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:27:35.0590318Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:27:35.0590813Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:27:35.0591239Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:27:35.0591546Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:27:35.0592118Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:27:35.0592795Z         %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0593432Z         %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0593910Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:27:35.0594522Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:27:35.0595160Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:27:35.0595484Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:27:35.0595829Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0596143Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:27:35.0596412Z         } else {
2026-02-21T08:27:35.0596658Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:27:35.0597019Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:27:35.0597389Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:27:35.0597719Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:27:35.0598121Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:27:35.0598512Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:27:35.0598836Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:27:35.0599158Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:27:35.0599503Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0599829Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:27:35.0600092Z         }
2026-02-21T08:27:35.0600318Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:27:35.0600633Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:27:35.0600966Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:27:35.0601289Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:27:35.0601592Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:35.0601907Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:35.0602191Z         tt.reduce.return %32 : f32
2026-02-21T08:27:35.0602509Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:27:35.0602871Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0603305Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:27:35.0603690Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0604001Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:27:35.0604399Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:27:35.0604700Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:27:35.0604984Z       %24 = arith.muli %23, %c256_i32 : i32
2026-02-21T08:27:35.0605340Z       %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:27:35.0605747Z       %26 = tt.splat %24 : i32 -> tensor<256xi32>
2026-02-21T08:27:35.0606039Z       %27 = arith.addi %26, %25 : tensor<256xi32>
2026-02-21T08:27:35.0606512Z       %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:27:35.0607133Z         %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0607711Z         %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0608241Z         %34 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:27:35.0608814Z           %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:27:35.0609361Z           %37 = arith.subf %33, %32 : tensor<256x32xf32>
2026-02-21T08:27:35.0609675Z           %38 = arith.mulf %36, %37 : tensor<256x32xf32>
2026-02-21T08:27:35.0609982Z           %39 = arith.addf %38, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0610289Z           scf.yield %39 : tensor<256x32xf32>
2026-02-21T08:27:35.0610534Z         } else {
2026-02-21T08:27:35.0610769Z           %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:27:35.0611105Z           %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32>
2026-02-21T08:27:35.0611432Z           %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32>
2026-02-21T08:27:35.0611755Z           %39 = arith.ori %37, %38 : tensor<256x32xi1>
2026-02-21T08:27:35.0612190Z           %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:27:35.0612594Z           %41 = math.log %40 : tensor<256x32xf32>
2026-02-21T08:27:35.0612910Z           %42 = arith.subf %41, %32 : tensor<256x32xf32>
2026-02-21T08:27:35.0613242Z           %43 = arith.mulf %33, %42 : tensor<256x32xf32>
2026-02-21T08:27:35.0613582Z           %44 = arith.addf %43, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0613890Z           scf.yield %44 : tensor<256x32xf32>
2026-02-21T08:27:35.0614161Z         }
2026-02-21T08:27:35.0614385Z         %35 = arith.addf %arg7, %34 : tensor<256x32xf32>
2026-02-21T08:27:35.0614704Z         scf.yield %35 : tensor<256x32xf32>
2026-02-21T08:27:35.0615017Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:27:35.0615350Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:27:35.0615643Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:35.0615923Z         %32 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:35.0616217Z         tt.reduce.return %32 : f32
2026-02-21T08:27:35.0616504Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:27:35.0616877Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0617302Z       %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:27:35.0617692Z       tt.store %31, %29 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0618007Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:27:35.0618329Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:27:35.0618658Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:27:35.0619027Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:27:35.0619423Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:27:35.0619731Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:27:35.0620249Z       %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>)  : i32 {
2026-02-21T08:27:35.0620919Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0621645Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x32xf32>> -> tensor<256x32xf32>
2026-02-21T08:27:35.0622159Z         %24 = scf.if %arg3 -> (tensor<256x32xf32>) {
2026-02-21T08:27:35.0622760Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32>
2026-02-21T08:27:35.0623384Z           %27 = arith.subf %23, %22 : tensor<256x32xf32>
2026-02-21T08:27:35.0623714Z           %28 = arith.mulf %26, %27 : tensor<256x32xf32>
2026-02-21T08:27:35.0624062Z           %29 = arith.addf %28, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0624397Z           scf.yield %29 : tensor<256x32xf32>
2026-02-21T08:27:35.0624664Z         } else {
2026-02-21T08:27:35.0624922Z           %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32>
2026-02-21T08:27:35.0625364Z           %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32>
2026-02-21T08:27:35.0625731Z           %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32>
2026-02-21T08:27:35.0626073Z           %29 = arith.ori %27, %28 : tensor<256x32xi1>
2026-02-21T08:27:35.0626468Z           %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32>
2026-02-21T08:27:35.0626866Z           %31 = math.log %30 : tensor<256x32xf32>
2026-02-21T08:27:35.0627181Z           %32 = arith.subf %31, %22 : tensor<256x32xf32>
2026-02-21T08:27:35.0627517Z           %33 = arith.mulf %23, %32 : tensor<256x32xf32>
2026-02-21T08:27:35.0627845Z           %34 = arith.addf %33, %cst : tensor<256x32xf32>
2026-02-21T08:27:35.0628168Z           scf.yield %34 : tensor<256x32xf32>
2026-02-21T08:27:35.0628432Z         }
2026-02-21T08:27:35.0628660Z         %25 = arith.addf %arg7, %24 : tensor<256x32xf32>
2026-02-21T08:27:35.0628969Z         scf.yield %25 : tensor<256x32xf32>
2026-02-21T08:27:35.0629295Z       } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:27:35.0629634Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:27:35.0629934Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:35.0630215Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:35.0630502Z         tt.reduce.return %22 : f32
2026-02-21T08:27:35.0630790Z       }) : (tensor<256x32xf32>) -> tensor<256xf32>
2026-02-21T08:27:35.0631144Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0631548Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:27:35.0631958Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:27:35.0632280Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:27:35.0632558Z     tt.return
2026-02-21T08:27:35.0632742Z   }
2026-02-21T08:27:35.0632918Z }
2026-02-21T08:27:35.0633018Z 
2026-02-21T08:27:35.0633087Z {-#
2026-02-21T08:27:35.0633277Z   external_resources: {
2026-02-21T08:27:35.0633510Z     mlir_reproducer: {
2026-02-21T08:27:35.0640588Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:27:35.0648007Z       disable_threading: false,
2026-02-21T08:27:35.0648249Z       verify_each: true
2026-02-21T08:27:35.0648446Z     }
2026-02-21T08:27:35.0648611Z   }
2026-02-21T08:27:35.0648761Z #-}
2026-02-21T08:27:35.0649543Z /tmp/torchinductor_root/lt/cltaacf64k627mu66er2ezv42z4eihhvqghotubql3xm55hfpgyi.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:27:35.0651474Z /tmp/torchinductor_root/lt/cltaacf64k627mu66er2ezv42z4eihhvqghotubql3xm55hfpgyi.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:27:35.0653118Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:27:35.0654862Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:27:35.0656444Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:27:35.0656819Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:27:36.4903151Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:27:36.4903929Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:27:36.4904549Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:27:36.4904744Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:27:36.4904935Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:27:36.4905162Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x256xf32>
2026-02-21T08:27:36.4905395Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:27:36.4905608Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:27:36.4905819Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:27:36.4906008Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:27:36.4906181Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:27:36.4906497Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<64x256xf32>>
2026-02-21T08:27:36.4906923Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<64x256xf32>>
2026-02-21T08:27:36.4907231Z     %2 = tt.get_program_id x : i32
2026-02-21T08:27:36.4907430Z     scf.for %arg5 = %2 to %c64_i32 step %c9472_i32  : i32 {
2026-02-21T08:27:36.4907644Z       %3 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:27:36.4907873Z       %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:27:36.4908115Z       %5 = tt.splat %3 : i32 -> tensor<64xi32>
2026-02-21T08:27:36.4908313Z       %6 = arith.addi %5, %4 : tensor<64xi32>
2026-02-21T08:27:36.4908510Z       %c16128_i32 = arith.constant 16128 : i32
2026-02-21T08:27:36.4909174Z       %c768_i32 = arith.constant 768 : i32
2026-02-21T08:27:36.4909665Z       %7 = scf.for %arg6 = %c0_i32 to %c16128_i32 step %c768_i32 iter_args(%arg7 = %cst) -> (tensor<64x256xf32>)  : i32 {
2026-02-21T08:27:36.4910339Z         %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4910959Z         %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4911422Z         %17 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:27:36.4912127Z           %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:27:36.4912747Z           %32 = arith.subf %16, %15 : tensor<64x256xf32>
2026-02-21T08:27:36.4913081Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:27:36.4913543Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4913862Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:27:36.4914136Z         } else {
2026-02-21T08:27:36.4914380Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:27:36.4914740Z           %32 = arith.cmpf ogt, %16, %31 : tensor<64x256xf32>
2026-02-21T08:27:36.4915097Z           %33 = arith.cmpf une, %16, %16 : tensor<64x256xf32>
2026-02-21T08:27:36.4915439Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:27:36.4915831Z           %35 = arith.select %34, %16, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:27:36.4916218Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:27:36.4916539Z           %37 = arith.subf %36, %15 : tensor<64x256xf32>
2026-02-21T08:27:36.4916866Z           %38 = arith.mulf %16, %37 : tensor<64x256xf32>
2026-02-21T08:27:36.4917203Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4917523Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:27:36.4917798Z         }
2026-02-21T08:27:36.4918023Z         %18 = arith.addf %arg7, %17 : tensor<64x256xf32>
2026-02-21T08:27:36.4918346Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:36.4918651Z         %19 = arith.muli %c256_i32, %c1_i32 : i32
2026-02-21T08:27:36.4918933Z         %20 = arith.addi %arg6, %19 : i32
2026-02-21T08:27:36.4919366Z         %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4919992Z         %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4920463Z         %23 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:27:36.4921072Z           %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:27:36.4921688Z           %32 = arith.subf %22, %21 : tensor<64x256xf32>
2026-02-21T08:27:36.4922059Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:27:36.4922417Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4922731Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:27:36.4923002Z         } else {
2026-02-21T08:27:36.4923246Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:27:36.4923609Z           %32 = arith.cmpf ogt, %22, %31 : tensor<64x256xf32>
2026-02-21T08:27:36.4923981Z           %33 = arith.cmpf une, %22, %22 : tensor<64x256xf32>
2026-02-21T08:27:36.4924316Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:27:36.4924717Z           %35 = arith.select %34, %22, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:27:36.4925105Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:27:36.4925426Z           %37 = arith.subf %36, %21 : tensor<64x256xf32>
2026-02-21T08:27:36.4925752Z           %38 = arith.mulf %22, %37 : tensor<64x256xf32>
2026-02-21T08:27:36.4926091Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4926508Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:27:36.4926768Z         }
2026-02-21T08:27:36.4926994Z         %24 = arith.addf %18, %23 : tensor<64x256xf32>
2026-02-21T08:27:36.4927303Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:27:36.4927603Z         %25 = arith.muli %c256_i32, %c2_i32 : i32
2026-02-21T08:27:36.4927893Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:27:36.4928333Z         %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4928925Z         %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4929395Z         %29 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:27:36.4930002Z           %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:27:36.4930726Z           %32 = arith.subf %28, %27 : tensor<64x256xf32>
2026-02-21T08:27:36.4931059Z           %33 = arith.mulf %31, %32 : tensor<64x256xf32>
2026-02-21T08:27:36.4931394Z           %34 = arith.addf %33, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4931712Z           scf.yield %34 : tensor<64x256xf32>
2026-02-21T08:27:36.4932081Z         } else {
2026-02-21T08:27:36.4932327Z           %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:27:36.4932688Z           %32 = arith.cmpf ogt, %28, %31 : tensor<64x256xf32>
2026-02-21T08:27:36.4933043Z           %33 = arith.cmpf une, %28, %28 : tensor<64x256xf32>
2026-02-21T08:27:36.4933391Z           %34 = arith.ori %32, %33 : tensor<64x256xi1>
2026-02-21T08:27:36.4933785Z           %35 = arith.select %34, %28, %31 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:27:36.4934166Z           %36 = math.log %35 : tensor<64x256xf32>
2026-02-21T08:27:36.4934487Z           %37 = arith.subf %36, %27 : tensor<64x256xf32>
2026-02-21T08:27:36.4934815Z           %38 = arith.mulf %28, %37 : tensor<64x256xf32>
2026-02-21T08:27:36.4935149Z           %39 = arith.addf %38, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4935463Z           scf.yield %39 : tensor<64x256xf32>
2026-02-21T08:27:36.4935732Z         }
2026-02-21T08:27:36.4935954Z         %30 = arith.addf %24, %29 : tensor<64x256xf32>
2026-02-21T08:27:36.4936259Z         scf.yield %30 : tensor<64x256xf32>
2026-02-21T08:27:36.4936550Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:27:36.4936998Z       %8 = tt.descriptor_load %0[%3, %c16128_i32] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4937651Z       %9 = tt.descriptor_load %1[%3, %c16128_i32] : !tt.tensordesc<tensor<64x256xf32>> -> tensor<64x256xf32>
2026-02-21T08:27:36.4938136Z       %10 = scf.if %arg3 -> (tensor<64x256xf32>) {
2026-02-21T08:27:36.4938741Z         %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32>
2026-02-21T08:27:36.4939330Z         %16 = arith.subf %9, %8 : tensor<64x256xf32>
2026-02-21T08:27:36.4939654Z         %17 = arith.mulf %15, %16 : tensor<64x256xf32>
2026-02-21T08:27:36.4939992Z         %18 = arith.addf %17, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4940301Z         scf.yield %18 : tensor<64x256xf32>
2026-02-21T08:27:36.4940569Z       } else {
2026-02-21T08:27:36.4940803Z         %15 = tt.splat %arg4 : f32 -> tensor<64x256xf32>
2026-02-21T08:27:36.4941156Z         %16 = arith.cmpf ogt, %9, %15 : tensor<64x256xf32>
2026-02-21T08:27:36.4941510Z         %17 = arith.cmpf une, %9, %9 : tensor<64x256xf32>
2026-02-21T08:27:36.4941841Z         %18 = arith.ori %16, %17 : tensor<64x256xi1>
2026-02-21T08:27:36.4942271Z         %19 = arith.select %18, %9, %15 : tensor<64x256xi1>, tensor<64x256xf32>
2026-02-21T08:27:36.4942655Z         %20 = math.log %19 : tensor<64x256xf32>
2026-02-21T08:27:36.4942983Z         %21 = arith.subf %20, %8 : tensor<64x256xf32>
2026-02-21T08:27:36.4943306Z         %22 = arith.mulf %9, %21 : tensor<64x256xf32>
2026-02-21T08:27:36.4943648Z         %23 = arith.addf %22, %cst : tensor<64x256xf32>
2026-02-21T08:27:36.4944058Z         scf.yield %23 : tensor<64x256xf32>
2026-02-21T08:27:36.4944326Z       }
2026-02-21T08:27:36.4944553Z       %11 = arith.addf %7, %10 : tensor<64x256xf32>
2026-02-21T08:27:36.4944879Z       %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({
2026-02-21T08:27:36.4945188Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:36.4945474Z         %15 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:36.4945780Z         tt.reduce.return %15 : f32
2026-02-21T08:27:36.4946075Z       }) : (tensor<64x256xf32>) -> tensor<64xf32>
2026-02-21T08:27:36.4946457Z       %13 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:36.4946893Z       %14 = tt.addptr %13, %6 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:27:36.4947272Z       tt.store %14, %12 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:36.4947692Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:27:36.4948167Z     tt.return
2026-02-21T08:27:36.4948364Z   }
2026-02-21T08:27:36.4948537Z }
2026-02-21T08:27:36.4948652Z 
2026-02-21T08:27:36.4948722Z {-#
2026-02-21T08:27:36.4948907Z   external_resources: {
2026-02-21T08:27:36.4949150Z     mlir_reproducer: {
2026-02-21T08:27:36.4956743Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:27:36.4964776Z       disable_threading: false,
2026-02-21T08:27:36.4965044Z       verify_each: true
2026-02-21T08:27:36.4965262Z     }
2026-02-21T08:27:36.4965440Z   }
2026-02-21T08:27:36.4965605Z #-}
2026-02-21T08:27:36.4966327Z /tmp/torchinductor_root/hh/chhlewbekkztkmhfeiv3trprbofzgt5kx7k3voaanknl5ztkortp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:27:36.4968469Z /tmp/torchinductor_root/hh/chhlewbekkztkmhfeiv3trprbofzgt5kx7k3voaanknl5ztkortp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:27:36.4970203Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:27:36.4972146Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:27:36.4973944Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:27:36.4974360Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:27:36.5452407Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:27:36.5453333Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:27:36.5454234Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:27:36.5454656Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:27:36.5454939Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:36.5455306Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32>
2026-02-21T08:27:36.5455670Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:27:36.5455961Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:27:36.5456259Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:27:36.5456559Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:27:36.5456855Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:27:36.5457355Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:27:36.5458088Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:27:36.5458597Z     %2 = tt.get_program_id x : i32
2026-02-21T08:27:36.5458858Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:27:36.5459136Z     %4 = arith.minsi %3, %c32_i32 : i32
2026-02-21T08:27:36.5459429Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:27:36.5459743Z       %5 = arith.muli %arg5, %c128_i32 : i32
2026-02-21T08:27:36.5460102Z       %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:27:36.5460500Z       %7 = tt.splat %5 : i32 -> tensor<128xi32>
2026-02-21T08:27:36.5460807Z       %8 = arith.addi %7, %6 : tensor<128xi32>
2026-02-21T08:27:36.5461307Z       %9 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>)  : i32 {
2026-02-21T08:27:36.5462017Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:27:36.5462650Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:27:36.5463130Z         %15 = scf.if %arg3 -> (tensor<128x32xf32>) {
2026-02-21T08:27:36.5463735Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32>
2026-02-21T08:27:36.5464349Z           %18 = arith.subf %14, %13 : tensor<128x32xf32>
2026-02-21T08:27:36.5464679Z           %19 = arith.mulf %17, %18 : tensor<128x32xf32>
2026-02-21T08:27:36.5465011Z           %20 = arith.addf %19, %cst : tensor<128x32xf32>
2026-02-21T08:27:36.5465330Z           scf.yield %20 : tensor<128x32xf32>
2026-02-21T08:27:36.5465592Z         } else {
2026-02-21T08:27:36.5465843Z           %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32>
2026-02-21T08:27:36.5466206Z           %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32>
2026-02-21T08:27:36.5466559Z           %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32>
2026-02-21T08:27:36.5466900Z           %20 = arith.ori %18, %19 : tensor<128x32xi1>
2026-02-21T08:27:36.5467273Z           %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32>
2026-02-21T08:27:36.5467665Z           %22 = math.log %21 : tensor<128x32xf32>
2026-02-21T08:27:36.5467984Z           %23 = arith.subf %22, %13 : tensor<128x32xf32>
2026-02-21T08:27:36.5468415Z           %24 = arith.mulf %14, %23 : tensor<128x32xf32>
2026-02-21T08:27:36.5468759Z           %25 = arith.addf %24, %cst : tensor<128x32xf32>
2026-02-21T08:27:36.5469075Z           scf.yield %25 : tensor<128x32xf32>
2026-02-21T08:27:36.5469348Z         }
2026-02-21T08:27:36.5469573Z         %16 = arith.addf %arg7, %15 : tensor<128x32xf32>
2026-02-21T08:27:36.5469895Z         scf.yield %16 : tensor<128x32xf32>
2026-02-21T08:27:36.5470342Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:27:36.5470816Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:27:36.5471110Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:36.5471385Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:36.5471676Z         tt.reduce.return %13 : f32
2026-02-21T08:27:36.5472001Z       }) : (tensor<128x32xf32>) -> tensor<128xf32>
2026-02-21T08:27:36.5472455Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:27:36.5472889Z       %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:27:36.5473287Z       tt.store %12, %10 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:27:36.5473575Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T08:27:36.5473848Z     tt.return
2026-02-21T08:27:36.5474039Z   }
2026-02-21T08:27:36.5474206Z }
2026-02-21T08:27:36.5474309Z 
2026-02-21T08:27:36.5474387Z {-#
2026-02-21T08:27:36.5474568Z   external_resources: {
2026-02-21T08:27:36.5474812Z     mlir_reproducer: {
2026-02-21T08:27:36.5482429Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:27:36.5490317Z       disable_threading: false,
2026-02-21T08:27:36.5490618Z       verify_each: true
2026-02-21T08:27:36.5490883Z     }
2026-02-21T08:27:36.5491090Z   }
2026-02-21T08:27:36.5491301Z #-}
2026-02-21T08:27:36.5492133Z /tmp/torchinductor_root/zc/czcqwzi5sssdgaf7drc4sspkirwdodgk7w2szpqbcbt5ahalb7yk.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:27:36.5494370Z /tmp/torchinductor_root/zc/czcqwzi5sssdgaf7drc4sspkirwdodgk7w2szpqbcbt5ahalb7yk.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:27:36.5496220Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:27:36.5498128Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:27:36.5499856Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:27:36.5500256Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:27:37.9793557Z module {
2026-02-21T08:27:37.9796158Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:27:37.9796889Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:27:37.9797077Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:27:37.9797252Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:27:37.9797475Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x8xf32>
2026-02-21T08:27:37.9797693Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:27:37.9797876Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:27:37.9798054Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:27:37.9798236Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:27:37.9798407Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:27:37.9798718Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<64x8xf32>>
2026-02-21T08:27:37.9799157Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<64x8xf32>>
2026-02-21T08:27:37.9799468Z     %2 = tt.get_program_id x : i32
2026-02-21T08:27:37.9799641Z     %3 = arith.subi %c64_i32, %2 : i32
2026-02-21T08:27:37.9799809Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:27:37.9799990Z     %4 = arith.subi %c9472_i32, %c1_i32 : i32
2026-02-21T08:27:37.9800166Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:27:37.9800337Z     %6 = arith.divui %5, %c9472_i32 : i32
2026-02-21T08:27:37.9800516Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:27:37.9800707Z     %7 = arith.remsi %6, %c3_i32 : i32
2026-02-21T08:27:37.9819089Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:27:37.9819320Z     %9 = arith.muli %8, %c9472_i32 : i32
2026-02-21T08:27:37.9819518Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:27:37.9819721Z     %11 = arith.muli %c9472_i32, %c3_i32 : i32
2026-02-21T08:27:37.9819937Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:27:37.9820169Z       %12 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:27:37.9820443Z       %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:27:37.9820714Z       %14 = tt.splat %12 : i32 -> tensor<64xi32>
2026-02-21T08:27:37.9820935Z       %15 = arith.addi %14, %13 : tensor<64xi32>
2026-02-21T08:27:37.9821261Z       %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>)  : i32 {
2026-02-21T08:27:37.9821693Z         %40 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9822121Z         %41 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9822414Z         %42 = scf.if %arg3 -> (tensor<64x8xf32>) {
2026-02-21T08:27:37.9822801Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32>
2026-02-21T08:27:37.9823186Z           %45 = arith.subf %41, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9823615Z           %46 = arith.mulf %44, %45 : tensor<64x8xf32>
2026-02-21T08:27:37.9823830Z           %47 = arith.addf %46, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9824050Z           scf.yield %47 : tensor<64x8xf32>
2026-02-21T08:27:37.9824238Z         } else {
2026-02-21T08:27:37.9824408Z           %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32>
2026-02-21T08:27:37.9824645Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32>
2026-02-21T08:27:37.9824876Z           %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32>
2026-02-21T08:27:37.9825097Z           %47 = arith.ori %45, %46 : tensor<64x8xi1>
2026-02-21T08:27:37.9825340Z           %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32>
2026-02-21T08:27:37.9825600Z           %49 = math.log %48 : tensor<64x8xf32>
2026-02-21T08:27:37.9825810Z           %50 = arith.subf %49, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9826083Z           %51 = arith.mulf %41, %50 : tensor<64x8xf32>
2026-02-21T08:27:37.9826311Z           %52 = arith.addf %51, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9826514Z           scf.yield %52 : tensor<64x8xf32>
2026-02-21T08:27:37.9826698Z         }
2026-02-21T08:27:37.9826849Z         %43 = arith.addf %arg7, %42 : tensor<64x8xf32>
2026-02-21T08:27:37.9827058Z         scf.yield %43 : tensor<64x8xf32>
2026-02-21T08:27:37.9827324Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:27:37.9827615Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:27:37.9827820Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:37.9828007Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:37.9828210Z         tt.reduce.return %40 : f32
2026-02-21T08:27:37.9828401Z       }) : (tensor<64x8xf32>) -> tensor<64xf32>
2026-02-21T08:27:37.9828654Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9828918Z       %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:27:37.9829163Z       tt.store %19, %17 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9829369Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:27:37.9829557Z       %20 = arith.muli %c9472_i32, %c1_i32_0 : i32
2026-02-21T08:27:37.9829751Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:27:37.9829926Z       %22 = arith.muli %21, %c64_i32 : i32
2026-02-21T08:27:37.9830153Z       %23 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:27:37.9830389Z       %24 = tt.splat %22 : i32 -> tensor<64xi32>
2026-02-21T08:27:37.9830587Z       %25 = arith.addi %24, %23 : tensor<64xi32>
2026-02-21T08:27:37.9830887Z       %26 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>)  : i32 {
2026-02-21T08:27:37.9831318Z         %40 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9831694Z         %41 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9832022Z         %42 = scf.if %arg3 -> (tensor<64x8xf32>) {
2026-02-21T08:27:37.9832388Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32>
2026-02-21T08:27:37.9832744Z           %45 = arith.subf %41, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9832956Z           %46 = arith.mulf %44, %45 : tensor<64x8xf32>
2026-02-21T08:27:37.9833173Z           %47 = arith.addf %46, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9833370Z           scf.yield %47 : tensor<64x8xf32>
2026-02-21T08:27:37.9833550Z         } else {
2026-02-21T08:27:37.9833714Z           %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32>
2026-02-21T08:27:37.9833941Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32>
2026-02-21T08:27:37.9834161Z           %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32>
2026-02-21T08:27:37.9834380Z           %47 = arith.ori %45, %46 : tensor<64x8xi1>
2026-02-21T08:27:37.9834634Z           %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32>
2026-02-21T08:27:37.9834974Z           %49 = math.log %48 : tensor<64x8xf32>
2026-02-21T08:27:37.9835182Z           %50 = arith.subf %49, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9835382Z           %51 = arith.mulf %41, %50 : tensor<64x8xf32>
2026-02-21T08:27:37.9835596Z           %52 = arith.addf %51, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9835790Z           scf.yield %52 : tensor<64x8xf32>
2026-02-21T08:27:37.9835966Z         }
2026-02-21T08:27:37.9836112Z         %43 = arith.addf %arg7, %42 : tensor<64x8xf32>
2026-02-21T08:27:37.9836319Z         scf.yield %43 : tensor<64x8xf32>
2026-02-21T08:27:37.9836581Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:27:37.9836848Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:27:37.9837049Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:37.9837226Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:37.9837486Z         tt.reduce.return %40 : f32
2026-02-21T08:27:37.9837675Z       }) : (tensor<64x8xf32>) -> tensor<64xf32>
2026-02-21T08:27:37.9837912Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9838178Z       %29 = tt.addptr %28, %25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:27:37.9838416Z       tt.store %29, %27 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9838625Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:27:37.9838811Z       %30 = arith.muli %c9472_i32, %c2_i32 : i32
2026-02-21T08:27:37.9839008Z       %31 = arith.addi %arg5, %30 : i32
2026-02-21T08:27:37.9839184Z       %32 = arith.muli %31, %c64_i32 : i32
2026-02-21T08:27:37.9839416Z       %33 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:27:37.9839652Z       %34 = tt.splat %32 : i32 -> tensor<64xi32>
2026-02-21T08:27:37.9839850Z       %35 = arith.addi %34, %33 : tensor<64xi32>
2026-02-21T08:27:37.9840164Z       %36 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>)  : i32 {
2026-02-21T08:27:37.9840564Z         %40 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9840928Z         %41 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9841203Z         %42 = scf.if %arg3 -> (tensor<64x8xf32>) {
2026-02-21T08:27:37.9841563Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32>
2026-02-21T08:27:37.9841986Z           %45 = arith.subf %41, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9842182Z           %46 = arith.mulf %44, %45 : tensor<64x8xf32>
2026-02-21T08:27:37.9842389Z           %47 = arith.addf %46, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9842578Z           scf.yield %47 : tensor<64x8xf32>
2026-02-21T08:27:37.9842751Z         } else {
2026-02-21T08:27:37.9842910Z           %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32>
2026-02-21T08:27:37.9843133Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32>
2026-02-21T08:27:37.9843354Z           %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32>
2026-02-21T08:27:37.9843556Z           %47 = arith.ori %45, %46 : tensor<64x8xi1>
2026-02-21T08:27:37.9843798Z           %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32>
2026-02-21T08:27:37.9844029Z           %49 = math.log %48 : tensor<64x8xf32>
2026-02-21T08:27:37.9844223Z           %50 = arith.subf %49, %40 : tensor<64x8xf32>
2026-02-21T08:27:37.9844415Z           %51 = arith.mulf %41, %50 : tensor<64x8xf32>
2026-02-21T08:27:37.9844624Z           %52 = arith.addf %51, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9844819Z           scf.yield %52 : tensor<64x8xf32>
2026-02-21T08:27:37.9844981Z         }
2026-02-21T08:27:37.9845132Z         %43 = arith.addf %arg7, %42 : tensor<64x8xf32>
2026-02-21T08:27:37.9845320Z         scf.yield %43 : tensor<64x8xf32>
2026-02-21T08:27:37.9845577Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:27:37.9845891Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:27:37.9846084Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:37.9846257Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:37.9846447Z         tt.reduce.return %40 : f32
2026-02-21T08:27:37.9846632Z       }) : (tensor<64x8xf32>) -> tensor<64xf32>
2026-02-21T08:27:37.9846852Z       %38 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9847110Z       %39 = tt.addptr %38, %35 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:27:37.9847336Z       tt.store %39, %37 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9847533Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:27:37.9847731Z     scf.for %arg5 = %10 to %c64_i32 step %c9472_i32  : i32 {
2026-02-21T08:27:37.9847951Z       %12 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:27:37.9848235Z       %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:27:37.9848473Z       %14 = tt.splat %12 : i32 -> tensor<64xi32>
2026-02-21T08:27:37.9848666Z       %15 = arith.addi %14, %13 : tensor<64xi32>
2026-02-21T08:27:37.9848961Z       %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>)  : i32 {
2026-02-21T08:27:37.9849355Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9849704Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<64x8xf32>> -> tensor<64x8xf32>
2026-02-21T08:27:37.9849986Z         %22 = scf.if %arg3 -> (tensor<64x8xf32>) {
2026-02-21T08:27:37.9850343Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32>
2026-02-21T08:27:37.9850691Z           %25 = arith.subf %21, %20 : tensor<64x8xf32>
2026-02-21T08:27:37.9850903Z           %26 = arith.mulf %24, %25 : tensor<64x8xf32>
2026-02-21T08:27:37.9851108Z           %27 = arith.addf %26, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9851308Z           scf.yield %27 : tensor<64x8xf32>
2026-02-21T08:27:37.9851474Z         } else {
2026-02-21T08:27:37.9851641Z           %24 = tt.splat %arg4 : f32 -> tensor<64x8xf32>
2026-02-21T08:27:37.9851894Z           %25 = arith.cmpf ogt, %21, %24 : tensor<64x8xf32>
2026-02-21T08:27:37.9852109Z           %26 = arith.cmpf une, %21, %21 : tensor<64x8xf32>
2026-02-21T08:27:37.9852324Z           %27 = arith.ori %25, %26 : tensor<64x8xi1>
2026-02-21T08:27:37.9852555Z           %28 = arith.select %27, %21, %24 : tensor<64x8xi1>, tensor<64x8xf32>
2026-02-21T08:27:37.9852794Z           %29 = math.log %28 : tensor<64x8xf32>
2026-02-21T08:27:37.9852982Z           %30 = arith.subf %29, %20 : tensor<64x8xf32>
2026-02-21T08:27:37.9853186Z           %31 = arith.mulf %21, %30 : tensor<64x8xf32>
2026-02-21T08:27:37.9853390Z           %32 = arith.addf %31, %cst : tensor<64x8xf32>
2026-02-21T08:27:37.9853578Z           scf.yield %32 : tensor<64x8xf32>
2026-02-21T08:27:37.9853752Z         }
2026-02-21T08:27:37.9853910Z         %23 = arith.addf %arg7, %22 : tensor<64x8xf32>
2026-02-21T08:27:37.9854108Z         scf.yield %23 : tensor<64x8xf32>
2026-02-21T08:27:37.9854351Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:27:37.9854619Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:27:37.9854813Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:27:37.9854985Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:27:37.9855172Z         tt.reduce.return %20 : f32
2026-02-21T08:27:37.9855350Z       }) : (tensor<64x8xf32>) -> tensor<64xf32>
2026-02-21T08:27:37.9855580Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9855832Z       %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:27:37.9856070Z       tt.store %19, %17 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:27:37.9856273Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:27:37.9856490Z     tt.return
2026-02-21T08:27:37.9856624Z   }
2026-02-21T08:27:37.9856742Z }
2026-02-21T08:27:37.9856813Z 
2026-02-21T08:27:37.9856875Z {-#
2026-02-21T08:27:37.9857000Z   external_resources: {
2026-02-21T08:27:37.9857162Z     mlir_reproducer: {
2026-02-21T08:27:37.9861523Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:27:37.9866207Z       disable_threading: false,
2026-02-21T08:27:37.9866380Z       verify_each: true
2026-02-21T08:27:37.9866547Z     }
2026-02-21T08:27:37.9866700Z   }
2026-02-21T08:27:37.9866840Z #-}
2026-02-21T08:27:37.9867383Z /tmp/torchinductor_root/ep/cep5rg3g5xg5gfzsazcgokkggs64aoxaafehishm262xa2jmr7tz.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:27:37.9868809Z /tmp/torchinductor_root/ep/cep5rg3g5xg5gfzsazcgokkggs64aoxaafehishm262xa2jmr7tz.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:27:37.9869873Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:27:37.9871070Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:27:37.9872119Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:27:37.9872387Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:27:42.1171165Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.6 configs/s
2026-02-21T08:27:42.1180959Z [45s] Adaptive compile timeout: 30s (90% percentile=16.5s, bounds=[30.0s, 30s])
2026-02-21T08:27:43.1528880Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 950.5 configs/s
2026-02-21T08:27:43.2066427Z [46s] Initial random population of 100, 5 starting points: 
2026-02-21T08:27:43.2070256Z error=14
2026-02-21T08:27:43.2071387Z timeout=8
2026-02-21T08:27:43.2071517Z ok=78
2026-02-21T08:27:43.2071635Z min=0.1157
2026-02-21T08:27:43.2071765Z mid=1.2339
2026-02-21T08:27:43.2071936Z max=148.8250
2026-02-21T08:27:43.2072096Z best={'block_sizes': [2048, 1],
2026-02-21T08:27:43.2072344Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:27:43.2072579Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:27:43.2072768Z  'num_stages': 8,
2026-02-21T08:27:43.2072910Z  'num_warps': 4,
2026-02-21T08:27:43.2073042Z  'pid_type': 'flat',
2026-02-21T08:27:43.2073198Z  'range_flattens': [None, False],
2026-02-21T08:27:43.2073368Z  'range_multi_buffers': [None, False],
2026-02-21T08:27:43.2073549Z  'range_num_stages': [0, 4],
2026-02-21T08:27:43.2073706Z  'range_unroll_factors': [0, 0],
2026-02-21T08:27:43.2073885Z  'range_warp_specializes': [None, True]}
2026-02-21T08:27:43.2080914Z [46s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:27:44.2433360Z [47s] Generation 1 starting: 78 neighbors, 5 active search path(s)
2026-02-21T08:27:56.9809316Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.4 configs/s
2026-02-21T08:28:00.1843083Z module {
2026-02-21T08:28:00.1844997Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:28:00.1845671Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:28:00.1845890Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:28:00.1846156Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x256xf32>
2026-02-21T08:28:00.1846416Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:28:00.1846632Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:28:00.1846857Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:28:00.1847069Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:28:00.1847314Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:28:00.1847689Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x256xf32>>
2026-02-21T08:28:00.1848194Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x256xf32>>
2026-02-21T08:28:00.1848543Z     %2 = tt.get_program_id x : i32
2026-02-21T08:28:00.1848746Z     %3 = arith.muli %2, %c8_i32 : i32
2026-02-21T08:28:00.1849001Z     %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:28:00.1849268Z     %5 = tt.splat %3 : i32 -> tensor<8xi32>
2026-02-21T08:28:00.1849482Z     %6 = arith.addi %5, %4 : tensor<8xi32>
2026-02-21T08:28:00.1849827Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<8x256xf32>)  : i32 {
2026-02-21T08:28:00.1850301Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<8x256xf32>> -> tensor<8x256xf32>
2026-02-21T08:28:00.1850715Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<8x256xf32>> -> tensor<8x256xf32>
2026-02-21T08:28:00.1851054Z       %13 = scf.if %arg3 -> (tensor<8x256xf32>) {
2026-02-21T08:28:00.1851471Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x256xf32>) -> tensor<8x256xf32>
2026-02-21T08:28:00.1852118Z         %16 = arith.subf %12, %11 : tensor<8x256xf32>
2026-02-21T08:28:00.1852368Z         %17 = arith.mulf %15, %16 : tensor<8x256xf32>
2026-02-21T08:28:00.1852601Z         %18 = arith.addf %17, %cst : tensor<8x256xf32>
2026-02-21T08:28:00.1852839Z         scf.yield %18 : tensor<8x256xf32>
2026-02-21T08:28:00.1853034Z       } else {
2026-02-21T08:28:00.1853230Z         %15 = tt.splat %arg4 : f32 -> tensor<8x256xf32>
2026-02-21T08:28:00.1853491Z         %16 = arith.cmpf ogt, %12, %15 : tensor<8x256xf32>
2026-02-21T08:28:00.1853743Z         %17 = arith.cmpf une, %12, %12 : tensor<8x256xf32>
2026-02-21T08:28:00.1853995Z         %18 = arith.ori %16, %17 : tensor<8x256xi1>
2026-02-21T08:28:00.1854700Z         %19 = arith.select %18, %12, %15 : tensor<8x256xi1>, tensor<8x256xf32>
2026-02-21T08:28:00.1854994Z         %20 = math.log %19 : tensor<8x256xf32>
2026-02-21T08:28:00.1855215Z         %21 = arith.subf %20, %11 : tensor<8x256xf32>
2026-02-21T08:28:00.1855455Z         %22 = arith.mulf %12, %21 : tensor<8x256xf32>
2026-02-21T08:28:00.1855690Z         %23 = arith.addf %22, %cst : tensor<8x256xf32>
2026-02-21T08:28:00.1855907Z         scf.yield %23 : tensor<8x256xf32>
2026-02-21T08:28:00.1856107Z       }
2026-02-21T08:28:00.1856266Z       %14 = arith.addf %arg6, %13 : tensor<8x256xf32>
2026-02-21T08:28:00.1856492Z       scf.yield %14 : tensor<8x256xf32>
2026-02-21T08:28:00.1856777Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:28:00.1857104Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:28:00.1857318Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:28:00.1857631Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:28:00.1857856Z       tt.reduce.return %11 : f32
2026-02-21T08:28:00.1858059Z     }) : (tensor<8x256xf32>) -> tensor<8xf32>
2026-02-21T08:28:00.1858315Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:28:00.1858603Z     %10 = tt.addptr %9, %6 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:28:00.1858869Z     tt.store %10, %8 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:28:00.1859074Z     tt.return
2026-02-21T08:28:00.1859213Z   }
2026-02-21T08:28:00.1859352Z }
2026-02-21T08:28:00.1859429Z 
2026-02-21T08:28:00.1859483Z {-#
2026-02-21T08:28:00.1859630Z   external_resources: {
2026-02-21T08:28:00.1859802Z     mlir_reproducer: {
2026-02-21T08:28:00.1864783Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:28:00.1869605Z       disable_threading: false,
2026-02-21T08:28:00.1869778Z       verify_each: true
2026-02-21T08:28:00.1869932Z     }
2026-02-21T08:28:00.1870052Z   }
2026-02-21T08:28:00.1870174Z #-}
2026-02-21T08:28:00.1870618Z /tmp/torchinductor_root/eb/cebfo3qanzurb5hreek6etdls6omorkgf56ltlrepkgmmlylpopl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:28:00.1871930Z /tmp/torchinductor_root/eb/cebfo3qanzurb5hreek6etdls6omorkgf56ltlrepkgmmlylpopl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:28:00.1873189Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:28:00.1874285Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:28:00.1875223Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:28:00.1875494Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:28:01.6397620Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.3 configs/s
2026-02-21T08:28:14.1025729Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.6 configs/s
2026-02-21T08:28:14.4585743Z [77s] Generation 1 complete: 
2026-02-21T08:28:14.4589900Z error=1
2026-02-21T08:28:14.4591384Z ok=83
2026-02-21T08:28:14.4591582Z min=0.1177
2026-02-21T08:28:14.4597242Z mid=0.1361
2026-02-21T08:28:14.4599460Z max=1.1249
2026-02-21T08:28:14.4599675Z best={'block_sizes': [2048, 1],
2026-02-21T08:28:14.4603063Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:28:14.4606295Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:28:14.4610755Z  'num_stages': 8,
2026-02-21T08:28:14.4614578Z  'num_warps': 4,
2026-02-21T08:28:14.4618959Z  'pid_type': 'flat',
2026-02-21T08:28:14.4620574Z  'range_flattens': [None, False],
2026-02-21T08:28:14.4620855Z  'range_multi_buffers': [None, False],
2026-02-21T08:28:14.4626244Z  'range_num_stages': [0, 4],
2026-02-21T08:28:14.4627844Z  'range_unroll_factors': [0, 0],
2026-02-21T08:28:14.4628151Z  'range_warp_specializes': [None, True]}
2026-02-21T08:28:14.4633332Z [77s] Fitting surrogate: 184 points, 184 targets
2026-02-21T08:28:15.4274941Z [78s] Generation 2 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:28:20.7571119Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 3.7 configs/s
2026-02-21T08:28:23.1239762Z module {
2026-02-21T08:28:23.1244633Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:28:23.1245758Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:28:23.1246006Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:28:23.1246231Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32>
2026-02-21T08:28:23.1246466Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:28:23.1246645Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:28:23.1246869Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:28:23.1247065Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:28:23.1247242Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:28:23.1247546Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:23.1247975Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:23.1248284Z     %2 = tt.get_program_id x : i32
2026-02-21T08:28:23.1248456Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:28:23.1248674Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:28:23.1248903Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:28:23.1249091Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:28:23.1249395Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>)  : i32 {
2026-02-21T08:28:23.1250132Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:23.1250502Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:23.1250784Z       %13 = scf.if %arg3 -> (tensor<4x256xf32>) {
2026-02-21T08:28:23.1251156Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32>
2026-02-21T08:28:23.1251530Z         %16 = arith.subf %12, %11 : tensor<4x256xf32>
2026-02-21T08:28:23.1251731Z         %17 = arith.mulf %15, %16 : tensor<4x256xf32>
2026-02-21T08:28:23.1251981Z         %18 = arith.addf %17, %cst : tensor<4x256xf32>
2026-02-21T08:28:23.1252180Z         scf.yield %18 : tensor<4x256xf32>
2026-02-21T08:28:23.1252351Z       } else {
2026-02-21T08:28:23.1252507Z         %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32>
2026-02-21T08:28:23.1252838Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32>
2026-02-21T08:28:23.1253066Z         %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32>
2026-02-21T08:28:23.1253265Z         %18 = arith.ori %16, %17 : tensor<4x256xi1>
2026-02-21T08:28:23.1253507Z         %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32>
2026-02-21T08:28:23.1253738Z         %20 = math.log %19 : tensor<4x256xf32>
2026-02-21T08:28:23.1253932Z         %21 = arith.subf %20, %11 : tensor<4x256xf32>
2026-02-21T08:28:23.1254124Z         %22 = arith.mulf %12, %21 : tensor<4x256xf32>
2026-02-21T08:28:23.1254324Z         %23 = arith.addf %22, %cst : tensor<4x256xf32>
2026-02-21T08:28:23.1254516Z         scf.yield %23 : tensor<4x256xf32>
2026-02-21T08:28:23.1254679Z       }
2026-02-21T08:28:23.1254827Z       %14 = arith.addf %arg6, %13 : tensor<4x256xf32>
2026-02-21T08:28:23.1255014Z       scf.yield %14 : tensor<4x256xf32>
2026-02-21T08:28:23.1255263Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:28:23.1255521Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:28:23.1255710Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:28:23.1255875Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:28:23.1256061Z       tt.reduce.return %11 : f32
2026-02-21T08:28:23.1256245Z     }) : (tensor<4x256xf32>) -> tensor<4xf32>
2026-02-21T08:28:23.1256459Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:23.1256716Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:28:23.1256935Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:23.1257111Z     tt.return
2026-02-21T08:28:23.1257229Z   }
2026-02-21T08:28:23.1257346Z }
2026-02-21T08:28:23.1257413Z 
2026-02-21T08:28:23.1257469Z {-#
2026-02-21T08:28:23.1257598Z   external_resources: {
2026-02-21T08:28:23.1257753Z     mlir_reproducer: {
2026-02-21T08:28:23.1262173Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:28:23.1266816Z       disable_threading: false,
2026-02-21T08:28:23.1266984Z       verify_each: true
2026-02-21T08:28:23.1267122Z     }
2026-02-21T08:28:23.1267243Z   }
2026-02-21T08:28:23.1267352Z #-}
2026-02-21T08:28:23.1267829Z /tmp/torchinductor_root/7a/c7ahhktrmlgwjp4px77jizrsvzaxhjcxlkqp6bijd6wtzgezrfa6.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:28:23.1269038Z /tmp/torchinductor_root/7a/c7ahhktrmlgwjp4px77jizrsvzaxhjcxlkqp6bijd6wtzgezrfa6.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:28:23.1270020Z [86s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:28:23.1271008Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:28:23.1271927Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:28:23.1272181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:28:23.5600017Z module {
2026-02-21T08:28:23.5600953Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:28:23.5602177Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:28:23.5602478Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:28:23.5602845Z     %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf32>
2026-02-21T08:28:23.5603200Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:28:23.5603489Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:28:23.5603819Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:28:23.5604112Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:28:23.5604398Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:28:23.5604932Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x512xf32>>
2026-02-21T08:28:23.5605660Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<8x512xf32>>
2026-02-21T08:28:23.5606162Z     %2 = tt.get_program_id x : i32
2026-02-21T08:28:23.5606438Z     %3 = arith.muli %2, %c8_i32 : i32
2026-02-21T08:28:23.5606785Z     %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:28:23.5607153Z     %5 = tt.splat %3 : i32 -> tensor<8xi32>
2026-02-21T08:28:23.5607446Z     %6 = arith.addi %5, %4 : tensor<8xi32>
2026-02-21T08:28:23.5607937Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c512_i32 iter_args(%arg6 = %cst) -> (tensor<8x512xf32>)  : i32 {
2026-02-21T08:28:23.5608596Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<8x512xf32>> -> tensor<8x512xf32>
2026-02-21T08:28:23.5609193Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<8x512xf32>> -> tensor<8x512xf32>
2026-02-21T08:28:23.5609996Z       %13 = scf.if %arg3 -> (tensor<8x512xf32>) {
2026-02-21T08:28:23.5610595Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32>
2026-02-21T08:28:23.5611193Z         %16 = arith.subf %12, %11 : tensor<8x512xf32>
2026-02-21T08:28:23.5611519Z         %17 = arith.mulf %15, %16 : tensor<8x512xf32>
2026-02-21T08:28:23.5611837Z         %18 = arith.addf %17, %cst : tensor<8x512xf32>
2026-02-21T08:28:23.5612187Z         scf.yield %18 : tensor<8x512xf32>
2026-02-21T08:28:23.5612453Z       } else {
2026-02-21T08:28:23.5612705Z         %15 = tt.splat %arg4 : f32 -> tensor<8x512xf32>
2026-02-21T08:28:23.5613070Z         %16 = arith.cmpf ogt, %12, %15 : tensor<8x512xf32>
2026-02-21T08:28:23.5613422Z         %17 = arith.cmpf une, %12, %12 : tensor<8x512xf32>
2026-02-21T08:28:23.5613766Z         %18 = arith.ori %16, %17 : tensor<8x512xi1>
2026-02-21T08:28:23.5614277Z         %19 = arith.select %18, %12, %15 : tensor<8x512xi1>, tensor<8x512xf32>
2026-02-21T08:28:23.5614681Z         %20 = math.log %19 : tensor<8x512xf32>
2026-02-21T08:28:23.5614992Z         %21 = arith.subf %20, %11 : tensor<8x512xf32>
2026-02-21T08:28:23.5615316Z         %22 = arith.mulf %12, %21 : tensor<8x512xf32>
2026-02-21T08:28:23.5615651Z         %23 = arith.addf %22, %cst : tensor<8x512xf32>
2026-02-21T08:28:23.5615949Z         scf.yield %23 : tensor<8x512xf32>
2026-02-21T08:28:23.5616218Z       }
2026-02-21T08:28:23.5616434Z       %14 = arith.addf %arg6, %13 : tensor<8x512xf32>
2026-02-21T08:28:23.5616740Z       scf.yield %14 : tensor<8x512xf32>
2026-02-21T08:28:23.5617147Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:28:23.5617582Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:28:23.5617872Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:28:23.5618152Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:28:23.5618451Z       tt.reduce.return %11 : f32
2026-02-21T08:28:23.5618738Z     }) : (tensor<8x512xf32>) -> tensor<8xf32>
2026-02-21T08:28:23.5619100Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<8x!tt.ptr<f32>>
2026-02-21T08:28:23.5619523Z     %10 = tt.addptr %9, %6 : tensor<8x!tt.ptr<f32>>, tensor<8xi32>
2026-02-21T08:28:23.5619897Z     tt.store %10, %8 : tensor<8x!tt.ptr<f32>>
2026-02-21T08:28:23.5620179Z     tt.return
2026-02-21T08:28:23.5620356Z   }
2026-02-21T08:28:23.5620532Z }
2026-02-21T08:28:23.5620630Z 
2026-02-21T08:28:23.5620699Z {-#
2026-02-21T08:28:23.5620892Z   external_resources: {
2026-02-21T08:28:23.5621126Z     mlir_reproducer: {
2026-02-21T08:28:23.5628734Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:28:23.5636723Z       disable_threading: false,
2026-02-21T08:28:23.5636987Z       verify_each: true
2026-02-21T08:28:23.5637199Z     }
2026-02-21T08:28:23.5637374Z   }
2026-02-21T08:28:23.5637537Z #-}
2026-02-21T08:28:23.5638248Z /tmp/torchinductor_root/e5/ce54lljssnh2prtaef7tqwd3wn6vl4knfn4dpqe6vnzyl4to3czs.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:28:23.5640414Z /tmp/torchinductor_root/e5/ce54lljssnh2prtaef7tqwd3wn6vl4knfn4dpqe6vnzyl4to3czs.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:28:23.5642166Z [86s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:28:23.5643885Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:28:23.5645412Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:28:23.5645826Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:28:24.5473753Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 17.9 configs/s
2026-02-21T08:28:35.9013091Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 90.6 configs/s
2026-02-21T08:28:36.2323135Z [99s] Generation 2 complete: 
2026-02-21T08:28:36.2327411Z error=2
2026-02-21T08:28:36.2331363Z ok=70
2026-02-21T08:28:36.2335191Z min=0.1160
2026-02-21T08:28:36.2336686Z mid=0.1258
2026-02-21T08:28:36.2336847Z max=0.5642
2026-02-21T08:28:36.2336984Z best={'block_sizes': [2048, 1],
2026-02-21T08:28:36.2337216Z  'indexing': ['pointer', 'tensor_descriptor', 'pointer'],
2026-02-21T08:28:36.2337450Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:28:36.2337620Z  'num_stages': 8,
2026-02-21T08:28:36.2337764Z  'num_warps': 1,
2026-02-21T08:28:36.2337901Z  'pid_type': 'flat',
2026-02-21T08:28:36.2338057Z  'range_flattens': [None, False],
2026-02-21T08:28:36.2338230Z  'range_multi_buffers': [None, None],
2026-02-21T08:28:36.2338415Z  'range_num_stages': [0, 4],
2026-02-21T08:28:36.2338576Z  'range_unroll_factors': [0, 0],
2026-02-21T08:28:36.2338762Z  'range_warp_specializes': [None, True]}
2026-02-21T08:28:36.2338997Z [99s] Fitting surrogate: 256 points, 256 targets
2026-02-21T08:28:37.0553238Z [100s] Generation 3 starting: 57 neighbors, 4 active search path(s)
2026-02-21T08:28:41.2143421Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.2 configs/s
2026-02-21T08:28:43.2977658Z module {
2026-02-21T08:28:43.2978258Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:28:43.2978880Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:28:43.2979072Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:28:43.2979301Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32>
2026-02-21T08:28:43.2979519Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:28:43.2979699Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:28:43.2979890Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:28:43.2980517Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:28:43.2980695Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:28:43.2981013Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:43.2981455Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:43.2981761Z     %2 = tt.get_program_id x : i32
2026-02-21T08:28:43.2982150Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:28:43.2982368Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:28:43.2982609Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:28:43.2982793Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:28:43.2983100Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>)  : i32 {
2026-02-21T08:28:43.2983608Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:43.2983983Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:43.2984276Z       %13 = scf.if %arg3 -> (tensor<4x256xf32>) {
2026-02-21T08:28:43.2984639Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32>
2026-02-21T08:28:43.2985011Z         %16 = arith.subf %12, %11 : tensor<4x256xf32>
2026-02-21T08:28:43.2985247Z         %17 = arith.mulf %15, %16 : tensor<4x256xf32>
2026-02-21T08:28:43.2985448Z         %18 = arith.addf %17, %cst : tensor<4x256xf32>
2026-02-21T08:28:43.2985645Z         scf.yield %18 : tensor<4x256xf32>
2026-02-21T08:28:43.2985808Z       } else {
2026-02-21T08:28:43.2985970Z         %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32>
2026-02-21T08:28:43.2986182Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32>
2026-02-21T08:28:43.2986409Z         %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32>
2026-02-21T08:28:43.2986621Z         %18 = arith.ori %16, %17 : tensor<4x256xi1>
2026-02-21T08:28:43.2986853Z         %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32>
2026-02-21T08:28:43.2987095Z         %20 = math.log %19 : tensor<4x256xf32>
2026-02-21T08:28:43.2987283Z         %21 = arith.subf %20, %11 : tensor<4x256xf32>
2026-02-21T08:28:43.2987485Z         %22 = arith.mulf %12, %21 : tensor<4x256xf32>
2026-02-21T08:28:43.2987681Z         %23 = arith.addf %22, %cst : tensor<4x256xf32>
2026-02-21T08:28:43.2987878Z         scf.yield %23 : tensor<4x256xf32>
2026-02-21T08:28:43.2988050Z       }
2026-02-21T08:28:43.2988189Z       %14 = arith.addf %arg6, %13 : tensor<4x256xf32>
2026-02-21T08:28:43.2988385Z       scf.yield %14 : tensor<4x256xf32>
2026-02-21T08:28:43.2988634Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:28:43.2988901Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:28:43.2989087Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:28:43.2989261Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:28:43.2989439Z       tt.reduce.return %11 : f32
2026-02-21T08:28:43.2989619Z     }) : (tensor<4x256xf32>) -> tensor<4xf32>
2026-02-21T08:28:43.2989839Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:43.2990085Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:28:43.2990312Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:43.2990484Z     tt.return
2026-02-21T08:28:43.2990608Z   }
2026-02-21T08:28:43.2990721Z }
2026-02-21T08:28:43.2990796Z 
2026-02-21T08:28:43.2990845Z {-#
2026-02-21T08:28:43.2990967Z   external_resources: {
2026-02-21T08:28:43.2991125Z     mlir_reproducer: {
2026-02-21T08:28:43.2995480Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:28:43.2999840Z       disable_threading: false,
2026-02-21T08:28:43.3000006Z       verify_each: true
2026-02-21T08:28:43.3000144Z     }
2026-02-21T08:28:43.3000263Z   }
2026-02-21T08:28:43.3000369Z #-}
2026-02-21T08:28:43.3000779Z /tmp/torchinductor_root/yz/cyzks6vbmk67lvqrfhanze723cfpvacoohlj5tj54zfrsorbhg3o.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:28:43.3001977Z /tmp/torchinductor_root/yz/cyzks6vbmk67lvqrfhanze723cfpvacoohlj5tj54zfrsorbhg3o.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:28:43.3002948Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:28:43.3003919Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:28:43.3004788Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:28:43.3005034Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:28:43.6554426Z module {
2026-02-21T08:28:43.6555087Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:28:43.6560213Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:28:43.6564348Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:28:43.6564672Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32>
2026-02-21T08:28:43.6569102Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:28:43.6573645Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:28:43.6576945Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:28:43.6577254Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:28:43.6577485Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:28:43.6577866Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:43.6578640Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x256xf32>>
2026-02-21T08:28:43.6583742Z     %2 = tt.get_program_id x : i32
2026-02-21T08:28:43.6585791Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:28:43.6586066Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:28:43.6586316Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:28:43.6586505Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:28:43.6586817Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>)  : i32 {
2026-02-21T08:28:43.6587213Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:43.6587577Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x256xf32>> -> tensor<4x256xf32>
2026-02-21T08:28:43.6588084Z       %13 = scf.if %arg3 -> (tensor<4x256xf32>) {
2026-02-21T08:28:43.6588463Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32>
2026-02-21T08:28:43.6588828Z         %16 = arith.subf %12, %11 : tensor<4x256xf32>
2026-02-21T08:28:43.6589054Z         %17 = arith.mulf %15, %16 : tensor<4x256xf32>
2026-02-21T08:28:43.6589257Z         %18 = arith.addf %17, %cst : tensor<4x256xf32>
2026-02-21T08:28:43.6589468Z         scf.yield %18 : tensor<4x256xf32>
2026-02-21T08:28:43.6589632Z       } else {
2026-02-21T08:28:43.6589794Z         %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32>
2026-02-21T08:28:43.6590010Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32>
2026-02-21T08:28:43.6590222Z         %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32>
2026-02-21T08:28:43.6590448Z         %18 = arith.ori %16, %17 : tensor<4x256xi1>
2026-02-21T08:28:43.6590688Z         %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32>
2026-02-21T08:28:43.6590938Z         %20 = math.log %19 : tensor<4x256xf32>
2026-02-21T08:28:43.6591133Z         %21 = arith.subf %20, %11 : tensor<4x256xf32>
2026-02-21T08:28:43.6591326Z         %22 = arith.mulf %12, %21 : tensor<4x256xf32>
2026-02-21T08:28:43.6591528Z         %23 = arith.addf %22, %cst : tensor<4x256xf32>
2026-02-21T08:28:43.6591715Z         scf.yield %23 : tensor<4x256xf32>
2026-02-21T08:28:43.6592006Z       }
2026-02-21T08:28:43.6592151Z       %14 = arith.addf %arg6, %13 : tensor<4x256xf32>
2026-02-21T08:28:43.6592343Z       scf.yield %14 : tensor<4x256xf32>
2026-02-21T08:28:43.6592547Z     } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:28:43.6592744Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:28:43.6592933Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:28:43.6593103Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:28:43.6593291Z       tt.reduce.return %11 : f32
2026-02-21T08:28:43.6593473Z     }) : (tensor<4x256xf32>) -> tensor<4xf32>
2026-02-21T08:28:43.6593703Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:43.6593953Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:28:43.6594182Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:28:43.6594360Z     tt.return
2026-02-21T08:28:43.6594477Z   }
2026-02-21T08:28:43.6594612Z }
2026-02-21T08:28:43.6594677Z 
2026-02-21T08:28:43.6594724Z {-#
2026-02-21T08:28:43.6594850Z   external_resources: {
2026-02-21T08:28:43.6594997Z     mlir_reproducer: {
2026-02-21T08:28:43.6599265Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:28:43.6603664Z       disable_threading: false,
2026-02-21T08:28:43.6603846Z       verify_each: true
2026-02-21T08:28:43.6603997Z     }
2026-02-21T08:28:43.6604144Z   }
2026-02-21T08:28:43.6604305Z #-}
2026-02-21T08:28:43.6604824Z /tmp/torchinductor_root/uu/cuucfqp53u36oyzbhddfv4rgl27hwc7yimcic2whdoumdwksxue4.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:28:43.6606165Z /tmp/torchinductor_root/uu/cuucfqp53u36oyzbhddfv4rgl27hwc7yimcic2whdoumdwksxue4.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:28:43.6607230Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:28:43.6608314Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:28:43.6609279Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:28:43.6609534Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:28:44.5186198Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 18.1 configs/s
2026-02-21T08:28:52.9784791Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 118.8         
2026-02-21T08:28:52.9786213Z                                                                   configs/s     
2026-02-21T08:28:53.2397662Z [116s] Generation 3 complete: 
2026-02-21T08:28:53.2402036Z error=2
2026-02-21T08:28:53.2403318Z ok=59
2026-02-21T08:28:53.2403474Z min=0.1136
2026-02-21T08:28:53.2403610Z mid=0.1238
2026-02-21T08:28:53.2403724Z max=0.3698
2026-02-21T08:28:53.2403864Z best={'block_sizes': [1024, 2],
2026-02-21T08:28:53.2404122Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:28:53.2404402Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:28:53.2404589Z  'num_stages': 5,
2026-02-21T08:28:53.2404724Z  'num_warps': 1,
2026-02-21T08:28:53.2404865Z  'pid_type': 'flat',
2026-02-21T08:28:53.2405015Z  'range_flattens': [None, False],
2026-02-21T08:28:53.2405192Z  'range_multi_buffers': [None, True],
2026-02-21T08:28:53.2405365Z  'range_num_stages': [0, 1],
2026-02-21T08:28:53.2405529Z  'range_unroll_factors': [0, 1],
2026-02-21T08:28:53.2405725Z  'range_warp_specializes': [None, True]}
2026-02-21T08:28:53.2414148Z [116s] Fitting surrogate: 317 points, 317 targets
2026-02-21T08:28:53.9812475Z [116s] Generation 4 starting: 43 neighbors, 3 active search path(s)
2026-02-21T08:28:56.0748019Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 22.1 configs/s
2026-02-21T08:28:58.6039366Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.7 configs/s
2026-02-21T08:29:05.9807670Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 143.4         
2026-02-21T08:29:05.9810942Z                                                                   configs/s     
2026-02-21T08:29:06.2051060Z [129s] Generation 4 complete: 
2026-02-21T08:29:06.2051339Z ok=46
2026-02-21T08:29:06.2051474Z min=0.1076
2026-02-21T08:29:06.2051627Z mid=0.1199
2026-02-21T08:29:06.2051750Z max=0.1894
2026-02-21T08:29:06.2053407Z best={'block_sizes': [1024, 1],
2026-02-21T08:29:06.2054093Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:06.2054414Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:29:06.2054615Z  'num_stages': 5,
2026-02-21T08:29:06.2054755Z  'num_warps': 1,
2026-02-21T08:29:06.2054905Z  'pid_type': 'flat',
2026-02-21T08:29:06.2055065Z  'range_flattens': [None, False],
2026-02-21T08:29:06.2055249Z  'range_multi_buffers': [None, True],
2026-02-21T08:29:06.2055427Z  'range_num_stages': [0, 2],
2026-02-21T08:29:06.2055597Z  'range_unroll_factors': [0, 1],
2026-02-21T08:29:06.2055777Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:06.2065609Z [129s] Fitting surrogate: 363 points, 363 targets
2026-02-21T08:29:07.0483884Z [130s] Generation 5 starting: 42 neighbors, 3 active search path(s)
2026-02-21T08:29:09.0796819Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 41.5 configs/s
2026-02-21T08:29:11.5514359Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.7 configs/s
2026-02-21T08:29:18.4312394Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 147.1         
2026-02-21T08:29:18.4312948Z                                                                   configs/s     
2026-02-21T08:29:18.6500814Z [141s] Generation 5 complete: 
2026-02-21T08:29:18.6504386Z ok=46
2026-02-21T08:29:18.6508276Z min=0.1076
2026-02-21T08:29:18.6511476Z mid=0.1199
2026-02-21T08:29:18.6515560Z max=0.2203
2026-02-21T08:29:18.6518859Z best={'block_sizes': [1024, 1],
2026-02-21T08:29:18.6522824Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:18.6524153Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:29:18.6524352Z  'num_stages': 5,
2026-02-21T08:29:18.6524496Z  'num_warps': 1,
2026-02-21T08:29:18.6524640Z  'pid_type': 'flat',
2026-02-21T08:29:18.6524804Z  'range_flattens': [None, False],
2026-02-21T08:29:18.6524985Z  'range_multi_buffers': [None, True],
2026-02-21T08:29:18.6525166Z  'range_num_stages': [0, 3],
2026-02-21T08:29:18.6525337Z  'range_unroll_factors': [0, 1],
2026-02-21T08:29:18.6525542Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:18.6525765Z [141s] Fitting surrogate: 409 points, 409 targets
2026-02-21T08:29:19.0303301Z [142s] Generation 6 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:29:20.5111784Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 33.4 configs/s
2026-02-21T08:29:21.2905523Z module {
2026-02-21T08:29:21.2907863Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:29:21.2908582Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:29:21.2908793Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:29:21.2909051Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:29:21.2909296Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:29:21.2909494Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:29:21.2909728Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:29:21.2910372Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T08:29:21.2910573Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:29:21.2910908Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:29:21.2911390Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:29:21.2913044Z     %2 = tt.get_program_id x : i32
2026-02-21T08:29:21.2913288Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:29:21.2917887Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:29:21.2922155Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:29:21.2926943Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:29:21.2931435Z     %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:29:21.2936793Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:29:21.2938309Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:29:21.2938654Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:29:21.2939054Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:29:21.2939457Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:29:21.2939659Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:29:21.2939873Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:29:21.2940073Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:29:21.2940257Z       } else {
2026-02-21T08:29:21.2940427Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:29:21.2940677Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:29:21.2940918Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:29:21.2941147Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:29:21.2941406Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:29:21.2941662Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:29:21.2942062Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:29:21.2942279Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:29:21.2942502Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:29:21.2942707Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:29:21.2942932Z       }
2026-02-21T08:29:21.2943098Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:29:21.2943286Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:29:21.2943543Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:29:21.2944084Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:29:21.2944280Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:29:21.2944459Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:29:21.2944656Z       tt.reduce.return %11 : f32
2026-02-21T08:29:21.2944844Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:29:21.2945065Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:29:21.2945325Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:29:21.2945558Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:29:21.2945747Z     tt.return
2026-02-21T08:29:21.2945875Z   }
2026-02-21T08:29:21.2946012Z }
2026-02-21T08:29:21.2946084Z 
2026-02-21T08:29:21.2946136Z {-#
2026-02-21T08:29:21.2946274Z   external_resources: {
2026-02-21T08:29:21.2946433Z     mlir_reproducer: {
2026-02-21T08:29:21.2950793Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:29:21.2955157Z       disable_threading: false,
2026-02-21T08:29:21.2955330Z       verify_each: true
2026-02-21T08:29:21.2955485Z     }
2026-02-21T08:29:21.2955605Z   }
2026-02-21T08:29:21.2955733Z #-}
2026-02-21T08:29:21.2956162Z /tmp/torchinductor_root/qh/cqhg3hekzjiacdff4zpxjzwm4m5mixo7gnd7d65mn2s4lw3jh4q3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:29:21.2957398Z /tmp/torchinductor_root/qh/cqhg3hekzjiacdff4zpxjzwm4m5mixo7gnd7d65mn2s4lw3jh4q3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:29:21.2958394Z [144s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:29:21.2959430Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:29:21.2960363Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:29:21.2960686Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:29:21.6948476Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 18.5 configs/s
2026-02-21T08:29:25.4521024Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 291.9         
2026-02-21T08:29:25.4521573Z                                                                   configs/s     
2026-02-21T08:29:25.5820980Z [148s] Generation 6 complete: 
2026-02-21T08:29:25.5822622Z error=1
2026-02-21T08:29:25.5822780Z ok=22
2026-02-21T08:29:25.5822907Z min=0.1095
2026-02-21T08:29:25.5823047Z mid=0.1158
2026-02-21T08:29:25.5823168Z max=0.1669
2026-02-21T08:29:25.5823318Z best={'block_sizes': [1024, 1],
2026-02-21T08:29:25.5823576Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:29:25.5823866Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:29:25.5824053Z  'num_stages': 5,
2026-02-21T08:29:25.5824563Z  'num_warps': 2,
2026-02-21T08:29:25.5824736Z  'pid_type': 'flat',
2026-02-21T08:29:25.5824889Z  'range_flattens': [None, False],
2026-02-21T08:29:25.5825072Z  'range_multi_buffers': [None, True],
2026-02-21T08:29:25.5825247Z  'range_num_stages': [0, 3],
2026-02-21T08:29:25.5825417Z  'range_unroll_factors': [0, 1],
2026-02-21T08:29:25.5825591Z  'range_warp_specializes': [None, True]}
2026-02-21T08:29:25.5832889Z [148s] Fitting surrogate: 432 points, 432 targets
2026-02-21T08:29:25.7421770Z [148s] Autotuning complete in 148.7s after searching 406 configs.
2026-02-21T08:29:25.7422359Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:29:25.7423308Z     @helion.kernel(config=helion.Config(block_sizes=[1024, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:29:25.7424147Z 
2026-02-21T08:29:25.7717755Z [148s] Code of selected kernel: /tmp/torchinductor_root/ev/cevbvrakdvigvaxviegoca6q5acz4nqw7apu6ysxypfng7qcpeon.py
2026-02-21T08:29:25.7718151Z from __future__ import annotations
2026-02-21T08:29:25.7719290Z 
2026-02-21T08:29:25.7719451Z import torch
2026-02-21T08:29:25.7719611Z import triton
2026-02-21T08:29:25.7719765Z import triton.language as tl
2026-02-21T08:29:25.7719964Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:29:25.7720234Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:29:25.7720515Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:29:25.7720792Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:29:25.7720958Z 
2026-02-21T08:29:25.7721033Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T08:29:25.7721204Z _BLOCK_SIZE_0 = tl.constexpr(1024)
2026-02-21T08:29:25.7721313Z 
2026-02-21T08:29:25.7721387Z @triton.jit
2026-02-21T08:29:25.7721579Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:29:25.7721935Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:29:25.7722179Z     pid_0 = tl.program_id(0)
2026-02-21T08:29:25.7722353Z     offset_1 = pid_0
2026-02-21T08:29:25.7722535Z     indices_1 = offset_1 + tl.zeros([1], tl.int32)
2026-02-21T08:29:25.7722815Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:29:25.7723141Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:29:25.7723420Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:29:25.7723740Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:29:25.7724007Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:29:25.7724444Z     for offset_0 in tl.range(0, 16384, _BLOCK_SIZE_0, loop_unroll_factor=1, warp_specialize=True, num_stages=3, disallow_acc_multi_buffer=False, flatten=False):
2026-02-21T08:29:25.7725182Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:29:25.7725415Z         loss_sum_copy = loss_sum
2026-02-21T08:29:25.7725584Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:29:25.7725846Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:29:25.7726152Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:29:25.7726420Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:29:25.7726772Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 16384 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:29:25.7727120Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:29:25.7727545Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 16384 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:29:25.7727871Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:29:25.7728131Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:29:25.7728418Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:29:25.7728635Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:29:25.7728809Z         if log_target:
2026-02-21T08:29:25.7728961Z             y_true_val_copy = y_true_val
2026-02-21T08:29:25.7729144Z             y_pred_val_copy = y_pred_val
2026-02-21T08:29:25.7729316Z             kl_loss_copy = kl_loss
2026-02-21T08:29:25.7729495Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:29:25.7729681Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:29:25.7729867Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:29:25.7730070Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:29:25.7730297Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:29:25.7730540Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:29:25.7730789Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:29:25.7730979Z             v_2 = v_0 * v_1
2026-02-21T08:29:25.7731136Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:29:25.7731319Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:29:25.7731562Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:29:25.7731889Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:29:25.7732117Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:29:25.7732290Z         _not = not log_target
2026-02-21T08:29:25.7732456Z         if _not:
2026-02-21T08:29:25.7732603Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:29:25.7732796Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:29:25.7732977Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:29:25.7733176Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:29:25.7733381Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:29:25.7733588Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:29:25.7733848Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:29:25.7734142Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:29:25.7734374Z             v_5 = tl_math.log(v_4)
2026-02-21T08:29:25.7734597Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:29:25.7734844Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:29:25.7735025Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:29:25.7735212Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:29:25.7735411Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:29:25.7735609Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:29:25.7735913Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:29:25.7736157Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:29:25.7736383Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:29:25.7736519Z 
2026-02-21T08:29:25.7736819Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:29:25.7737233Z     """
2026-02-21T08:29:25.7737381Z     Compute KL Divergence loss.
2026-02-21T08:29:25.7737496Z 
2026-02-21T08:29:25.7737553Z     Args:
2026-02-21T08:29:25.7737739Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:29:25.7738036Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:29:25.7738386Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:29:25.7738767Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:29:25.7739031Z         eps: Small value to avoid numerical issues
2026-02-21T08:29:25.7739176Z 
2026-02-21T08:29:25.7739235Z     Returns:
2026-02-21T08:29:25.7739367Z         loss: KL divergence loss
2026-02-21T08:29:25.7739525Z     """
2026-02-21T08:29:25.7739659Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:29:25.7739842Z     BT, V = y_pred.shape
2026-02-21T08:29:25.7740030Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:29:25.7740298Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:29:25.7740523Z     # src[kl_div.py:77]: )
2026-02-21T08:29:25.7740770Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:29:25.7741057Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:29:25.7741272Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:29:25.7741479Z     # src[kl_div.py:82]: else:
2026-02-21T08:29:25.7741636Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:29:25.7741796Z     if reduction == 'none':
2026-02-21T08:29:25.7742005Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:29:25.7742214Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:29:25.7742381Z     else:
2026-02-21T08:29:25.7742594Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:29:25.7742915Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:29:25.7743197Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:29:25.7743508Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:29:25.7743757Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:29:25.7744052Z     _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=4, num_stages=5)
2026-02-21T08:29:25.7744387Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:29:25.7744619Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:29:25.7744850Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:29:25.7745041Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:29:25.7745212Z     if reduction == 'batchmean':
2026-02-21T08:29:25.7745402Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:29:25.7745621Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:29:25.7745805Z     elif reduction == 'sum':
2026-02-21T08:29:25.7745997Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:29:25.7746213Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:29:25.7746389Z     elif reduction == 'mean':
2026-02-21T08:29:25.7746589Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:29:25.7746806Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:29:25.7746980Z     else:
2026-02-21T08:29:25.7747113Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:29:25.7747357Z         final_loss = loss
2026-02-21T08:29:25.7747518Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:29:25.7747687Z     return final_loss
2026-02-21T08:29:26.8022786Z WARNING:tritonbench.utils.triton_op:Completed input ID 2:
2026-02-21T08:29:26.8026726Z (B, T, V)
2026-02-21T08:29:26.8026923Z ---------------
2026-02-21T08:29:26.8027118Z (8, 512, 16384)
2026-02-21T08:29:26.8027225Z 
2026-02-21T08:29:26.8376710Z  50%|█████     | 3/6 [07:59<07:56, 158.97s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3:
2026-02-21T08:29:26.8381071Z (B, T, V)
2026-02-21T08:29:26.8383066Z ---------------
2026-02-21T08:29:26.8383297Z (8, 512, 32768)
2026-02-21T08:29:26.8387587Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:29:28.0374680Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for liger_kl_div
2026-02-21T08:29:29.1633193Z INFO:tritonbench.utils.triton_op:Took 3.97ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:29:30.5049934Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:29:30.5050298Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:29:30.5050592Z               'dtype': 'torch.float32',
2026-02-21T08:29:30.5050868Z               'shape': (4096, 32768),
2026-02-21T08:29:30.5051144Z               'stride': (32768, 1)},
2026-02-21T08:29:30.5051417Z             { 'device': 'cuda:0',
2026-02-21T08:29:30.5051677Z               'dtype': 'torch.float32',
2026-02-21T08:29:30.5052298Z               'shape': (4096, 32768),
2026-02-21T08:29:30.5052566Z               'stride': (32768, 1)}),
2026-02-21T08:29:30.5052826Z   'kwargs': {}}
2026-02-21T08:29:30.5062052Z INFO:tritonbench.utils.triton_op:Took 1.64ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:29:30.7376594Z [0s] Autotune random seed: 2135561342
2026-02-21T08:29:30.7732030Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:30:03.6654858Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=64, num_stages=8, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[3, 0], range_unroll_factors=[0, 4], range_warp_specializes=[True, None])
2026-02-21T08:30:03.7081163Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True])
2026-02-21T08:30:04.2249891Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 8], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, None])
2026-02-21T08:30:04.3136854Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_sm_multiplier=16, num_stages=8, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None])
2026-02-21T08:30:04.6716066Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None])
2026-02-21T08:30:05.8680310Z [35s] Timeout after 30s compiling Config(block_sizes=[128, 1024], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 1], range_unroll_factors=[1, 1], range_warp_specializes=[None, True])
2026-02-21T08:30:05.8695971Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:30:05.9579640Z module {
2026-02-21T08:30:05.9584414Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:05.9585658Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:30:05.9586274Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:05.9586512Z     %cst = arith.constant dense<32768> : tensor<16x1xi32>
2026-02-21T08:30:05.9586775Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32>
2026-02-21T08:30:05.9586994Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:30:05.9587180Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:05.9587371Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:05.9587550Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:05.9587731Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:05.9588039Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<16x8xf32>>
2026-02-21T08:30:05.9588355Z     %1 = tt.get_program_id x : i32
2026-02-21T08:30:05.9588524Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T08:30:05.9588749Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:30:05.9588989Z     %4 = tt.splat %2 : i32 -> tensor<16xi32>
2026-02-21T08:30:05.9589173Z     %5 = arith.addi %4, %3 : tensor<16xi32>
2026-02-21T08:30:05.9589481Z     %6 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>)  : i32 {
2026-02-21T08:30:05.9589820Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:30:05.9590064Z       %11 = tt.splat %arg5 : i32 -> tensor<8xi32>
2026-02-21T08:30:05.9590256Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:30:05.9590533Z       %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc<tensor<16x8xf32>> -> tensor<16x8xf32>
2026-02-21T08:30:05.9590869Z       %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:30:05.9591118Z       %15 = arith.muli %14, %cst : tensor<16x1xi32>
2026-02-21T08:30:05.9591363Z       %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:30:05.9591631Z       %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32>
2026-02-21T08:30:05.9591932Z       %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32>
2026-02-21T08:30:05.9592162Z       %19 = arith.addi %17, %18 : tensor<16x8xi32>
2026-02-21T08:30:05.9592399Z       %20 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:30:05.9592671Z       %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr<f32>>, tensor<16x8xi32>
2026-02-21T08:30:05.9592957Z       %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:30:05.9593220Z       %23 = scf.if %arg3 -> (tensor<16x8xf32>) {
2026-02-21T08:30:05.9593572Z         %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32>
2026-02-21T08:30:05.9593939Z         %26 = arith.subf %22, %13 : tensor<16x8xf32>
2026-02-21T08:30:05.9594148Z         %27 = arith.mulf %25, %26 : tensor<16x8xf32>
2026-02-21T08:30:05.9594352Z         %28 = arith.addf %27, %cst_0 : tensor<16x8xf32>
2026-02-21T08:30:05.9594556Z         scf.yield %28 : tensor<16x8xf32>
2026-02-21T08:30:05.9594845Z       } else {
2026-02-21T08:30:05.9595004Z         %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32>
2026-02-21T08:30:05.9595213Z         %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32>
2026-02-21T08:30:05.9595428Z         %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32>
2026-02-21T08:30:05.9595621Z         %28 = arith.ori %26, %27 : tensor<16x8xi1>
2026-02-21T08:30:05.9595853Z         %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32>
2026-02-21T08:30:05.9596086Z         %30 = math.log %29 : tensor<16x8xf32>
2026-02-21T08:30:05.9596272Z         %31 = arith.subf %30, %13 : tensor<16x8xf32>
2026-02-21T08:30:05.9596468Z         %32 = arith.mulf %22, %31 : tensor<16x8xf32>
2026-02-21T08:30:05.9596662Z         %33 = arith.addf %32, %cst_0 : tensor<16x8xf32>
2026-02-21T08:30:05.9596856Z         scf.yield %33 : tensor<16x8xf32>
2026-02-21T08:30:05.9597016Z       }
2026-02-21T08:30:05.9597162Z       %24 = arith.addf %arg6, %23 : tensor<16x8xf32>
2026-02-21T08:30:05.9597412Z       scf.yield %24 : tensor<16x8xf32>
2026-02-21T08:30:05.9597590Z     } {tt.warp_specialize}
2026-02-21T08:30:05.9597759Z     %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:30:05.9597934Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:30:05.9598107Z       %10 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:30:05.9598281Z       tt.reduce.return %10 : f32
2026-02-21T08:30:05.9598457Z     }) : (tensor<16x8xf32>) -> tensor<16xf32>
2026-02-21T08:30:05.9598669Z     %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:30:05.9598922Z     %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:30:05.9599160Z     tt.store %9, %7 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:30:05.9599332Z     tt.return
2026-02-21T08:30:05.9599458Z   }
2026-02-21T08:30:05.9599571Z }
2026-02-21T08:30:05.9599636Z 
2026-02-21T08:30:05.9599693Z {-#
2026-02-21T08:30:05.9599817Z   external_resources: {
2026-02-21T08:30:05.9599973Z     mlir_reproducer: {
2026-02-21T08:30:05.9604313Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:05.9608697Z       disable_threading: false,
2026-02-21T08:30:05.9608879Z       verify_each: true
2026-02-21T08:30:05.9609029Z     }
2026-02-21T08:30:05.9609163Z   }
2026-02-21T08:30:05.9609294Z #-}
2026-02-21T08:30:05.9609791Z /tmp/torchinductor_root/hc/chc5bnnpsov5qsmmbqdu5bumoylsvdxz3pcs4sl2kb2c3a5myqiv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:05.9611199Z /tmp/torchinductor_root/hc/chc5bnnpsov5qsmmbqdu5bumoylsvdxz3pcs4sl2kb2c3a5myqiv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:05.9612320Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:05.9613428Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:30:05.9614373Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:05.9614619Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:11.4559625Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:30:11.4560538Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:11.4561338Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:30:11.4561588Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:11.4561812Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:11.4562186Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x16xf32>
2026-02-21T08:30:11.4562410Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:30:11.4562604Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:11.4562836Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:11.4563028Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:11.4563208Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:11.4563512Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x16xf32>>
2026-02-21T08:30:11.4563944Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x16xf32>>
2026-02-21T08:30:11.4564244Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:11.4564441Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:30:11.4564620Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:30:11.4564791Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:30:11.4564965Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:30:11.4565139Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:30:11.4565317Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:30:11.4565476Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:30:11.4565660Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:30:11.4565826Z     %9 = arith.remsi %8, %c3_i32 : i32
2026-02-21T08:30:11.4565995Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:30:11.4566159Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:30:11.4566333Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:30:11.4566505Z     %13 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:30:11.4566692Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:30:11.4566891Z       %14 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:30:11.4567115Z       %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:11.4567371Z       %16 = tt.splat %14 : i32 -> tensor<64xi32>
2026-02-21T08:30:11.4567560Z       %17 = arith.addi %16, %15 : tensor<64xi32>
2026-02-21T08:30:11.4567871Z       %18 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>)  : i32 {
2026-02-21T08:30:11.4568275Z         %42 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4568964Z         %43 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4569262Z         %44 = scf.if %arg3 -> (tensor<64x16xf32>) {
2026-02-21T08:30:11.4569623Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32>
2026-02-21T08:30:11.4569998Z           %47 = arith.subf %43, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4570210Z           %48 = arith.mulf %46, %47 : tensor<64x16xf32>
2026-02-21T08:30:11.4570412Z           %49 = arith.addf %48, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4570615Z           scf.yield %49 : tensor<64x16xf32>
2026-02-21T08:30:11.4570780Z         } else {
2026-02-21T08:30:11.4570942Z           %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32>
2026-02-21T08:30:11.4571158Z           %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32>
2026-02-21T08:30:11.4571503Z           %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32>
2026-02-21T08:30:11.4571718Z           %49 = arith.ori %47, %48 : tensor<64x16xi1>
2026-02-21T08:30:11.4571997Z           %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32>
2026-02-21T08:30:11.4572239Z           %51 = math.log %50 : tensor<64x16xf32>
2026-02-21T08:30:11.4572429Z           %52 = arith.subf %51, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4572637Z           %53 = arith.mulf %43, %52 : tensor<64x16xf32>
2026-02-21T08:30:11.4572839Z           %54 = arith.addf %53, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4573038Z           scf.yield %54 : tensor<64x16xf32>
2026-02-21T08:30:11.4573208Z         }
2026-02-21T08:30:11.4573359Z         %45 = arith.addf %arg7, %44 : tensor<64x16xf32>
2026-02-21T08:30:11.4573553Z         scf.yield %45 : tensor<64x16xf32>
2026-02-21T08:30:11.4573750Z       } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:11.4573963Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.4574152Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.4574330Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.4574506Z         tt.reduce.return %42 : f32
2026-02-21T08:30:11.4574687Z       }) : (tensor<64x16xf32>) -> tensor<64xf32>
2026-02-21T08:30:11.4574908Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4575166Z       %21 = tt.addptr %20, %17 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:11.4575400Z       tt.store %21, %19 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4575587Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:30:11.4575774Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:30:11.4575953Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:30:11.4576132Z       %24 = arith.muli %23, %c64_i32 : i32
2026-02-21T08:30:11.4576347Z       %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:11.4576587Z       %26 = tt.splat %24 : i32 -> tensor<64xi32>
2026-02-21T08:30:11.4576779Z       %27 = arith.addi %26, %25 : tensor<64xi32>
2026-02-21T08:30:11.4577075Z       %28 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>)  : i32 {
2026-02-21T08:30:11.4577467Z         %42 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4577819Z         %43 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4578103Z         %44 = scf.if %arg3 -> (tensor<64x16xf32>) {
2026-02-21T08:30:11.4578458Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32>
2026-02-21T08:30:11.4578813Z           %47 = arith.subf %43, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4579025Z           %48 = arith.mulf %46, %47 : tensor<64x16xf32>
2026-02-21T08:30:11.4579227Z           %49 = arith.addf %48, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4579496Z           scf.yield %49 : tensor<64x16xf32>
2026-02-21T08:30:11.4579658Z         } else {
2026-02-21T08:30:11.4579819Z           %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32>
2026-02-21T08:30:11.4580034Z           %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32>
2026-02-21T08:30:11.4580250Z           %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32>
2026-02-21T08:30:11.4580456Z           %49 = arith.ori %47, %48 : tensor<64x16xi1>
2026-02-21T08:30:11.4580683Z           %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32>
2026-02-21T08:30:11.4580920Z           %51 = math.log %50 : tensor<64x16xf32>
2026-02-21T08:30:11.4581111Z           %52 = arith.subf %51, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4581312Z           %53 = arith.mulf %43, %52 : tensor<64x16xf32>
2026-02-21T08:30:11.4581513Z           %54 = arith.addf %53, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4581735Z           scf.yield %54 : tensor<64x16xf32>
2026-02-21T08:30:11.4582137Z         }
2026-02-21T08:30:11.4582289Z         %45 = arith.addf %arg7, %44 : tensor<64x16xf32>
2026-02-21T08:30:11.4582476Z         scf.yield %45 : tensor<64x16xf32>
2026-02-21T08:30:11.4582678Z       } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:11.4582875Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.4583064Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.4583241Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.4583417Z         tt.reduce.return %42 : f32
2026-02-21T08:30:11.4583601Z       }) : (tensor<64x16xf32>) -> tensor<64xf32>
2026-02-21T08:30:11.4583815Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4584074Z       %31 = tt.addptr %30, %27 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:11.4584298Z       tt.store %31, %29 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4584493Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:11.4584680Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:30:11.4584858Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:30:11.4585035Z       %34 = arith.muli %33, %c64_i32 : i32
2026-02-21T08:30:11.4585247Z       %35 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:11.4585484Z       %36 = tt.splat %34 : i32 -> tensor<64xi32>
2026-02-21T08:30:11.4585663Z       %37 = arith.addi %36, %35 : tensor<64xi32>
2026-02-21T08:30:11.4585991Z       %38 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>)  : i32 {
2026-02-21T08:30:11.4586390Z         %42 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4586750Z         %43 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4587038Z         %44 = scf.if %arg3 -> (tensor<64x16xf32>) {
2026-02-21T08:30:11.4587395Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32>
2026-02-21T08:30:11.4587768Z           %47 = arith.subf %43, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4587969Z           %48 = arith.mulf %46, %47 : tensor<64x16xf32>
2026-02-21T08:30:11.4588174Z           %49 = arith.addf %48, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4588372Z           scf.yield %49 : tensor<64x16xf32>
2026-02-21T08:30:11.4588537Z         } else {
2026-02-21T08:30:11.4588699Z           %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32>
2026-02-21T08:30:11.4588914Z           %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32>
2026-02-21T08:30:11.4589147Z           %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32>
2026-02-21T08:30:11.4589346Z           %49 = arith.ori %47, %48 : tensor<64x16xi1>
2026-02-21T08:30:11.4589579Z           %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32>
2026-02-21T08:30:11.4589814Z           %51 = math.log %50 : tensor<64x16xf32>
2026-02-21T08:30:11.4590003Z           %52 = arith.subf %51, %42 : tensor<64x16xf32>
2026-02-21T08:30:11.4590261Z           %53 = arith.mulf %43, %52 : tensor<64x16xf32>
2026-02-21T08:30:11.4590455Z           %54 = arith.addf %53, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4590651Z           scf.yield %54 : tensor<64x16xf32>
2026-02-21T08:30:11.4590813Z         }
2026-02-21T08:30:11.4590961Z         %45 = arith.addf %arg7, %44 : tensor<64x16xf32>
2026-02-21T08:30:11.4591147Z         scf.yield %45 : tensor<64x16xf32>
2026-02-21T08:30:11.4591347Z       } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:11.4591577Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.4591759Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.4591972Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.4592158Z         tt.reduce.return %42 : f32
2026-02-21T08:30:11.4592338Z       }) : (tensor<64x16xf32>) -> tensor<64xf32>
2026-02-21T08:30:11.4592568Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4592921Z       %41 = tt.addptr %40, %37 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:11.4593162Z       tt.store %41, %39 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4593347Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:30:11.4593529Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:30:11.4593728Z       %14 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:30:11.4593943Z       %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:11.4594180Z       %16 = tt.splat %14 : i32 -> tensor<64xi32>
2026-02-21T08:30:11.4594363Z       %17 = arith.addi %16, %15 : tensor<64xi32>
2026-02-21T08:30:11.4594667Z       %18 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>)  : i32 {
2026-02-21T08:30:11.4595057Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4595437Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<64x16xf32>> -> tensor<64x16xf32>
2026-02-21T08:30:11.4595725Z         %24 = scf.if %arg3 -> (tensor<64x16xf32>) {
2026-02-21T08:30:11.4596074Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32>
2026-02-21T08:30:11.4596436Z           %27 = arith.subf %23, %22 : tensor<64x16xf32>
2026-02-21T08:30:11.4596636Z           %28 = arith.mulf %26, %27 : tensor<64x16xf32>
2026-02-21T08:30:11.4596845Z           %29 = arith.addf %28, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4597043Z           scf.yield %29 : tensor<64x16xf32>
2026-02-21T08:30:11.4597206Z         } else {
2026-02-21T08:30:11.4597366Z           %26 = tt.splat %arg4 : f32 -> tensor<64x16xf32>
2026-02-21T08:30:11.4597578Z           %27 = arith.cmpf ogt, %23, %26 : tensor<64x16xf32>
2026-02-21T08:30:11.4597799Z           %28 = arith.cmpf une, %23, %23 : tensor<64x16xf32>
2026-02-21T08:30:11.4598006Z           %29 = arith.ori %27, %28 : tensor<64x16xi1>
2026-02-21T08:30:11.4598246Z           %30 = arith.select %29, %23, %26 : tensor<64x16xi1>, tensor<64x16xf32>
2026-02-21T08:30:11.4598475Z           %31 = math.log %30 : tensor<64x16xf32>
2026-02-21T08:30:11.4598670Z           %32 = arith.subf %31, %22 : tensor<64x16xf32>
2026-02-21T08:30:11.4598867Z           %33 = arith.mulf %23, %32 : tensor<64x16xf32>
2026-02-21T08:30:11.4599062Z           %34 = arith.addf %33, %cst : tensor<64x16xf32>
2026-02-21T08:30:11.4599256Z           scf.yield %34 : tensor<64x16xf32>
2026-02-21T08:30:11.4599418Z         }
2026-02-21T08:30:11.4599562Z         %25 = arith.addf %arg7, %24 : tensor<64x16xf32>
2026-02-21T08:30:11.4599746Z         scf.yield %25 : tensor<64x16xf32>
2026-02-21T08:30:11.4599943Z       } {tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:11.4600143Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.4600324Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.4600499Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.4600736Z         tt.reduce.return %22 : f32
2026-02-21T08:30:11.4600918Z       }) : (tensor<64x16xf32>) -> tensor<64xf32>
2026-02-21T08:30:11.4601134Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4601390Z       %21 = tt.addptr %20, %17 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:11.4601611Z       tt.store %21, %19 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:11.4601801Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:30:11.4601990Z     tt.return
2026-02-21T08:30:11.4602113Z   }
2026-02-21T08:30:11.4602234Z }
2026-02-21T08:30:11.4602301Z 
2026-02-21T08:30:11.4602350Z {-#
2026-02-21T08:30:11.4602482Z   external_resources: {
2026-02-21T08:30:11.4602630Z     mlir_reproducer: {
2026-02-21T08:30:11.4607014Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:11.4611555Z       disable_threading: false,
2026-02-21T08:30:11.4611759Z       verify_each: true
2026-02-21T08:30:11.4611955Z     }
2026-02-21T08:30:11.4612105Z   }
2026-02-21T08:30:11.4612262Z #-}
2026-02-21T08:30:11.4612791Z /tmp/torchinductor_root/2l/c2lby4mn655jcmwxwsdua65gj62bm6qvop2az7nfbzwxlpel5xur.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:11.4614185Z /tmp/torchinductor_root/2l/c2lby4mn655jcmwxwsdua65gj62bm6qvop2az7nfbzwxlpel5xur.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:11.4615327Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:11.4616503Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], maxnreg=64, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 1], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:30:11.4617572Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:11.4617821Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:11.4921674Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:30:11.4925311Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:11.4930261Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:30:11.4934840Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:30:11.4938752Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:11.4943854Z     %c592_i32 = arith.constant 592 : i32
2026-02-21T08:30:11.4947124Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32>
2026-02-21T08:30:11.4950301Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:11.4955501Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:11.4955790Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:11.4959945Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:11.4964140Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:11.4964612Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<128x16xf32>>
2026-02-21T08:30:11.4965428Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<128x16xf32>>
2026-02-21T08:30:11.4970340Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:11.4974912Z     scf.for %arg5 = %2 to %c32_i32 step %c592_i32  : i32 {
2026-02-21T08:30:11.4978670Z       %3 = arith.muli %arg5, %c128_i32 : i32
2026-02-21T08:30:11.4983177Z       %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:30:11.4985110Z       %5 = tt.splat %3 : i32 -> tensor<128xi32>
2026-02-21T08:30:11.4985353Z       %6 = arith.addi %5, %4 : tensor<128xi32>
2026-02-21T08:30:11.4985669Z       %7 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<128x16xf32>)  : i32 {
2026-02-21T08:30:11.4986089Z         %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<128x16xf32>> -> tensor<128x16xf32>
2026-02-21T08:30:11.4986464Z         %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<128x16xf32>> -> tensor<128x16xf32>
2026-02-21T08:30:11.4986749Z         %13 = scf.if %arg3 -> (tensor<128x16xf32>) {
2026-02-21T08:30:11.4987118Z           %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32>
2026-02-21T08:30:11.4987477Z           %16 = arith.subf %12, %11 : tensor<128x16xf32>
2026-02-21T08:30:11.4987685Z           %17 = arith.mulf %15, %16 : tensor<128x16xf32>
2026-02-21T08:30:11.4987888Z           %18 = arith.addf %17, %cst : tensor<128x16xf32>
2026-02-21T08:30:11.4988091Z           scf.yield %18 : tensor<128x16xf32>
2026-02-21T08:30:11.4988266Z         } else {
2026-02-21T08:30:11.4988425Z           %15 = tt.splat %arg4 : f32 -> tensor<128x16xf32>
2026-02-21T08:30:11.4988654Z           %16 = arith.cmpf ogt, %12, %15 : tensor<128x16xf32>
2026-02-21T08:30:11.4988873Z           %17 = arith.cmpf une, %12, %12 : tensor<128x16xf32>
2026-02-21T08:30:11.4989085Z           %18 = arith.ori %16, %17 : tensor<128x16xi1>
2026-02-21T08:30:11.4989321Z           %19 = arith.select %18, %12, %15 : tensor<128x16xi1>, tensor<128x16xf32>
2026-02-21T08:30:11.4989570Z           %20 = math.log %19 : tensor<128x16xf32>
2026-02-21T08:30:11.4989774Z           %21 = arith.subf %20, %11 : tensor<128x16xf32>
2026-02-21T08:30:11.4989976Z           %22 = arith.mulf %12, %21 : tensor<128x16xf32>
2026-02-21T08:30:11.4990183Z           %23 = arith.addf %22, %cst : tensor<128x16xf32>
2026-02-21T08:30:11.4990376Z           scf.yield %23 : tensor<128x16xf32>
2026-02-21T08:30:11.4990561Z         }
2026-02-21T08:30:11.4990702Z         %14 = arith.addf %arg7, %13 : tensor<128x16xf32>
2026-02-21T08:30:11.4990897Z         scf.yield %14 : tensor<128x16xf32>
2026-02-21T08:30:11.4991069Z       }
2026-02-21T08:30:11.4991204Z       %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.4991612Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.4991786Z         %11 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.4992055Z         tt.reduce.return %11 : f32
2026-02-21T08:30:11.4992237Z       }) : (tensor<128x16xf32>) -> tensor<128xf32>
2026-02-21T08:30:11.4992469Z       %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:30:11.4992727Z       %10 = tt.addptr %9, %6 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:30:11.4992966Z       tt.store %10, %8 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:30:11.4993203Z     } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:11.4993409Z     tt.return
2026-02-21T08:30:11.4993538Z   }
2026-02-21T08:30:11.4993650Z }
2026-02-21T08:30:11.4993725Z 
2026-02-21T08:30:11.4993773Z {-#
2026-02-21T08:30:11.4993895Z   external_resources: {
2026-02-21T08:30:11.4994054Z     mlir_reproducer: {
2026-02-21T08:30:11.4998455Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:11.5002708Z       disable_threading: false,
2026-02-21T08:30:11.5002872Z       verify_each: true
2026-02-21T08:30:11.5003008Z     }
2026-02-21T08:30:11.5003124Z   }
2026-02-21T08:30:11.5003229Z #-}
2026-02-21T08:30:11.5003643Z /tmp/torchinductor_root/kv/ckvgqnmj4yegtswbnyslacdbdlyyi26tqlv5jhjogcicpk66757j.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:11.5004821Z /tmp/torchinductor_root/kv/ckvgqnmj4yegtswbnyslacdbdlyyi26tqlv5jhjogcicpk66757j.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:11.5005776Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:11.5006848Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=4, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:30:11.5007858Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:11.5008103Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:11.5236647Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:30:11.5238497Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:11.5239134Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:30:11.5243343Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:11.5247943Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:11.5251840Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:11.5253774Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x128xf32>
2026-02-21T08:30:11.5254021Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:30:11.5254407Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:11.5254649Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:11.5258402Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:11.5262927Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:11.5266929Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<4x128xf32>>
2026-02-21T08:30:11.5268309Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<4x128xf32>>
2026-02-21T08:30:11.5268664Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:11.5268848Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:30:11.5269046Z     %4 = arith.minsi %3, %c1024_i32 : i32
2026-02-21T08:30:11.5269248Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:30:11.5269457Z       %5 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:30:11.5269699Z       %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:30:11.5269942Z       %7 = tt.splat %5 : i32 -> tensor<4xi32>
2026-02-21T08:30:11.5270133Z       %8 = arith.addi %7, %6 : tensor<4xi32>
2026-02-21T08:30:11.5270435Z       %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c128_i32 iter_args(%arg7 = %cst) -> (tensor<4x128xf32>)  : i32 {
2026-02-21T08:30:11.5270842Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<4x128xf32>> -> tensor<4x128xf32>
2026-02-21T08:30:11.5271227Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<4x128xf32>> -> tensor<4x128xf32>
2026-02-21T08:30:11.5271519Z         %15 = scf.if %arg3 -> (tensor<4x128xf32>) {
2026-02-21T08:30:11.5271998Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x128xf32>) -> tensor<4x128xf32>
2026-02-21T08:30:11.5272564Z           %18 = arith.subf %14, %13 : tensor<4x128xf32>
2026-02-21T08:30:11.5272883Z           %19 = arith.mulf %17, %18 : tensor<4x128xf32>
2026-02-21T08:30:11.5273142Z           %20 = arith.addf %19, %cst : tensor<4x128xf32>
2026-02-21T08:30:11.5273370Z           scf.yield %20 : tensor<4x128xf32>
2026-02-21T08:30:11.5273544Z         } else {
2026-02-21T08:30:11.5273721Z           %17 = tt.splat %arg4 : f32 -> tensor<4x128xf32>
2026-02-21T08:30:11.5273977Z           %18 = arith.cmpf ogt, %14, %17 : tensor<4x128xf32>
2026-02-21T08:30:11.5274202Z           %19 = arith.cmpf une, %14, %14 : tensor<4x128xf32>
2026-02-21T08:30:11.5274424Z           %20 = arith.ori %18, %19 : tensor<4x128xi1>
2026-02-21T08:30:11.5274668Z           %21 = arith.select %20, %14, %17 : tensor<4x128xi1>, tensor<4x128xf32>
2026-02-21T08:30:11.5274918Z           %22 = math.log %21 : tensor<4x128xf32>
2026-02-21T08:30:11.5275123Z           %23 = arith.subf %22, %13 : tensor<4x128xf32>
2026-02-21T08:30:11.5275326Z           %24 = arith.mulf %14, %23 : tensor<4x128xf32>
2026-02-21T08:30:11.5275539Z           %25 = arith.addf %24, %cst : tensor<4x128xf32>
2026-02-21T08:30:11.5275741Z           scf.yield %25 : tensor<4x128xf32>
2026-02-21T08:30:11.5276169Z         }
2026-02-21T08:30:11.5276319Z         %16 = arith.addf %arg7, %15 : tensor<4x128xf32>
2026-02-21T08:30:11.5276524Z         scf.yield %16 : tensor<4x128xf32>
2026-02-21T08:30:11.5276817Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:11.5277121Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:30:11.5277323Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:11.5277503Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:11.5277698Z         tt.reduce.return %13 : f32
2026-02-21T08:30:11.5277887Z       }) : (tensor<4x128xf32>) -> tensor<4xf32>
2026-02-21T08:30:11.5278123Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:30:11.5278385Z       %12 = tt.addptr %11, %8 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:30:11.5278624Z       tt.store %12, %10 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:30:11.5279011Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T08:30:11.5279318Z     tt.return
2026-02-21T08:30:11.5279465Z   }
2026-02-21T08:30:11.5279584Z }
2026-02-21T08:30:11.5279662Z 
2026-02-21T08:30:11.5279713Z {-#
2026-02-21T08:30:11.5279842Z   external_resources: {
2026-02-21T08:30:11.5280001Z     mlir_reproducer: {
2026-02-21T08:30:11.5284444Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:11.5288774Z       disable_threading: false,
2026-02-21T08:30:11.5288943Z       verify_each: true
2026-02-21T08:30:11.5289080Z     }
2026-02-21T08:30:11.5289198Z   }
2026-02-21T08:30:11.5289303Z #-}
2026-02-21T08:30:11.5289718Z /tmp/torchinductor_root/cw/ccwb7dn2gwrggtpkebrg3wqmhspmdsywjh5dp3wmoysq6jmdayeb.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:11.5290897Z /tmp/torchinductor_root/cw/ccwb7dn2gwrggtpkebrg3wqmhspmdsywjh5dp3wmoysq6jmdayeb.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:11.5291914Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:11.5293183Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', ''], maxnreg=64, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[1, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:30:11.5294162Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:11.5294408Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:12.7843000Z module {
2026-02-21T08:30:12.7843638Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:12.7846018Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:12.7846243Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:30:12.7846471Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf32>
2026-02-21T08:30:12.7846705Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:30:12.7846891Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:12.7847069Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:12.7847252Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:12.7847424Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:12.7847752Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x64xf32>>
2026-02-21T08:30:12.7849153Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x64xf32>>
2026-02-21T08:30:12.7849473Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:12.7849654Z     %3 = arith.subi %c64_i32, %2 : i32
2026-02-21T08:30:12.7849863Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:12.7853963Z     %4 = arith.subi %c9472_i32, %c1_i32 : i32
2026-02-21T08:30:12.7854248Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:30:12.7858626Z     %6 = arith.divui %5, %c9472_i32 : i32
2026-02-21T08:30:12.7864036Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:30:12.7867307Z     %7 = arith.remsi %6, %c3_i32 : i32
2026-02-21T08:30:12.7870712Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:30:12.7872673Z     %9 = arith.muli %8, %c9472_i32 : i32
2026-02-21T08:30:12.7872903Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:30:12.7873089Z     %11 = arith.muli %c9472_i32, %c3_i32 : i32
2026-02-21T08:30:12.7873296Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:30:12.7873490Z       %12 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:30:12.7873726Z       %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:12.7873975Z       %14 = tt.splat %12 : i32 -> tensor<64xi32>
2026-02-21T08:30:12.7874185Z       %15 = arith.addi %14, %13 : tensor<64xi32>
2026-02-21T08:30:12.7874509Z       %16 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>)  : i32 {
2026-02-21T08:30:12.7874907Z         %40 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7875273Z         %41 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7875551Z         %42 = scf.if %arg3 -> (tensor<64x64xf32>) {
2026-02-21T08:30:12.7875921Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32>
2026-02-21T08:30:12.7876297Z           %45 = arith.subf %41, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7876507Z           %46 = arith.mulf %44, %45 : tensor<64x64xf32>
2026-02-21T08:30:12.7876728Z           %47 = arith.addf %46, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7876929Z           scf.yield %47 : tensor<64x64xf32>
2026-02-21T08:30:12.7877110Z         } else {
2026-02-21T08:30:12.7877588Z           %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32>
2026-02-21T08:30:12.7877827Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32>
2026-02-21T08:30:12.7878061Z           %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32>
2026-02-21T08:30:12.7878274Z           %47 = arith.ori %45, %46 : tensor<64x64xi1>
2026-02-21T08:30:12.7878525Z           %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32>
2026-02-21T08:30:12.7878769Z           %49 = math.log %48 : tensor<64x64xf32>
2026-02-21T08:30:12.7878970Z           %50 = arith.subf %49, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7879166Z           %51 = arith.mulf %41, %50 : tensor<64x64xf32>
2026-02-21T08:30:12.7879372Z           %52 = arith.addf %51, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7879569Z           scf.yield %52 : tensor<64x64xf32>
2026-02-21T08:30:12.7879733Z         }
2026-02-21T08:30:12.7879949Z         %43 = arith.addf %arg7, %42 : tensor<64x64xf32>
2026-02-21T08:30:12.7880147Z         scf.yield %43 : tensor<64x64xf32>
2026-02-21T08:30:12.7880435Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:12.7880710Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:30:12.7880894Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:12.7881073Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:12.7881258Z         tt.reduce.return %40 : f32
2026-02-21T08:30:12.7881437Z       }) : (tensor<64x64xf32>) -> tensor<64xf32>
2026-02-21T08:30:12.7881665Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7882073Z       %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:12.7882310Z       tt.store %19, %17 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7882500Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:30:12.7882691Z       %20 = arith.muli %c9472_i32, %c1_i32_0 : i32
2026-02-21T08:30:12.7882878Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:30:12.7883059Z       %22 = arith.muli %21, %c64_i32 : i32
2026-02-21T08:30:12.7883282Z       %23 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:12.7883511Z       %24 = tt.splat %22 : i32 -> tensor<64xi32>
2026-02-21T08:30:12.7883703Z       %25 = arith.addi %24, %23 : tensor<64xi32>
2026-02-21T08:30:12.7884000Z       %26 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>)  : i32 {
2026-02-21T08:30:12.7884397Z         %40 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7884751Z         %41 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7885026Z         %42 = scf.if %arg3 -> (tensor<64x64xf32>) {
2026-02-21T08:30:12.7885381Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32>
2026-02-21T08:30:12.7885737Z           %45 = arith.subf %41, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7885943Z           %46 = arith.mulf %44, %45 : tensor<64x64xf32>
2026-02-21T08:30:12.7886143Z           %47 = arith.addf %46, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7886339Z           scf.yield %47 : tensor<64x64xf32>
2026-02-21T08:30:12.7886507Z         } else {
2026-02-21T08:30:12.7886660Z           %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32>
2026-02-21T08:30:12.7886877Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32>
2026-02-21T08:30:12.7887089Z           %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32>
2026-02-21T08:30:12.7887300Z           %47 = arith.ori %45, %46 : tensor<64x64xi1>
2026-02-21T08:30:12.7887528Z           %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32>
2026-02-21T08:30:12.7887769Z           %49 = math.log %48 : tensor<64x64xf32>
2026-02-21T08:30:12.7887962Z           %50 = arith.subf %49, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7888161Z           %51 = arith.mulf %41, %50 : tensor<64x64xf32>
2026-02-21T08:30:12.7888449Z           %52 = arith.addf %51, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7888645Z           scf.yield %52 : tensor<64x64xf32>
2026-02-21T08:30:12.7888820Z         }
2026-02-21T08:30:12.7888966Z         %43 = arith.addf %arg7, %42 : tensor<64x64xf32>
2026-02-21T08:30:12.7889169Z         scf.yield %43 : tensor<64x64xf32>
2026-02-21T08:30:12.7889424Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:12.7889705Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:30:12.7889906Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:12.7890083Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:12.7890273Z         tt.reduce.return %40 : f32
2026-02-21T08:30:12.7890453Z       }) : (tensor<64x64xf32>) -> tensor<64xf32>
2026-02-21T08:30:12.7890685Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7890993Z       %29 = tt.addptr %28, %25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:12.7891235Z       tt.store %29, %27 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7891432Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:30:12.7891608Z       %30 = arith.muli %c9472_i32, %c2_i32 : i32
2026-02-21T08:30:12.7891792Z       %31 = arith.addi %arg5, %30 : i32
2026-02-21T08:30:12.7892001Z       %32 = arith.muli %31, %c64_i32 : i32
2026-02-21T08:30:12.7892226Z       %33 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:12.7892458Z       %34 = tt.splat %32 : i32 -> tensor<64xi32>
2026-02-21T08:30:12.7892649Z       %35 = arith.addi %34, %33 : tensor<64xi32>
2026-02-21T08:30:12.7892946Z       %36 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>)  : i32 {
2026-02-21T08:30:12.7893341Z         %40 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7893705Z         %41 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7893988Z         %42 = scf.if %arg3 -> (tensor<64x64xf32>) {
2026-02-21T08:30:12.7894348Z           %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32>
2026-02-21T08:30:12.7894704Z           %45 = arith.subf %41, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7894911Z           %46 = arith.mulf %44, %45 : tensor<64x64xf32>
2026-02-21T08:30:12.7895117Z           %47 = arith.addf %46, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7895307Z           scf.yield %47 : tensor<64x64xf32>
2026-02-21T08:30:12.7895477Z         } else {
2026-02-21T08:30:12.7895627Z           %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32>
2026-02-21T08:30:12.7895843Z           %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32>
2026-02-21T08:30:12.7896054Z           %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32>
2026-02-21T08:30:12.7896265Z           %47 = arith.ori %45, %46 : tensor<64x64xi1>
2026-02-21T08:30:12.7896502Z           %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32>
2026-02-21T08:30:12.7896734Z           %49 = math.log %48 : tensor<64x64xf32>
2026-02-21T08:30:12.7896953Z           %50 = arith.subf %49, %40 : tensor<64x64xf32>
2026-02-21T08:30:12.7897156Z           %51 = arith.mulf %41, %50 : tensor<64x64xf32>
2026-02-21T08:30:12.7897353Z           %52 = arith.addf %51, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7897548Z           scf.yield %52 : tensor<64x64xf32>
2026-02-21T08:30:12.7897710Z         }
2026-02-21T08:30:12.7897855Z         %43 = arith.addf %arg7, %42 : tensor<64x64xf32>
2026-02-21T08:30:12.7898039Z         scf.yield %43 : tensor<64x64xf32>
2026-02-21T08:30:12.7898288Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:12.7898552Z       %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({
2026-02-21T08:30:12.7898733Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:12.7898970Z         %40 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:12.7899145Z         tt.reduce.return %40 : f32
2026-02-21T08:30:12.7899325Z       }) : (tensor<64x64xf32>) -> tensor<64xf32>
2026-02-21T08:30:12.7899549Z       %38 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7899824Z       %39 = tt.addptr %38, %35 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:12.7900071Z       tt.store %39, %37 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7900265Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:30:12.7900473Z     scf.for %arg5 = %10 to %c64_i32 step %c9472_i32  : i32 {
2026-02-21T08:30:12.7900690Z       %12 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:30:12.7900926Z       %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:12.7901165Z       %14 = tt.splat %12 : i32 -> tensor<64xi32>
2026-02-21T08:30:12.7901366Z       %15 = arith.addi %14, %13 : tensor<64xi32>
2026-02-21T08:30:12.7901737Z       %16 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>)  : i32 {
2026-02-21T08:30:12.7902182Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7902563Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<64x64xf32>> -> tensor<64x64xf32>
2026-02-21T08:30:12.7902858Z         %22 = scf.if %arg3 -> (tensor<64x64xf32>) {
2026-02-21T08:30:12.7903235Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32>
2026-02-21T08:30:12.7903609Z           %25 = arith.subf %21, %20 : tensor<64x64xf32>
2026-02-21T08:30:12.7903821Z           %26 = arith.mulf %24, %25 : tensor<64x64xf32>
2026-02-21T08:30:12.7904036Z           %27 = arith.addf %26, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7904235Z           scf.yield %27 : tensor<64x64xf32>
2026-02-21T08:30:12.7904414Z         } else {
2026-02-21T08:30:12.7904578Z           %24 = tt.splat %arg4 : f32 -> tensor<64x64xf32>
2026-02-21T08:30:12.7904811Z           %25 = arith.cmpf ogt, %21, %24 : tensor<64x64xf32>
2026-02-21T08:30:12.7905033Z           %26 = arith.cmpf une, %21, %21 : tensor<64x64xf32>
2026-02-21T08:30:12.7905253Z           %27 = arith.ori %25, %26 : tensor<64x64xi1>
2026-02-21T08:30:12.7905501Z           %28 = arith.select %27, %21, %24 : tensor<64x64xi1>, tensor<64x64xf32>
2026-02-21T08:30:12.7905745Z           %29 = math.log %28 : tensor<64x64xf32>
2026-02-21T08:30:12.7905951Z           %30 = arith.subf %29, %20 : tensor<64x64xf32>
2026-02-21T08:30:12.7906153Z           %31 = arith.mulf %21, %30 : tensor<64x64xf32>
2026-02-21T08:30:12.7906367Z           %32 = arith.addf %31, %cst : tensor<64x64xf32>
2026-02-21T08:30:12.7906562Z           scf.yield %32 : tensor<64x64xf32>
2026-02-21T08:30:12.7906740Z         }
2026-02-21T08:30:12.7906886Z         %23 = arith.addf %arg7, %22 : tensor<64x64xf32>
2026-02-21T08:30:12.7907099Z         scf.yield %23 : tensor<64x64xf32>
2026-02-21T08:30:12.7907353Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:30:12.7907611Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:30:12.7907802Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:12.7907971Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:12.7908157Z         tt.reduce.return %20 : f32
2026-02-21T08:30:12.7908335Z       }) : (tensor<64x64xf32>) -> tensor<64xf32>
2026-02-21T08:30:12.7908562Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7908819Z       %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:12.7909047Z       tt.store %19, %17 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:12.7909236Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:30:12.7909386Z     tt.return
2026-02-21T08:30:12.7909512Z   }
2026-02-21T08:30:12.7909625Z }
2026-02-21T08:30:12.7909698Z 
2026-02-21T08:30:12.7909747Z {-#
2026-02-21T08:30:12.7909870Z   external_resources: {
2026-02-21T08:30:12.7910086Z     mlir_reproducer: {
2026-02-21T08:30:12.7914407Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:12.7918788Z       disable_threading: false,
2026-02-21T08:30:12.7918956Z       verify_each: true
2026-02-21T08:30:12.7919095Z     }
2026-02-21T08:30:12.7919221Z   }
2026-02-21T08:30:12.7919333Z #-}
2026-02-21T08:30:12.7919740Z /tmp/torchinductor_root/4e/c4e2f3hrdolelqb36u3a232yzjx6227te325rbipesnksnfjniyl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:12.7920924Z /tmp/torchinductor_root/4e/c4e2f3hrdolelqb36u3a232yzjx6227te325rbipesnksnfjniyl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:12.7921926Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:12.7923019Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:30:12.7924004Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:12.7924252Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:13.5490679Z module {
2026-02-21T08:30:13.5491613Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:13.5492631Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:30:13.5492927Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:13.5493296Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32>
2026-02-21T08:30:13.5493665Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:13.5493999Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:13.5494636Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:13.5494943Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:13.5495233Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:13.5495745Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:30:13.5496493Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<128x32xf32>>
2026-02-21T08:30:13.5496998Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:13.5497273Z     %3 = arith.muli %2, %c128_i32 : i32
2026-02-21T08:30:13.5497628Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:30:13.5498027Z     %5 = tt.splat %3 : i32 -> tensor<128xi32>
2026-02-21T08:30:13.5498334Z     %6 = arith.addi %5, %4 : tensor<128xi32>
2026-02-21T08:30:13.5498980Z     %7 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<128x32xf32>)  : i32 {
2026-02-21T08:30:13.5499668Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:30:13.5500297Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<128x32xf32>> -> tensor<128x32xf32>
2026-02-21T08:30:13.5500768Z       %13 = scf.if %arg3 -> (tensor<128x32xf32>) {
2026-02-21T08:30:13.5501386Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32>
2026-02-21T08:30:13.5502045Z         %16 = arith.subf %12, %11 : tensor<128x32xf32>
2026-02-21T08:30:13.5502379Z         %17 = arith.mulf %15, %16 : tensor<128x32xf32>
2026-02-21T08:30:13.5502713Z         %18 = arith.addf %17, %cst : tensor<128x32xf32>
2026-02-21T08:30:13.5503036Z         scf.yield %18 : tensor<128x32xf32>
2026-02-21T08:30:13.5503311Z       } else {
2026-02-21T08:30:13.5503560Z         %15 = tt.splat %arg4 : f32 -> tensor<128x32xf32>
2026-02-21T08:30:13.5503930Z         %16 = arith.cmpf ogt, %12, %15 : tensor<128x32xf32>
2026-02-21T08:30:13.5504287Z         %17 = arith.cmpf une, %12, %12 : tensor<128x32xf32>
2026-02-21T08:30:13.5504634Z         %18 = arith.ori %16, %17 : tensor<128x32xi1>
2026-02-21T08:30:13.5505025Z         %19 = arith.select %18, %12, %15 : tensor<128x32xi1>, tensor<128x32xf32>
2026-02-21T08:30:13.5505427Z         %20 = math.log %19 : tensor<128x32xf32>
2026-02-21T08:30:13.5505751Z         %21 = arith.subf %20, %11 : tensor<128x32xf32>
2026-02-21T08:30:13.5506076Z         %22 = arith.mulf %12, %21 : tensor<128x32xf32>
2026-02-21T08:30:13.5506412Z         %23 = arith.addf %22, %cst : tensor<128x32xf32>
2026-02-21T08:30:13.5506721Z         scf.yield %23 : tensor<128x32xf32>
2026-02-21T08:30:13.5506995Z       }
2026-02-21T08:30:13.5507211Z       %14 = arith.addf %arg6, %13 : tensor<128x32xf32>
2026-02-21T08:30:13.5507526Z       scf.yield %14 : tensor<128x32xf32>
2026-02-21T08:30:13.5507939Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:13.5508376Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:30:13.5508673Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:30:13.5508943Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:30:13.5509237Z       tt.reduce.return %11 : f32
2026-02-21T08:30:13.5509519Z     }) : (tensor<128x32xf32>) -> tensor<128xf32>
2026-02-21T08:30:13.5509890Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:30:13.5510312Z     %10 = tt.addptr %9, %6 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:30:13.5510695Z     tt.store %10, %8 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:30:13.5510982Z     tt.return
2026-02-21T08:30:13.5511165Z   }
2026-02-21T08:30:13.5511340Z }
2026-02-21T08:30:13.5511443Z 
2026-02-21T08:30:13.5511514Z {-#
2026-02-21T08:30:13.5511709Z   external_resources: {
2026-02-21T08:30:13.5511983Z     mlir_reproducer: {
2026-02-21T08:30:13.5519659Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:13.5527550Z       disable_threading: false,
2026-02-21T08:30:13.5527812Z       verify_each: true
2026-02-21T08:30:13.5528033Z     }
2026-02-21T08:30:13.5528200Z   }
2026-02-21T08:30:13.5528368Z #-}
2026-02-21T08:30:13.5529073Z /tmp/torchinductor_root/3y/c3ylosf5o3jtxcrwzpobw5iszyv4zvcreo5umlfav3tb2yj6e6zh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:13.5531186Z /tmp/torchinductor_root/3y/c3ylosf5o3jtxcrwzpobw5iszyv4zvcreo5umlfav3tb2yj6e6zh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:13.5532966Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:13.5534659Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:30:13.5536172Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:13.5536593Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:15.7120605Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:30:15.7121319Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:15.7122000Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:30:15.7122190Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:15.7122373Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:15.7122585Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x128xf32>
2026-02-21T08:30:15.7122819Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:30:15.7122997Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:15.7123225Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:15.7123727Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:15.7123903Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:15.7124218Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x128xf32>>
2026-02-21T08:30:15.7124649Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<64x128xf32>>
2026-02-21T08:30:15.7124966Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:15.7125139Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:30:15.7125321Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:30:15.7125519Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:30:15.7125711Z       %5 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:30:15.7125940Z       %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:30:15.7126183Z       %7 = tt.splat %5 : i32 -> tensor<64xi32>
2026-02-21T08:30:15.7126469Z       %8 = arith.addi %7, %6 : tensor<64xi32>
2026-02-21T08:30:15.7126658Z       %c256_i32 = arith.constant 256 : i32
2026-02-21T08:30:15.7126970Z       %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<64x128xf32>)  : i32 {
2026-02-21T08:30:15.7127376Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:30:15.7127743Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:30:15.7128040Z         %15 = scf.if %arg3 -> (tensor<64x128xf32>) {
2026-02-21T08:30:15.7128409Z           %23 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T08:30:15.7128793Z           %24 = arith.subf %14, %13 : tensor<64x128xf32>
2026-02-21T08:30:15.7129018Z           %25 = arith.mulf %23, %24 : tensor<64x128xf32>
2026-02-21T08:30:15.7129229Z           %26 = arith.addf %25, %cst : tensor<64x128xf32>
2026-02-21T08:30:15.7129431Z           scf.yield %26 : tensor<64x128xf32>
2026-02-21T08:30:15.7129596Z         } else {
2026-02-21T08:30:15.7129760Z           %23 = tt.splat %arg4 : f32 -> tensor<64x128xf32>
2026-02-21T08:30:15.7129973Z           %24 = arith.cmpf ogt, %14, %23 : tensor<64x128xf32>
2026-02-21T08:30:15.7130194Z           %25 = arith.cmpf une, %14, %14 : tensor<64x128xf32>
2026-02-21T08:30:15.7130400Z           %26 = arith.ori %24, %25 : tensor<64x128xi1>
2026-02-21T08:30:15.7130643Z           %27 = arith.select %26, %14, %23 : tensor<64x128xi1>, tensor<64x128xf32>
2026-02-21T08:30:15.7130884Z           %28 = math.log %27 : tensor<64x128xf32>
2026-02-21T08:30:15.7131076Z           %29 = arith.subf %28, %13 : tensor<64x128xf32>
2026-02-21T08:30:15.7131280Z           %30 = arith.mulf %14, %29 : tensor<64x128xf32>
2026-02-21T08:30:15.7131479Z           %31 = arith.addf %30, %cst : tensor<64x128xf32>
2026-02-21T08:30:15.7131679Z           scf.yield %31 : tensor<64x128xf32>
2026-02-21T08:30:15.7131842Z         }
2026-02-21T08:30:15.7132108Z         %16 = arith.addf %arg7, %15 : tensor<64x128xf32>
2026-02-21T08:30:15.7132305Z         %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:30:15.7132523Z         %17 = arith.muli %c128_i32, %c1_i32_0 : i32
2026-02-21T08:30:15.7132709Z         %18 = arith.addi %arg6, %17 : i32
2026-02-21T08:30:15.7132991Z         %19 = tt.descriptor_load %0[%5, %18] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:30:15.7133353Z         %20 = tt.descriptor_load %1[%5, %18] : !tt.tensordesc<tensor<64x128xf32>> -> tensor<64x128xf32>
2026-02-21T08:30:15.7133636Z         %21 = scf.if %arg3 -> (tensor<64x128xf32>) {
2026-02-21T08:30:15.7134006Z           %23 = tt.extern_elementwise %20 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32>
2026-02-21T08:30:15.7134362Z           %24 = arith.subf %20, %19 : tensor<64x128xf32>
2026-02-21T08:30:15.7134569Z           %25 = arith.mulf %23, %24 : tensor<64x128xf32>
2026-02-21T08:30:15.7134858Z           %26 = arith.addf %25, %cst : tensor<64x128xf32>
2026-02-21T08:30:15.7135052Z           scf.yield %26 : tensor<64x128xf32>
2026-02-21T08:30:15.7135231Z         } else {
2026-02-21T08:30:15.7135395Z           %23 = tt.splat %arg4 : f32 -> tensor<64x128xf32>
2026-02-21T08:30:15.7135619Z           %24 = arith.cmpf ogt, %20, %23 : tensor<64x128xf32>
2026-02-21T08:30:15.7135840Z           %25 = arith.cmpf une, %20, %20 : tensor<64x128xf32>
2026-02-21T08:30:15.7136057Z           %26 = arith.ori %24, %25 : tensor<64x128xi1>
2026-02-21T08:30:15.7136295Z           %27 = arith.select %26, %20, %23 : tensor<64x128xi1>, tensor<64x128xf32>
2026-02-21T08:30:15.7136544Z           %28 = math.log %27 : tensor<64x128xf32>
2026-02-21T08:30:15.7136747Z           %29 = arith.subf %28, %19 : tensor<64x128xf32>
2026-02-21T08:30:15.7136947Z           %30 = arith.mulf %20, %29 : tensor<64x128xf32>
2026-02-21T08:30:15.7137214Z           %31 = arith.addf %30, %cst : tensor<64x128xf32>
2026-02-21T08:30:15.7137410Z           scf.yield %31 : tensor<64x128xf32>
2026-02-21T08:30:15.7137577Z         }
2026-02-21T08:30:15.7137714Z         %22 = arith.addf %16, %21 : tensor<64x128xf32>
2026-02-21T08:30:15.7137906Z         scf.yield %22 : tensor<64x128xf32>
2026-02-21T08:30:15.7138092Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:30:15.7138278Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:30:15.7138462Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:15.7138632Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:15.7138813Z         tt.reduce.return %13 : f32
2026-02-21T08:30:15.7138987Z       }) : (tensor<64x128xf32>) -> tensor<64xf32>
2026-02-21T08:30:15.7139216Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:15.7139470Z       %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:30:15.7139703Z       tt.store %12, %10 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:30:15.7140051Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:30:15.7140379Z     tt.return
2026-02-21T08:30:15.7140507Z   }
2026-02-21T08:30:15.7140621Z }
2026-02-21T08:30:15.7140695Z 
2026-02-21T08:30:15.7140743Z {-#
2026-02-21T08:30:15.7140863Z   external_resources: {
2026-02-21T08:30:15.7141019Z     mlir_reproducer: {
2026-02-21T08:30:15.7145338Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:15.7149891Z       disable_threading: false,
2026-02-21T08:30:15.7150068Z       verify_each: true
2026-02-21T08:30:15.7150211Z     }
2026-02-21T08:30:15.7150335Z   }
2026-02-21T08:30:15.7150447Z #-}
2026-02-21T08:30:15.7150875Z /tmp/torchinductor_root/4f/c4fdvapk55kjtm4nftfpzewslzr3q745k423xfsdvr6boffxkbgg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:15.7152135Z /tmp/torchinductor_root/4f/c4fdvapk55kjtm4nftfpzewslzr3q745k423xfsdvr6boffxkbgg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:15.7153205Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:15.7154368Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=8, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[1, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:30:15.7155409Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:15.7155694Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:17.5356182Z module {
2026-02-21T08:30:17.5360904Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:30:17.5364501Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:30:17.5368940Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:30:17.5372449Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:30:17.5372787Z     %cst = arith.constant dense<0.000000e+00> : tensor<1024x4xf32>
2026-02-21T08:30:17.5373049Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:30:17.5373258Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:30:17.5373445Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:30:17.5373635Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:30:17.5373833Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:30:17.5378140Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<1024x4xf32>>
2026-02-21T08:30:17.5382714Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<1024x4xf32>>
2026-02-21T08:30:17.5384388Z     %2 = tt.get_program_id x : i32
2026-02-21T08:30:17.5384625Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:30:17.5384820Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T08:30:17.5385017Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:30:17.5385228Z       %5 = arith.muli %arg5, %c1024_i32 : i32
2026-02-21T08:30:17.5385465Z       %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:30:17.5385712Z       %7 = tt.splat %5 : i32 -> tensor<1024xi32>
2026-02-21T08:30:17.5385908Z       %8 = arith.addi %7, %6 : tensor<1024xi32>
2026-02-21T08:30:17.5386210Z       %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<1024x4xf32>)  : i32 {
2026-02-21T08:30:17.5386610Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<1024x4xf32>> -> tensor<1024x4xf32>
2026-02-21T08:30:17.5386966Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<1024x4xf32>> -> tensor<1024x4xf32>
2026-02-21T08:30:17.5387262Z         %15 = scf.if %arg3 -> (tensor<1024x4xf32>) {
2026-02-21T08:30:17.5387973Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x4xf32>) -> tensor<1024x4xf32>
2026-02-21T08:30:17.5388336Z           %18 = arith.subf %14, %13 : tensor<1024x4xf32>
2026-02-21T08:30:17.5388548Z           %19 = arith.mulf %17, %18 : tensor<1024x4xf32>
2026-02-21T08:30:17.5388751Z           %20 = arith.addf %19, %cst : tensor<1024x4xf32>
2026-02-21T08:30:17.5388956Z           scf.yield %20 : tensor<1024x4xf32>
2026-02-21T08:30:17.5389128Z         } else {
2026-02-21T08:30:17.5389286Z           %17 = tt.splat %arg4 : f32 -> tensor<1024x4xf32>
2026-02-21T08:30:17.5389511Z           %18 = arith.cmpf ogt, %14, %17 : tensor<1024x4xf32>
2026-02-21T08:30:17.5389722Z           %19 = arith.cmpf une, %14, %14 : tensor<1024x4xf32>
2026-02-21T08:30:17.5389930Z           %20 = arith.ori %18, %19 : tensor<1024x4xi1>
2026-02-21T08:30:17.5390257Z           %21 = arith.select %20, %14, %17 : tensor<1024x4xi1>, tensor<1024x4xf32>
2026-02-21T08:30:17.5390510Z           %22 = math.log %21 : tensor<1024x4xf32>
2026-02-21T08:30:17.5390702Z           %23 = arith.subf %22, %13 : tensor<1024x4xf32>
2026-02-21T08:30:17.5390906Z           %24 = arith.mulf %14, %23 : tensor<1024x4xf32>
2026-02-21T08:30:17.5391117Z           %25 = arith.addf %24, %cst : tensor<1024x4xf32>
2026-02-21T08:30:17.5391311Z           scf.yield %25 : tensor<1024x4xf32>
2026-02-21T08:30:17.5391507Z         }
2026-02-21T08:30:17.5391656Z         %16 = arith.addf %arg7, %15 : tensor<1024x4xf32>
2026-02-21T08:30:17.5391954Z         scf.yield %16 : tensor<1024x4xf32>
2026-02-21T08:30:17.5392149Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:30:17.5392331Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:30:17.5392528Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:30:17.5392706Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:30:17.5392901Z         tt.reduce.return %13 : f32
2026-02-21T08:30:17.5393089Z       }) : (tensor<1024x4xf32>) -> tensor<1024xf32>
2026-02-21T08:30:17.5393332Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>>
2026-02-21T08:30:17.5393592Z       %12 = tt.addptr %11, %8 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32>
2026-02-21T08:30:17.5393838Z       tt.store %12, %10 : tensor<1024x!tt.ptr<f32>>
2026-02-21T08:30:17.5394084Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.warp_specialize}
2026-02-21T08:30:17.5394301Z     tt.return
2026-02-21T08:30:17.5394434Z   }
2026-02-21T08:30:17.5394547Z }
2026-02-21T08:30:17.5394620Z 
2026-02-21T08:30:17.5394668Z {-#
2026-02-21T08:30:17.5394791Z   external_resources: {
2026-02-21T08:30:17.5394946Z     mlir_reproducer: {
2026-02-21T08:30:17.5399171Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:30:17.5403750Z       disable_threading: false,
2026-02-21T08:30:17.5403930Z       verify_each: true
2026-02-21T08:30:17.5404075Z     }
2026-02-21T08:30:17.5404204Z   }
2026-02-21T08:30:17.5404310Z #-}
2026-02-21T08:30:17.5404723Z /tmp/torchinductor_root/4m/c4moxtqczhzhvxwf6zuzg4uslqz5jjoud3lscas2hcnwmlxlqjkw.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:30:17.5405986Z /tmp/torchinductor_root/4m/c4moxtqczhzhvxwf6zuzg4uslqz5jjoud3lscas2hcnwmlxlqjkw.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:30:17.5406943Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:30:17.5407982Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:30:17.5408908Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:30:17.5409151Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:30:17.7902974Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.3 configs/s
2026-02-21T08:30:17.7911312Z [47s] Adaptive compile timeout: 30s (90% percentile=7.7s, bounds=[30.0s, 30s])
2026-02-21T08:30:20.7499619Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 916/916 305.9 configs/s
2026-02-21T08:30:20.8381254Z [50s] Initial random population of 100, 5 starting points: 
2026-02-21T08:30:20.8381549Z error=17
2026-02-21T08:30:20.8381691Z timeout=6
2026-02-21T08:30:20.8381833Z ok=77
2026-02-21T08:30:20.8382246Z min=0.2264
2026-02-21T08:30:20.8382390Z mid=1.7909
2026-02-21T08:30:20.8382517Z max=220.4436
2026-02-21T08:30:20.8382673Z best={'block_sizes': [512, 1],
2026-02-21T08:30:20.8382924Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:30:20.8383179Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:30:20.8383403Z  'maxnreg': 32,
2026-02-21T08:30:20.8383551Z  'num_sm_multiplier': 64,
2026-02-21T08:30:20.8383714Z  'num_stages': 6,
2026-02-21T08:30:20.8383851Z  'num_warps': 2,
2026-02-21T08:30:20.8384038Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:30:20.8384244Z  'range_flattens': [False, False],
2026-02-21T08:30:20.8384434Z  'range_multi_buffers': [False, True],
2026-02-21T08:30:20.8384620Z  'range_num_stages': [0, 0],
2026-02-21T08:30:20.8384795Z  'range_unroll_factors': [3, 0],
2026-02-21T08:30:20.8384972Z  'range_warp_specializes': [None, True]}
2026-02-21T08:30:20.8403131Z [50s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:30:21.9389721Z [51s] Generation 1 starting: 83 neighbors, 5 active search path(s)
2026-02-21T08:30:28.7519312Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 7.3 configs/s
2026-02-21T08:30:33.8826935Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 17.1 configs/s
2026-02-21T08:30:52.0907472Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 992/992 55.5 configs/s
2026-02-21T08:30:52.4249142Z [81s] Generation 1 complete: 
2026-02-21T08:30:52.4253201Z ok=88
2026-02-21T08:30:52.4256598Z min=0.2038
2026-02-21T08:30:52.4260513Z mid=0.2652
2026-02-21T08:30:52.4264376Z max=1.6531
2026-02-21T08:30:52.4268287Z best={'block_sizes': [1024, 1],
2026-02-21T08:30:52.4271813Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:30:52.4277547Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:30:52.4277890Z  'num_stages': 6,
2026-02-21T08:30:52.4278120Z  'num_warps': 1,
2026-02-21T08:30:52.4278370Z  'pid_type': 'flat',
2026-02-21T08:30:52.4278632Z  'range_flattens': [None, False],
2026-02-21T08:30:52.4278921Z  'range_multi_buffers': [None, None],
2026-02-21T08:30:52.4279207Z  'range_num_stages': [0, 4],
2026-02-21T08:30:52.4279464Z  'range_unroll_factors': [0, 0],
2026-02-21T08:30:52.4279760Z  'range_warp_specializes': [None, True]}
2026-02-21T08:30:52.4280104Z [81s] Fitting surrogate: 188 points, 188 targets
2026-02-21T08:30:53.3653720Z [82s] Generation 2 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:30:59.2351035Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 3.5 configs/s
2026-02-21T08:31:02.2089531Z module {
2026-02-21T08:31:02.2090160Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:31:02.2090703Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:31:02.2090899Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:31:02.2091120Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:31:02.2091358Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:31:02.2091538Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:31:02.2091721Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:31:02.2092117Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T08:31:02.2092296Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:31:02.2092649Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:31:02.2093091Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:31:02.2093402Z     %2 = tt.get_program_id x : i32
2026-02-21T08:31:02.2093580Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:31:02.2093794Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:31:02.2094033Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:31:02.2094212Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:31:02.2094522Z     %7 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:31:02.2094918Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:31:02.2095282Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:31:02.2095577Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:31:02.2095937Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:31:02.2096305Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:31:02.2096503Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:31:02.2096709Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:31:02.2096907Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:31:02.2097071Z       } else {
2026-02-21T08:31:02.2097232Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:31:02.2097446Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:31:02.2097669Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:31:02.2097874Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:31:02.2098124Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:31:02.2098677Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:31:02.2098868Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:31:02.2099073Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:31:02.2099275Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:31:02.2099473Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:31:02.2099639Z       }
2026-02-21T08:31:02.2099786Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:31:02.2099974Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:31:02.2100178Z     } {tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:31:02.2100393Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:31:02.2100574Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:31:02.2100754Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:31:02.2100933Z       tt.reduce.return %11 : f32
2026-02-21T08:31:02.2101204Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:31:02.2101428Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:31:02.2101678Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:31:02.2101934Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:31:02.2102112Z     tt.return
2026-02-21T08:31:02.2102238Z   }
2026-02-21T08:31:02.2102354Z }
2026-02-21T08:31:02.2102420Z 
2026-02-21T08:31:02.2102479Z {-#
2026-02-21T08:31:02.2102602Z   external_resources: {
2026-02-21T08:31:02.2102758Z     mlir_reproducer: {
2026-02-21T08:31:02.2106943Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:31:02.2111253Z       disable_threading: false,
2026-02-21T08:31:02.2111414Z       verify_each: true
2026-02-21T08:31:02.2111557Z     }
2026-02-21T08:31:02.2111674Z   }
2026-02-21T08:31:02.2111782Z #-}
2026-02-21T08:31:02.2112234Z /tmp/torchinductor_root/jp/cjp7lrgyeevtgjtetd7rzaqljywnn7cyuy3rdfzuigyauoom3p2n.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:31:02.2113423Z /tmp/torchinductor_root/jp/cjp7lrgyeevtgjtetd7rzaqljywnn7cyuy3rdfzuigyauoom3p2n.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:31:02.2114514Z [91s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:31:02.2115510Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:31:02.2116406Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:31:02.2116670Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:31:03.0808486Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.9 configs/s
2026-02-21T08:31:21.9657799Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 54.0 configs/s
2026-02-21T08:31:22.4658807Z [111s] Generation 2 complete: 
2026-02-21T08:31:22.4660902Z error=1
2026-02-21T08:31:22.4661104Z ok=71
2026-02-21T08:31:22.4661282Z min=0.2100
2026-02-21T08:31:22.4661455Z mid=0.2263
2026-02-21T08:31:22.4661634Z max=0.6544
2026-02-21T08:31:22.4661819Z best={'block_sizes': [1024, 1],
2026-02-21T08:31:22.4662507Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:31:22.4662922Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:31:22.4663218Z  'num_stages': 6,
2026-02-21T08:31:22.4663420Z  'num_warps': 4,
2026-02-21T08:31:22.4663635Z  'pid_type': 'flat',
2026-02-21T08:31:22.4663863Z  'range_flattens': [None, False],
2026-02-21T08:31:22.4664140Z  'range_multi_buffers': [None, True],
2026-02-21T08:31:22.4664418Z  'range_num_stages': [0, 4],
2026-02-21T08:31:22.4664651Z  'range_unroll_factors': [0, 0],
2026-02-21T08:31:22.4664918Z  'range_warp_specializes': [None, True]}
2026-02-21T08:31:22.4674888Z [111s] Fitting surrogate: 260 points, 260 targets
2026-02-21T08:31:23.5767809Z [112s] Generation 3 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:31:27.5864826Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 41.5 configs/s
2026-02-21T08:31:31.3911557Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 17.8 configs/s
2026-02-21T08:31:51.1861057Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 50.7 configs/s
2026-02-21T08:31:51.5823617Z [140s] Generation 3 complete: 
2026-02-21T08:31:51.5824005Z ok=72
2026-02-21T08:31:51.5826718Z min=0.2100
2026-02-21T08:31:51.5826916Z mid=0.2222
2026-02-21T08:31:51.5827099Z max=0.3532
2026-02-21T08:31:51.5827302Z best={'block_sizes': [1024, 1],
2026-02-21T08:31:51.5827740Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:31:51.5830048Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:31:51.5830360Z  'num_stages': 6,
2026-02-21T08:31:51.5830584Z  'num_warps': 4,
2026-02-21T08:31:51.5830832Z  'pid_type': 'flat',
2026-02-21T08:31:51.5831089Z  'range_flattens': [None, False],
2026-02-21T08:31:51.5831423Z  'range_multi_buffers': [None, True],
2026-02-21T08:31:51.5833364Z  'range_num_stages': [0, 4],
2026-02-21T08:31:51.5833631Z  'range_unroll_factors': [0, 0],
2026-02-21T08:31:51.5833894Z  'range_warp_specializes': [None, True]}
2026-02-21T08:31:51.5849959Z [140s] Fitting surrogate: 332 points, 332 targets
2026-02-21T08:31:52.5970740Z [141s] Generation 4 starting: 52 neighbors, 5 active search path(s)
2026-02-21T08:31:57.0729511Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 7.6 configs/s
2026-02-21T08:32:00.1009171Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.8 configs/s
2026-02-21T08:32:14.5749143Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 69.6 configs/s
2026-02-21T08:32:14.8670745Z [164s] Generation 4 complete: 
2026-02-21T08:32:14.8675806Z ok=58
2026-02-21T08:32:14.8679592Z min=0.2121
2026-02-21T08:32:14.8683480Z mid=0.2222
2026-02-21T08:32:14.8687386Z max=0.4756
2026-02-21T08:32:14.8690570Z best={'block_sizes': [2048, 1],
2026-02-21T08:32:14.8694523Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:32:14.8698319Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:32:14.8698605Z  'num_stages': 6,
2026-02-21T08:32:14.8698774Z  'num_warps': 8,
2026-02-21T08:32:14.8698946Z  'pid_type': 'flat',
2026-02-21T08:32:14.8699120Z  'range_flattens': [None, None],
2026-02-21T08:32:14.8705444Z  'range_multi_buffers': [None, True],
2026-02-21T08:32:14.8709511Z  'range_num_stages': [0, 0],
2026-02-21T08:32:14.8710762Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:14.8710956Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:14.8711176Z [164s] Fitting surrogate: 390 points, 390 targets
2026-02-21T08:32:15.8751043Z [165s] Generation 5 starting: 62 neighbors, 5 active search path(s)
2026-02-21T08:32:18.9424954Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 29.6 configs/s
2026-02-21T08:32:22.6223862Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 17.6 configs/s
2026-02-21T08:32:40.0366995Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 58.5 configs/s
2026-02-21T08:32:40.3961371Z [189s] Generation 5 complete: 
2026-02-21T08:32:40.3962771Z ok=67
2026-02-21T08:32:40.3962931Z min=0.2099
2026-02-21T08:32:40.3963064Z mid=0.2180
2026-02-21T08:32:40.3963179Z max=0.5397
2026-02-21T08:32:40.3963325Z best={'block_sizes': [2048, 1],
2026-02-21T08:32:40.3963578Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:32:40.3963838Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:32:40.3964029Z  'num_stages': 6,
2026-02-21T08:32:40.3964166Z  'num_warps': 8,
2026-02-21T08:32:40.3964311Z  'pid_type': 'flat',
2026-02-21T08:32:40.3964462Z  'range_flattens': [None, True],
2026-02-21T08:32:40.3964640Z  'range_multi_buffers': [None, True],
2026-02-21T08:32:40.3964849Z  'range_num_stages': [0, 0],
2026-02-21T08:32:40.3965043Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:40.3965243Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:40.3974231Z [189s] Fitting surrogate: 457 points, 457 targets
2026-02-21T08:32:40.9507361Z [190s] Generation 6 starting: 34 neighbors, 3 active search path(s)
2026-02-21T08:32:45.0083164Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 1.9 configs/s
2026-02-21T08:32:47.0085239Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 17.9 configs/s
2026-02-21T08:32:56.0879194Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.6         
2026-02-21T08:32:56.0880321Z                                                                   configs/s     
2026-02-21T08:32:56.2911274Z [205s] Generation 6 complete: 
2026-02-21T08:32:56.2912726Z ok=38
2026-02-21T08:32:56.2912917Z min=0.2100
2026-02-21T08:32:56.2913072Z mid=0.2221
2026-02-21T08:32:56.2913205Z max=1.0025
2026-02-21T08:32:56.2913369Z best={'block_sizes': [2048, 1],
2026-02-21T08:32:56.2913677Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:32:56.2913995Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:32:56.2914205Z  'num_stages': 6,
2026-02-21T08:32:56.2914366Z  'num_warps': 8,
2026-02-21T08:32:56.2914519Z  'pid_type': 'flat',
2026-02-21T08:32:56.2914703Z  'range_flattens': [None, True],
2026-02-21T08:32:56.2914906Z  'range_multi_buffers': [None, True],
2026-02-21T08:32:56.2915111Z  'range_num_stages': [0, 0],
2026-02-21T08:32:56.2915298Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:56.2915497Z  'range_warp_specializes': [None, True]}
2026-02-21T08:32:56.2932908Z [205s] Fitting surrogate: 495 points, 495 targets
2026-02-21T08:32:57.0400204Z [206s] Generation 7 starting: 45 neighbors, 3 active search path(s)
2026-02-21T08:32:59.5501424Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 29.1 configs/s
2026-02-21T08:33:02.2413116Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.8 configs/s
2026-02-21T08:33:14.7091789Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.4 configs/s
2026-02-21T08:33:14.9691044Z [224s] Generation 7 complete: 
2026-02-21T08:33:14.9691342Z ok=49
2026-02-21T08:33:14.9691493Z min=0.2100
2026-02-21T08:33:14.9691650Z mid=0.2243
2026-02-21T08:33:14.9691799Z max=1.2893
2026-02-21T08:33:14.9692066Z best={'block_sizes': [2048, 1],
2026-02-21T08:33:14.9692281Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:33:14.9692516Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:33:14.9692708Z  'num_stages': 2,
2026-02-21T08:33:14.9692861Z  'num_warps': 8,
2026-02-21T08:33:14.9693002Z  'pid_type': 'flat',
2026-02-21T08:33:14.9693165Z  'range_flattens': [None, False],
2026-02-21T08:33:14.9693341Z  'range_multi_buffers': [None, True],
2026-02-21T08:33:14.9693523Z  'range_num_stages': [0, 4],
2026-02-21T08:33:14.9693691Z  'range_unroll_factors': [0, 1],
2026-02-21T08:33:14.9693867Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:14.9706557Z [224s] Fitting surrogate: 544 points, 544 targets
2026-02-21T08:33:15.5782142Z [224s] Generation 8 starting: 28 neighbors, 2 active search path(s)
2026-02-21T08:33:17.5818722Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 17.2 configs/s
2026-02-21T08:33:19.2484986Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 17.9 configs/s
2026-02-21T08:33:26.6110744Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.4         
2026-02-21T08:33:26.6114562Z                                                                   configs/s     
2026-02-21T08:33:26.7714481Z [235s] Generation 8 complete: 
2026-02-21T08:33:26.7714730Z ok=30
2026-02-21T08:33:26.7714861Z min=0.2056
2026-02-21T08:33:26.7714988Z mid=0.2388
2026-02-21T08:33:26.7715105Z max=0.6790
2026-02-21T08:33:26.7715243Z best={'block_sizes': [2048, 1],
2026-02-21T08:33:26.7715440Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:33:26.7715662Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:33:26.7715881Z  'num_stages': 3,
2026-02-21T08:33:26.7716039Z  'num_warps': 8,
2026-02-21T08:33:26.7716171Z  'pid_type': 'flat',
2026-02-21T08:33:26.7716327Z  'range_flattens': [None, True],
2026-02-21T08:33:26.7716494Z  'range_multi_buffers': [None, None],
2026-02-21T08:33:26.7716675Z  'range_num_stages': [0, 3],
2026-02-21T08:33:26.7716836Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:26.7717005Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:26.7733432Z [236s] Fitting surrogate: 574 points, 574 targets
2026-02-21T08:33:27.1784300Z [236s] Generation 9 starting: 12 neighbors, 1 active search path(s)
2026-02-21T08:33:28.1266056Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 24.3 configs/s
2026-02-21T08:33:28.8172824Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.7 configs/s
2026-02-21T08:33:31.5497216Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 363.2         
2026-02-21T08:33:31.5497596Z                                                                   configs/s     
2026-02-21T08:33:31.6353060Z [240s] Generation 9 complete: 
2026-02-21T08:33:31.6353365Z ok=13
2026-02-21T08:33:31.6353562Z min=0.2098
2026-02-21T08:33:31.6353745Z mid=0.2200
2026-02-21T08:33:31.6353928Z max=0.3554
2026-02-21T08:33:31.6354134Z best={'block_sizes': [2048, 1],
2026-02-21T08:33:31.6354440Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:33:31.6354794Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:33:31.6355090Z  'num_stages': 4,
2026-02-21T08:33:31.6355308Z  'num_warps': 8,
2026-02-21T08:33:31.6355515Z  'pid_type': 'flat',
2026-02-21T08:33:31.6355755Z  'range_flattens': [None, True],
2026-02-21T08:33:31.6356025Z  'range_multi_buffers': [None, None],
2026-02-21T08:33:31.6356312Z  'range_num_stages': [0, 4],
2026-02-21T08:33:31.6356572Z  'range_unroll_factors': [0, 0],
2026-02-21T08:33:31.6356851Z  'range_warp_specializes': [None, True]}
2026-02-21T08:33:31.6378387Z [240s] Fitting surrogate: 587 points, 587 targets
2026-02-21T08:33:31.8074022Z [241s] Autotuning complete in 241.0s after searching 556 configs.
2026-02-21T08:33:31.8074494Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:33:31.8075905Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:33:31.8077169Z 
2026-02-21T08:33:31.8077565Z [241s] Code of selected kernel: /tmp/torchinductor_root/62/c624ylltgvkjwjg4c7ypjk65kagjycjwkdkbdjnh2wdf7knnrj6m.py
2026-02-21T08:33:31.8365277Z from __future__ import annotations
2026-02-21T08:33:31.8365472Z 
2026-02-21T08:33:31.8365566Z import torch
2026-02-21T08:33:31.8365755Z import triton
2026-02-21T08:33:31.8365972Z import triton.language as tl
2026-02-21T08:33:31.8366275Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:33:31.8366739Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:33:31.8367200Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:33:31.8367629Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:33:31.8367909Z 
2026-02-21T08:33:31.8368005Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T08:33:31.8368267Z _BLOCK_SIZE_0 = tl.constexpr(2048)
2026-02-21T08:33:31.8368440Z 
2026-02-21T08:33:31.8368515Z @triton.jit
2026-02-21T08:33:31.8368798Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:33:31.8369247Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:33:31.8369629Z     pid_0 = tl.program_id(0)
2026-02-21T08:33:31.8369863Z     offset_1 = pid_0
2026-02-21T08:33:31.8370119Z     indices_1 = offset_1 + tl.zeros([1], tl.int32)
2026-02-21T08:33:31.8370558Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:33:31.8371063Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:33:31.8371504Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:33:31.8372290Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:33:31.8372700Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:33:31.8373110Z     for offset_0 in tl.range(0, 32768, _BLOCK_SIZE_0, warp_specialize=True, num_stages=4, flatten=True):
2026-02-21T08:33:31.8373624Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:33:31.8373958Z         loss_sum_copy = loss_sum
2026-02-21T08:33:31.8374212Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:33:31.8374635Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:33:31.8375105Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:33:31.8375517Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:33:31.8376246Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 32768 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:33:31.8376784Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:33:31.8377341Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 32768 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:33:31.8377860Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:33:31.8378277Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:33:31.8378750Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:33:31.8379093Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:33:31.8379357Z         if log_target:
2026-02-21T08:33:31.8379586Z             y_true_val_copy = y_true_val
2026-02-21T08:33:31.8379958Z             y_pred_val_copy = y_pred_val
2026-02-21T08:33:31.8380231Z             kl_loss_copy = kl_loss
2026-02-21T08:33:31.8380513Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:33:31.8380818Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:33:31.8381114Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:33:31.8381445Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:33:31.8381805Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:33:31.8382242Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:33:31.8382643Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:33:31.8382939Z             v_2 = v_0 * v_1
2026-02-21T08:33:31.8383179Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:33:31.8383463Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:33:31.8383856Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:33:31.8384321Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:33:31.8384662Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:33:31.8384920Z         _not = not log_target
2026-02-21T08:33:31.8385151Z         if _not:
2026-02-21T08:33:31.8385357Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:33:31.8385634Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:33:31.8385900Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:33:31.8386185Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:33:31.8386493Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:33:31.8386794Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:33:31.8387185Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:33:31.8387637Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:33:31.8387978Z             v_5 = tl_math.log(v_4)
2026-02-21T08:33:31.8388317Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:33:31.8388695Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:33:31.8388968Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:33:31.8389255Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:33:31.8389544Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:33:31.8389848Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:33:31.8390184Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:33:31.8390546Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:33:31.8390873Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:33:31.8391079Z 
2026-02-21T08:33:31.8391537Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:33:31.8392230Z     """
2026-02-21T08:33:31.8392430Z     Compute KL Divergence loss.
2026-02-21T08:33:31.8392605Z 
2026-02-21T08:33:31.8392764Z     Args:
2026-02-21T08:33:31.8393030Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:33:31.8393482Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:33:31.8394023Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:33:31.8394506Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:33:31.8394898Z         eps: Small value to avoid numerical issues
2026-02-21T08:33:31.8395108Z 
2026-02-21T08:33:31.8395192Z     Returns:
2026-02-21T08:33:31.8395396Z         loss: KL divergence loss
2026-02-21T08:33:31.8395633Z     """
2026-02-21T08:33:31.8395832Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:33:31.8396115Z     BT, V = y_pred.shape
2026-02-21T08:33:31.8396406Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:33:31.8396915Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:33:31.8397285Z     # src[kl_div.py:77]: )
2026-02-21T08:33:31.8397687Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:33:31.8398149Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:33:31.8398492Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:33:31.8398814Z     # src[kl_div.py:82]: else:
2026-02-21T08:33:31.8399052Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:33:31.8399295Z     if reduction == 'none':
2026-02-21T08:33:31.8399576Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:33:31.8399900Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:33:31.8400153Z     else:
2026-02-21T08:33:31.8400485Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:33:31.8401005Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:33:31.8401459Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:33:31.8402546Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:33:31.8402963Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:33:31.8403443Z     _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=4)
2026-02-21T08:33:31.8403987Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:33:31.8404351Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:33:31.8404716Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:33:31.8405012Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:33:31.8405277Z     if reduction == 'batchmean':
2026-02-21T08:33:31.8405582Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:33:31.8405926Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:33:31.8406202Z     elif reduction == 'sum':
2026-02-21T08:33:31.8406510Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:33:31.8406857Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:33:31.8407131Z     elif reduction == 'mean':
2026-02-21T08:33:31.8407447Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:33:31.8407802Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:33:31.8408074Z     else:
2026-02-21T08:33:31.8408276Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:33:31.8408563Z         final_loss = loss
2026-02-21T08:33:31.8408809Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:33:31.8409077Z     return final_loss
2026-02-21T08:33:33.0024955Z WARNING:tritonbench.utils.triton_op:Completed input ID 3:
2026-02-21T08:33:33.0028992Z (B, T, V)
2026-02-21T08:33:33.0030975Z ---------------
2026-02-21T08:33:33.0031216Z (8, 512, 32768)
2026-02-21T08:33:33.0035875Z 
2026-02-21T08:33:33.0044481Z  67%|██████▋   | 4/6 [12:05<06:26, 193.41s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4:
2026-02-21T08:33:33.0049333Z (B, T, V)
2026-02-21T08:33:33.0050829Z ---------------
2026-02-21T08:33:33.0051021Z (8, 512, 65536)
2026-02-21T08:33:33.0057576Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:33:34.0779906Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:33:35.2863482Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:33:39.0012667Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:33:39.0016792Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:33:39.0020797Z               'dtype': 'torch.float32',
2026-02-21T08:33:39.0023767Z               'shape': (4096, 65536),
2026-02-21T08:33:39.0027746Z               'stride': (65536, 1)},
2026-02-21T08:33:39.0031636Z             { 'device': 'cuda:0',
2026-02-21T08:33:39.0036142Z               'dtype': 'torch.float32',
2026-02-21T08:33:39.0039488Z               'shape': (4096, 65536),
2026-02-21T08:33:39.0040607Z               'stride': (65536, 1)}),
2026-02-21T08:33:39.0040794Z   'kwargs': {}}
2026-02-21T08:33:39.0041084Z INFO:tritonbench.utils.triton_op:Took 2.55ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:33:39.2205507Z [0s] Autotune random seed: 2135561342
2026-02-21T08:33:39.3918736Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:34:12.6724283Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None])
2026-02-21T08:34:13.0652796Z [33s] Timeout after 30s compiling Config(block_sizes=[4096, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=2, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, None])
2026-02-21T08:34:14.8944745Z [35s] Timeout after 30s compiling Config(block_sizes=[65536, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=8, num_stages=7, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 0], range_warp_specializes=[False, None])
2026-02-21T08:34:15.1111322Z [35s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None])
2026-02-21T08:34:15.1126468Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:34:15.2078138Z module {
2026-02-21T08:34:15.2083040Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:34:15.2084593Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:34:15.2084820Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:34:15.2085036Z     %cst = arith.constant dense<65536> : tensor<16x1xi32>
2026-02-21T08:34:15.2085303Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32>
2026-02-21T08:34:15.2085533Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:34:15.2085732Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:34:15.2085949Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T08:34:15.2086572Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T08:34:15.2086769Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:34:15.2087086Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<16x8xf32>>
2026-02-21T08:34:15.2087416Z     %1 = tt.get_program_id x : i32
2026-02-21T08:34:15.2087591Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T08:34:15.2087823Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:34:15.2088066Z     %4 = tt.splat %2 : i32 -> tensor<16xi32>
2026-02-21T08:34:15.2088246Z     %5 = arith.addi %4, %3 : tensor<16xi32>
2026-02-21T08:34:15.2088551Z     %6 = scf.for %arg5 = %c0_i32 to %c65536_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>)  : i32 {
2026-02-21T08:34:15.2088887Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:34:15.2089238Z       %11 = tt.splat %arg5 : i32 -> tensor<8xi32>
2026-02-21T08:34:15.2089530Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:34:15.2089842Z       %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc<tensor<16x8xf32>> -> tensor<16x8xf32>
2026-02-21T08:34:15.2090177Z       %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:34:15.2090429Z       %15 = arith.muli %14, %cst : tensor<16x1xi32>
2026-02-21T08:34:15.2090676Z       %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:34:15.2091000Z       %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32>
2026-02-21T08:34:15.2091263Z       %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32>
2026-02-21T08:34:15.2091490Z       %19 = arith.addi %17, %18 : tensor<16x8xi32>
2026-02-21T08:34:15.2091715Z       %20 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:34:15.2094986Z       %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr<f32>>, tensor<16x8xi32>
2026-02-21T08:34:15.2099232Z       %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:34:15.2102982Z       %23 = scf.if %arg3 -> (tensor<16x8xf32>) {
2026-02-21T08:34:15.2106427Z         %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32>
2026-02-21T08:34:15.2110364Z         %26 = arith.subf %22, %13 : tensor<16x8xf32>
2026-02-21T08:34:15.2114342Z         %27 = arith.mulf %25, %26 : tensor<16x8xf32>
2026-02-21T08:34:15.2116227Z         %28 = arith.addf %27, %cst_0 : tensor<16x8xf32>
2026-02-21T08:34:15.2116440Z         scf.yield %28 : tensor<16x8xf32>
2026-02-21T08:34:15.2116617Z       } else {
2026-02-21T08:34:15.2116783Z         %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32>
2026-02-21T08:34:15.2117016Z         %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32>
2026-02-21T08:34:15.2117233Z         %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32>
2026-02-21T08:34:15.2117444Z         %28 = arith.ori %26, %27 : tensor<16x8xi1>
2026-02-21T08:34:15.2117688Z         %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32>
2026-02-21T08:34:15.2117926Z         %30 = math.log %29 : tensor<16x8xf32>
2026-02-21T08:34:15.2118113Z         %31 = arith.subf %30, %13 : tensor<16x8xf32>
2026-02-21T08:34:15.2118314Z         %32 = arith.mulf %22, %31 : tensor<16x8xf32>
2026-02-21T08:34:15.2118511Z         %33 = arith.addf %32, %cst_0 : tensor<16x8xf32>
2026-02-21T08:34:15.2118703Z         scf.yield %33 : tensor<16x8xf32>
2026-02-21T08:34:15.2118867Z       }
2026-02-21T08:34:15.2119017Z       %24 = arith.addf %arg6, %23 : tensor<16x8xf32>
2026-02-21T08:34:15.2119212Z       scf.yield %24 : tensor<16x8xf32>
2026-02-21T08:34:15.2119387Z     } {tt.warp_specialize}
2026-02-21T08:34:15.2119562Z     %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:34:15.2119749Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:34:15.2119933Z       %10 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:34:15.2120122Z       tt.reduce.return %10 : f32
2026-02-21T08:34:15.2120538Z     }) : (tensor<16x8xf32>) -> tensor<16xf32>
2026-02-21T08:34:15.2120766Z     %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:34:15.2121042Z     %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:34:15.2121280Z     tt.store %9, %7 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:34:15.2121458Z     tt.return
2026-02-21T08:34:15.2121597Z   }
2026-02-21T08:34:15.2121719Z }
2026-02-21T08:34:15.2121785Z 
2026-02-21T08:34:15.2121994Z {-#
2026-02-21T08:34:15.2122124Z   external_resources: {
2026-02-21T08:34:15.2122284Z     mlir_reproducer: {
2026-02-21T08:34:15.2126614Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:34:15.2131132Z       disable_threading: false,
2026-02-21T08:34:15.2131308Z       verify_each: true
2026-02-21T08:34:15.2131450Z     }
2026-02-21T08:34:15.2131574Z   }
2026-02-21T08:34:15.2131686Z #-}
2026-02-21T08:34:15.2132144Z /tmp/torchinductor_root/23/c23utnd4szzwwjmnl75xrskyirxil2q5pn33uhnulddun2ry4cua.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:34:15.2133359Z /tmp/torchinductor_root/23/c23utnd4szzwwjmnl75xrskyirxil2q5pn33uhnulddun2ry4cua.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:34:15.2134353Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:34:15.2135381Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:34:15.2136215Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:34:15.2136459Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:34:23.7560551Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:34:23.7562480Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:34:23.7563517Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:34:23.7568154Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:34:23.7569703Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:34:23.7570022Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x4096xf32>
2026-02-21T08:34:23.7570259Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:34:23.7576267Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:34:23.7580827Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T08:34:23.7585485Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T08:34:23.7589961Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:34:23.7592211Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<4x4096xf32>>
2026-02-21T08:34:23.7592981Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<4x4096xf32>>
2026-02-21T08:34:23.7593374Z     %2 = tt.get_program_id x : i32
2026-02-21T08:34:23.7598305Z     scf.for %arg5 = %2 to %c1024_i32 step %c9472_i32  : i32 {
2026-02-21T08:34:23.7598642Z       %3 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:34:23.7598912Z       %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:34:23.7603678Z       %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:34:23.7608100Z       %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:34:23.7609632Z       %c61440_i32 = arith.constant 61440 : i32
2026-02-21T08:34:23.7609862Z       %c12288_i32 = arith.constant 12288 : i32
2026-02-21T08:34:23.7610257Z       %7 = scf.for %arg6 = %c0_i32 to %c61440_i32 step %c12288_i32 iter_args(%arg7 = %cst) -> (tensor<4x4096xf32>)  : i32 {
2026-02-21T08:34:23.7615514Z         %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7619957Z         %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7621816Z         %17 = scf.if %arg3 -> (tensor<4x4096xf32>) {
2026-02-21T08:34:23.7622309Z           %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32>
2026-02-21T08:34:23.7622691Z           %32 = arith.subf %16, %15 : tensor<4x4096xf32>
2026-02-21T08:34:23.7622895Z           %33 = arith.mulf %31, %32 : tensor<4x4096xf32>
2026-02-21T08:34:23.7623114Z           %34 = arith.addf %33, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7623313Z           scf.yield %34 : tensor<4x4096xf32>
2026-02-21T08:34:23.7623489Z         } else {
2026-02-21T08:34:23.7623652Z           %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32>
2026-02-21T08:34:23.7623877Z           %32 = arith.cmpf ogt, %16, %31 : tensor<4x4096xf32>
2026-02-21T08:34:23.7624108Z           %33 = arith.cmpf une, %16, %16 : tensor<4x4096xf32>
2026-02-21T08:34:23.7624319Z           %34 = arith.ori %32, %33 : tensor<4x4096xi1>
2026-02-21T08:34:23.7624561Z           %35 = arith.select %34, %16, %31 : tensor<4x4096xi1>, tensor<4x4096xf32>
2026-02-21T08:34:23.7624799Z           %36 = math.log %35 : tensor<4x4096xf32>
2026-02-21T08:34:23.7624997Z           %37 = arith.subf %36, %15 : tensor<4x4096xf32>
2026-02-21T08:34:23.7625190Z           %38 = arith.mulf %16, %37 : tensor<4x4096xf32>
2026-02-21T08:34:23.7625394Z           %39 = arith.addf %38, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7625589Z           scf.yield %39 : tensor<4x4096xf32>
2026-02-21T08:34:23.7625763Z         }
2026-02-21T08:34:23.7625915Z         %18 = arith.addf %arg7, %17 : tensor<4x4096xf32>
2026-02-21T08:34:23.7626106Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:34:23.7626298Z         %19 = arith.muli %c4096_i32, %c1_i32 : i32
2026-02-21T08:34:23.7626477Z         %20 = arith.addi %arg6, %19 : i32
2026-02-21T08:34:23.7626751Z         %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7627332Z         %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7627622Z         %23 = scf.if %arg3 -> (tensor<4x4096xf32>) {
2026-02-21T08:34:23.7627985Z           %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32>
2026-02-21T08:34:23.7628341Z           %32 = arith.subf %22, %21 : tensor<4x4096xf32>
2026-02-21T08:34:23.7628547Z           %33 = arith.mulf %31, %32 : tensor<4x4096xf32>
2026-02-21T08:34:23.7628748Z           %34 = arith.addf %33, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7628949Z           scf.yield %34 : tensor<4x4096xf32>
2026-02-21T08:34:23.7629123Z         } else {
2026-02-21T08:34:23.7629313Z           %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32>
2026-02-21T08:34:23.7629605Z           %32 = arith.cmpf ogt, %22, %31 : tensor<4x4096xf32>
2026-02-21T08:34:23.7629828Z           %33 = arith.cmpf une, %22, %22 : tensor<4x4096xf32>
2026-02-21T08:34:23.7630047Z           %34 = arith.ori %32, %33 : tensor<4x4096xi1>
2026-02-21T08:34:23.7630289Z           %35 = arith.select %34, %22, %31 : tensor<4x4096xi1>, tensor<4x4096xf32>
2026-02-21T08:34:23.7630522Z           %36 = math.log %35 : tensor<4x4096xf32>
2026-02-21T08:34:23.7630725Z           %37 = arith.subf %36, %21 : tensor<4x4096xf32>
2026-02-21T08:34:23.7630920Z           %38 = arith.mulf %22, %37 : tensor<4x4096xf32>
2026-02-21T08:34:23.7631131Z           %39 = arith.addf %38, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7631323Z           scf.yield %39 : tensor<4x4096xf32>
2026-02-21T08:34:23.7631498Z         }
2026-02-21T08:34:23.7631641Z         %24 = arith.addf %18, %23 : tensor<4x4096xf32>
2026-02-21T08:34:23.7631828Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:34:23.7632063Z         %25 = arith.muli %c4096_i32, %c2_i32 : i32
2026-02-21T08:34:23.7632245Z         %26 = arith.addi %arg6, %25 : i32
2026-02-21T08:34:23.7632516Z         %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7632862Z         %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7633145Z         %29 = scf.if %arg3 -> (tensor<4x4096xf32>) {
2026-02-21T08:34:23.7633503Z           %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32>
2026-02-21T08:34:23.7633860Z           %32 = arith.subf %28, %27 : tensor<4x4096xf32>
2026-02-21T08:34:23.7634065Z           %33 = arith.mulf %31, %32 : tensor<4x4096xf32>
2026-02-21T08:34:23.7634266Z           %34 = arith.addf %33, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7634466Z           scf.yield %34 : tensor<4x4096xf32>
2026-02-21T08:34:23.7634634Z         } else {
2026-02-21T08:34:23.7634801Z           %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32>
2026-02-21T08:34:23.7635021Z           %32 = arith.cmpf ogt, %28, %31 : tensor<4x4096xf32>
2026-02-21T08:34:23.7635234Z           %33 = arith.cmpf une, %28, %28 : tensor<4x4096xf32>
2026-02-21T08:34:23.7635444Z           %34 = arith.ori %32, %33 : tensor<4x4096xi1>
2026-02-21T08:34:23.7635671Z           %35 = arith.select %34, %28, %31 : tensor<4x4096xi1>, tensor<4x4096xf32>
2026-02-21T08:34:23.7635911Z           %36 = math.log %35 : tensor<4x4096xf32>
2026-02-21T08:34:23.7636099Z           %37 = arith.subf %36, %27 : tensor<4x4096xf32>
2026-02-21T08:34:23.7636298Z           %38 = arith.mulf %28, %37 : tensor<4x4096xf32>
2026-02-21T08:34:23.7636498Z           %39 = arith.addf %38, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7636686Z           scf.yield %39 : tensor<4x4096xf32>
2026-02-21T08:34:23.7636855Z         }
2026-02-21T08:34:23.7636989Z         %30 = arith.addf %24, %29 : tensor<4x4096xf32>
2026-02-21T08:34:23.7637182Z         scf.yield %30 : tensor<4x4096xf32>
2026-02-21T08:34:23.7637360Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:34:23.7637707Z       %8 = tt.descriptor_load %0[%3, %c61440_i32] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7638097Z       %9 = tt.descriptor_load %1[%3, %c61440_i32] : !tt.tensordesc<tensor<4x4096xf32>> -> tensor<4x4096xf32>
2026-02-21T08:34:23.7638389Z       %10 = scf.if %arg3 -> (tensor<4x4096xf32>) {
2026-02-21T08:34:23.7638751Z         %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32>
2026-02-21T08:34:23.7639109Z         %16 = arith.subf %9, %8 : tensor<4x4096xf32>
2026-02-21T08:34:23.7639320Z         %17 = arith.mulf %15, %16 : tensor<4x4096xf32>
2026-02-21T08:34:23.7639526Z         %18 = arith.addf %17, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7639730Z         scf.yield %18 : tensor<4x4096xf32>
2026-02-21T08:34:23.7639908Z       } else {
2026-02-21T08:34:23.7640118Z         %15 = tt.splat %arg4 : f32 -> tensor<4x4096xf32>
2026-02-21T08:34:23.7640337Z         %16 = arith.cmpf ogt, %9, %15 : tensor<4x4096xf32>
2026-02-21T08:34:23.7640547Z         %17 = arith.cmpf une, %9, %9 : tensor<4x4096xf32>
2026-02-21T08:34:23.7640754Z         %18 = arith.ori %16, %17 : tensor<4x4096xi1>
2026-02-21T08:34:23.7640984Z         %19 = arith.select %18, %9, %15 : tensor<4x4096xi1>, tensor<4x4096xf32>
2026-02-21T08:34:23.7641229Z         %20 = math.log %19 : tensor<4x4096xf32>
2026-02-21T08:34:23.7641431Z         %21 = arith.subf %20, %8 : tensor<4x4096xf32>
2026-02-21T08:34:23.7641625Z         %22 = arith.mulf %9, %21 : tensor<4x4096xf32>
2026-02-21T08:34:23.7641829Z         %23 = arith.addf %22, %cst : tensor<4x4096xf32>
2026-02-21T08:34:23.7642045Z         scf.yield %23 : tensor<4x4096xf32>
2026-02-21T08:34:23.7642214Z       }
2026-02-21T08:34:23.7642350Z       %11 = arith.addf %7, %10 : tensor<4x4096xf32>
2026-02-21T08:34:23.7642553Z       %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({
2026-02-21T08:34:23.7642739Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:23.7642920Z         %15 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:23.7643102Z         tt.reduce.return %15 : f32
2026-02-21T08:34:23.7643281Z       }) : (tensor<4x4096xf32>) -> tensor<4xf32>
2026-02-21T08:34:23.7643510Z       %13 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:23.7643761Z       %14 = tt.addptr %13, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:34:23.7643993Z       tt.store %14, %12 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:23.7644248Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:34:23.7644496Z     tt.return
2026-02-21T08:34:23.7644618Z   }
2026-02-21T08:34:23.7644737Z }
2026-02-21T08:34:23.7644804Z 
2026-02-21T08:34:23.7644860Z {-#
2026-02-21T08:34:23.7644981Z   external_resources: {
2026-02-21T08:34:23.7645135Z     mlir_reproducer: {
2026-02-21T08:34:23.7649363Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:34:23.7653796Z       disable_threading: false,
2026-02-21T08:34:23.7653960Z       verify_each: true
2026-02-21T08:34:23.7654120Z     }
2026-02-21T08:34:23.7654256Z   }
2026-02-21T08:34:23.7654395Z #-}
2026-02-21T08:34:23.7655001Z /tmp/torchinductor_root/by/cbyrkaomiylxwyh72dcm7phdezoaifof2wfpdtzmnlfnjxf2eehc.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:34:23.7656282Z /tmp/torchinductor_root/by/cbyrkaomiylxwyh72dcm7phdezoaifof2wfpdtzmnlfnjxf2eehc.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:34:23.7657329Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:34:23.7658491Z Config: @helion.kernel(config=helion.Config(block_sizes=[4096, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:34:23.7659513Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:34:23.7659798Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:34:23.7891157Z module attributes {ttg.maxnreg = 128 : i32} {
2026-02-21T08:34:23.7896135Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:34:23.7900288Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:34:23.7904212Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:34:23.7906384Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:34:23.7906702Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x1024xf32>
2026-02-21T08:34:23.7906948Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:34:23.7911392Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:34:23.7915356Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T08:34:23.7919492Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T08:34:23.7921198Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:34:23.7921579Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<64x1024xf32>>
2026-02-21T08:34:23.7922084Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<64x1024xf32>>
2026-02-21T08:34:23.7922411Z     %2 = tt.get_program_id x : i32
2026-02-21T08:34:23.7922605Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:34:23.7922790Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:34:23.7923002Z     scf.for %arg5 = %2 to %4 step %c1_i32  : i32 {
2026-02-21T08:34:23.7923202Z       %5 = arith.muli %arg5, %c64_i32 : i32
2026-02-21T08:34:23.7923433Z       %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:34:23.7923672Z       %7 = tt.splat %5 : i32 -> tensor<64xi32>
2026-02-21T08:34:23.7923879Z       %8 = arith.addi %7, %6 : tensor<64xi32>
2026-02-21T08:34:23.7924426Z       %9 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<64x1024xf32>)  : i32 {
2026-02-21T08:34:23.7924842Z         %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc<tensor<64x1024xf32>> -> tensor<64x1024xf32>
2026-02-21T08:34:23.7925216Z         %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc<tensor<64x1024xf32>> -> tensor<64x1024xf32>
2026-02-21T08:34:23.7925506Z         %15 = scf.if %arg3 -> (tensor<64x1024xf32>) {
2026-02-21T08:34:23.7925885Z           %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x1024xf32>) -> tensor<64x1024xf32>
2026-02-21T08:34:23.7926261Z           %18 = arith.subf %14, %13 : tensor<64x1024xf32>
2026-02-21T08:34:23.7926476Z           %19 = arith.mulf %17, %18 : tensor<64x1024xf32>
2026-02-21T08:34:23.7926690Z           %20 = arith.addf %19, %cst : tensor<64x1024xf32>
2026-02-21T08:34:23.7926961Z           scf.yield %20 : tensor<64x1024xf32>
2026-02-21T08:34:23.7927144Z         } else {
2026-02-21T08:34:23.7927309Z           %17 = tt.splat %arg4 : f32 -> tensor<64x1024xf32>
2026-02-21T08:34:23.7927537Z           %18 = arith.cmpf ogt, %14, %17 : tensor<64x1024xf32>
2026-02-21T08:34:23.7927756Z           %19 = arith.cmpf une, %14, %14 : tensor<64x1024xf32>
2026-02-21T08:34:23.7927974Z           %20 = arith.ori %18, %19 : tensor<64x1024xi1>
2026-02-21T08:34:23.7928221Z           %21 = arith.select %20, %14, %17 : tensor<64x1024xi1>, tensor<64x1024xf32>
2026-02-21T08:34:23.7928479Z           %22 = math.log %21 : tensor<64x1024xf32>
2026-02-21T08:34:23.7928678Z           %23 = arith.subf %22, %13 : tensor<64x1024xf32>
2026-02-21T08:34:23.7928872Z           %24 = arith.mulf %14, %23 : tensor<64x1024xf32>
2026-02-21T08:34:23.7929077Z           %25 = arith.addf %24, %cst : tensor<64x1024xf32>
2026-02-21T08:34:23.7929275Z           scf.yield %25 : tensor<64x1024xf32>
2026-02-21T08:34:23.7929438Z         }
2026-02-21T08:34:23.7929590Z         %16 = arith.addf %arg7, %15 : tensor<64x1024xf32>
2026-02-21T08:34:23.7929782Z         scf.yield %16 : tensor<64x1024xf32>
2026-02-21T08:34:23.7930061Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:34:23.7930345Z       %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({
2026-02-21T08:34:23.7930535Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:23.7930712Z         %13 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:23.7930887Z         tt.reduce.return %13 : f32
2026-02-21T08:34:23.7931071Z       }) : (tensor<64x1024xf32>) -> tensor<64xf32>
2026-02-21T08:34:23.7931295Z       %11 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
2026-02-21T08:34:23.7931552Z       %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
2026-02-21T08:34:23.7931773Z       tt.store %12, %10 : tensor<64x!tt.ptr<f32>>
2026-02-21T08:34:23.7932017Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T08:34:23.7932180Z     tt.return
2026-02-21T08:34:23.7932314Z   }
2026-02-21T08:34:23.7932434Z }
2026-02-21T08:34:23.7932500Z 
2026-02-21T08:34:23.7932551Z {-#
2026-02-21T08:34:23.7932684Z   external_resources: {
2026-02-21T08:34:23.7932835Z     mlir_reproducer: {
2026-02-21T08:34:23.7937081Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:34:23.7941531Z       disable_threading: false,
2026-02-21T08:34:23.7941717Z       verify_each: true
2026-02-21T08:34:23.7941900Z     }
2026-02-21T08:34:23.7942012Z   }
2026-02-21T08:34:23.7942127Z #-}
2026-02-21T08:34:23.7942545Z /tmp/torchinductor_root/4b/c4bilkrdypptccswjo2372mrikzya4kjnax776j2kcpz7owgfsfp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:34:23.7943693Z /tmp/torchinductor_root/4b/c4bilkrdypptccswjo2372mrikzya4kjnax776j2kcpz7owgfsfp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:34:23.7944643Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:34:23.7945725Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:34:23.7946707Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:34:23.7946958Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:34:24.7708492Z module attributes {ttg.maxnreg = 32 : i32} {
2026-02-21T08:34:24.7710190Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:34:24.7710880Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:34:24.7711073Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:34:24.7711300Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:34:24.7711497Z     %c2368_i32 = arith.constant 2368 : i32
2026-02-21T08:34:24.7711712Z     %cst = arith.constant dense<65536> : tensor<4x1xi32>
2026-02-21T08:34:24.7712204Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x16xf32>
2026-02-21T08:34:24.7712435Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:34:24.7712617Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:34:24.7712801Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T08:34:24.7712991Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T08:34:24.7713164Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:34:24.7713482Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<4x16xf32>>
2026-02-21T08:34:24.7713798Z     %1 = tt.get_program_id x : i32
2026-02-21T08:34:24.7713976Z     %2 = arith.subi %c1024_i32, %1 : i32
2026-02-21T08:34:24.7714148Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:34:24.7714337Z     %3 = arith.subi %c2368_i32, %c1_i32 : i32
2026-02-21T08:34:24.7714885Z     %4 = arith.addi %2, %3 : i32
2026-02-21T08:34:24.7715058Z     %5 = arith.divui %4, %c2368_i32 : i32
2026-02-21T08:34:24.7715246Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:34:24.7715421Z     %6 = arith.remsi %5, %c3_i32 : i32
2026-02-21T08:34:24.7715603Z     %7 = arith.subi %5, %6 : i32
2026-02-21T08:34:24.7715767Z     %8 = arith.muli %7, %c2368_i32 : i32
2026-02-21T08:34:24.7715950Z     %9 = arith.addi %1, %8 : i32
2026-02-21T08:34:24.7716121Z     %10 = arith.muli %c2368_i32, %c3_i32 : i32
2026-02-21T08:34:24.7716326Z     scf.for %arg5 = %1 to %9 step %10  : i32 {
2026-02-21T08:34:24.7716525Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:34:24.7716749Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:34:24.7717007Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:34:24.7717198Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:34:24.7717617Z       %15 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>)  : i32 {
2026-02-21T08:34:24.7717973Z         %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:34:24.7718227Z         %40 = tt.splat %arg6 : i32 -> tensor<16xi32>
2026-02-21T08:34:24.7718432Z         %41 = arith.addi %40, %39 : tensor<16xi32>
2026-02-21T08:34:24.7718678Z         %42 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:34:24.7718940Z         %43 = arith.muli %42, %cst : tensor<4x1xi32>
2026-02-21T08:34:24.7719186Z         %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:34:24.7719481Z         %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7719735Z         %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7719971Z         %47 = arith.addi %45, %46 : tensor<4x16xi32>
2026-02-21T08:34:24.7720203Z         %48 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7720508Z         %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr<f32>>, tensor<4x16xi32>
2026-02-21T08:34:24.7720787Z         %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7721124Z         %51 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x16xf32>> -> tensor<4x16xf32>
2026-02-21T08:34:24.7721410Z         %52 = scf.if %arg3 -> (tensor<4x16xf32>) {
2026-02-21T08:34:24.7721760Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32>
2026-02-21T08:34:24.7722158Z           %55 = arith.subf %51, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7722354Z           %56 = arith.mulf %54, %55 : tensor<4x16xf32>
2026-02-21T08:34:24.7722568Z           %57 = arith.addf %56, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7722765Z           scf.yield %57 : tensor<4x16xf32>
2026-02-21T08:34:24.7722941Z         } else {
2026-02-21T08:34:24.7723110Z           %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32>
2026-02-21T08:34:24.7723321Z           %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32>
2026-02-21T08:34:24.7723544Z           %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32>
2026-02-21T08:34:24.7723745Z           %57 = arith.ori %55, %56 : tensor<4x16xi1>
2026-02-21T08:34:24.7723983Z           %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32>
2026-02-21T08:34:24.7724217Z           %59 = math.log %58 : tensor<4x16xf32>
2026-02-21T08:34:24.7724412Z           %60 = arith.subf %59, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7724610Z           %61 = arith.mulf %51, %60 : tensor<4x16xf32>
2026-02-21T08:34:24.7724808Z           %62 = arith.addf %61, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7725005Z           scf.yield %62 : tensor<4x16xf32>
2026-02-21T08:34:24.7725167Z         }
2026-02-21T08:34:24.7725312Z         %53 = arith.addf %arg7, %52 : tensor<4x16xf32>
2026-02-21T08:34:24.7725500Z         scf.yield %53 : tensor<4x16xf32>
2026-02-21T08:34:24.7725819Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:34:24.7726073Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:34:24.7726260Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:24.7726443Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:24.7726629Z         tt.reduce.return %39 : f32
2026-02-21T08:34:24.7726821Z       }) : (tensor<4x16xf32>) -> tensor<4xf32>
2026-02-21T08:34:24.7727047Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7727312Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:34:24.7727545Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7727748Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:34:24.7727945Z       %19 = arith.muli %c2368_i32, %c1_i32_1 : i32
2026-02-21T08:34:24.7728195Z       %20 = arith.addi %arg5, %19 : i32
2026-02-21T08:34:24.7728387Z       %21 = arith.muli %20, %c4_i32 : i32
2026-02-21T08:34:24.7728608Z       %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:34:24.7728855Z       %23 = tt.splat %21 : i32 -> tensor<4xi32>
2026-02-21T08:34:24.7729047Z       %24 = arith.addi %23, %22 : tensor<4xi32>
2026-02-21T08:34:24.7729369Z       %25 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>)  : i32 {
2026-02-21T08:34:24.7729744Z         %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:34:24.7729995Z         %40 = tt.splat %arg6 : i32 -> tensor<16xi32>
2026-02-21T08:34:24.7730207Z         %41 = arith.addi %40, %39 : tensor<16xi32>
2026-02-21T08:34:24.7730455Z         %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:34:24.7730723Z         %43 = arith.muli %42, %cst : tensor<4x1xi32>
2026-02-21T08:34:24.7730978Z         %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:34:24.7731284Z         %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7731554Z         %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7731781Z         %47 = arith.addi %45, %46 : tensor<4x16xi32>
2026-02-21T08:34:24.7732049Z         %48 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7732318Z         %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr<f32>>, tensor<4x16xi32>
2026-02-21T08:34:24.7732615Z         %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7732957Z         %51 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc<tensor<4x16xf32>> -> tensor<4x16xf32>
2026-02-21T08:34:24.7733257Z         %52 = scf.if %arg3 -> (tensor<4x16xf32>) {
2026-02-21T08:34:24.7733632Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32>
2026-02-21T08:34:24.7734007Z           %55 = arith.subf %51, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7734220Z           %56 = arith.mulf %54, %55 : tensor<4x16xf32>
2026-02-21T08:34:24.7734431Z           %57 = arith.addf %56, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7734641Z           scf.yield %57 : tensor<4x16xf32>
2026-02-21T08:34:24.7734812Z         } else {
2026-02-21T08:34:24.7734983Z           %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32>
2026-02-21T08:34:24.7735212Z           %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32>
2026-02-21T08:34:24.7735423Z           %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32>
2026-02-21T08:34:24.7735632Z           %57 = arith.ori %55, %56 : tensor<4x16xi1>
2026-02-21T08:34:24.7735860Z           %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32>
2026-02-21T08:34:24.7736095Z           %59 = math.log %58 : tensor<4x16xf32>
2026-02-21T08:34:24.7736280Z           %60 = arith.subf %59, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7736479Z           %61 = arith.mulf %51, %60 : tensor<4x16xf32>
2026-02-21T08:34:24.7736745Z           %62 = arith.addf %61, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7736931Z           scf.yield %62 : tensor<4x16xf32>
2026-02-21T08:34:24.7737099Z         }
2026-02-21T08:34:24.7737237Z         %53 = arith.addf %arg7, %52 : tensor<4x16xf32>
2026-02-21T08:34:24.7737427Z         scf.yield %53 : tensor<4x16xf32>
2026-02-21T08:34:24.7737663Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:34:24.7737923Z       %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({
2026-02-21T08:34:24.7738110Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:24.7738279Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:24.7738463Z         tt.reduce.return %39 : f32
2026-02-21T08:34:24.7738641Z       }) : (tensor<4x16xf32>) -> tensor<4xf32>
2026-02-21T08:34:24.7738863Z       %27 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7739167Z       %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:34:24.7739400Z       tt.store %28, %26 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7739588Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:34:24.7739770Z       %29 = arith.muli %c2368_i32, %c2_i32 : i32
2026-02-21T08:34:24.7739957Z       %30 = arith.addi %arg5, %29 : i32
2026-02-21T08:34:24.7740126Z       %31 = arith.muli %30, %c4_i32 : i32
2026-02-21T08:34:24.7740345Z       %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:34:24.7740576Z       %33 = tt.splat %31 : i32 -> tensor<4xi32>
2026-02-21T08:34:24.7740768Z       %34 = arith.addi %33, %32 : tensor<4xi32>
2026-02-21T08:34:24.7741067Z       %35 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>)  : i32 {
2026-02-21T08:34:24.7741423Z         %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:34:24.7741673Z         %40 = tt.splat %arg6 : i32 -> tensor<16xi32>
2026-02-21T08:34:24.7741925Z         %41 = arith.addi %40, %39 : tensor<16xi32>
2026-02-21T08:34:24.7742177Z         %42 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:34:24.7742421Z         %43 = arith.muli %42, %cst : tensor<4x1xi32>
2026-02-21T08:34:24.7742662Z         %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:34:24.7742927Z         %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7743173Z         %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7743393Z         %47 = arith.addi %45, %46 : tensor<4x16xi32>
2026-02-21T08:34:24.7743610Z         %48 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7743871Z         %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr<f32>>, tensor<4x16xi32>
2026-02-21T08:34:24.7744143Z         %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7744470Z         %51 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc<tensor<4x16xf32>> -> tensor<4x16xf32>
2026-02-21T08:34:24.7744756Z         %52 = scf.if %arg3 -> (tensor<4x16xf32>) {
2026-02-21T08:34:24.7745106Z           %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32>
2026-02-21T08:34:24.7745465Z           %55 = arith.subf %51, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7745662Z           %56 = arith.mulf %54, %55 : tensor<4x16xf32>
2026-02-21T08:34:24.7745871Z           %57 = arith.addf %56, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7746063Z           scf.yield %57 : tensor<4x16xf32>
2026-02-21T08:34:24.7746236Z         } else {
2026-02-21T08:34:24.7746399Z           %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32>
2026-02-21T08:34:24.7746609Z           %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32>
2026-02-21T08:34:24.7746826Z           %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32>
2026-02-21T08:34:24.7747028Z           %57 = arith.ori %55, %56 : tensor<4x16xi1>
2026-02-21T08:34:24.7747319Z           %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32>
2026-02-21T08:34:24.7747547Z           %59 = math.log %58 : tensor<4x16xf32>
2026-02-21T08:34:24.7747744Z           %60 = arith.subf %59, %50 : tensor<4x16xf32>
2026-02-21T08:34:24.7747941Z           %61 = arith.mulf %51, %60 : tensor<4x16xf32>
2026-02-21T08:34:24.7748138Z           %62 = arith.addf %61, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7748333Z           scf.yield %62 : tensor<4x16xf32>
2026-02-21T08:34:24.7748491Z         }
2026-02-21T08:34:24.7748636Z         %53 = arith.addf %arg7, %52 : tensor<4x16xf32>
2026-02-21T08:34:24.7748818Z         scf.yield %53 : tensor<4x16xf32>
2026-02-21T08:34:24.7749063Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:34:24.7749317Z       %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({
2026-02-21T08:34:24.7749506Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:24.7749764Z         %39 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:24.7749945Z         tt.reduce.return %39 : f32
2026-02-21T08:34:24.7750125Z       }) : (tensor<4x16xf32>) -> tensor<4xf32>
2026-02-21T08:34:24.7750342Z       %37 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7750595Z       %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:34:24.7750816Z       tt.store %38, %36 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7751005Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:34:24.7751204Z     scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32  : i32 {
2026-02-21T08:34:24.7751424Z       %11 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:34:24.7751678Z       %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:34:24.7751952Z       %13 = tt.splat %11 : i32 -> tensor<4xi32>
2026-02-21T08:34:24.7752137Z       %14 = arith.addi %13, %12 : tensor<4xi32>
2026-02-21T08:34:24.7752440Z       %15 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>)  : i32 {
2026-02-21T08:34:24.7752780Z         %19 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:34:24.7753025Z         %20 = tt.splat %arg6 : i32 -> tensor<16xi32>
2026-02-21T08:34:24.7753225Z         %21 = arith.addi %20, %19 : tensor<16xi32>
2026-02-21T08:34:24.7753457Z         %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32>
2026-02-21T08:34:24.7753707Z         %23 = arith.muli %22, %cst : tensor<4x1xi32>
2026-02-21T08:34:24.7753943Z         %24 = tt.expand_dims %21 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32>
2026-02-21T08:34:24.7754219Z         %25 = tt.broadcast %23 : tensor<4x1xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7754463Z         %26 = tt.broadcast %24 : tensor<1x16xi32> -> tensor<4x16xi32>
2026-02-21T08:34:24.7754693Z         %27 = arith.addi %25, %26 : tensor<4x16xi32>
2026-02-21T08:34:24.7754924Z         %28 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7755181Z         %29 = tt.addptr %28, %27 : tensor<4x16x!tt.ptr<f32>>, tensor<4x16xi32>
2026-02-21T08:34:24.7755468Z         %30 = tt.load %29 evictionPolicy = evict_last : tensor<4x16x!tt.ptr<f32>>
2026-02-21T08:34:24.7755787Z         %31 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc<tensor<4x16xf32>> -> tensor<4x16xf32>
2026-02-21T08:34:24.7756071Z         %32 = scf.if %arg3 -> (tensor<4x16xf32>) {
2026-02-21T08:34:24.7756417Z           %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32>
2026-02-21T08:34:24.7756776Z           %35 = arith.subf %31, %30 : tensor<4x16xf32>
2026-02-21T08:34:24.7756979Z           %36 = arith.mulf %34, %35 : tensor<4x16xf32>
2026-02-21T08:34:24.7757176Z           %37 = arith.addf %36, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7757374Z           scf.yield %37 : tensor<4x16xf32>
2026-02-21T08:34:24.7757535Z         } else {
2026-02-21T08:34:24.7757700Z           %34 = tt.splat %arg4 : f32 -> tensor<4x16xf32>
2026-02-21T08:34:24.7757964Z           %35 = arith.cmpf ogt, %31, %34 : tensor<4x16xf32>
2026-02-21T08:34:24.7758180Z           %36 = arith.cmpf une, %31, %31 : tensor<4x16xf32>
2026-02-21T08:34:24.7758387Z           %37 = arith.ori %35, %36 : tensor<4x16xi1>
2026-02-21T08:34:24.7758612Z           %38 = arith.select %37, %31, %34 : tensor<4x16xi1>, tensor<4x16xf32>
2026-02-21T08:34:24.7758863Z           %39 = math.log %38 : tensor<4x16xf32>
2026-02-21T08:34:24.7759046Z           %40 = arith.subf %39, %30 : tensor<4x16xf32>
2026-02-21T08:34:24.7759244Z           %41 = arith.mulf %31, %40 : tensor<4x16xf32>
2026-02-21T08:34:24.7759448Z           %42 = arith.addf %41, %cst_0 : tensor<4x16xf32>
2026-02-21T08:34:24.7759638Z           scf.yield %42 : tensor<4x16xf32>
2026-02-21T08:34:24.7759808Z         }
2026-02-21T08:34:24.7759945Z         %33 = arith.addf %arg7, %32 : tensor<4x16xf32>
2026-02-21T08:34:24.7760190Z         scf.yield %33 : tensor<4x16xf32>
2026-02-21T08:34:24.7760431Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:34:24.7760693Z       %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({
2026-02-21T08:34:24.7760880Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:24.7761049Z         %19 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:24.7761230Z         tt.reduce.return %19 : f32
2026-02-21T08:34:24.7761405Z       }) : (tensor<4x16xf32>) -> tensor<4xf32>
2026-02-21T08:34:24.7761625Z       %17 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7761919Z       %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:34:24.7762156Z       tt.store %18, %16 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:34:24.7762339Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:34:24.7762501Z     tt.return
2026-02-21T08:34:24.7762635Z   }
2026-02-21T08:34:24.7762754Z }
2026-02-21T08:34:24.7762825Z 
2026-02-21T08:34:24.7762885Z {-#
2026-02-21T08:34:24.7763010Z   external_resources: {
2026-02-21T08:34:24.7763172Z     mlir_reproducer: {
2026-02-21T08:34:24.7767372Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:34:24.7771644Z       disable_threading: false,
2026-02-21T08:34:24.7771835Z       verify_each: true
2026-02-21T08:34:24.7772025Z     }
2026-02-21T08:34:24.7772170Z   }
2026-02-21T08:34:24.7772373Z #-}
2026-02-21T08:34:24.7772866Z /tmp/torchinductor_root/hg/chgh4z3oq5aytli2qovqw2k2dzxgymzdxa2ewtfhxxv62w56enwr.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:34:24.7774123Z /tmp/torchinductor_root/hg/chgh4z3oq5aytli2qovqw2k2dzxgymzdxa2ewtfhxxv62w56enwr.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:34:24.7775156Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:34:24.7776416Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:34:24.7777441Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:34:24.7777686Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:34:30.0213438Z module attributes {ttg.maxnreg = 256 : i32} {
2026-02-21T08:34:30.0218518Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:34:30.0219145Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:34:30.0219353Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:34:30.0219544Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:34:30.0219710Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:34:30.0219954Z     %cst = arith.constant dense<65536> : tensor<1024x1xi32>
2026-02-21T08:34:30.0220225Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<1024x64xf32>
2026-02-21T08:34:30.0220462Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:34:30.0220639Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:34:30.0220826Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T08:34:30.0221011Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T08:34:30.0221184Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:34:30.0221499Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : <f32>, <tensor<1024x64xf32>>
2026-02-21T08:34:30.0221808Z     %1 = tt.get_program_id x : i32
2026-02-21T08:34:30.0222231Z     %2 = arith.addi %1, %c1_i32 : i32
2026-02-21T08:34:30.0222406Z     %3 = arith.minsi %2, %c4_i32 : i32
2026-02-21T08:34:30.0222608Z     scf.for %arg5 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T08:34:30.0222810Z       %4 = arith.muli %arg5, %c1024_i32 : i32
2026-02-21T08:34:30.0223060Z       %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32>
2026-02-21T08:34:30.0223328Z       %6 = tt.splat %4 : i32 -> tensor<1024xi32>
2026-02-21T08:34:30.0223518Z       %7 = arith.addi %6, %5 : tensor<1024xi32>
2026-02-21T08:34:30.0223741Z       %c65472_i32 = arith.constant 65472 : i32
2026-02-21T08:34:30.0223927Z       %c192_i32 = arith.constant 192 : i32
2026-02-21T08:34:30.0224244Z       %8 = scf.for %arg6 = %c0_i32 to %c65472_i32 step %c192_i32 iter_args(%arg7 = %cst_0) -> (tensor<1024x64xf32>)  : i32 {
2026-02-21T08:34:30.0224608Z         %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:34:30.0224861Z         %28 = tt.splat %arg6 : i32 -> tensor<64xi32>
2026-02-21T08:34:30.0225062Z         %29 = arith.addi %28, %27 : tensor<64xi32>
2026-02-21T08:34:30.0225322Z         %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32>
2026-02-21T08:34:30.0225585Z         %31 = arith.muli %30, %cst : tensor<1024x1xi32>
2026-02-21T08:34:30.0225844Z         %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:34:30.0226490Z         %33 = tt.broadcast %31 : tensor<1024x1xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0226757Z         %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0226998Z         %35 = arith.addi %33, %34 : tensor<1024x64xi32>
2026-02-21T08:34:30.0227230Z         %36 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0227510Z         %37 = tt.addptr %36, %35 : tensor<1024x64x!tt.ptr<f32>>, tensor<1024x64xi32>
2026-02-21T08:34:30.0227808Z         %38 = tt.load %37 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0228163Z         %39 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc<tensor<1024x64xf32>> -> tensor<1024x64xf32>
2026-02-21T08:34:30.0228465Z         %40 = scf.if %arg3 -> (tensor<1024x64xf32>) {
2026-02-21T08:34:30.0228935Z           %76 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32>
2026-02-21T08:34:30.0229323Z           %77 = arith.subf %39, %38 : tensor<1024x64xf32>
2026-02-21T08:34:30.0229527Z           %78 = arith.mulf %76, %77 : tensor<1024x64xf32>
2026-02-21T08:34:30.0229754Z           %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0229963Z           scf.yield %79 : tensor<1024x64xf32>
2026-02-21T08:34:30.0230149Z         } else {
2026-02-21T08:34:30.0230330Z           %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32>
2026-02-21T08:34:30.0230561Z           %77 = arith.cmpf ogt, %39, %76 : tensor<1024x64xf32>
2026-02-21T08:34:30.0230801Z           %78 = arith.cmpf une, %39, %39 : tensor<1024x64xf32>
2026-02-21T08:34:30.0231023Z           %79 = arith.ori %77, %78 : tensor<1024x64xi1>
2026-02-21T08:34:30.0231277Z           %80 = arith.select %79, %39, %76 : tensor<1024x64xi1>, tensor<1024x64xf32>
2026-02-21T08:34:30.0231524Z           %81 = math.log %80 : tensor<1024x64xf32>
2026-02-21T08:34:30.0231736Z           %82 = arith.subf %81, %38 : tensor<1024x64xf32>
2026-02-21T08:34:30.0231986Z           %83 = arith.mulf %39, %82 : tensor<1024x64xf32>
2026-02-21T08:34:30.0232191Z           %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0232397Z           scf.yield %84 : tensor<1024x64xf32>
2026-02-21T08:34:30.0232565Z         }
2026-02-21T08:34:30.0232722Z         %41 = arith.addf %arg7, %40 : tensor<1024x64xf32>
2026-02-21T08:34:30.0232919Z         %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:34:30.0233110Z         %42 = arith.muli %c64_i32, %c1_i32_1 : i32
2026-02-21T08:34:30.0233303Z         %43 = arith.addi %arg6, %42 : i32
2026-02-21T08:34:30.0233522Z         %44 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:34:30.0233763Z         %45 = tt.splat %43 : i32 -> tensor<64xi32>
2026-02-21T08:34:30.0233953Z         %46 = arith.addi %45, %44 : tensor<64xi32>
2026-02-21T08:34:30.0234208Z         %47 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32>
2026-02-21T08:34:30.0234481Z         %48 = arith.muli %47, %cst : tensor<1024x1xi32>
2026-02-21T08:34:30.0234745Z         %49 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:34:30.0235044Z         %50 = tt.broadcast %48 : tensor<1024x1xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0235317Z         %51 = tt.broadcast %49 : tensor<1x64xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0235561Z         %52 = arith.addi %50, %51 : tensor<1024x64xi32>
2026-02-21T08:34:30.0235797Z         %53 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0236086Z         %54 = tt.addptr %53, %52 : tensor<1024x64x!tt.ptr<f32>>, tensor<1024x64xi32>
2026-02-21T08:34:30.0236391Z         %55 = tt.load %54 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0236752Z         %56 = tt.descriptor_load %0[%4, %43] : !tt.tensordesc<tensor<1024x64xf32>> -> tensor<1024x64xf32>
2026-02-21T08:34:30.0237164Z         %57 = scf.if %arg3 -> (tensor<1024x64xf32>) {
2026-02-21T08:34:30.0237540Z           %76 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32>
2026-02-21T08:34:30.0237934Z           %77 = arith.subf %56, %55 : tensor<1024x64xf32>
2026-02-21T08:34:30.0238149Z           %78 = arith.mulf %76, %77 : tensor<1024x64xf32>
2026-02-21T08:34:30.0238376Z           %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0238596Z           scf.yield %79 : tensor<1024x64xf32>
2026-02-21T08:34:30.0238778Z         } else {
2026-02-21T08:34:30.0238959Z           %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32>
2026-02-21T08:34:30.0239191Z           %77 = arith.cmpf ogt, %56, %76 : tensor<1024x64xf32>
2026-02-21T08:34:30.0239428Z           %78 = arith.cmpf une, %56, %56 : tensor<1024x64xf32>
2026-02-21T08:34:30.0239648Z           %79 = arith.ori %77, %78 : tensor<1024x64xi1>
2026-02-21T08:34:30.0239968Z           %80 = arith.select %79, %56, %76 : tensor<1024x64xi1>, tensor<1024x64xf32>
2026-02-21T08:34:30.0240228Z           %81 = math.log %80 : tensor<1024x64xf32>
2026-02-21T08:34:30.0240430Z           %82 = arith.subf %81, %55 : tensor<1024x64xf32>
2026-02-21T08:34:30.0240646Z           %83 = arith.mulf %56, %82 : tensor<1024x64xf32>
2026-02-21T08:34:30.0240859Z           %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0241071Z           scf.yield %84 : tensor<1024x64xf32>
2026-02-21T08:34:30.0241245Z         }
2026-02-21T08:34:30.0241403Z         %58 = arith.addf %41, %57 : tensor<1024x64xf32>
2026-02-21T08:34:30.0241607Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:34:30.0241811Z         %59 = arith.muli %c64_i32, %c2_i32 : i32
2026-02-21T08:34:30.0242043Z         %60 = arith.addi %arg6, %59 : i32
2026-02-21T08:34:30.0242271Z         %61 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:34:30.0242526Z         %62 = tt.splat %60 : i32 -> tensor<64xi32>
2026-02-21T08:34:30.0242725Z         %63 = arith.addi %62, %61 : tensor<64xi32>
2026-02-21T08:34:30.0242982Z         %64 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32>
2026-02-21T08:34:30.0243247Z         %65 = arith.muli %64, %cst : tensor<1024x1xi32>
2026-02-21T08:34:30.0243505Z         %66 = tt.expand_dims %63 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:34:30.0243806Z         %67 = tt.broadcast %65 : tensor<1024x1xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0244076Z         %68 = tt.broadcast %66 : tensor<1x64xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0244312Z         %69 = arith.addi %67, %68 : tensor<1024x64xi32>
2026-02-21T08:34:30.0244537Z         %70 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0244815Z         %71 = tt.addptr %70, %69 : tensor<1024x64x!tt.ptr<f32>>, tensor<1024x64xi32>
2026-02-21T08:34:30.0245107Z         %72 = tt.load %71 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0245455Z         %73 = tt.descriptor_load %0[%4, %60] : !tt.tensordesc<tensor<1024x64xf32>> -> tensor<1024x64xf32>
2026-02-21T08:34:30.0245750Z         %74 = scf.if %arg3 -> (tensor<1024x64xf32>) {
2026-02-21T08:34:30.0246105Z           %76 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32>
2026-02-21T08:34:30.0246475Z           %77 = arith.subf %73, %72 : tensor<1024x64xf32>
2026-02-21T08:34:30.0246678Z           %78 = arith.mulf %76, %77 : tensor<1024x64xf32>
2026-02-21T08:34:30.0246892Z           %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0247097Z           scf.yield %79 : tensor<1024x64xf32>
2026-02-21T08:34:30.0247264Z         } else {
2026-02-21T08:34:30.0247426Z           %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32>
2026-02-21T08:34:30.0247639Z           %77 = arith.cmpf ogt, %73, %76 : tensor<1024x64xf32>
2026-02-21T08:34:30.0247859Z           %78 = arith.cmpf une, %73, %73 : tensor<1024x64xf32>
2026-02-21T08:34:30.0248173Z           %79 = arith.ori %77, %78 : tensor<1024x64xi1>
2026-02-21T08:34:30.0248409Z           %80 = arith.select %79, %73, %76 : tensor<1024x64xi1>, tensor<1024x64xf32>
2026-02-21T08:34:30.0248644Z           %81 = math.log %80 : tensor<1024x64xf32>
2026-02-21T08:34:30.0248896Z           %82 = arith.subf %81, %72 : tensor<1024x64xf32>
2026-02-21T08:34:30.0249101Z           %83 = arith.mulf %73, %82 : tensor<1024x64xf32>
2026-02-21T08:34:30.0249322Z           %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0249535Z           scf.yield %84 : tensor<1024x64xf32>
2026-02-21T08:34:30.0249703Z         }
2026-02-21T08:34:30.0249860Z         %75 = arith.addf %58, %74 : tensor<1024x64xf32>
2026-02-21T08:34:30.0250050Z         scf.yield %75 : tensor<1024x64xf32>
2026-02-21T08:34:30.0250241Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:34:30.0250519Z       %9 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
2026-02-21T08:34:30.0250773Z       %10 = tt.splat %c65472_i32 : i32 -> tensor<64xi32>
2026-02-21T08:34:30.0250981Z       %11 = arith.addi %10, %9 : tensor<64xi32>
2026-02-21T08:34:30.0251214Z       %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32>
2026-02-21T08:34:30.0251478Z       %13 = arith.muli %12, %cst : tensor<1024x1xi32>
2026-02-21T08:34:30.0251715Z       %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
2026-02-21T08:34:30.0252025Z       %15 = tt.broadcast %13 : tensor<1024x1xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0252280Z       %16 = tt.broadcast %14 : tensor<1x64xi32> -> tensor<1024x64xi32>
2026-02-21T08:34:30.0252513Z       %17 = arith.addi %15, %16 : tensor<1024x64xi32>
2026-02-21T08:34:30.0252744Z       %18 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0253011Z       %19 = tt.addptr %18, %17 : tensor<1024x64x!tt.ptr<f32>>, tensor<1024x64xi32>
2026-02-21T08:34:30.0253314Z       %20 = tt.load %19 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr<f32>>
2026-02-21T08:34:30.0253661Z       %21 = tt.descriptor_load %0[%4, %c65472_i32] : !tt.tensordesc<tensor<1024x64xf32>> -> tensor<1024x64xf32>
2026-02-21T08:34:30.0253961Z       %22 = scf.if %arg3 -> (tensor<1024x64xf32>) {
2026-02-21T08:34:30.0254313Z         %27 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32>
2026-02-21T08:34:30.0254678Z         %28 = arith.subf %21, %20 : tensor<1024x64xf32>
2026-02-21T08:34:30.0254879Z         %29 = arith.mulf %27, %28 : tensor<1024x64xf32>
2026-02-21T08:34:30.0255082Z         %30 = arith.addf %29, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0255282Z         scf.yield %30 : tensor<1024x64xf32>
2026-02-21T08:34:30.0255445Z       } else {
2026-02-21T08:34:30.0255603Z         %27 = tt.splat %arg4 : f32 -> tensor<1024x64xf32>
2026-02-21T08:34:30.0255818Z         %28 = arith.cmpf ogt, %21, %27 : tensor<1024x64xf32>
2026-02-21T08:34:30.0256038Z         %29 = arith.cmpf une, %21, %21 : tensor<1024x64xf32>
2026-02-21T08:34:30.0256245Z         %30 = arith.ori %28, %29 : tensor<1024x64xi1>
2026-02-21T08:34:30.0256476Z         %31 = arith.select %30, %21, %27 : tensor<1024x64xi1>, tensor<1024x64xf32>
2026-02-21T08:34:30.0256719Z         %32 = math.log %31 : tensor<1024x64xf32>
2026-02-21T08:34:30.0256909Z         %33 = arith.subf %32, %20 : tensor<1024x64xf32>
2026-02-21T08:34:30.0257108Z         %34 = arith.mulf %21, %33 : tensor<1024x64xf32>
2026-02-21T08:34:30.0257305Z         %35 = arith.addf %34, %cst_0 : tensor<1024x64xf32>
2026-02-21T08:34:30.0257505Z         scf.yield %35 : tensor<1024x64xf32>
2026-02-21T08:34:30.0257674Z       }
2026-02-21T08:34:30.0257814Z       %23 = arith.addf %8, %22 : tensor<1024x64xf32>
2026-02-21T08:34:30.0258013Z       %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({
2026-02-21T08:34:30.0258195Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:34:30.0258374Z         %27 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:34:30.0258615Z         tt.reduce.return %27 : f32
2026-02-21T08:34:30.0258805Z       }) : (tensor<1024x64xf32>) -> tensor<1024xf32>
2026-02-21T08:34:30.0259030Z       %25 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>>
2026-02-21T08:34:30.0259295Z       %26 = tt.addptr %25, %7 : tensor<1024x!tt.ptr<f32>>, tensor<1024xi32>
2026-02-21T08:34:30.0259542Z       tt.store %26, %24 : tensor<1024x!tt.ptr<f32>>
2026-02-21T08:34:30.0259735Z     } {tt.warp_specialize}
2026-02-21T08:34:30.0259895Z     tt.return
2026-02-21T08:34:30.0260017Z   }
2026-02-21T08:34:30.0260139Z }
2026-02-21T08:34:30.0260205Z 
2026-02-21T08:34:30.0260253Z {-#
2026-02-21T08:34:30.0260391Z   external_resources: {
2026-02-21T08:34:30.0260541Z     mlir_reproducer: {
2026-02-21T08:34:30.0264955Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:34:30.0269407Z       disable_threading: false,
2026-02-21T08:34:30.0269570Z       verify_each: true
2026-02-21T08:34:30.0269717Z     }
2026-02-21T08:34:30.0269833Z   }
2026-02-21T08:34:30.0269950Z #-}
2026-02-21T08:34:30.0270362Z /tmp/torchinductor_root/4r/c4rovp3bv6fv3fr5la5e2rov22fm2kkuysyfoxnyh52ztqjqxk3h.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:34:30.0271552Z /tmp/torchinductor_root/4r/c4rovp3bv6fv3fr5la5e2rov22fm2kkuysyfoxnyh52ztqjqxk3h.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:34:30.0272546Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:34:30.0273617Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:34:30.0274580Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:34:30.0274839Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:34:36.3901802Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 4.6 configs/s
2026-02-21T08:34:36.3910938Z [56s] Adaptive compile timeout: 30s (90% percentile=4.8s, bounds=[30.0s, 30s])
2026-02-21T08:34:39.7926325Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 535/535 170.9 configs/s
2026-02-21T08:34:39.9123003Z [60s] Initial random population of 100, 5 starting points: 
2026-02-21T08:34:39.9124288Z error=12
2026-02-21T08:34:39.9124503Z timeout=4
2026-02-21T08:34:39.9124680Z ok=84
2026-02-21T08:34:39.9124874Z min=0.4066
2026-02-21T08:34:39.9125050Z mid=3.2093
2026-02-21T08:34:39.9125235Z max=448.7332
2026-02-21T08:34:39.9125448Z best={'block_sizes': [4096, 1],
2026-02-21T08:34:39.9125841Z  'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:34:39.9126282Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T08:34:39.9126580Z  'num_stages': 6,
2026-02-21T08:34:39.9127142Z  'num_warps': 32,
2026-02-21T08:34:39.9127378Z  'pid_type': 'flat',
2026-02-21T08:34:39.9127626Z  'range_flattens': [None, True],
2026-02-21T08:34:39.9127903Z  'range_multi_buffers': [None, False],
2026-02-21T08:34:39.9128183Z  'range_num_stages': [0, 1],
2026-02-21T08:34:39.9128435Z  'range_unroll_factors': [0, 1],
2026-02-21T08:34:39.9128759Z  'range_warp_specializes': [None, False]}
2026-02-21T08:34:39.9148215Z [60s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:34:41.4274740Z [62s] Generation 1 starting: 89 neighbors, 5 active search path(s)
2026-02-21T08:35:16.2960526Z [96s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[4, 1], range_unroll_factors=[0, 1], range_warp_specializes=[False, None])
2026-02-21T08:35:16.2977203Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.5 configs/s
2026-02-21T08:35:21.7469551Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 17.0 configs/s
2026-02-21T08:35:41.0794890Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 535/535 27.7 configs/s
2026-02-21T08:35:41.3383499Z [121s] Generation 1 complete: 
2026-02-21T08:35:41.3385136Z timeout=1
2026-02-21T08:35:41.3385338Z ok=93
2026-02-21T08:35:41.3385511Z min=0.4158
2026-02-21T08:35:41.3385673Z mid=0.5090
2026-02-21T08:35:41.3385836Z max=3.1733
2026-02-21T08:35:41.3386016Z best={'block_sizes': [4096, 1],
2026-02-21T08:35:41.3386348Z  'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'],
2026-02-21T08:35:41.3386691Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T08:35:41.3386945Z  'num_stages': 6,
2026-02-21T08:35:41.3387129Z  'num_warps': 32,
2026-02-21T08:35:41.3387317Z  'pid_type': 'flat',
2026-02-21T08:35:41.3387522Z  'range_flattens': [None, True],
2026-02-21T08:35:41.3387797Z  'range_multi_buffers': [None, True],
2026-02-21T08:35:41.3388037Z  'range_num_stages': [0, 1],
2026-02-21T08:35:41.3388196Z  'range_unroll_factors': [0, 1],
2026-02-21T08:35:41.3388377Z  'range_warp_specializes': [None, False]}
2026-02-21T08:35:41.3405201Z [121s] Fitting surrogate: 194 points, 194 targets
2026-02-21T08:35:42.5488272Z [123s] Generation 2 starting: 74 neighbors, 5 active search path(s)
2026-02-21T08:35:47.0082679Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 11.8 configs/s
2026-02-21T08:35:51.3448029Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.2 configs/s
2026-02-21T08:36:09.1641509Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 537/537 30.9 configs/s
2026-02-21T08:36:09.4279403Z [150s] Generation 2 complete: 
2026-02-21T08:36:09.4283552Z ok=79
2026-02-21T08:36:09.4285485Z min=0.4227
2026-02-21T08:36:09.4285645Z mid=0.4516
2026-02-21T08:36:09.4285766Z max=1.5099
2026-02-21T08:36:09.4285910Z best={'block_sizes': [1024, 1],
2026-02-21T08:36:09.4286174Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:36:09.4286915Z  'load_eviction_policies': ['', 'last'],
2026-02-21T08:36:09.4287092Z  'num_stages': 1,
2026-02-21T08:36:09.4287244Z  'num_warps': 1,
2026-02-21T08:36:09.4287383Z  'pid_type': 'flat',
2026-02-21T08:36:09.4287540Z  'range_flattens': [None, False],
2026-02-21T08:36:09.4287713Z  'range_multi_buffers': [None, False],
2026-02-21T08:36:09.4287895Z  'range_num_stages': [0, 1],
2026-02-21T08:36:09.4288061Z  'range_unroll_factors': [0, 1],
2026-02-21T08:36:09.4288231Z  'range_warp_specializes': [None, True]}
2026-02-21T08:36:09.4293951Z [150s] Fitting surrogate: 273 points, 273 targets
2026-02-21T08:36:10.3138608Z [150s] Generation 3 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:36:13.4705099Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 38.1 configs/s
2026-02-21T08:36:17.0030558Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 17.5 configs/s
2026-02-21T08:36:33.8969338Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 537/537 31.8 configs/s
2026-02-21T08:36:34.1413032Z [174s] Generation 3 complete: 
2026-02-21T08:36:34.1417030Z ok=66
2026-02-21T08:36:34.1420421Z min=0.4208
2026-02-21T08:36:34.1424363Z mid=0.4354
2026-02-21T08:36:34.1428218Z max=0.8530
2026-02-21T08:36:34.1432613Z best={'block_sizes': [2048, 2],
2026-02-21T08:36:34.1436243Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:36:34.1439546Z  'load_eviction_policies': ['', 'first'],
2026-02-21T08:36:34.1439831Z  'num_stages': 6,
2026-02-21T08:36:34.1440014Z  'num_warps': 32,
2026-02-21T08:36:34.1440200Z  'pid_type': 'flat',
2026-02-21T08:36:34.1440393Z  'range_flattens': [None, False],
2026-02-21T08:36:34.1440587Z  'range_multi_buffers': [None, False],
2026-02-21T08:36:34.1440768Z  'range_num_stages': [0, 0],
2026-02-21T08:36:34.1440941Z  'range_unroll_factors': [0, 0],
2026-02-21T08:36:34.1446362Z  'range_warp_specializes': [None, True]}
2026-02-21T08:36:34.1446650Z [174s] Fitting surrogate: 339 points, 339 targets
2026-02-21T08:36:34.9291284Z [175s] Generation 4 starting: 49 neighbors, 4 active search path(s)
2026-02-21T08:36:37.6158598Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 35.7 configs/s
2026-02-21T08:36:40.5115912Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.5 configs/s
2026-02-21T08:36:53.3666940Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 541/541 42.0 configs/s
2026-02-21T08:36:53.5639210Z [194s] Generation 4 complete: 
2026-02-21T08:36:53.5642845Z ok=53
2026-02-21T08:36:53.5647945Z min=0.4158
2026-02-21T08:36:53.5649571Z mid=0.4332
2026-02-21T08:36:53.5649778Z max=0.9440
2026-02-21T08:36:53.5652755Z best={'block_sizes': [2048, 1],
2026-02-21T08:36:53.5652999Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:36:53.5653211Z  'load_eviction_policies': ['last', 'first'],
2026-02-21T08:36:53.5653448Z  'num_stages': 7,
2026-02-21T08:36:53.5653593Z  'num_warps': 8,
2026-02-21T08:36:53.5653765Z  'pid_type': 'flat',
2026-02-21T08:36:53.5658517Z  'range_flattens': [None, True],
2026-02-21T08:36:53.5662324Z  'range_multi_buffers': [None, None],
2026-02-21T08:36:53.5666704Z  'range_num_stages': [0, 1],
2026-02-21T08:36:53.5668202Z  'range_unroll_factors': [0, 1],
2026-02-21T08:36:53.5668438Z  'range_warp_specializes': [None, False]}
2026-02-21T08:36:53.5668719Z [194s] Fitting surrogate: 392 points, 392 targets
2026-02-21T08:36:54.2506544Z [194s] Generation 5 starting: 44 neighbors, 4 active search path(s)
2026-02-21T08:36:57.1962355Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 11.5 configs/s
2026-02-21T08:36:59.7190837Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.8 configs/s
2026-02-21T08:37:12.1869304Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 541/541 43.3 configs/s
2026-02-21T08:37:12.3947837Z [213s] Generation 5 complete: 
2026-02-21T08:37:12.3949585Z ok=48
2026-02-21T08:37:12.3949747Z min=0.4127
2026-02-21T08:37:12.3949889Z mid=0.4392
2026-02-21T08:37:12.3950049Z max=0.7652
2026-02-21T08:37:12.3950969Z best={'block_sizes': [2048, 2],
2026-02-21T08:37:12.3951215Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:37:12.3951430Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:37:12.3951619Z  'num_stages': 7,
2026-02-21T08:37:12.3952833Z  'num_warps': 32,
2026-02-21T08:37:12.3953032Z  'pid_type': 'flat',
2026-02-21T08:37:12.3953222Z  'range_flattens': [None, False],
2026-02-21T08:37:12.3953426Z  'range_multi_buffers': [None, False],
2026-02-21T08:37:12.3953640Z  'range_num_stages': [0, 1],
2026-02-21T08:37:12.3953830Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:12.3954029Z  'range_warp_specializes': [None, True]}
2026-02-21T08:37:12.3966389Z [213s] Fitting surrogate: 440 points, 440 targets
2026-02-21T08:37:12.8664910Z [213s] Generation 6 starting: 23 neighbors, 2 active search path(s)
2026-02-21T08:37:15.1785038Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 10.4 configs/s
2026-02-21T08:37:16.5905949Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 17.6 configs/s
2026-02-21T08:37:22.2230610Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 550/550 96.7 configs/s
2026-02-21T08:37:22.3499726Z [222s] Generation 6 complete: 
2026-02-21T08:37:22.3503246Z ok=25
2026-02-21T08:37:22.3507131Z min=0.4128
2026-02-21T08:37:22.3511016Z mid=0.4793
2026-02-21T08:37:22.3514928Z max=0.8183
2026-02-21T08:37:22.3518917Z best={'block_sizes': [2048, 2],
2026-02-21T08:37:22.3520332Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:37:22.3520557Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:37:22.3520752Z  'num_stages': 7,
2026-02-21T08:37:22.3520895Z  'num_warps': 32,
2026-02-21T08:37:22.3521050Z  'pid_type': 'flat',
2026-02-21T08:37:22.3521211Z  'range_flattens': [None, False],
2026-02-21T08:37:22.3521385Z  'range_multi_buffers': [None, False],
2026-02-21T08:37:22.3521569Z  'range_num_stages': [0, 1],
2026-02-21T08:37:22.3521732Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:22.3522129Z  'range_warp_specializes': [None, True]}
2026-02-21T08:37:22.3522355Z [222s] Fitting surrogate: 465 points, 465 targets
2026-02-21T08:37:22.7608723Z [223s] Generation 7 starting: 21 neighbors, 2 active search path(s)
2026-02-21T08:37:25.8397991Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.2 configs/s
2026-02-21T08:37:27.1254987Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.7 configs/s
2026-02-21T08:37:32.5693543Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 550/550 99.9 configs/s
2026-02-21T08:37:32.6940091Z [233s] Generation 7 complete: 
2026-02-21T08:37:32.6942528Z ok=23
2026-02-21T08:37:32.6942726Z min=0.4153
2026-02-21T08:37:32.6942914Z mid=0.4374
2026-02-21T08:37:32.6943083Z max=0.9636
2026-02-21T08:37:32.6943274Z best={'block_sizes': [2048, 2],
2026-02-21T08:37:32.6943565Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:37:32.6943905Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:37:32.6944244Z  'num_stages': 7,
2026-02-21T08:37:32.6944924Z  'num_warps': 32,
2026-02-21T08:37:32.6945136Z  'pid_type': 'flat',
2026-02-21T08:37:32.6945366Z  'range_flattens': [None, False],
2026-02-21T08:37:32.6945645Z  'range_multi_buffers': [None, False],
2026-02-21T08:37:32.6945922Z  'range_num_stages': [0, 0],
2026-02-21T08:37:32.6946173Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:32.6946441Z  'range_warp_specializes': [None, True]}
2026-02-21T08:37:32.6961327Z [233s] Fitting surrogate: 488 points, 488 targets
2026-02-21T08:37:33.1877003Z [233s] Generation 8 starting: 20 neighbors, 2 active search path(s)
2026-02-21T08:37:34.4299739Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 31.0 configs/s
2026-02-21T08:37:35.6118333Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.6 configs/s
2026-02-21T08:37:41.4480141Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 556/556 94.5 configs/s
2026-02-21T08:37:41.5749823Z [242s] Generation 8 complete: 
2026-02-21T08:37:41.5754568Z ok=22
2026-02-21T08:37:41.5755769Z min=0.4147
2026-02-21T08:37:41.5755928Z mid=0.4495
2026-02-21T08:37:41.5756047Z max=0.6380
2026-02-21T08:37:41.5756192Z best={'block_sizes': [2048, 1],
2026-02-21T08:37:41.5756395Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:37:41.5756614Z  'load_eviction_policies': ['first', 'first'],
2026-02-21T08:37:41.5756803Z  'num_stages': 7,
2026-02-21T08:37:41.5756939Z  'num_warps': 8,
2026-02-21T08:37:41.5757081Z  'pid_type': 'flat',
2026-02-21T08:37:41.5757230Z  'range_flattens': [None, True],
2026-02-21T08:37:41.5757411Z  'range_multi_buffers': [None, False],
2026-02-21T08:37:41.5757585Z  'range_num_stages': [0, 1],
2026-02-21T08:37:41.5757752Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:41.5757924Z  'range_warp_specializes': [None, True]}
2026-02-21T08:37:41.5766274Z [242s] Fitting surrogate: 510 points, 510 targets
2026-02-21T08:37:41.8548884Z [242s] Autotuning complete in 242.5s after searching 482 configs.
2026-02-21T08:37:41.8552215Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:37:41.8557062Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:37:41.8557866Z 
2026-02-21T08:37:41.8558124Z [242s] Code of selected kernel: /tmp/torchinductor_root/p6/cp6bl3k74ixwykbkjt4ihezx34drzzpnpmzbaq6zywbrex3k4wwg.py
2026-02-21T08:37:41.8742276Z from __future__ import annotations
2026-02-21T08:37:41.8742545Z 
2026-02-21T08:37:41.8746751Z import torch
2026-02-21T08:37:41.8746974Z import triton
2026-02-21T08:37:41.8747124Z import triton.language as tl
2026-02-21T08:37:41.8747334Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:37:41.8747626Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:37:41.8747923Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:37:41.8748197Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:37:41.8748368Z 
2026-02-21T08:37:41.8748438Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T08:37:41.8748615Z _BLOCK_SIZE_0 = tl.constexpr(2048)
2026-02-21T08:37:41.8748725Z 
2026-02-21T08:37:41.8748780Z @triton.jit
2026-02-21T08:37:41.8748967Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:37:41.8749256Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:37:41.8749493Z     pid_0 = tl.program_id(0)
2026-02-21T08:37:41.8749661Z     offset_1 = pid_0
2026-02-21T08:37:41.8749828Z     indices_1 = offset_1 + tl.zeros([1], tl.int32)
2026-02-21T08:37:41.8750105Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:37:41.8750420Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:37:41.8751015Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:37:41.8751335Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:37:41.8751603Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:37:41.8752036Z     for offset_0 in tl.range(0, 65536, _BLOCK_SIZE_0, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T08:37:41.8752444Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:37:41.8752676Z         loss_sum_copy = loss_sum
2026-02-21T08:37:41.8752843Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:37:41.8753111Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:37:41.8753425Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:37:41.8753774Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:37:41.8754139Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 65536 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:37:41.8754486Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:37:41.8754828Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 65536 + indices_0[None, :] * 1), None, eviction_policy='evict_first')
2026-02-21T08:37:41.8755153Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:37:41.8755415Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:37:41.8755716Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:37:41.8755929Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:37:41.8756101Z         if log_target:
2026-02-21T08:37:41.8756253Z             y_true_val_copy = y_true_val
2026-02-21T08:37:41.8756437Z             y_pred_val_copy = y_pred_val
2026-02-21T08:37:41.8756615Z             kl_loss_copy = kl_loss
2026-02-21T08:37:41.8756806Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:37:41.8757026Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:37:41.8757207Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:37:41.8757419Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:37:41.8757640Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:37:41.8757886Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:37:41.8758137Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:37:41.8758329Z             v_2 = v_0 * v_1
2026-02-21T08:37:41.8758493Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:37:41.8758670Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:37:41.8758922Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:37:41.8759210Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:37:41.8759424Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:37:41.8759592Z         _not = not log_target
2026-02-21T08:37:41.8759747Z         if _not:
2026-02-21T08:37:41.8759885Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:37:41.8760066Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:37:41.8760243Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:37:41.8760425Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:37:41.8760627Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:37:41.8760816Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:37:41.8761064Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:37:41.8761348Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:37:41.8761563Z             v_5 = tl_math.log(v_4)
2026-02-21T08:37:41.8761784Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:37:41.8762135Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:37:41.8762320Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:37:41.8762495Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:37:41.8762683Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:37:41.8762866Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:37:41.8763079Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:37:41.8763313Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:37:41.8763522Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:37:41.8763650Z 
2026-02-21T08:37:41.8763937Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:37:41.8764315Z     """
2026-02-21T08:37:41.8764452Z     Compute KL Divergence loss.
2026-02-21T08:37:41.8764560Z 
2026-02-21T08:37:41.8764611Z     Args:
2026-02-21T08:37:41.8764845Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:37:41.8765138Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:37:41.8765458Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:37:41.8765767Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:37:41.8766003Z         eps: Small value to avoid numerical issues
2026-02-21T08:37:41.8766139Z 
2026-02-21T08:37:41.8766191Z     Returns:
2026-02-21T08:37:41.8766323Z         loss: KL divergence loss
2026-02-21T08:37:41.8766484Z     """
2026-02-21T08:37:41.8766626Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:37:41.8766803Z     BT, V = y_pred.shape
2026-02-21T08:37:41.8767004Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:37:41.8767267Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:37:41.8767513Z     # src[kl_div.py:77]: )
2026-02-21T08:37:41.8767765Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:37:41.8768052Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:37:41.8768272Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:37:41.8768469Z     # src[kl_div.py:82]: else:
2026-02-21T08:37:41.8768633Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:37:41.8768788Z     if reduction == 'none':
2026-02-21T08:37:41.8768977Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:37:41.8769175Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:37:41.8769343Z     else:
2026-02-21T08:37:41.8769556Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:37:41.8769879Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:37:41.8770172Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:37:41.8770500Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:37:41.8770760Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:37:41.8771047Z     _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=7)
2026-02-21T08:37:41.8771383Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:37:41.8771618Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:37:41.8771838Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:37:41.8772069Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:37:41.8772233Z     if reduction == 'batchmean':
2026-02-21T08:37:41.8772433Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:37:41.8772639Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:37:41.8772823Z     elif reduction == 'sum':
2026-02-21T08:37:41.8773011Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:37:41.8773285Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:37:41.8773469Z     elif reduction == 'mean':
2026-02-21T08:37:41.8773664Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:37:41.8773887Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:37:41.8774055Z     else:
2026-02-21T08:37:41.8774194Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:37:41.8774365Z         final_loss = loss
2026-02-21T08:37:41.8774529Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:37:41.8774693Z     return final_loss
2026-02-21T08:37:43.2186227Z WARNING:tritonbench.utils.triton_op:Completed input ID 4:
2026-02-21T08:37:43.2188082Z (B, T, V)
2026-02-21T08:37:43.2188320Z ---------------
2026-02-21T08:37:43.2192737Z (8, 512, 65536)
2026-02-21T08:37:43.2196557Z 
2026-02-21T08:37:43.2525628Z  83%|████████▎ | 5/6 [16:15<03:33, 213.89s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5:
2026-02-21T08:37:43.2529806Z (B, T, V)
2026-02-21T08:37:43.2533419Z ----------------
2026-02-21T08:37:43.2536371Z (8, 512, 131072)
2026-02-21T08:37:43.2549572Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div
2026-02-21T08:37:44.5169098Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div
2026-02-21T08:37:45.5232461Z INFO:tritonbench.utils.triton_op:Took 4.58ms to get benchmark function for torch_compile_kl_div
2026-02-21T08:37:49.7770973Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:37:49.7771423Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:37:49.7775897Z               'dtype': 'torch.float32',
2026-02-21T08:37:49.7779977Z               'shape': (4096, 131072),
2026-02-21T08:37:49.7783761Z               'stride': (131072, 1)},
2026-02-21T08:37:49.7788150Z             { 'device': 'cuda:0',
2026-02-21T08:37:49.7789636Z               'dtype': 'torch.float32',
2026-02-21T08:37:49.7789852Z               'shape': (4096, 131072),
2026-02-21T08:37:49.7790119Z               'stride': (131072, 1)}),
2026-02-21T08:37:49.7790300Z   'kwargs': {}}
2026-02-21T08:37:49.7795344Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for helion_kl_div_tritonbench
2026-02-21T08:37:49.9972296Z [0s] Autotune random seed: 2135561342
2026-02-21T08:37:50.1548037Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:38:23.2642791Z [33s] Timeout after 30s compiling Config(block_sizes=[65536, 1], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=1, num_stages=6, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, False], range_num_stages=[1, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None])
2026-02-21T08:38:23.6352722Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None])
2026-02-21T08:38:24.2169797Z [34s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=1, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[False, False])
2026-02-21T08:38:24.6411773Z [34s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, False])
2026-02-21T08:38:25.1597895Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[False, False])
2026-02-21T08:38:25.9038097Z [35s] Timeout after 30s compiling Config(block_sizes=[512, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, None])
2026-02-21T08:38:26.6282787Z [36s] Timeout after 30s compiling Config(block_sizes=[2048, 8], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=4, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, False])
2026-02-21T08:38:26.6835125Z [36s] Timeout after 30s compiling Config(block_sizes=[4096, 32], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], num_stages=8, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, False])
2026-02-21T08:38:27.3763688Z [37s] Timeout after 30s compiling Config(block_sizes=[65536, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=8, num_stages=7, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 0], range_warp_specializes=[False, None])
2026-02-21T08:38:27.3778979Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s
2026-02-21T08:38:27.4877576Z module {
2026-02-21T08:38:27.4878227Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:38:27.4878818Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:38:27.4879008Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:38:27.4879243Z     %cst = arith.constant dense<131072> : tensor<16x1xi32>
2026-02-21T08:38:27.4879515Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32>
2026-02-21T08:38:27.4879765Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:38:27.4879964Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:38:27.4880197Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:38:27.4880433Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:38:27.4880609Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:38:27.4880923Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<16x8xf32>>
2026-02-21T08:38:27.4881235Z     %1 = tt.get_program_id x : i32
2026-02-21T08:38:27.4881418Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T08:38:27.4881646Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:38:27.4882126Z     %4 = tt.splat %2 : i32 -> tensor<16xi32>
2026-02-21T08:38:27.4882335Z     %5 = arith.addi %4, %3 : tensor<16xi32>
2026-02-21T08:38:27.4882644Z     %6 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>)  : i32 {
2026-02-21T08:38:27.4883006Z       %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32>
2026-02-21T08:38:27.4883259Z       %11 = tt.splat %arg5 : i32 -> tensor<8xi32>
2026-02-21T08:38:27.4883777Z       %12 = arith.addi %11, %10 : tensor<8xi32>
2026-02-21T08:38:27.4884069Z       %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc<tensor<16x8xf32>> -> tensor<16x8xf32>
2026-02-21T08:38:27.4884398Z       %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32>
2026-02-21T08:38:27.4884658Z       %15 = arith.muli %14, %cst : tensor<16x1xi32>
2026-02-21T08:38:27.4884898Z       %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32>
2026-02-21T08:38:27.4885176Z       %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32>
2026-02-21T08:38:27.4885420Z       %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32>
2026-02-21T08:38:27.4885644Z       %19 = arith.addi %17, %18 : tensor<16x8xi32>
2026-02-21T08:38:27.4885877Z       %20 = tt.splat %arg1 : !tt.ptr<f32> -> tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:38:27.4886141Z       %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr<f32>>, tensor<16x8xi32>
2026-02-21T08:38:27.4886523Z       %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr<f32>>
2026-02-21T08:38:27.4886775Z       %23 = scf.if %arg3 -> (tensor<16x8xf32>) {
2026-02-21T08:38:27.4887126Z         %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32>
2026-02-21T08:38:27.4887487Z         %26 = arith.subf %22, %13 : tensor<16x8xf32>
2026-02-21T08:38:27.4887684Z         %27 = arith.mulf %25, %26 : tensor<16x8xf32>
2026-02-21T08:38:27.4887891Z         %28 = arith.addf %27, %cst_0 : tensor<16x8xf32>
2026-02-21T08:38:27.4888081Z         scf.yield %28 : tensor<16x8xf32>
2026-02-21T08:38:27.4888254Z       } else {
2026-02-21T08:38:27.4888408Z         %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32>
2026-02-21T08:38:27.4888627Z         %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32>
2026-02-21T08:38:27.4888838Z         %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32>
2026-02-21T08:38:27.4889046Z         %28 = arith.ori %26, %27 : tensor<16x8xi1>
2026-02-21T08:38:27.4889278Z         %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32>
2026-02-21T08:38:27.4889504Z         %30 = math.log %29 : tensor<16x8xf32>
2026-02-21T08:38:27.4889692Z         %31 = arith.subf %30, %13 : tensor<16x8xf32>
2026-02-21T08:38:27.4889877Z         %32 = arith.mulf %22, %31 : tensor<16x8xf32>
2026-02-21T08:38:27.4890078Z         %33 = arith.addf %32, %cst_0 : tensor<16x8xf32>
2026-02-21T08:38:27.4890261Z         scf.yield %33 : tensor<16x8xf32>
2026-02-21T08:38:27.4890426Z       }
2026-02-21T08:38:27.4890568Z       %24 = arith.addf %arg6, %23 : tensor<16x8xf32>
2026-02-21T08:38:27.4890750Z       scf.yield %24 : tensor<16x8xf32>
2026-02-21T08:38:27.4890946Z     } {tt.warp_specialize}
2026-02-21T08:38:27.4891117Z     %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({
2026-02-21T08:38:27.4891294Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:38:27.4891471Z       %10 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:38:27.4891644Z       tt.reduce.return %10 : f32
2026-02-21T08:38:27.4891830Z     }) : (tensor<16x8xf32>) -> tensor<16xf32>
2026-02-21T08:38:27.4892084Z     %8 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:38:27.4892342Z     %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:38:27.4892564Z     tt.store %9, %7 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:38:27.4892742Z     tt.return
2026-02-21T08:38:27.4892870Z   }
2026-02-21T08:38:27.4892987Z }
2026-02-21T08:38:27.4893053Z 
2026-02-21T08:38:27.4893112Z {-#
2026-02-21T08:38:27.4893252Z   external_resources: {
2026-02-21T08:38:27.4893409Z     mlir_reproducer: {
2026-02-21T08:38:27.4897761Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:38:27.4902396Z       disable_threading: false,
2026-02-21T08:38:27.4902576Z       verify_each: true
2026-02-21T08:38:27.4902732Z     }
2026-02-21T08:38:27.4902871Z   }
2026-02-21T08:38:27.4902991Z #-}
2026-02-21T08:38:27.4903544Z /tmp/torchinductor_root/xo/cxoplxv4egu44ahe6hembpyhmwszytytmjhzlnwzjl6cai3rj64b.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:38:27.4904965Z /tmp/torchinductor_root/xo/cxoplxv4egu44ahe6hembpyhmwszytytmjhzlnwzjl6cai3rj64b.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:38:27.4906040Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:38:27.4907073Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:38:27.4908012Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:38:27.4908262Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:38:35.9154087Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:38:35.9155705Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:38:35.9156300Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:38:35.9156484Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:38:35.9156678Z     %c4736_i32 = arith.constant 4736 : i32
2026-02-21T08:38:35.9156909Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32>
2026-02-21T08:38:35.9157131Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:38:35.9157313Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:38:35.9157497Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:38:35.9157689Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:38:35.9157866Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:38:35.9158186Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:38:35.9158630Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<16x256xf32>>
2026-02-21T08:38:35.9159352Z     %2 = tt.get_program_id x : i32
2026-02-21T08:38:35.9159564Z     scf.for %arg5 = %2 to %c256_i32 step %c4736_i32  : i32 {
2026-02-21T08:38:35.9159782Z       %3 = arith.muli %arg5, %c16_i32 : i32
2026-02-21T08:38:35.9160013Z       %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32>
2026-02-21T08:38:35.9160252Z       %5 = tt.splat %3 : i32 -> tensor<16xi32>
2026-02-21T08:38:35.9160449Z       %6 = arith.addi %5, %4 : tensor<16xi32>
2026-02-21T08:38:35.9160634Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T08:38:35.9160939Z       %7 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c512_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>)  : i32 {
2026-02-21T08:38:35.9161351Z         %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:38:35.9161810Z         %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:38:35.9162342Z         %13 = scf.if %arg3 -> (tensor<16x256xf32>) {
2026-02-21T08:38:35.9162719Z           %21 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:38:35.9173751Z           %22 = arith.subf %12, %11 : tensor<16x256xf32>
2026-02-21T08:38:35.9178003Z           %23 = arith.mulf %21, %22 : tensor<16x256xf32>
2026-02-21T08:38:35.9179334Z           %24 = arith.addf %23, %cst : tensor<16x256xf32>
2026-02-21T08:38:35.9179574Z           scf.yield %24 : tensor<16x256xf32>
2026-02-21T08:38:35.9179759Z         } else {
2026-02-21T08:38:35.9179934Z           %21 = tt.splat %arg4 : f32 -> tensor<16x256xf32>
2026-02-21T08:38:35.9180170Z           %22 = arith.cmpf ogt, %12, %21 : tensor<16x256xf32>
2026-02-21T08:38:35.9180389Z           %23 = arith.cmpf une, %12, %12 : tensor<16x256xf32>
2026-02-21T08:38:35.9180615Z           %24 = arith.ori %22, %23 : tensor<16x256xi1>
2026-02-21T08:38:35.9180872Z           %25 = arith.select %24, %12, %21 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:38:35.9181111Z           %26 = math.log %25 : tensor<16x256xf32>
2026-02-21T08:38:35.9181315Z           %27 = arith.subf %26, %11 : tensor<16x256xf32>
2026-02-21T08:38:35.9181515Z           %28 = arith.mulf %12, %27 : tensor<16x256xf32>
2026-02-21T08:38:35.9181724Z           %29 = arith.addf %28, %cst : tensor<16x256xf32>
2026-02-21T08:38:35.9182117Z           scf.yield %29 : tensor<16x256xf32>
2026-02-21T08:38:35.9182294Z         }
2026-02-21T08:38:35.9182443Z         %14 = arith.addf %arg7, %13 : tensor<16x256xf32>
2026-02-21T08:38:35.9182650Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:38:35.9182847Z         %15 = arith.muli %c256_i32, %c1_i32 : i32
2026-02-21T08:38:35.9183034Z         %16 = arith.addi %arg6, %15 : i32
2026-02-21T08:38:35.9183316Z         %17 = tt.descriptor_load %0[%3, %16] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:38:35.9183673Z         %18 = tt.descriptor_load %1[%3, %16] : !tt.tensordesc<tensor<16x256xf32>> -> tensor<16x256xf32>
2026-02-21T08:38:35.9183960Z         %19 = scf.if %arg3 -> (tensor<16x256xf32>) {
2026-02-21T08:38:35.9184331Z           %21 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32>
2026-02-21T08:38:35.9184687Z           %22 = arith.subf %18, %17 : tensor<16x256xf32>
2026-02-21T08:38:35.9184892Z           %23 = arith.mulf %21, %22 : tensor<16x256xf32>
2026-02-21T08:38:35.9185093Z           %24 = arith.addf %23, %cst : tensor<16x256xf32>
2026-02-21T08:38:35.9185290Z           scf.yield %24 : tensor<16x256xf32>
2026-02-21T08:38:35.9185453Z         } else {
2026-02-21T08:38:35.9185633Z           %21 = tt.splat %arg4 : f32 -> tensor<16x256xf32>
2026-02-21T08:38:35.9185853Z           %22 = arith.cmpf ogt, %18, %21 : tensor<16x256xf32>
2026-02-21T08:38:35.9186070Z           %23 = arith.cmpf une, %18, %18 : tensor<16x256xf32>
2026-02-21T08:38:35.9186538Z           %24 = arith.ori %22, %23 : tensor<16x256xi1>
2026-02-21T08:38:35.9186774Z           %25 = arith.select %24, %18, %21 : tensor<16x256xi1>, tensor<16x256xf32>
2026-02-21T08:38:35.9187017Z           %26 = math.log %25 : tensor<16x256xf32>
2026-02-21T08:38:35.9187216Z           %27 = arith.subf %26, %17 : tensor<16x256xf32>
2026-02-21T08:38:35.9187414Z           %28 = arith.mulf %18, %27 : tensor<16x256xf32>
2026-02-21T08:38:35.9187621Z           %29 = arith.addf %28, %cst : tensor<16x256xf32>
2026-02-21T08:38:35.9187814Z           scf.yield %29 : tensor<16x256xf32>
2026-02-21T08:38:35.9187984Z         }
2026-02-21T08:38:35.9188123Z         %20 = arith.addf %14, %19 : tensor<16x256xf32>
2026-02-21T08:38:35.9188319Z         scf.yield %20 : tensor<16x256xf32>
2026-02-21T08:38:35.9188503Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:38:35.9188698Z       %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:38:35.9188948Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:35.9189127Z         %11 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:35.9189313Z         tt.reduce.return %11 : f32
2026-02-21T08:38:35.9189493Z       }) : (tensor<16x256xf32>) -> tensor<16xf32>
2026-02-21T08:38:35.9189722Z       %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<16x!tt.ptr<f32>>
2026-02-21T08:38:35.9189978Z       %10 = tt.addptr %9, %6 : tensor<16x!tt.ptr<f32>>, tensor<16xi32>
2026-02-21T08:38:35.9190216Z       tt.store %10, %8 : tensor<16x!tt.ptr<f32>>
2026-02-21T08:38:35.9190455Z     } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:38:35.9190662Z     tt.return
2026-02-21T08:38:35.9190801Z   }
2026-02-21T08:38:35.9190918Z }
2026-02-21T08:38:35.9190985Z 
2026-02-21T08:38:35.9191042Z {-#
2026-02-21T08:38:35.9191164Z   external_resources: {
2026-02-21T08:38:35.9191320Z     mlir_reproducer: {
2026-02-21T08:38:35.9195570Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:38:35.9199972Z       disable_threading: false,
2026-02-21T08:38:35.9200139Z       verify_each: true
2026-02-21T08:38:35.9200282Z     }
2026-02-21T08:38:35.9200408Z   }
2026-02-21T08:38:35.9200520Z #-}
2026-02-21T08:38:35.9200953Z /tmp/torchinductor_root/u5/cu5cqukk7327wqhkcxuvresgfsw2rabe3azrdt6atotry4lstclq.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:38:35.9202279Z /tmp/torchinductor_root/u5/cu5cqukk7327wqhkcxuvresgfsw2rabe3azrdt6atotry4lstclq.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:38:35.9203278Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:38:35.9204390Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:38:35.9205455Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:38:35.9205723Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:38:46.7982836Z module {
2026-02-21T08:38:46.7983871Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:38:46.7984914Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:38:46.7985254Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:38:46.7985624Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:38:46.7985917Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:38:46.7986299Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x1024xf32>
2026-02-21T08:38:46.7986691Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:38:46.7987011Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:38:46.7987362Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:38:46.7987713Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:38:46.7988029Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:38:46.7988586Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<256x1024xf32>>
2026-02-21T08:38:46.7989415Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<256x1024xf32>>
2026-02-21T08:38:46.7989974Z     %2 = tt.get_program_id x : i32
2026-02-21T08:38:46.7990276Z     %3 = arith.addi %2, %c1_i32 : i32
2026-02-21T08:38:46.7990569Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T08:38:46.7990872Z     %5 = arith.subi %4, %2 : i32
2026-02-21T08:38:46.7991162Z     %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:38:46.7991469Z     %6 = arith.subi %c1_i32, %c1_i32_0 : i32
2026-02-21T08:38:46.7991774Z     %7 = arith.addi %5, %6 : i32
2026-02-21T08:38:46.7992138Z     %8 = arith.divui %7, %c1_i32 : i32
2026-02-21T08:38:46.7992461Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:38:46.7992752Z     %9 = arith.remsi %8, %c3_i32 : i32
2026-02-21T08:38:46.7993060Z     %10 = arith.subi %8, %9 : i32
2026-02-21T08:38:46.7993343Z     %11 = arith.muli %10, %c1_i32 : i32
2026-02-21T08:38:46.7993651Z     %12 = arith.addi %2, %11 : i32
2026-02-21T08:38:46.7993951Z     %13 = arith.muli %c1_i32, %c3_i32 : i32
2026-02-21T08:38:46.7994280Z     scf.for %arg5 = %2 to %12 step %13  : i32 {
2026-02-21T08:38:46.7994620Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:38:46.7995018Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:38:46.7995462Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:38:46.7995797Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:38:46.7996376Z       %18 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>)  : i32 {
2026-02-21T08:38:46.7997150Z         %42 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.7998232Z         %43 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.7998772Z         %44 = scf.if %arg3 -> (tensor<256x1024xf32>) {
2026-02-21T08:38:46.7999451Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32>
2026-02-21T08:38:46.8000147Z           %47 = arith.subf %43, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8000524Z           %48 = arith.mulf %46, %47 : tensor<256x1024xf32>
2026-02-21T08:38:46.8000897Z           %49 = arith.addf %48, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8001261Z           scf.yield %49 : tensor<256x1024xf32>
2026-02-21T08:38:46.8001554Z         } else {
2026-02-21T08:38:46.8001838Z           %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32>
2026-02-21T08:38:46.8002488Z           %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32>
2026-02-21T08:38:46.8002891Z           %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32>
2026-02-21T08:38:46.8003274Z           %49 = arith.ori %47, %48 : tensor<256x1024xi1>
2026-02-21T08:38:46.8003701Z           %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32>
2026-02-21T08:38:46.8004154Z           %51 = math.log %50 : tensor<256x1024xf32>
2026-02-21T08:38:46.8004502Z           %52 = arith.subf %51, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8004863Z           %53 = arith.mulf %43, %52 : tensor<256x1024xf32>
2026-02-21T08:38:46.8005232Z           %54 = arith.addf %53, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8005578Z           scf.yield %54 : tensor<256x1024xf32>
2026-02-21T08:38:46.8005866Z         }
2026-02-21T08:38:46.8006110Z         %45 = arith.addf %arg7, %44 : tensor<256x1024xf32>
2026-02-21T08:38:46.8006462Z         scf.yield %45 : tensor<256x1024xf32>
2026-02-21T08:38:46.8006827Z       } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:38:46.8007231Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:38:46.8007550Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:46.8007980Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:46.8017554Z         tt.reduce.return %42 : f32
2026-02-21T08:38:46.8017931Z       }) : (tensor<256x1024xf32>) -> tensor<256xf32>
2026-02-21T08:38:46.8018360Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8018850Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:38:46.8019285Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8019632Z       %c1_i32_1 = arith.constant 1 : i32
2026-02-21T08:38:46.8019977Z       %22 = arith.muli %c1_i32, %c1_i32_1 : i32
2026-02-21T08:38:46.8020307Z       %23 = arith.addi %arg5, %22 : i32
2026-02-21T08:38:46.8020627Z       %24 = arith.muli %23, %c256_i32 : i32
2026-02-21T08:38:46.8021039Z       %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:38:46.8021500Z       %26 = tt.splat %24 : i32 -> tensor<256xi32>
2026-02-21T08:38:46.8021842Z       %27 = arith.addi %26, %25 : tensor<256xi32>
2026-02-21T08:38:46.8022510Z       %28 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>)  : i32 {
2026-02-21T08:38:46.8023313Z         %42 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8024037Z         %43 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8024586Z         %44 = scf.if %arg3 -> (tensor<256x1024xf32>) {
2026-02-21T08:38:46.8025274Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32>
2026-02-21T08:38:46.8025969Z           %47 = arith.subf %43, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8026350Z           %48 = arith.mulf %46, %47 : tensor<256x1024xf32>
2026-02-21T08:38:46.8026946Z           %49 = arith.addf %48, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8027319Z           scf.yield %49 : tensor<256x1024xf32>
2026-02-21T08:38:46.8027619Z         } else {
2026-02-21T08:38:46.8027906Z           %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32>
2026-02-21T08:38:46.8028304Z           %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32>
2026-02-21T08:38:46.8028717Z           %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32>
2026-02-21T08:38:46.8029110Z           %49 = arith.ori %47, %48 : tensor<256x1024xi1>
2026-02-21T08:38:46.8029547Z           %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32>
2026-02-21T08:38:46.8030003Z           %51 = math.log %50 : tensor<256x1024xf32>
2026-02-21T08:38:46.8030365Z           %52 = arith.subf %51, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8030741Z           %53 = arith.mulf %43, %52 : tensor<256x1024xf32>
2026-02-21T08:38:46.8031213Z           %54 = arith.addf %53, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8031587Z           scf.yield %54 : tensor<256x1024xf32>
2026-02-21T08:38:46.8031947Z         }
2026-02-21T08:38:46.8032190Z         %45 = arith.addf %arg7, %44 : tensor<256x1024xf32>
2026-02-21T08:38:46.8032553Z         scf.yield %45 : tensor<256x1024xf32>
2026-02-21T08:38:46.8032945Z       } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:38:46.8033356Z       %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({
2026-02-21T08:38:46.8033672Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:46.8033982Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:46.8034297Z         tt.reduce.return %42 : f32
2026-02-21T08:38:46.8034629Z       }) : (tensor<256x1024xf32>) -> tensor<256xf32>
2026-02-21T08:38:46.8035044Z       %30 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8035510Z       %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:38:46.8035940Z       tt.store %31, %29 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8036284Z       %c2_i32 = arith.constant 2 : i32
2026-02-21T08:38:46.8036606Z       %32 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T08:38:46.8036917Z       %33 = arith.addi %arg5, %32 : i32
2026-02-21T08:38:46.8037227Z       %34 = arith.muli %33, %c256_i32 : i32
2026-02-21T08:38:46.8037629Z       %35 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:38:46.8038046Z       %36 = tt.splat %34 : i32 -> tensor<256xi32>
2026-02-21T08:38:46.8038391Z       %37 = arith.addi %36, %35 : tensor<256xi32>
2026-02-21T08:38:46.8038961Z       %38 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>)  : i32 {
2026-02-21T08:38:46.8039736Z         %42 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8040435Z         %43 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8040985Z         %44 = scf.if %arg3 -> (tensor<256x1024xf32>) {
2026-02-21T08:38:46.8041665Z           %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32>
2026-02-21T08:38:46.8042411Z           %47 = arith.subf %43, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8042791Z           %48 = arith.mulf %46, %47 : tensor<256x1024xf32>
2026-02-21T08:38:46.8043169Z           %49 = arith.addf %48, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8043536Z           scf.yield %49 : tensor<256x1024xf32>
2026-02-21T08:38:46.8043845Z         } else {
2026-02-21T08:38:46.8044119Z           %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32>
2026-02-21T08:38:46.8044522Z           %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32>
2026-02-21T08:38:46.8044931Z           %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32>
2026-02-21T08:38:46.8045327Z           %49 = arith.ori %47, %48 : tensor<256x1024xi1>
2026-02-21T08:38:46.8045854Z           %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32>
2026-02-21T08:38:46.8046305Z           %51 = math.log %50 : tensor<256x1024xf32>
2026-02-21T08:38:46.8046666Z           %52 = arith.subf %51, %42 : tensor<256x1024xf32>
2026-02-21T08:38:46.8047030Z           %53 = arith.mulf %43, %52 : tensor<256x1024xf32>
2026-02-21T08:38:46.8047411Z           %54 = arith.addf %53, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8047762Z           scf.yield %54 : tensor<256x1024xf32>
2026-02-21T08:38:46.8048060Z         }
2026-02-21T08:38:46.8048313Z         %45 = arith.addf %arg7, %44 : tensor<256x1024xf32>
2026-02-21T08:38:46.8048676Z         scf.yield %45 : tensor<256x1024xf32>
2026-02-21T08:38:46.8049062Z       } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:38:46.8049471Z       %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({
2026-02-21T08:38:46.8049804Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:46.8050190Z         %42 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:46.8050519Z         tt.reduce.return %42 : f32
2026-02-21T08:38:46.8050840Z       }) : (tensor<256x1024xf32>) -> tensor<256xf32>
2026-02-21T08:38:46.8051243Z       %40 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8051711Z       %41 = tt.addptr %40, %37 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:38:46.8052200Z       tt.store %41, %39 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8052608Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:38:46.8053005Z     scf.for %arg5 = %12 to %4 step %c1_i32  : i32 {
2026-02-21T08:38:46.8053363Z       %14 = arith.muli %arg5, %c256_i32 : i32
2026-02-21T08:38:46.8053764Z       %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
2026-02-21T08:38:46.8054203Z       %16 = tt.splat %14 : i32 -> tensor<256xi32>
2026-02-21T08:38:46.8054534Z       %17 = arith.addi %16, %15 : tensor<256xi32>
2026-02-21T08:38:46.8055111Z       %18 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>)  : i32 {
2026-02-21T08:38:46.8055889Z         %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8056591Z         %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc<tensor<256x1024xf32>> -> tensor<256x1024xf32>
2026-02-21T08:38:46.8057133Z         %24 = scf.if %arg3 -> (tensor<256x1024xf32>) {
2026-02-21T08:38:46.8057799Z           %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32>
2026-02-21T08:38:46.8058492Z           %27 = arith.subf %23, %22 : tensor<256x1024xf32>
2026-02-21T08:38:46.8058861Z           %28 = arith.mulf %26, %27 : tensor<256x1024xf32>
2026-02-21T08:38:46.8059234Z           %29 = arith.addf %28, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8059600Z           scf.yield %29 : tensor<256x1024xf32>
2026-02-21T08:38:46.8059902Z         } else {
2026-02-21T08:38:46.8060186Z           %26 = tt.splat %arg4 : f32 -> tensor<256x1024xf32>
2026-02-21T08:38:46.8060578Z           %27 = arith.cmpf ogt, %23, %26 : tensor<256x1024xf32>
2026-02-21T08:38:46.8060985Z           %28 = arith.cmpf une, %23, %23 : tensor<256x1024xf32>
2026-02-21T08:38:46.8061375Z           %29 = arith.ori %27, %28 : tensor<256x1024xi1>
2026-02-21T08:38:46.8061804Z           %30 = arith.select %29, %23, %26 : tensor<256x1024xi1>, tensor<256x1024xf32>
2026-02-21T08:38:46.8062289Z           %31 = math.log %30 : tensor<256x1024xf32>
2026-02-21T08:38:46.8062649Z           %32 = arith.subf %31, %22 : tensor<256x1024xf32>
2026-02-21T08:38:46.8063025Z           %33 = arith.mulf %23, %32 : tensor<256x1024xf32>
2026-02-21T08:38:46.8063397Z           %34 = arith.addf %33, %cst : tensor<256x1024xf32>
2026-02-21T08:38:46.8063772Z           scf.yield %34 : tensor<256x1024xf32>
2026-02-21T08:38:46.8064075Z         }
2026-02-21T08:38:46.8064349Z         %25 = arith.addf %arg7, %24 : tensor<256x1024xf32>
2026-02-21T08:38:46.8064799Z         scf.yield %25 : tensor<256x1024xf32>
2026-02-21T08:38:46.8065192Z       } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:38:46.8065601Z       %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({
2026-02-21T08:38:46.8065925Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:46.8066240Z         %22 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:46.8066561Z         tt.reduce.return %22 : f32
2026-02-21T08:38:46.8066880Z       }) : (tensor<256x1024xf32>) -> tensor<256xf32>
2026-02-21T08:38:46.8067300Z       %20 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8067768Z       %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr<f32>>, tensor<256xi32>
2026-02-21T08:38:46.8068196Z       tt.store %21, %19 : tensor<256x!tt.ptr<f32>>
2026-02-21T08:38:46.8068595Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:38:46.8068955Z     tt.return
2026-02-21T08:38:46.8069237Z   }
2026-02-21T08:38:46.8069439Z }
2026-02-21T08:38:46.8069552Z 
2026-02-21T08:38:46.8069639Z {-#
2026-02-21T08:38:46.8069845Z   external_resources: {
2026-02-21T08:38:46.8070114Z     mlir_reproducer: {
2026-02-21T08:38:46.8078310Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:38:46.8086849Z       disable_threading: false,
2026-02-21T08:38:46.8087128Z       verify_each: true
2026-02-21T08:38:46.8087373Z     }
2026-02-21T08:38:46.8087557Z   }
2026-02-21T08:38:46.8087744Z #-}
2026-02-21T08:38:46.8088513Z /tmp/torchinductor_root/76/c76eam7pb3egtsi7cxrddznxiy4ady6vqx25xgh2kxuulbqy77jo.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:38:46.8090809Z /tmp/torchinductor_root/76/c76eam7pb3egtsi7cxrddznxiy4ady6vqx25xgh2kxuulbqy77jo.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:38:46.8092719Z [56s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:38:46.8094745Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=128, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True)
2026-02-21T08:38:46.8096635Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:38:46.8097090Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:38:53.8150925Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:38:53.8152815Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:38:53.8153633Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:38:53.8153892Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:38:53.8154586Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:38:53.8154803Z     %c9472_i32 = arith.constant 9472 : i32
2026-02-21T08:38:53.8155028Z     %cst = arith.constant dense<131072> : tensor<128x1xi32>
2026-02-21T08:38:53.8155282Z     %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x512xf32>
2026-02-21T08:38:53.8155527Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:38:53.8155716Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:38:53.8155902Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:38:53.8156098Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:38:53.8156280Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:38:53.8156604Z     %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<128x512xf32>>
2026-02-21T08:38:53.8156928Z     %1 = tt.get_program_id x : i32
2026-02-21T08:38:53.8157147Z     scf.for %arg5 = %1 to %c32_i32 step %c9472_i32  : i32 {
2026-02-21T08:38:53.8157360Z       %2 = arith.muli %arg5, %c128_i32 : i32
2026-02-21T08:38:53.8157604Z       %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
2026-02-21T08:38:53.8157869Z       %4 = tt.splat %2 : i32 -> tensor<128xi32>
2026-02-21T08:38:53.8158057Z       %5 = arith.addi %4, %3 : tensor<128xi32>
2026-02-21T08:38:53.8158257Z       %c130560_i32 = arith.constant 130560 : i32
2026-02-21T08:38:53.8158445Z       %c1536_i32 = arith.constant 1536 : i32
2026-02-21T08:38:53.8158773Z       %6 = scf.for %arg6 = %c0_i32 to %c130560_i32 step %c1536_i32 iter_args(%arg7 = %cst_0) -> (tensor<128x512xf32>)  : i32 {
2026-02-21T08:38:53.8159146Z         %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:38:53.8159400Z         %26 = tt.splat %arg6 : i32 -> tensor<512xi32>
2026-02-21T08:38:53.8159600Z         %27 = arith.addi %26, %25 : tensor<512xi32>
2026-02-21T08:38:53.8159850Z         %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:38:53.8160109Z         %29 = arith.muli %28, %cst : tensor<128x1xi32>
2026-02-21T08:38:53.8160364Z         %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:38:53.8160650Z         %31 = tt.broadcast %29 : tensor<128x1xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8160907Z         %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8161148Z         %33 = arith.addi %31, %32 : tensor<128x512xi32>
2026-02-21T08:38:53.8161384Z         %34 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8161664Z         %35 = tt.addptr %34, %33 : tensor<128x512x!tt.ptr<f32>>, tensor<128x512xi32>
2026-02-21T08:38:53.8162041Z         %36 = tt.load %35 evictionPolicy = evict_last : tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8162400Z         %37 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc<tensor<128x512xf32>> -> tensor<128x512xf32>
2026-02-21T08:38:53.8162714Z         %38 = scf.if %arg3 -> (tensor<128x512xf32>) {
2026-02-21T08:38:53.8163091Z           %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32>
2026-02-21T08:38:53.8163634Z           %75 = arith.subf %37, %36 : tensor<128x512xf32>
2026-02-21T08:38:53.8163850Z           %76 = arith.mulf %74, %75 : tensor<128x512xf32>
2026-02-21T08:38:53.8164080Z           %77 = arith.addf %76, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8164301Z           scf.yield %77 : tensor<128x512xf32>
2026-02-21T08:38:53.8164478Z         } else {
2026-02-21T08:38:53.8164656Z           %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32>
2026-02-21T08:38:53.8164879Z           %75 = arith.cmpf ogt, %37, %74 : tensor<128x512xf32>
2026-02-21T08:38:53.8165109Z           %76 = arith.cmpf une, %37, %37 : tensor<128x512xf32>
2026-02-21T08:38:53.8165316Z           %77 = arith.ori %75, %76 : tensor<128x512xi1>
2026-02-21T08:38:53.8165579Z           %78 = arith.select %77, %37, %74 : tensor<128x512xi1>, tensor<128x512xf32>
2026-02-21T08:38:53.8165903Z           %79 = math.log %78 : tensor<128x512xf32>
2026-02-21T08:38:53.8166108Z           %80 = arith.subf %79, %36 : tensor<128x512xf32>
2026-02-21T08:38:53.8166317Z           %81 = arith.mulf %37, %80 : tensor<128x512xf32>
2026-02-21T08:38:53.8166521Z           %82 = arith.addf %81, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8166722Z           scf.yield %82 : tensor<128x512xf32>
2026-02-21T08:38:53.8166885Z         }
2026-02-21T08:38:53.8167038Z         %39 = arith.addf %arg7, %38 : tensor<128x512xf32>
2026-02-21T08:38:53.8167247Z         %c1_i32 = arith.constant 1 : i32
2026-02-21T08:38:53.8167432Z         %40 = arith.muli %c512_i32, %c1_i32 : i32
2026-02-21T08:38:53.8167623Z         %41 = arith.addi %arg6, %40 : i32
2026-02-21T08:38:53.8167843Z         %42 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:38:53.8168087Z         %43 = tt.splat %41 : i32 -> tensor<512xi32>
2026-02-21T08:38:53.8168280Z         %44 = arith.addi %43, %42 : tensor<512xi32>
2026-02-21T08:38:53.8168531Z         %45 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:38:53.8168797Z         %46 = arith.muli %45, %cst : tensor<128x1xi32>
2026-02-21T08:38:53.8169040Z         %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:38:53.8169329Z         %48 = tt.broadcast %46 : tensor<128x1xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8169586Z         %49 = tt.broadcast %47 : tensor<1x512xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8169824Z         %50 = arith.addi %48, %49 : tensor<128x512xi32>
2026-02-21T08:38:53.8170051Z         %51 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8170323Z         %52 = tt.addptr %51, %50 : tensor<128x512x!tt.ptr<f32>>, tensor<128x512xi32>
2026-02-21T08:38:53.8170615Z         %53 = tt.load %52 evictionPolicy = evict_last : tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8170943Z         %54 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc<tensor<128x512xf32>> -> tensor<128x512xf32>
2026-02-21T08:38:53.8171237Z         %55 = scf.if %arg3 -> (tensor<128x512xf32>) {
2026-02-21T08:38:53.8171595Z           %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32>
2026-02-21T08:38:53.8172009Z           %75 = arith.subf %54, %53 : tensor<128x512xf32>
2026-02-21T08:38:53.8172215Z           %76 = arith.mulf %74, %75 : tensor<128x512xf32>
2026-02-21T08:38:53.8172421Z           %77 = arith.addf %76, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8172625Z           scf.yield %77 : tensor<128x512xf32>
2026-02-21T08:38:53.8172792Z         } else {
2026-02-21T08:38:53.8172953Z           %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32>
2026-02-21T08:38:53.8173168Z           %75 = arith.cmpf ogt, %54, %74 : tensor<128x512xf32>
2026-02-21T08:38:53.8173391Z           %76 = arith.cmpf une, %54, %54 : tensor<128x512xf32>
2026-02-21T08:38:53.8173603Z           %77 = arith.ori %75, %76 : tensor<128x512xi1>
2026-02-21T08:38:53.8173911Z           %78 = arith.select %77, %54, %74 : tensor<128x512xi1>, tensor<128x512xf32>
2026-02-21T08:38:53.8174159Z           %79 = math.log %78 : tensor<128x512xf32>
2026-02-21T08:38:53.8174360Z           %80 = arith.subf %79, %53 : tensor<128x512xf32>
2026-02-21T08:38:53.8174568Z           %81 = arith.mulf %54, %80 : tensor<128x512xf32>
2026-02-21T08:38:53.8174775Z           %82 = arith.addf %81, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8174983Z           scf.yield %82 : tensor<128x512xf32>
2026-02-21T08:38:53.8175152Z         }
2026-02-21T08:38:53.8175307Z         %56 = arith.addf %39, %55 : tensor<128x512xf32>
2026-02-21T08:38:53.8175508Z         %c2_i32 = arith.constant 2 : i32
2026-02-21T08:38:53.8175695Z         %57 = arith.muli %c512_i32, %c2_i32 : i32
2026-02-21T08:38:53.8175910Z         %58 = arith.addi %arg6, %57 : i32
2026-02-21T08:38:53.8176143Z         %59 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:38:53.8176439Z         %60 = tt.splat %58 : i32 -> tensor<512xi32>
2026-02-21T08:38:53.8176646Z         %61 = arith.addi %60, %59 : tensor<512xi32>
2026-02-21T08:38:53.8176882Z         %62 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:38:53.8177151Z         %63 = arith.muli %62, %cst : tensor<128x1xi32>
2026-02-21T08:38:53.8177396Z         %64 = tt.expand_dims %61 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:38:53.8177690Z         %65 = tt.broadcast %63 : tensor<128x1xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8177957Z         %66 = tt.broadcast %64 : tensor<1x512xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8178191Z         %67 = arith.addi %65, %66 : tensor<128x512xi32>
2026-02-21T08:38:53.8178427Z         %68 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8178693Z         %69 = tt.addptr %68, %67 : tensor<128x512x!tt.ptr<f32>>, tensor<128x512xi32>
2026-02-21T08:38:53.8178989Z         %70 = tt.load %69 evictionPolicy = evict_last : tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8179326Z         %71 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc<tensor<128x512xf32>> -> tensor<128x512xf32>
2026-02-21T08:38:53.8179610Z         %72 = scf.if %arg3 -> (tensor<128x512xf32>) {
2026-02-21T08:38:53.8179972Z           %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32>
2026-02-21T08:38:53.8180331Z           %75 = arith.subf %71, %70 : tensor<128x512xf32>
2026-02-21T08:38:53.8180536Z           %76 = arith.mulf %74, %75 : tensor<128x512xf32>
2026-02-21T08:38:53.8180740Z           %77 = arith.addf %76, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8180943Z           scf.yield %77 : tensor<128x512xf32>
2026-02-21T08:38:53.8181115Z         } else {
2026-02-21T08:38:53.8181271Z           %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32>
2026-02-21T08:38:53.8181496Z           %75 = arith.cmpf ogt, %71, %74 : tensor<128x512xf32>
2026-02-21T08:38:53.8181711Z           %76 = arith.cmpf une, %71, %71 : tensor<128x512xf32>
2026-02-21T08:38:53.8181959Z           %77 = arith.ori %75, %76 : tensor<128x512xi1>
2026-02-21T08:38:53.8182193Z           %78 = arith.select %77, %71, %74 : tensor<128x512xi1>, tensor<128x512xf32>
2026-02-21T08:38:53.8182435Z           %79 = math.log %78 : tensor<128x512xf32>
2026-02-21T08:38:53.8182636Z           %80 = arith.subf %79, %70 : tensor<128x512xf32>
2026-02-21T08:38:53.8182833Z           %81 = arith.mulf %71, %80 : tensor<128x512xf32>
2026-02-21T08:38:53.8183044Z           %82 = arith.addf %81, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8183239Z           scf.yield %82 : tensor<128x512xf32>
2026-02-21T08:38:53.8183410Z         }
2026-02-21T08:38:53.8183549Z         %73 = arith.addf %56, %72 : tensor<128x512xf32>
2026-02-21T08:38:53.8183742Z         scf.yield %73 : tensor<128x512xf32>
2026-02-21T08:38:53.8183919Z       } {tt.num_stages = 1 : i32}
2026-02-21T08:38:53.8184139Z       %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32>
2026-02-21T08:38:53.8184445Z       %8 = tt.splat %c130560_i32 : i32 -> tensor<512xi32>
2026-02-21T08:38:53.8184638Z       %9 = arith.addi %8, %7 : tensor<512xi32>
2026-02-21T08:38:53.8184877Z       %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
2026-02-21T08:38:53.8185126Z       %11 = arith.muli %10, %cst : tensor<128x1xi32>
2026-02-21T08:38:53.8185367Z       %12 = tt.expand_dims %9 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32>
2026-02-21T08:38:53.8185707Z       %13 = tt.broadcast %11 : tensor<128x1xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8185960Z       %14 = tt.broadcast %12 : tensor<1x512xi32> -> tensor<128x512xi32>
2026-02-21T08:38:53.8186193Z       %15 = arith.addi %13, %14 : tensor<128x512xi32>
2026-02-21T08:38:53.8186418Z       %16 = tt.splat %arg0 : !tt.ptr<f32> -> tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8186841Z       %17 = tt.addptr %16, %15 : tensor<128x512x!tt.ptr<f32>>, tensor<128x512xi32>
2026-02-21T08:38:53.8187248Z       %18 = tt.load %17 evictionPolicy = evict_last : tensor<128x512x!tt.ptr<f32>>
2026-02-21T08:38:53.8187776Z       %19 = tt.descriptor_load %0[%2, %c130560_i32] : !tt.tensordesc<tensor<128x512xf32>> -> tensor<128x512xf32>
2026-02-21T08:38:53.8188201Z       %20 = scf.if %arg3 -> (tensor<128x512xf32>) {
2026-02-21T08:38:53.8188611Z         %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32>
2026-02-21T08:38:53.8188986Z         %26 = arith.subf %19, %18 : tensor<128x512xf32>
2026-02-21T08:38:53.8189193Z         %27 = arith.mulf %25, %26 : tensor<128x512xf32>
2026-02-21T08:38:53.8189418Z         %28 = arith.addf %27, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8189622Z         scf.yield %28 : tensor<128x512xf32>
2026-02-21T08:38:53.8189804Z       } else {
2026-02-21T08:38:53.8189968Z         %25 = tt.splat %arg4 : f32 -> tensor<128x512xf32>
2026-02-21T08:38:53.8190185Z         %26 = arith.cmpf ogt, %19, %25 : tensor<128x512xf32>
2026-02-21T08:38:53.8190467Z         %27 = arith.cmpf une, %19, %19 : tensor<128x512xf32>
2026-02-21T08:38:53.8190749Z         %28 = arith.ori %26, %27 : tensor<128x512xi1>
2026-02-21T08:38:53.8191074Z         %29 = arith.select %28, %19, %25 : tensor<128x512xi1>, tensor<128x512xf32>
2026-02-21T08:38:53.8191423Z         %30 = math.log %29 : tensor<128x512xf32>
2026-02-21T08:38:53.8191689Z         %31 = arith.subf %30, %18 : tensor<128x512xf32>
2026-02-21T08:38:53.8191990Z         %32 = arith.mulf %19, %31 : tensor<128x512xf32>
2026-02-21T08:38:53.8192257Z         %33 = arith.addf %32, %cst_0 : tensor<128x512xf32>
2026-02-21T08:38:53.8192520Z         scf.yield %33 : tensor<128x512xf32>
2026-02-21T08:38:53.8192736Z       }
2026-02-21T08:38:53.8192930Z       %21 = arith.addf %6, %20 : tensor<128x512xf32>
2026-02-21T08:38:53.8193183Z       %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({
2026-02-21T08:38:53.8193429Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:38:53.8193657Z         %25 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:38:53.8193904Z         tt.reduce.return %25 : f32
2026-02-21T08:38:53.8194149Z       }) : (tensor<128x512xf32>) -> tensor<128xf32>
2026-02-21T08:38:53.8194440Z       %23 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<128x!tt.ptr<f32>>
2026-02-21T08:38:53.8194781Z       %24 = tt.addptr %23, %5 : tensor<128x!tt.ptr<f32>>, tensor<128xi32>
2026-02-21T08:38:53.8195082Z       tt.store %24, %22 : tensor<128x!tt.ptr<f32>>
2026-02-21T08:38:53.8195425Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize}
2026-02-21T08:38:53.8195737Z     tt.return
2026-02-21T08:38:53.8195899Z   }
2026-02-21T08:38:53.8196049Z }
2026-02-21T08:38:53.8196136Z 
2026-02-21T08:38:53.8196197Z {-#
2026-02-21T08:38:53.8196362Z   external_resources: {
2026-02-21T08:38:53.8196554Z     mlir_reproducer: {
2026-02-21T08:38:53.8201108Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:38:53.8206156Z       disable_threading: false,
2026-02-21T08:38:53.8206326Z       verify_each: true
2026-02-21T08:38:53.8206463Z     }
2026-02-21T08:38:53.8206584Z   }
2026-02-21T08:38:53.8206692Z #-}
2026-02-21T08:38:53.8207122Z /tmp/torchinductor_root/eo/ceodyuyt7ys736a4ooyt23geblu5dmn43rrkpaapfpn63ttb3xo2.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:38:53.8208308Z /tmp/torchinductor_root/eo/ceodyuyt7ys736a4ooyt23geblu5dmn43rrkpaapfpn63ttb3xo2.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:38:53.8209264Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:38:53.8210364Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True)
2026-02-21T08:38:53.8211356Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:38:53.8211608Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:38:54.5017307Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.7 configs/s
2026-02-21T08:38:54.5037505Z [64s] Adaptive compile timeout: 30s (90% percentile=13.3s, bounds=[30.0s, 30s])
2026-02-21T08:38:58.3073061Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 270/270 68.6 configs/s
2026-02-21T08:38:58.4775243Z [68s] Initial random population of 100, 5 starting points: 
2026-02-21T08:38:58.4775596Z error=11
2026-02-21T08:38:58.4775760Z timeout=9
2026-02-21T08:38:58.4775905Z ok=80
2026-02-21T08:38:58.4776057Z min=0.8233
2026-02-21T08:38:58.4776202Z mid=5.1088
2026-02-21T08:38:58.4776351Z max=380.8112
2026-02-21T08:38:58.4776522Z best={'block_sizes': [512, 2],
2026-02-21T08:38:58.4776787Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:38:58.4777052Z  'load_eviction_policies': ['', ''],
2026-02-21T08:38:58.4777313Z  'num_sm_multiplier': 128,
2026-02-21T08:38:58.4777896Z  'num_stages': 3,
2026-02-21T08:38:58.4778064Z  'num_warps': 1,
2026-02-21T08:38:58.4778219Z  'pid_type': 'persistent_blocked',
2026-02-21T08:38:58.4778394Z  'range_flattens': [None, False],
2026-02-21T08:38:58.4778574Z  'range_multi_buffers': [True, False],
2026-02-21T08:38:58.4778747Z  'range_num_stages': [1, 2],
2026-02-21T08:38:58.4778913Z  'range_unroll_factors': [4, 0],
2026-02-21T08:38:58.4779082Z  'range_warp_specializes': [False, True]}
2026-02-21T08:38:58.4792935Z [68s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:38:59.6372205Z [69s] Generation 1 starting: 79 neighbors, 5 active search path(s)
2026-02-21T08:39:28.5617781Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.3 configs/s
2026-02-21T08:39:33.7881166Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.0 configs/s
2026-02-21T08:39:51.3974920Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 15.6 configs/s
2026-02-21T08:39:51.6301418Z [121s] Generation 1 complete: 
2026-02-21T08:39:51.6303705Z ok=85
2026-02-21T08:39:51.6305057Z min=0.8048
2026-02-21T08:39:51.6305237Z mid=0.9575
2026-02-21T08:39:51.6305362Z max=4.8620
2026-02-21T08:39:51.6305514Z best={'block_sizes': [1024, 2],
2026-02-21T08:39:51.6305739Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:39:51.6305976Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:39:51.6306160Z  'num_stages': 7,
2026-02-21T08:39:51.6306294Z  'num_warps': 8,
2026-02-21T08:39:51.6306433Z  'pid_type': 'flat',
2026-02-21T08:39:51.6306582Z  'range_flattens': [None, None],
2026-02-21T08:39:51.6306762Z  'range_multi_buffers': [None, True],
2026-02-21T08:39:51.6306937Z  'range_num_stages': [0, 0],
2026-02-21T08:39:51.6307104Z  'range_unroll_factors': [0, 0],
2026-02-21T08:39:51.6307273Z  'range_warp_specializes': [None, True]}
2026-02-21T08:39:51.6322916Z [121s] Fitting surrogate: 185 points, 185 targets
2026-02-21T08:39:52.6770220Z [122s] Generation 2 starting: 74 neighbors, 5 active search path(s)
2026-02-21T08:39:58.0359501Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 28.6 configs/s
2026-02-21T08:40:01.1090317Z module {
2026-02-21T08:40:01.1094561Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:40:01.1095534Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:40:01.1095773Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:40:01.1096015Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:40:01.1096252Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:40:01.1096433Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:40:01.1096632Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:40:01.1096821Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:40:01.1097042Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:40:01.1097370Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:40:01.1097816Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:40:01.1098132Z     %2 = tt.get_program_id x : i32
2026-02-21T08:40:01.1098307Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:40:01.1098527Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:40:01.1098760Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:40:01.1098946Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:40:01.1099254Z     %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:40:01.1099673Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:40:01.1100050Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:40:01.1100771Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:40:01.1101142Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:40:01.1101510Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:40:01.1101719Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:40:01.1102114Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:40:01.1102312Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:40:01.1102488Z       } else {
2026-02-21T08:40:01.1102651Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:40:01.1102885Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:40:01.1103108Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:40:01.1103441Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:40:01.1103687Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:40:01.1103940Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:40:01.1104145Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:40:01.1104343Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:40:01.1104549Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:40:01.1104736Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:40:01.1104906Z       }
2026-02-21T08:40:01.1105045Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:40:01.1105238Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:40:01.1105556Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:40:01.1105873Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:40:01.1106061Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:40:01.1106232Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:40:01.1106415Z       tt.reduce.return %11 : f32
2026-02-21T08:40:01.1106589Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:40:01.1106812Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:01.1107093Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:40:01.1107311Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:01.1107490Z     tt.return
2026-02-21T08:40:01.1107608Z   }
2026-02-21T08:40:01.1107745Z }
2026-02-21T08:40:01.1107814Z 
2026-02-21T08:40:01.1107864Z {-#
2026-02-21T08:40:01.1107996Z   external_resources: {
2026-02-21T08:40:01.1108151Z     mlir_reproducer: {
2026-02-21T08:40:01.1112644Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:40:01.1117223Z       disable_threading: false,
2026-02-21T08:40:01.1117387Z       verify_each: true
2026-02-21T08:40:01.1117524Z     }
2026-02-21T08:40:01.1117644Z   }
2026-02-21T08:40:01.1117748Z #-}
2026-02-21T08:40:01.1118166Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:40:01.1119386Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:40:01.1120347Z [130s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:40:01.1121298Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:40:01.1122202Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:40:01.1122448Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:40:01.3004798Z module {
2026-02-21T08:40:01.3005416Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:40:01.3006032Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:40:01.3006221Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:40:01.3006452Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:40:01.3006674Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:40:01.3006862Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:40:01.3007051Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:40:01.3007247Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:40:01.3007426Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:40:01.3007750Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:40:01.3008203Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:40:01.3008509Z     %2 = tt.get_program_id x : i32
2026-02-21T08:40:01.3008688Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:40:01.3008902Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:40:01.3009140Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:40:01.3009328Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:40:01.3009633Z     %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:40:01.3010044Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:40:01.3010402Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:40:01.3010691Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:40:01.3011324Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:40:01.3011696Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:40:01.3012073Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:40:01.3012280Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:40:01.3012482Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:40:01.3012648Z       } else {
2026-02-21T08:40:01.3012813Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:40:01.3013029Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:40:01.3013257Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:40:01.3013468Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:40:01.3013703Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:40:01.3014022Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:40:01.3014217Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:40:01.3014411Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:40:01.3014609Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:40:01.3014808Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:40:01.3014979Z       }
2026-02-21T08:40:01.3015118Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:40:01.3015312Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:40:01.3015614Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize}
2026-02-21T08:40:01.3015936Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:40:01.3016115Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:40:01.3016288Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:40:01.3016468Z       tt.reduce.return %11 : f32
2026-02-21T08:40:01.3016653Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:40:01.3016877Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:01.3017123Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:40:01.3017351Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:01.3017522Z     tt.return
2026-02-21T08:40:01.3017650Z   }
2026-02-21T08:40:01.3017764Z }
2026-02-21T08:40:01.3017839Z 
2026-02-21T08:40:01.3017887Z {-#
2026-02-21T08:40:01.3018009Z   external_resources: {
2026-02-21T08:40:01.3018166Z     mlir_reproducer: {
2026-02-21T08:40:01.3022391Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:40:01.3026763Z       disable_threading: false,
2026-02-21T08:40:01.3026954Z       verify_each: true
2026-02-21T08:40:01.3027102Z     }
2026-02-21T08:40:01.3027251Z   }
2026-02-21T08:40:01.3027394Z #-}
2026-02-21T08:40:01.3027939Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:40:01.3029325Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:40:01.3030436Z [131s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:40:01.3031500Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:40:01.3032480Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:40:01.3032744Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:40:02.6545168Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.4 configs/s
2026-02-21T08:40:18.8577900Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 17.0 configs/s
2026-02-21T08:40:19.0813938Z [148s] Generation 2 complete: 
2026-02-21T08:40:19.0817765Z error=2
2026-02-21T08:40:19.0821635Z ok=77
2026-02-21T08:40:19.0826165Z min=0.8273
2026-02-21T08:40:19.0830869Z mid=0.9144
2026-02-21T08:40:19.0835437Z max=2.6329
2026-02-21T08:40:19.0835664Z best={'block_sizes': [1024, 1],
2026-02-21T08:40:19.0840574Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:40:19.0841987Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:40:19.0842221Z  'num_stages': 7,
2026-02-21T08:40:19.0842368Z  'num_warps': 2,
2026-02-21T08:40:19.0842524Z  'pid_type': 'flat',
2026-02-21T08:40:19.0842698Z  'range_flattens': [None, None],
2026-02-21T08:40:19.0842872Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:19.0843057Z  'range_num_stages': [0, 0],
2026-02-21T08:40:19.0843215Z  'range_unroll_factors': [0, 0],
2026-02-21T08:40:19.0843394Z  'range_warp_specializes': [None, True]}
2026-02-21T08:40:19.0843693Z [148s] Fitting surrogate: 264 points, 264 targets
2026-02-21T08:40:20.0657962Z [149s] Generation 3 starting: 66 neighbors, 5 active search path(s)
2026-02-21T08:40:25.9010677Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 2.7 configs/s
2026-02-21T08:40:30.2407067Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 15.8 configs/s
2026-02-21T08:40:47.2681171Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 16.2 configs/s
2026-02-21T08:40:47.4999309Z [177s] Generation 3 complete: 
2026-02-21T08:40:47.5002492Z ok=72
2026-02-21T08:40:47.5006534Z min=0.8499
2026-02-21T08:40:47.5010379Z mid=0.8683
2026-02-21T08:40:47.5015750Z max=3.1497
2026-02-21T08:40:47.5017757Z best={'block_sizes': [1024, 1],
2026-02-21T08:40:47.5018019Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:40:47.5018271Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:40:47.5018470Z  'num_stages': 7,
2026-02-21T08:40:47.5018614Z  'num_warps': 2,
2026-02-21T08:40:47.5018797Z  'pid_type': 'flat',
2026-02-21T08:40:47.5019290Z  'range_flattens': [None, None],
2026-02-21T08:40:47.5019482Z  'range_multi_buffers': [None, None],
2026-02-21T08:40:47.5019666Z  'range_num_stages': [0, 0],
2026-02-21T08:40:47.5019843Z  'range_unroll_factors': [0, 0],
2026-02-21T08:40:47.5020020Z  'range_warp_specializes': [None, True]}
2026-02-21T08:40:47.5020244Z [177s] Fitting surrogate: 336 points, 336 targets
2026-02-21T08:40:48.3869507Z [178s] Generation 4 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:40:51.6807098Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 20.7 configs/s
2026-02-21T08:40:52.4356041Z module {
2026-02-21T08:40:52.4356645Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:40:52.4361039Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:40:52.4362869Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:40:52.4363183Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x512xf32>
2026-02-21T08:40:52.4363411Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:40:52.4363596Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:40:52.4363785Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:40:52.4363982Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:40:52.4364167Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:40:52.4364483Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x512xf32>>
2026-02-21T08:40:52.4364923Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x512xf32>>
2026-02-21T08:40:52.4365229Z     %2 = tt.get_program_id x : i32
2026-02-21T08:40:52.4365406Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:40:52.4365629Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:40:52.4365870Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:40:52.4366076Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:40:52.4366383Z     %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c512_i32 iter_args(%arg6 = %cst) -> (tensor<4x512xf32>)  : i32 {
2026-02-21T08:40:52.4366794Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x512xf32>> -> tensor<4x512xf32>
2026-02-21T08:40:52.4367146Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x512xf32>> -> tensor<4x512xf32>
2026-02-21T08:40:52.4367430Z       %13 = scf.if %arg3 -> (tensor<4x512xf32>) {
2026-02-21T08:40:52.4367858Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x512xf32>) -> tensor<4x512xf32>
2026-02-21T08:40:52.4372245Z         %16 = arith.subf %12, %11 : tensor<4x512xf32>
2026-02-21T08:40:52.4376592Z         %17 = arith.mulf %15, %16 : tensor<4x512xf32>
2026-02-21T08:40:52.4381720Z         %18 = arith.addf %17, %cst : tensor<4x512xf32>
2026-02-21T08:40:52.4383476Z         scf.yield %18 : tensor<4x512xf32>
2026-02-21T08:40:52.4383696Z       } else {
2026-02-21T08:40:52.4383869Z         %15 = tt.splat %arg4 : f32 -> tensor<4x512xf32>
2026-02-21T08:40:52.4384104Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x512xf32>
2026-02-21T08:40:52.4384321Z         %17 = arith.cmpf une, %12, %12 : tensor<4x512xf32>
2026-02-21T08:40:52.4384539Z         %18 = arith.ori %16, %17 : tensor<4x512xi1>
2026-02-21T08:40:52.4384778Z         %19 = arith.select %18, %12, %15 : tensor<4x512xi1>, tensor<4x512xf32>
2026-02-21T08:40:52.4385021Z         %20 = math.log %19 : tensor<4x512xf32>
2026-02-21T08:40:52.4385223Z         %21 = arith.subf %20, %11 : tensor<4x512xf32>
2026-02-21T08:40:52.4385425Z         %22 = arith.mulf %12, %21 : tensor<4x512xf32>
2026-02-21T08:40:52.4385629Z         %23 = arith.addf %22, %cst : tensor<4x512xf32>
2026-02-21T08:40:52.4385820Z         scf.yield %23 : tensor<4x512xf32>
2026-02-21T08:40:52.4386006Z       }
2026-02-21T08:40:52.4386156Z       %14 = arith.addf %arg6, %13 : tensor<4x512xf32>
2026-02-21T08:40:52.4386350Z       scf.yield %14 : tensor<4x512xf32>
2026-02-21T08:40:52.4386661Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:40:52.4386988Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:40:52.4387176Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:40:52.4387343Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:40:52.4387527Z       tt.reduce.return %11 : f32
2026-02-21T08:40:52.4387703Z     }) : (tensor<4x512xf32>) -> tensor<4xf32>
2026-02-21T08:40:52.4387929Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:52.4388179Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:40:52.4388402Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:40:52.4388578Z     tt.return
2026-02-21T08:40:52.4388696Z   }
2026-02-21T08:40:52.4389009Z }
2026-02-21T08:40:52.4389090Z 
2026-02-21T08:40:52.4389138Z {-#
2026-02-21T08:40:52.4389269Z   external_resources: {
2026-02-21T08:40:52.4389421Z     mlir_reproducer: {
2026-02-21T08:40:52.4393760Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:40:52.4398154Z       disable_threading: false,
2026-02-21T08:40:52.4398326Z       verify_each: true
2026-02-21T08:40:52.4398563Z     }
2026-02-21T08:40:52.4398675Z   }
2026-02-21T08:40:52.4398792Z #-}
2026-02-21T08:40:52.4399199Z /tmp/torchinductor_root/av/cavgjzn4xdul3dvrx4qwewvpikdjh7yah47dgg3c7k5theizcaka.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:40:52.4400372Z /tmp/torchinductor_root/av/cavgjzn4xdul3dvrx4qwewvpikdjh7yah47dgg3c7k5theizcaka.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:40:52.4401324Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:40:52.4402343Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:40:52.4403230Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:40:52.4403489Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:40:55.4619133Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s
2026-02-21T08:41:10.3109027Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 18.5 configs/s
2026-02-21T08:41:10.5334893Z [200s] Generation 4 complete: 
2026-02-21T08:41:10.5336658Z error=2
2026-02-21T08:41:10.5336807Z ok=65
2026-02-21T08:41:10.5336944Z min=0.8469
2026-02-21T08:41:10.5337069Z mid=0.8653
2026-02-21T08:41:10.5337200Z max=3.2273
2026-02-21T08:41:10.5337342Z best={'block_sizes': [1024, 1],
2026-02-21T08:41:10.5337903Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:41:10.5338172Z  'load_eviction_policies': ['last', ''],
2026-02-21T08:41:10.5338361Z  'num_stages': 7,
2026-02-21T08:41:10.5338515Z  'num_warps': 1,
2026-02-21T08:41:10.5338653Z  'pid_type': 'flat',
2026-02-21T08:41:10.5338812Z  'range_flattens': [None, None],
2026-02-21T08:41:10.5338981Z  'range_multi_buffers': [None, False],
2026-02-21T08:41:10.5339161Z  'range_num_stages': [0, 1],
2026-02-21T08:41:10.5339317Z  'range_unroll_factors': [0, 0],
2026-02-21T08:41:10.5339497Z  'range_warp_specializes': [None, True]}
2026-02-21T08:41:10.5355016Z [200s] Fitting surrogate: 403 points, 403 targets
2026-02-21T08:41:11.3702252Z [201s] Generation 5 starting: 59 neighbors, 5 active search path(s)
2026-02-21T08:41:15.7046711Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 6.1 configs/s
2026-02-21T08:41:19.5675653Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.2 configs/s
2026-02-21T08:41:34.0593390Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 19.0 configs/s
2026-02-21T08:41:34.2772428Z [224s] Generation 5 complete: 
2026-02-21T08:41:34.2774361Z ok=64
2026-02-21T08:41:34.2774523Z min=0.8443
2026-02-21T08:41:34.2774650Z mid=0.9020
2026-02-21T08:41:34.2774774Z max=4.4452
2026-02-21T08:41:34.2774905Z best={'block_sizes': [1024, 1],
2026-02-21T08:41:34.2775134Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:41:34.2775360Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:34.2775526Z  'num_stages': 8,
2026-02-21T08:41:34.2775667Z  'num_warps': 1,
2026-02-21T08:41:34.2775801Z  'pid_type': 'flat',
2026-02-21T08:41:34.2775960Z  'range_flattens': [None, None],
2026-02-21T08:41:34.2776132Z  'range_multi_buffers': [None, False],
2026-02-21T08:41:34.2776312Z  'range_num_stages': [0, 1],
2026-02-21T08:41:34.2776472Z  'range_unroll_factors': [0, 0],
2026-02-21T08:41:34.2776651Z  'range_warp_specializes': [None, True]}
2026-02-21T08:41:34.2791712Z [224s] Fitting surrogate: 467 points, 467 targets
2026-02-21T08:41:35.1952760Z [225s] Generation 6 starting: 61 neighbors, 5 active search path(s)
2026-02-21T08:41:40.6770857Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 2.4 configs/s
2026-02-21T08:41:44.3835586Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.6 configs/s
2026-02-21T08:41:59.6353095Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 18.0 configs/s
2026-02-21T08:41:59.8609392Z [249s] Generation 6 complete: 
2026-02-21T08:41:59.8609651Z ok=66
2026-02-21T08:41:59.8609837Z min=0.8591
2026-02-21T08:41:59.8609989Z mid=0.8776
2026-02-21T08:41:59.8610139Z max=3.6290
2026-02-21T08:41:59.8610308Z best={'block_sizes': [1024, 1],
2026-02-21T08:41:59.8610561Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:41:59.8610820Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:59.8615536Z  'num_stages': 8,
2026-02-21T08:41:59.8615735Z  'num_warps': 1,
2026-02-21T08:41:59.8615885Z  'pid_type': 'flat',
2026-02-21T08:41:59.8616356Z  'range_flattens': [None, None],
2026-02-21T08:41:59.8616564Z  'range_multi_buffers': [None, False],
2026-02-21T08:41:59.8616748Z  'range_num_stages': [0, 1],
2026-02-21T08:41:59.8616909Z  'range_unroll_factors': [0, 1],
2026-02-21T08:41:59.8617090Z  'range_warp_specializes': [None, True]}
2026-02-21T08:41:59.8630597Z [249s] Fitting surrogate: 533 points, 533 targets
2026-02-21T08:42:00.6755238Z [250s] Generation 7 starting: 50 neighbors, 5 active search path(s)
2026-02-21T08:42:03.5758623Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 30.8 configs/s
2026-02-21T08:42:04.3683478Z module {
2026-02-21T08:42:04.3684169Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:42:04.3684871Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:42:04.3685108Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:42:04.3685384Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:42:04.3685656Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:42:04.3685838Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:42:04.3686030Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:42:04.3686218Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:42:04.3686401Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:42:04.3686714Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:42:04.3687156Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:42:04.3687458Z     %2 = tt.get_program_id x : i32
2026-02-21T08:42:04.3687634Z     %3 = arith.muli %2, %c4_i32 : i32
2026-02-21T08:42:04.3687852Z     %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:42:04.3688088Z     %5 = tt.splat %3 : i32 -> tensor<4xi32>
2026-02-21T08:42:04.3688281Z     %6 = arith.addi %5, %4 : tensor<4xi32>
2026-02-21T08:42:04.3688586Z     %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:42:04.3688993Z       %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:04.3689352Z       %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:04.3689641Z       %13 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:42:04.3690003Z         %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:42:04.3690362Z         %16 = arith.subf %12, %11 : tensor<4x1024xf32>
2026-02-21T08:42:04.3690566Z         %17 = arith.mulf %15, %16 : tensor<4x1024xf32>
2026-02-21T08:42:04.3690768Z         %18 = arith.addf %17, %cst : tensor<4x1024xf32>
2026-02-21T08:42:04.3691310Z         scf.yield %18 : tensor<4x1024xf32>
2026-02-21T08:42:04.3691481Z       } else {
2026-02-21T08:42:04.3691645Z         %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:42:04.3692023Z         %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32>
2026-02-21T08:42:04.3692249Z         %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32>
2026-02-21T08:42:04.3692461Z         %18 = arith.ori %16, %17 : tensor<4x1024xi1>
2026-02-21T08:42:04.3692696Z         %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:42:04.3692945Z         %20 = math.log %19 : tensor<4x1024xf32>
2026-02-21T08:42:04.3693136Z         %21 = arith.subf %20, %11 : tensor<4x1024xf32>
2026-02-21T08:42:04.3693337Z         %22 = arith.mulf %12, %21 : tensor<4x1024xf32>
2026-02-21T08:42:04.3693546Z         %23 = arith.addf %22, %cst : tensor<4x1024xf32>
2026-02-21T08:42:04.3693737Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:42:04.3693907Z       }
2026-02-21T08:42:04.3694151Z       %14 = arith.addf %arg6, %13 : tensor<4x1024xf32>
2026-02-21T08:42:04.3694357Z       scf.yield %14 : tensor<4x1024xf32>
2026-02-21T08:42:04.3694660Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize}
2026-02-21T08:42:04.3694984Z     %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({
2026-02-21T08:42:04.3695171Z     ^bb0(%arg5: f32, %arg6: f32):
2026-02-21T08:42:04.3695342Z       %11 = arith.addf %arg5, %arg6 : f32
2026-02-21T08:42:04.3695528Z       tt.reduce.return %11 : f32
2026-02-21T08:42:04.3695707Z     }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:42:04.3695933Z     %9 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:04.3696179Z     %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:42:04.3696413Z     tt.store %10, %8 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:04.3696584Z     tt.return
2026-02-21T08:42:04.3696712Z   }
2026-02-21T08:42:04.3696836Z }
2026-02-21T08:42:04.3696903Z 
2026-02-21T08:42:04.3696955Z {-#
2026-02-21T08:42:04.3697082Z   external_resources: {
2026-02-21T08:42:04.3697230Z     mlir_reproducer: {
2026-02-21T08:42:04.3701444Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:42:04.3705806Z       disable_threading: false,
2026-02-21T08:42:04.3705970Z       verify_each: true
2026-02-21T08:42:04.3706195Z     }
2026-02-21T08:42:04.3706311Z   }
2026-02-21T08:42:04.3706430Z #-}
2026-02-21T08:42:04.3706846Z /tmp/torchinductor_root/x7/cx77nrsi5h5ftxazbuunxznwcmli6oncasidotgp6j4mzzin64mi.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:42:04.3708118Z /tmp/torchinductor_root/x7/cx77nrsi5h5ftxazbuunxznwcmli6oncasidotgp6j4mzzin64mi.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:42:04.3709147Z [254s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:42:04.3710288Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:42:04.3711263Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:42:04.3711522Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:42:06.8374889Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.8 configs/s
2026-02-21T08:42:18.7811842Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 23.0 configs/s
2026-02-21T08:42:18.9866785Z [268s] Generation 7 complete: 
2026-02-21T08:42:18.9870703Z error=2
2026-02-21T08:42:18.9875305Z ok=54
2026-02-21T08:42:18.9879607Z min=0.8439
2026-02-21T08:42:18.9884044Z mid=0.9154
2026-02-21T08:42:18.9889126Z max=4.9690
2026-02-21T08:42:18.9893815Z best={'block_sizes': [1024, 1],
2026-02-21T08:42:18.9894150Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:42:18.9898800Z  'load_eviction_policies': ['first', ''],
2026-02-21T08:42:18.9902558Z  'num_stages': 8,
2026-02-21T08:42:18.9907058Z  'num_warps': 1,
2026-02-21T08:42:18.9908888Z  'pid_type': 'flat',
2026-02-21T08:42:18.9909089Z  'range_flattens': [None, None],
2026-02-21T08:42:18.9909271Z  'range_multi_buffers': [None, False],
2026-02-21T08:42:18.9909458Z  'range_num_stages': [0, 1],
2026-02-21T08:42:18.9909620Z  'range_unroll_factors': [0, 1],
2026-02-21T08:42:18.9909800Z  'range_warp_specializes': [None, True]}
2026-02-21T08:42:18.9910084Z [268s] Fitting surrogate: 589 points, 589 targets
2026-02-21T08:42:19.8464143Z [269s] Generation 8 starting: 48 neighbors, 4 active search path(s)
2026-02-21T08:42:23.9218831Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 3.1 configs/s
2026-02-21T08:42:26.9104551Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 16.6 configs/s
2026-02-21T08:42:38.4361108Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 23.8 configs/s
2026-02-21T08:42:38.6367717Z [288s] Generation 8 complete: 
2026-02-21T08:42:38.6373337Z ok=52
2026-02-21T08:42:38.6375028Z min=0.8320
2026-02-21T08:42:38.6375193Z mid=0.8786
2026-02-21T08:42:38.6375313Z max=2.6164
2026-02-21T08:42:38.6375458Z best={'block_sizes': [2048, 2],
2026-02-21T08:42:38.6375681Z  'indexing': ['pointer', 'pointer', 'tensor_descriptor'],
2026-02-21T08:42:38.6375915Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:42:38.6376089Z  'num_stages': 7,
2026-02-21T08:42:38.6376232Z  'num_warps': 32,
2026-02-21T08:42:38.6376373Z  'pid_type': 'flat',
2026-02-21T08:42:38.6376526Z  'range_flattens': [None, False],
2026-02-21T08:42:38.6376711Z  'range_multi_buffers': [None, False],
2026-02-21T08:42:38.6376891Z  'range_num_stages': [0, 0],
2026-02-21T08:42:38.6377090Z  'range_unroll_factors': [0, 0],
2026-02-21T08:42:38.6377263Z  'range_warp_specializes': [None, False]}
2026-02-21T08:42:38.6390875Z [288s] Fitting surrogate: 641 points, 641 targets
2026-02-21T08:42:39.2907877Z [289s] Generation 9 starting: 31 neighbors, 3 active search path(s)
2026-02-21T08:42:42.9391711Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 2.9 configs/s
2026-02-21T08:42:43.7004085Z module attributes {ttg.maxnreg = 64 : i32} {
2026-02-21T08:42:43.7008454Z   tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} {
2026-02-21T08:42:43.7012781Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:42:43.7014214Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:42:43.7014445Z     %c148_i32 = arith.constant 148 : i32
2026-02-21T08:42:43.7014674Z     %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32>
2026-02-21T08:42:43.7014907Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:42:43.7015087Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:42:43.7015596Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:42:43.7015820Z     %c131072_i64 = arith.constant 131072 : i64
2026-02-21T08:42:43.7015998Z     %c1_i64 = arith.constant 1 : i64
2026-02-21T08:42:43.7016317Z     %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:42:43.7016757Z     %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : <f32>, <tensor<4x1024xf32>>
2026-02-21T08:42:43.7017069Z     %2 = tt.get_program_id x : i32
2026-02-21T08:42:43.7017242Z     %3 = arith.subi %c1024_i32, %2 : i32
2026-02-21T08:42:43.7017411Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:42:43.7017588Z     %4 = arith.subi %c148_i32, %c1_i32 : i32
2026-02-21T08:42:43.7017758Z     %5 = arith.addi %3, %4 : i32
2026-02-21T08:42:43.7017924Z     %6 = arith.divui %5, %c148_i32 : i32
2026-02-21T08:42:43.7018089Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:42:43.7018259Z     %7 = arith.remsi %6, %c2_i32 : i32
2026-02-21T08:42:43.7018427Z     %8 = arith.subi %6, %7 : i32
2026-02-21T08:42:43.7018598Z     %9 = arith.muli %8, %c148_i32 : i32
2026-02-21T08:42:43.7018770Z     %10 = arith.addi %2, %9 : i32
2026-02-21T08:42:43.7018957Z     %11 = arith.muli %c148_i32, %c2_i32 : i32
2026-02-21T08:42:43.7019153Z     scf.for %arg5 = %2 to %10 step %11  : i32 {
2026-02-21T08:42:43.7019342Z       %12 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:42:43.7019568Z       %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:42:43.7019805Z       %14 = tt.splat %12 : i32 -> tensor<4xi32>
2026-02-21T08:42:43.7019996Z       %15 = arith.addi %14, %13 : tensor<4xi32>
2026-02-21T08:42:43.7020314Z       %16 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:42:43.7020723Z         %30 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7021095Z         %31 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7021390Z         %32 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:42:43.7021760Z           %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:42:43.7022218Z           %35 = arith.subf %31, %30 : tensor<4x1024xf32>
2026-02-21T08:42:43.7022425Z           %36 = arith.mulf %34, %35 : tensor<4x1024xf32>
2026-02-21T08:42:43.7022643Z           %37 = arith.addf %36, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7022844Z           scf.yield %37 : tensor<4x1024xf32>
2026-02-21T08:42:43.7023021Z         } else {
2026-02-21T08:42:43.7023232Z           %34 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:42:43.7023460Z           %35 = arith.cmpf ogt, %31, %34 : tensor<4x1024xf32>
2026-02-21T08:42:43.7023688Z           %36 = arith.cmpf une, %31, %31 : tensor<4x1024xf32>
2026-02-21T08:42:43.7023897Z           %37 = arith.ori %35, %36 : tensor<4x1024xi1>
2026-02-21T08:42:43.7024298Z           %38 = arith.select %37, %31, %34 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:42:43.7024551Z           %39 = math.log %38 : tensor<4x1024xf32>
2026-02-21T08:42:43.7024754Z           %40 = arith.subf %39, %30 : tensor<4x1024xf32>
2026-02-21T08:42:43.7024963Z           %41 = arith.mulf %31, %40 : tensor<4x1024xf32>
2026-02-21T08:42:43.7025164Z           %42 = arith.addf %41, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7025366Z           scf.yield %42 : tensor<4x1024xf32>
2026-02-21T08:42:43.7025534Z         }
2026-02-21T08:42:43.7025686Z         %33 = arith.addf %arg7, %32 : tensor<4x1024xf32>
2026-02-21T08:42:43.7025877Z         scf.yield %33 : tensor<4x1024xf32>
2026-02-21T08:42:43.7026134Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:42:43.7026407Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:42:43.7026667Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:42:43.7026855Z         %30 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:42:43.7027032Z         tt.reduce.return %30 : f32
2026-02-21T08:42:43.7027215Z       }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:42:43.7027436Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7027695Z       %19 = tt.addptr %18, %15 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:42:43.7027924Z       tt.store %19, %17 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7028132Z       %c1_i32_0 = arith.constant 1 : i32
2026-02-21T08:42:43.7028332Z       %20 = arith.muli %c148_i32, %c1_i32_0 : i32
2026-02-21T08:42:43.7028519Z       %21 = arith.addi %arg5, %20 : i32
2026-02-21T08:42:43.7028693Z       %22 = arith.muli %21, %c4_i32 : i32
2026-02-21T08:42:43.7028911Z       %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:42:43.7029140Z       %24 = tt.splat %22 : i32 -> tensor<4xi32>
2026-02-21T08:42:43.7029330Z       %25 = arith.addi %24, %23 : tensor<4xi32>
2026-02-21T08:42:43.7029637Z       %26 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:42:43.7030047Z         %30 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7030412Z         %31 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7030697Z         %32 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:42:43.7031059Z           %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:42:43.7031424Z           %35 = arith.subf %31, %30 : tensor<4x1024xf32>
2026-02-21T08:42:43.7031629Z           %36 = arith.mulf %34, %35 : tensor<4x1024xf32>
2026-02-21T08:42:43.7031827Z           %37 = arith.addf %36, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7032075Z           scf.yield %37 : tensor<4x1024xf32>
2026-02-21T08:42:43.7032250Z         } else {
2026-02-21T08:42:43.7032406Z           %34 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:42:43.7032631Z           %35 = arith.cmpf ogt, %31, %34 : tensor<4x1024xf32>
2026-02-21T08:42:43.7032848Z           %36 = arith.cmpf une, %31, %31 : tensor<4x1024xf32>
2026-02-21T08:42:43.7033064Z           %37 = arith.ori %35, %36 : tensor<4x1024xi1>
2026-02-21T08:42:43.7033301Z           %38 = arith.select %37, %31, %34 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:42:43.7033546Z           %39 = math.log %38 : tensor<4x1024xf32>
2026-02-21T08:42:43.7033748Z           %40 = arith.subf %39, %30 : tensor<4x1024xf32>
2026-02-21T08:42:43.7033946Z           %41 = arith.mulf %31, %40 : tensor<4x1024xf32>
2026-02-21T08:42:43.7034156Z           %42 = arith.addf %41, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7034351Z           scf.yield %42 : tensor<4x1024xf32>
2026-02-21T08:42:43.7034526Z         }
2026-02-21T08:42:43.7034674Z         %33 = arith.addf %arg7, %32 : tensor<4x1024xf32>
2026-02-21T08:42:43.7034949Z         scf.yield %33 : tensor<4x1024xf32>
2026-02-21T08:42:43.7035190Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:42:43.7035454Z       %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({
2026-02-21T08:42:43.7035641Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:42:43.7035812Z         %30 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:42:43.7035997Z         tt.reduce.return %30 : f32
2026-02-21T08:42:43.7036174Z       }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:42:43.7036397Z       %28 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7036646Z       %29 = tt.addptr %28, %25 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:42:43.7036879Z       tt.store %29, %27 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7037104Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:42:43.7037409Z     scf.for %arg5 = %10 to %c1024_i32 step %c148_i32  : i32 {
2026-02-21T08:42:43.7037631Z       %12 = arith.muli %arg5, %c4_i32 : i32
2026-02-21T08:42:43.7037850Z       %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32>
2026-02-21T08:42:43.7038093Z       %14 = tt.splat %12 : i32 -> tensor<4xi32>
2026-02-21T08:42:43.7038284Z       %15 = arith.addi %14, %13 : tensor<4xi32>
2026-02-21T08:42:43.7038607Z       %16 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>)  : i32 {
2026-02-21T08:42:43.7039029Z         %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7039404Z         %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc<tensor<4x1024xf32>> -> tensor<4x1024xf32>
2026-02-21T08:42:43.7039705Z         %22 = scf.if %arg3 -> (tensor<4x1024xf32>) {
2026-02-21T08:42:43.7040074Z           %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32>
2026-02-21T08:42:43.7040447Z           %25 = arith.subf %21, %20 : tensor<4x1024xf32>
2026-02-21T08:42:43.7040668Z           %26 = arith.mulf %24, %25 : tensor<4x1024xf32>
2026-02-21T08:42:43.7040881Z           %27 = arith.addf %26, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7041095Z           scf.yield %27 : tensor<4x1024xf32>
2026-02-21T08:42:43.7041263Z         } else {
2026-02-21T08:42:43.7041428Z           %24 = tt.splat %arg4 : f32 -> tensor<4x1024xf32>
2026-02-21T08:42:43.7041646Z           %25 = arith.cmpf ogt, %21, %24 : tensor<4x1024xf32>
2026-02-21T08:42:43.7041912Z           %26 = arith.cmpf une, %21, %21 : tensor<4x1024xf32>
2026-02-21T08:42:43.7042119Z           %27 = arith.ori %25, %26 : tensor<4x1024xi1>
2026-02-21T08:42:43.7042360Z           %28 = arith.select %27, %21, %24 : tensor<4x1024xi1>, tensor<4x1024xf32>
2026-02-21T08:42:43.7042605Z           %29 = math.log %28 : tensor<4x1024xf32>
2026-02-21T08:42:43.7042799Z           %30 = arith.subf %29, %20 : tensor<4x1024xf32>
2026-02-21T08:42:43.7043008Z           %31 = arith.mulf %21, %30 : tensor<4x1024xf32>
2026-02-21T08:42:43.7043207Z           %32 = arith.addf %31, %cst : tensor<4x1024xf32>
2026-02-21T08:42:43.7043408Z           scf.yield %32 : tensor<4x1024xf32>
2026-02-21T08:42:43.7043571Z         }
2026-02-21T08:42:43.7043720Z         %23 = arith.addf %arg7, %22 : tensor<4x1024xf32>
2026-02-21T08:42:43.7043916Z         scf.yield %23 : tensor<4x1024xf32>
2026-02-21T08:42:43.7044156Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize}
2026-02-21T08:42:43.7044424Z       %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({
2026-02-21T08:42:43.7044606Z       ^bb0(%arg6: f32, %arg7: f32):
2026-02-21T08:42:43.7044780Z         %20 = arith.addf %arg6, %arg7 : f32
2026-02-21T08:42:43.7044955Z         tt.reduce.return %20 : f32
2026-02-21T08:42:43.7045139Z       }) : (tensor<4x1024xf32>) -> tensor<4xf32>
2026-02-21T08:42:43.7045358Z       %18 = tt.splat %arg2 : !tt.ptr<f32> -> tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7045608Z       %19 = tt.addptr %18, %15 : tensor<4x!tt.ptr<f32>>, tensor<4xi32>
2026-02-21T08:42:43.7045906Z       tt.store %19, %17 : tensor<4x!tt.ptr<f32>>
2026-02-21T08:42:43.7046118Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:42:43.7046317Z     tt.return
2026-02-21T08:42:43.7046436Z   }
2026-02-21T08:42:43.7046553Z }
2026-02-21T08:42:43.7046619Z 
2026-02-21T08:42:43.7046667Z {-#
2026-02-21T08:42:43.7046796Z   external_resources: {
2026-02-21T08:42:43.7046947Z     mlir_reproducer: {
2026-02-21T08:42:43.7051204Z       pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
2026-02-21T08:42:43.7055618Z       disable_threading: false,
2026-02-21T08:42:43.7055790Z       verify_each: true
2026-02-21T08:42:43.7055933Z     }
2026-02-21T08:42:43.7056060Z   }
2026-02-21T08:42:43.7056173Z #-}
2026-02-21T08:42:43.7056606Z /tmp/torchinductor_root/g5/cg5nwozgggubgr466sqq55w5riqw7l6mhmslcmevptxxg2gvpnsu.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:42:43.7057844Z /tmp/torchinductor_root/g5/cg5nwozgggubgr466sqq55w5riqw7l6mhmslcmevptxxg2gvpnsu.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:42:43.7058857Z [293s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:42:43.7059976Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=1, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[2, 1], range_warp_specializes=[None, True]), static_shapes=True)
2026-02-21T08:42:43.7060993Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:42:43.7061254Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:42:44.9776074Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.0 configs/s
2026-02-21T08:42:52.1028582Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 279/279 38.6 configs/s
2026-02-21T08:42:52.2820885Z [302s] Generation 9 complete: 
2026-02-21T08:42:52.2826009Z error=1
2026-02-21T08:42:52.2831065Z ok=33
2026-02-21T08:42:52.2832449Z min=0.8337
2026-02-21T08:42:52.2832610Z mid=0.8899
2026-02-21T08:42:52.2832727Z max=8.5688
2026-02-21T08:42:52.2832869Z best={'block_sizes': [2048, 2],
2026-02-21T08:42:52.2833110Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:42:52.2833369Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:42:52.2833547Z  'num_stages': 7,
2026-02-21T08:42:52.2833687Z  'num_warps': 32,
2026-02-21T08:42:52.2833824Z  'pid_type': 'flat',
2026-02-21T08:42:52.2833974Z  'range_flattens': [None, False],
2026-02-21T08:42:52.2834148Z  'range_multi_buffers': [None, False],
2026-02-21T08:42:52.2834321Z  'range_num_stages': [0, 0],
2026-02-21T08:42:52.2834485Z  'range_unroll_factors': [0, 0],
2026-02-21T08:42:52.2834655Z  'range_warp_specializes': [None, False]}
2026-02-21T08:42:52.2842979Z [302s] Fitting surrogate: 675 points, 675 targets
2026-02-21T08:42:52.8175470Z [302s] Generation 10 starting: 22 neighbors, 2 active search path(s)
2026-02-21T08:42:54.2309654Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 26.7 configs/s
2026-02-21T08:42:55.5637298Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.1 configs/s
2026-02-21T08:43:00.3182865Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 279/279 57.4 configs/s
2026-02-21T08:43:00.4798522Z [310s] Generation 10 complete: 
2026-02-21T08:43:00.4802870Z error=1
2026-02-21T08:43:00.4804278Z ok=23
2026-02-21T08:43:00.4804441Z min=0.8344
2026-02-21T08:43:00.4804574Z mid=0.8929
2026-02-21T08:43:00.4804702Z max=3.5113
2026-02-21T08:43:00.4804843Z best={'block_sizes': [2048, 2],
2026-02-21T08:43:00.4805102Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:43:00.4805357Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:43:00.4805548Z  'num_stages': 7,
2026-02-21T08:43:00.4805718Z  'num_warps': 32,
2026-02-21T08:43:00.4805868Z  'pid_type': 'flat',
2026-02-21T08:43:00.4806026Z  'range_flattens': [None, False],
2026-02-21T08:43:00.4806199Z  'range_multi_buffers': [None, False],
2026-02-21T08:43:00.4806380Z  'range_num_stages': [0, 0],
2026-02-21T08:43:00.4806539Z  'range_unroll_factors': [0, 0],
2026-02-21T08:43:00.4806718Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:00.4818307Z [310s] Fitting surrogate: 699 points, 699 targets
2026-02-21T08:43:00.8480201Z [310s] Generation 11 starting: 8 neighbors, 1 active search path(s)
2026-02-21T08:43:01.4792740Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 54.7 configs/s
2026-02-21T08:43:01.9829196Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 17.4 configs/s
2026-02-21T08:43:03.9848713Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 279/279 131.1 configs/s
2026-02-21T08:43:04.1325024Z [313s] Generation 11 complete: 
2026-02-21T08:43:04.1329396Z ok=9
2026-02-21T08:43:04.1331097Z min=0.8335
2026-02-21T08:43:04.1331260Z mid=0.8551
2026-02-21T08:43:04.1331386Z max=1.2503
2026-02-21T08:43:04.1331515Z best={'block_sizes': [2048, 2],
2026-02-21T08:43:04.1331765Z  'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'],
2026-02-21T08:43:04.1332277Z  'load_eviction_policies': ['last', 'last'],
2026-02-21T08:43:04.1332458Z  'num_stages': 7,
2026-02-21T08:43:04.1332602Z  'num_warps': 32,
2026-02-21T08:43:04.1332735Z  'pid_type': 'flat',
2026-02-21T08:43:04.1332897Z  'range_flattens': [None, False],
2026-02-21T08:43:04.1333070Z  'range_multi_buffers': [None, False],
2026-02-21T08:43:04.1333250Z  'range_num_stages': [0, 0],
2026-02-21T08:43:04.1333408Z  'range_unroll_factors': [0, 0],
2026-02-21T08:43:04.1333587Z  'range_warp_specializes': [None, False]}
2026-02-21T08:43:04.1347478Z [313s] Fitting surrogate: 708 points, 708 targets
2026-02-21T08:43:04.4168295Z [314s] Autotuning complete in 314.3s after searching 668 configs.
2026-02-21T08:43:04.4168704Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:43:04.4169974Z     @helion.kernel(config=helion.Config(block_sizes=[2048, 2], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True)
2026-02-21T08:43:04.4170784Z 
2026-02-21T08:43:04.4171025Z [314s] Code of selected kernel: /tmp/torchinductor_root/xt/cxtwlkxgiqf2akxmt2wa6geuqdah3nhx4n5v3r2xiooicgcsodh2.py
2026-02-21T08:43:04.4355732Z from __future__ import annotations
2026-02-21T08:43:04.4355965Z 
2026-02-21T08:43:04.4356125Z import torch
2026-02-21T08:43:04.4356260Z import triton
2026-02-21T08:43:04.4356465Z import triton.language as tl
2026-02-21T08:43:04.4356705Z from torch._inductor.runtime import triton_helpers
2026-02-21T08:43:04.4357259Z from torch._inductor.runtime.triton_helpers import math as tl_math
2026-02-21T08:43:04.4357594Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T08:43:04.4357871Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T08:43:04.4358051Z 
2026-02-21T08:43:04.4358121Z _BLOCK_SIZE_1 = tl.constexpr(2)
2026-02-21T08:43:04.4358292Z _BLOCK_SIZE_0 = tl.constexpr(2048)
2026-02-21T08:43:04.4358411Z 
2026-02-21T08:43:04.4358465Z @triton.jit
2026-02-21T08:43:04.4358645Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps):
2026-02-21T08:43:04.4358960Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:43:04.4359210Z     pid_0 = tl.program_id(0)
2026-02-21T08:43:04.4359384Z     offset_1 = pid_0 * _BLOCK_SIZE_1
2026-02-21T08:43:04.4359618Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T08:43:04.4359919Z     # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:43:04.4360234Z     loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:43:04.4360511Z     # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n):
2026-02-21T08:43:04.4360816Z     # src[kl_div.py:93]:     kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:43:04.4361090Z     # src[kl_div.py:92-112]: ...
2026-02-21T08:43:04.4361415Z     for offset_0 in tl.range(0, 131072, _BLOCK_SIZE_0, warp_specialize=False, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T08:43:04.4361811Z         indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T08:43:04.4362128Z         loss_sum_copy = loss_sum
2026-02-21T08:43:04.4362305Z         loss_sum_copy_0 = loss_sum_copy
2026-02-21T08:43:04.4362602Z         # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32)
2026-02-21T08:43:04.4362906Z         kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32)
2026-02-21T08:43:04.4363177Z         # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v]
2026-02-21T08:43:04.4363540Z         y_pred_val = tl.load(y_pred + (indices_1[:, None] * 131072 + indices_0[None, :] * 1), None, eviction_policy='evict_last')
2026-02-21T08:43:04.4363888Z         # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v]
2026-02-21T08:43:04.4364231Z         y_true_val = tl.load(y_true + (indices_1[:, None] * 131072 + indices_0[None, :] * 1), None, eviction_policy='evict_last')
2026-02-21T08:43:04.4364552Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:43:04.4364816Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:43:04.4365116Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:43:04.4365324Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:43:04.4365514Z         if log_target:
2026-02-21T08:43:04.4365671Z             y_true_val_copy = y_true_val
2026-02-21T08:43:04.4365862Z             y_pred_val_copy = y_pred_val
2026-02-21T08:43:04.4366040Z             kl_loss_copy = kl_loss
2026-02-21T08:43:04.4366337Z             y_true_val_copy_0 = y_true_val_copy
2026-02-21T08:43:04.4366535Z             y_pred_val_copy_0 = y_pred_val_copy
2026-02-21T08:43:04.4366731Z             kl_loss_copy_0 = kl_loss_copy
2026-02-21T08:43:04.4366964Z             # src[kl_div.py:100]: prob_true = torch.exp(y_true_val)
2026-02-21T08:43:04.4367195Z             v_0 = libdevice.exp(y_true_val_copy_0)
2026-02-21T08:43:04.4367450Z             # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val)
2026-02-21T08:43:04.4367713Z             v_1 = y_true_val_copy_0 - y_pred_val_copy_0
2026-02-21T08:43:04.4367912Z             v_2 = v_0 * v_1
2026-02-21T08:43:04.4368078Z             kl_loss = kl_loss_copy_0 + v_2
2026-02-21T08:43:04.4368273Z         # src[kl_div.py:98]: if log_target:
2026-02-21T08:43:04.4368539Z         # src[kl_div.py:99]:     # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space
2026-02-21T08:43:04.4368909Z         # src[kl_div.py:100]:     prob_true = torch.exp(y_true_val)
2026-02-21T08:43:04.4369137Z         # src[kl_div.py:98-106]: ...
2026-02-21T08:43:04.4369311Z         _not = not log_target
2026-02-21T08:43:04.4369477Z         if _not:
2026-02-21T08:43:04.4369625Z             y_true_val_copy_1 = y_true_val
2026-02-21T08:43:04.4369811Z             y_pred_val_copy_1 = y_pred_val
2026-02-21T08:43:04.4369986Z             kl_loss_copy_1 = kl_loss
2026-02-21T08:43:04.4370177Z             y_true_val_copy_1_0 = y_true_val_copy_1
2026-02-21T08:43:04.4370393Z             y_pred_val_copy_1_0 = y_pred_val_copy_1
2026-02-21T08:43:04.4370597Z             kl_loss_copy_1_0 = kl_loss_copy_1
2026-02-21T08:43:04.4370864Z             # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps))
2026-02-21T08:43:04.4371161Z             v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps)
2026-02-21T08:43:04.4371384Z             v_5 = tl_math.log(v_4)
2026-02-21T08:43:04.4371608Z             # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val)
2026-02-21T08:43:04.4371918Z             v_6 = v_5 - y_pred_val_copy_1_0
2026-02-21T08:43:04.4372121Z             v_7 = y_true_val_copy_1_0 * v_6
2026-02-21T08:43:04.4372313Z             kl_loss = kl_loss_copy_1_0 + v_7
2026-02-21T08:43:04.4372524Z         # src[kl_div.py:112]: loss_sum += kl_loss
2026-02-21T08:43:04.4372723Z         loss_sum = loss_sum_copy_0 + kl_loss
2026-02-21T08:43:04.4372953Z     # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1)
2026-02-21T08:43:04.4373191Z     sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32)
2026-02-21T08:43:04.4373417Z     tl.store(loss + indices_1 * 1, sum_1, None)
2026-02-21T08:43:04.4373551Z 
2026-02-21T08:43:04.4373854Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher):
2026-02-21T08:43:04.4374262Z     """
2026-02-21T08:43:04.4374402Z     Compute KL Divergence loss.
2026-02-21T08:43:04.4374508Z 
2026-02-21T08:43:04.4374561Z     Args:
2026-02-21T08:43:04.4374739Z         y_pred: Input predictions in log-space, shape (BT, V)
2026-02-21T08:43:04.4375021Z         y_true: Target values (probabilities or log-probabilities), shape (BT, V)
2026-02-21T08:43:04.4375351Z         log_target: If True, y_true is in log-space; if False, y_true is probabilities
2026-02-21T08:43:04.4375662Z         reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean')
2026-02-21T08:43:04.4375901Z         eps: Small value to avoid numerical issues
2026-02-21T08:43:04.4376039Z 
2026-02-21T08:43:04.4376092Z     Returns:
2026-02-21T08:43:04.4376225Z         loss: KL divergence loss
2026-02-21T08:43:04.4376380Z     """
2026-02-21T08:43:04.4376515Z     # src[kl_div.py:74]: BT, V = y_pred.shape
2026-02-21T08:43:04.4376698Z     BT, V = y_pred.shape
2026-02-21T08:43:04.4376890Z     # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, (
2026-02-21T08:43:04.4377156Z     # src[kl_div.py:76]:     f"Shape mismatch: {y_true.shape} != {y_pred.shape}"
2026-02-21T08:43:04.4377394Z     # src[kl_div.py:77]: )
2026-02-21T08:43:04.4377709Z     assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}'
2026-02-21T08:43:04.4377991Z     # src[kl_div.py:80]: if reduction == "none":
2026-02-21T08:43:04.4378206Z     # src[kl_div.py:81]:     loss = torch.zeros_like(y_pred)
2026-02-21T08:43:04.4378411Z     # src[kl_div.py:82]: else:
2026-02-21T08:43:04.4378567Z     # src[kl_div.py:80-83]: ...
2026-02-21T08:43:04.4378728Z     if reduction == 'none':
2026-02-21T08:43:04.4378913Z         # src[kl_div.py:81]: loss = torch.zeros_like(y_pred)
2026-02-21T08:43:04.4379114Z         loss = torch.zeros_like(y_pred)
2026-02-21T08:43:04.4379283Z     else:
2026-02-21T08:43:04.4379498Z         # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:43:04.4379824Z         loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device)
2026-02-21T08:43:04.4380168Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:43:04.4380405Z     _BLOCK_SIZE_1 = 2
2026-02-21T08:43:04.4380605Z     # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m):
2026-02-21T08:43:04.4380907Z     # src[kl_div.py:90]:     loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32)
2026-02-21T08:43:04.4381167Z     # src[kl_div.py:89-115]: ...
2026-02-21T08:43:04.4381512Z     _launcher(_helion_kl_div_forward, (triton.cdiv(4096, _BLOCK_SIZE_1),), y_pred, y_true, loss, log_target, eps, num_warps=32, num_stages=7)
2026-02-21T08:43:04.4381925Z     # src[kl_div.py:118]: if reduction == "batchmean":
2026-02-21T08:43:04.4382156Z     # src[kl_div.py:119]:     final_loss = torch.sum(loss) / BT
2026-02-21T08:43:04.4382391Z     # src[kl_div.py:120]: elif reduction == "sum":
2026-02-21T08:43:04.4382590Z     # src[kl_div.py:118-125]: ...
2026-02-21T08:43:04.4382758Z     if reduction == 'batchmean':
2026-02-21T08:43:04.4382960Z         # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT
2026-02-21T08:43:04.4383171Z         final_loss = torch.sum(loss) / BT
2026-02-21T08:43:04.4383352Z     elif reduction == 'sum':
2026-02-21T08:43:04.4383538Z         # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0)
2026-02-21T08:43:04.4383756Z         final_loss = torch.sum(loss, dim=0)
2026-02-21T08:43:04.4383938Z     elif reduction == 'mean':
2026-02-21T08:43:04.4384132Z         # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:43:04.4384359Z         final_loss = torch.sum(loss) / (BT * V)
2026-02-21T08:43:04.4384528Z     else:
2026-02-21T08:43:04.4384670Z         # src[kl_div.py:125]: final_loss = loss
2026-02-21T08:43:04.4384845Z         final_loss = loss
2026-02-21T08:43:04.4385010Z     # src[kl_div.py:127]: return final_loss
2026-02-21T08:43:04.4385179Z     return final_loss
2026-02-21T08:43:05.7162929Z WARNING:tritonbench.utils.triton_op:Completed input ID 5:
2026-02-21T08:43:05.7164747Z (B, T, V)
2026-02-21T08:43:05.7164904Z ----------------
2026-02-21T08:43:05.7165068Z (8, 512, 131072)
2026-02-21T08:43:05.7165164Z 
2026-02-21T08:43:05.7165485Z 100%|██████████| 6/6 [21:38<00:00, 250.82s/it]
2026-02-21T08:43:05.7170044Z 100%|██████████| 6/6 [21:38<00:00, 216.35s/it]
2026-02-21T08:43:05.7179192Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpcy2mmull.csv
2026-02-21T08:43:06.3849077Z        (B, T, V)    liger_kl_div-speedup    liger_kl_div-accuracy    torch_compile_kl_div-speedup    torch_compile_kl_div-accuracy    helion_kl_div_tritonbench-speedup    helion_kl_div_tritonbench-accuracy
2026-02-21T08:43:06.3853430Z ----------------  ----------------------  -----------------------  ------------------------------  -------------------------------  -----------------------------------  ------------------------------------
2026-02-21T08:43:06.3855460Z   (8, 512, 4096)                 3.11483                        1                         3.03404                                1                              3.36283                                     1
2026-02-21T08:43:06.3855969Z   (8, 512, 8192)                 3.50195                        1                         3.18197                                1                              3.94311                                     1
2026-02-21T08:43:06.3856772Z  (8, 512, 16384)                 4.03663                        1                         3.25016                                1                              4.16629                                     1
2026-02-21T08:43:06.3857223Z  (8, 512, 32768)                 4.0457                         1                         3.15498                                1                              3.98916                                     1
2026-02-21T08:43:06.3857666Z  (8, 512, 65536)                 3.99696                        1                         3.44325                                1                              3.88139                                     1
2026-02-21T08:43:06.3858230Z (8, 512, 131072)                 3.74916                        1                         3.49813                                1                              3.61356                                     1
2026-02-21T08:43:06.3858689Z          average                 3.74087                        1                         3.26042                                1                              3.82605                                     1
2026-02-21T08:43:08.4984482Z ✅ Completed benchmark for kernel: kl_div
2026-02-21T08:43:08.4994259Z [
2026-02-21T08:43:08.4994499Z   {
2026-02-21T08:43:08.4994663Z     "benchmark": {
2026-02-21T08:43:08.4994841Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.4995100Z       "extra_info": {
2026-02-21T08:43:08.4995255Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.4995476Z       }
2026-02-21T08:43:08.4995604Z     },
2026-02-21T08:43:08.4995764Z     "model": {
2026-02-21T08:43:08.4995917Z       "name": "kl_div"
2026-02-21T08:43:08.4996055Z     },
2026-02-21T08:43:08.4996231Z     "metric": {
2026-02-21T08:43:08.4996408Z       "name": "triton_speedup",
2026-02-21T08:43:08.4996650Z       "benchmark_values": [
2026-02-21T08:43:08.4996812Z         3.1148345231848293,
2026-02-21T08:43:08.4996952Z         3.5019458271068076,
2026-02-21T08:43:08.4997100Z         4.036633414300421,
2026-02-21T08:43:08.4997243Z         4.045695489846165,
2026-02-21T08:43:08.4997392Z         3.99695912395291,
2026-02-21T08:43:08.4997532Z         3.7491585331985173
2026-02-21T08:43:08.4997672Z       ]
2026-02-21T08:43:08.4997783Z     },
2026-02-21T08:43:08.4997904Z     "shape": [
2026-02-21T08:43:08.4998029Z       "(8, 512, 4096)",
2026-02-21T08:43:08.4998170Z       "(8, 512, 8192)",
2026-02-21T08:43:08.4998309Z       "(8, 512, 16384)",
2026-02-21T08:43:08.4998446Z       "(8, 512, 32768)",
2026-02-21T08:43:08.4998585Z       "(8, 512, 65536)",
2026-02-21T08:43:08.4998716Z       "(8, 512, 131072)"
2026-02-21T08:43:08.4998851Z     ]
2026-02-21T08:43:08.4998962Z   },
2026-02-21T08:43:08.4999079Z   {
2026-02-21T08:43:08.4999194Z     "benchmark": {
2026-02-21T08:43:08.4999347Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.4999511Z       "extra_info": {
2026-02-21T08:43:08.4999659Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.4999806Z       }
2026-02-21T08:43:08.4999925Z     },
2026-02-21T08:43:08.5000047Z     "model": {
2026-02-21T08:43:08.5000172Z       "name": "kl_div"
2026-02-21T08:43:08.5000312Z     },
2026-02-21T08:43:08.5000424Z     "metric": {
2026-02-21T08:43:08.5000567Z       "name": "triton_accuracy",
2026-02-21T08:43:08.5000728Z       "benchmark_values": [
2026-02-21T08:43:08.5000879Z         1.0,
2026-02-21T08:43:08.5000999Z         1.0,
2026-02-21T08:43:08.5001126Z         1.0,
2026-02-21T08:43:08.5001242Z         1.0,
2026-02-21T08:43:08.5001366Z         1.0,
2026-02-21T08:43:08.5001481Z         1.0
2026-02-21T08:43:08.5001602Z       ]
2026-02-21T08:43:08.5001716Z     },
2026-02-21T08:43:08.5001827Z     "shape": [
2026-02-21T08:43:08.5002195Z       "(8, 512, 4096)",
2026-02-21T08:43:08.5002334Z       "(8, 512, 8192)",
2026-02-21T08:43:08.5002485Z       "(8, 512, 16384)",
2026-02-21T08:43:08.5003095Z       "(8, 512, 32768)",
2026-02-21T08:43:08.5003236Z       "(8, 512, 65536)",
2026-02-21T08:43:08.5003367Z       "(8, 512, 131072)"
2026-02-21T08:43:08.5003530Z     ]
2026-02-21T08:43:08.5003661Z   },
2026-02-21T08:43:08.5003779Z   {
2026-02-21T08:43:08.5003912Z     "benchmark": {
2026-02-21T08:43:08.5004059Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.5004230Z       "extra_info": {
2026-02-21T08:43:08.5004368Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.5004520Z       }
2026-02-21T08:43:08.5004628Z     },
2026-02-21T08:43:08.5004749Z     "model": {
2026-02-21T08:43:08.5004872Z       "name": "kl_div"
2026-02-21T08:43:08.5005010Z     },
2026-02-21T08:43:08.5005122Z     "metric": {
2026-02-21T08:43:08.5005270Z       "name": "torch_compile_speedup",
2026-02-21T08:43:08.5005451Z       "benchmark_values": [
2026-02-21T08:43:08.5005597Z         3.034036673978964,
2026-02-21T08:43:08.5005743Z         3.181974407448598,
2026-02-21T08:43:08.5005978Z         3.2501613834972476,
2026-02-21T08:43:08.5006134Z         3.15497766353392,
2026-02-21T08:43:08.5006272Z         3.443246731995977,
2026-02-21T08:43:08.5006414Z         3.4981284879927594
2026-02-21T08:43:08.5006545Z       ]
2026-02-21T08:43:08.5006662Z     },
2026-02-21T08:43:08.5006772Z     "shape": [
2026-02-21T08:43:08.5006902Z       "(8, 512, 4096)",
2026-02-21T08:43:08.5007043Z       "(8, 512, 8192)",
2026-02-21T08:43:08.5007174Z       "(8, 512, 16384)",
2026-02-21T08:43:08.5007316Z       "(8, 512, 32768)",
2026-02-21T08:43:08.5007449Z       "(8, 512, 65536)",
2026-02-21T08:43:08.5007588Z       "(8, 512, 131072)"
2026-02-21T08:43:08.5007714Z     ]
2026-02-21T08:43:08.5007832Z   },
2026-02-21T08:43:08.5007938Z   {
2026-02-21T08:43:08.5008058Z     "benchmark": {
2026-02-21T08:43:08.5008194Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.5008356Z       "extra_info": {
2026-02-21T08:43:08.5008494Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.5008643Z       }
2026-02-21T08:43:08.5008791Z     },
2026-02-21T08:43:08.5008904Z     "model": {
2026-02-21T08:43:08.5009034Z       "name": "kl_div"
2026-02-21T08:43:08.5009162Z     },
2026-02-21T08:43:08.5009282Z     "metric": {
2026-02-21T08:43:08.5009420Z       "name": "torch_compile_accuracy",
2026-02-21T08:43:08.5009598Z       "benchmark_values": [
2026-02-21T08:43:08.5009748Z         1.0,
2026-02-21T08:43:08.5009865Z         1.0,
2026-02-21T08:43:08.5009992Z         1.0,
2026-02-21T08:43:08.5010108Z         1.0,
2026-02-21T08:43:08.5010232Z         1.0,
2026-02-21T08:43:08.5010350Z         1.0
2026-02-21T08:43:08.5010472Z       ]
2026-02-21T08:43:08.5010580Z     },
2026-02-21T08:43:08.5010701Z     "shape": [
2026-02-21T08:43:08.5010823Z       "(8, 512, 4096)",
2026-02-21T08:43:08.5010962Z       "(8, 512, 8192)",
2026-02-21T08:43:08.5011091Z       "(8, 512, 16384)",
2026-02-21T08:43:08.5011230Z       "(8, 512, 32768)",
2026-02-21T08:43:08.5011368Z       "(8, 512, 65536)",
2026-02-21T08:43:08.5011501Z       "(8, 512, 131072)"
2026-02-21T08:43:08.5011639Z     ]
2026-02-21T08:43:08.5011750Z   },
2026-02-21T08:43:08.5011911Z   {
2026-02-21T08:43:08.5012025Z     "benchmark": {
2026-02-21T08:43:08.5012172Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.5012329Z       "extra_info": {
2026-02-21T08:43:08.5012480Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.5012626Z       }
2026-02-21T08:43:08.5012746Z     },
2026-02-21T08:43:08.5012859Z     "model": {
2026-02-21T08:43:08.5012993Z       "name": "kl_div"
2026-02-21T08:43:08.5013131Z     },
2026-02-21T08:43:08.5013259Z     "metric": {
2026-02-21T08:43:08.5013409Z       "name": "helion_speedup",
2026-02-21T08:43:08.5013571Z       "benchmark_values": [
2026-02-21T08:43:08.5013732Z         3.3628256221234514,
2026-02-21T08:43:08.5013875Z         3.9431083058633747,
2026-02-21T08:43:08.5014024Z         4.1662855246496715,
2026-02-21T08:43:08.5014166Z         3.9891636594329802,
2026-02-21T08:43:08.5014316Z         3.881386565950531,
2026-02-21T08:43:08.5014459Z         3.6135570456389985
2026-02-21T08:43:08.5014607Z       ]
2026-02-21T08:43:08.5014820Z     },
2026-02-21T08:43:08.5014943Z     "shape": [
2026-02-21T08:43:08.5015086Z       "(8, 512, 4096)",
2026-02-21T08:43:08.5015232Z       "(8, 512, 8192)",
2026-02-21T08:43:08.5015385Z       "(8, 512, 16384)",
2026-02-21T08:43:08.5015530Z       "(8, 512, 32768)",
2026-02-21T08:43:08.5015684Z       "(8, 512, 65536)",
2026-02-21T08:43:08.5015829Z       "(8, 512, 131072)"
2026-02-21T08:43:08.5015978Z     ]
2026-02-21T08:43:08.5016100Z   },
2026-02-21T08:43:08.5016229Z   {
2026-02-21T08:43:08.5016357Z     "benchmark": {
2026-02-21T08:43:08.5016520Z       "name": "Helion Benchmark",
2026-02-21T08:43:08.5016699Z       "extra_info": {
2026-02-21T08:43:08.5016851Z         "device": "NVIDIA B200"
2026-02-21T08:43:08.5017016Z       }
2026-02-21T08:43:08.5017138Z     },
2026-02-21T08:43:08.5017274Z     "model": {
2026-02-21T08:43:08.5017413Z       "name": "kl_div"
2026-02-21T08:43:08.5017566Z     },
2026-02-21T08:43:08.5017692Z     "metric": {
2026-02-21T08:43:08.5017922Z       "name": "helion_accuracy",
2026-02-21T08:43:08.5018097Z       "benchmark_values": [
2026-02-21T08:43:08.5018258Z         1.0,
2026-02-21T08:43:08.5018386Z         1.0,
2026-02-21T08:43:08.5018520Z         1.0,
2026-02-21T08:43:08.5018652Z         1.0,
2026-02-21T08:43:08.5018779Z         1.0,
2026-02-21T08:43:08.5018913Z         1.0
2026-02-21T08:43:08.5019038Z       ]
2026-02-21T08:43:08.5019169Z     },
2026-02-21T08:43:08.5019293Z     "shape": [
2026-02-21T08:43:08.5019435Z       "(8, 512, 4096)",
2026-02-21T08:43:08.5019581Z       "(8, 512, 8192)",
2026-02-21T08:43:08.5019734Z       "(8, 512, 16384)",
2026-02-21T08:43:08.5019879Z       "(8, 512, 32768)",
2026-02-21T08:43:08.5020031Z       "(8, 512, 65536)",
2026-02-21T08:43:08.5020174Z       "(8, 512, 131072)"
2026-02-21T08:43:08.5020324Z     ]
2026-02-21T08:43:08.5020449Z   }
2026-02-21T08:43:08.5036264Z ]
2026-02-21T08:43:08.5089991Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main
2026-02-21T08:43:08.5090266Z with:
2026-02-21T08:43:08.5090634Z   github-token: ***
2026-02-21T08:43:08.5090794Z   venv: .venv/bin/activate
2026-02-21T08:43:08.5090951Z   schema-version: v3
2026-02-21T08:43:08.5091099Z env:
2026-02-21T08:43:08.5091234Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:08.5091447Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5091714Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:08.5092017Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5092234Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5092446Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5092794Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:08.5093221Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:08.5093446Z ##[endgroup]
2026-02-21T08:43:08.5149819Z ##[group]Run set -eux
2026-02-21T08:43:08.5150004Z [36;1mset -eux[0m
2026-02-21T08:43:08.5150160Z [36;1m[0m
2026-02-21T08:43:08.5150318Z [36;1mif [[ -z "${GITHUB_TOKEN}" ]]; then[0m
2026-02-21T08:43:08.5150525Z [36;1m  echo "Missing github-token input"[0m
2026-02-21T08:43:08.5150714Z [36;1m  exit 1[0m
2026-02-21T08:43:08.5150843Z [36;1mfi[0m
2026-02-21T08:43:08.5151779Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:08.5152060Z env:
2026-02-21T08:43:08.5152205Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:08.5152411Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5152671Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:08.5152925Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5153141Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5153361Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5153735Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:08.5154248Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:08.5154614Z   GITHUB_TOKEN: ***
2026-02-21T08:43:08.5154756Z ##[endgroup]
2026-02-21T08:43:08.5713459Z + [[ -z *** ]]
2026-02-21T08:43:08.5777369Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main
2026-02-21T08:43:08.5777635Z with:
2026-02-21T08:43:08.5777877Z   github-token: ***
2026-02-21T08:43:08.5778027Z env:
2026-02-21T08:43:08.5778159Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:08.5778368Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5778618Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:08.5778855Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5779075Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5779302Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5779653Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:08.5780029Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:08.5780246Z ##[endgroup]
2026-02-21T08:43:08.5789603Z ##[group]Run set -eux
2026-02-21T08:43:08.5789784Z [36;1mset -eux[0m
2026-02-21T08:43:08.5789939Z [36;1m[0m
2026-02-21T08:43:08.5790245Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}"[0m
2026-02-21T08:43:08.5790695Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:08.5790898Z env:
2026-02-21T08:43:08.5791045Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:08.5791252Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5791511Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:08.5791971Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5792198Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5792425Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:08.5792801Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:08.5793210Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:08.5793545Z   GITHUB_TOKEN: ***
2026-02-21T08:43:08.5793704Z ##[endgroup]
2026-02-21T08:43:08.6340797Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-04-1007
2026-02-21T08:43:09.9433141Z setting job-id=64380329773
2026-02-21T08:43:09.9435304Z setting job-name=run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200
2026-02-21T08:43:09.9579657Z ##[group]Run set -eux
2026-02-21T08:43:09.9579831Z [36;1mset -eux[0m
2026-02-21T08:43:09.9579959Z [36;1m[0m
2026-02-21T08:43:09.9580129Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T08:43:09.9580331Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T08:43:09.9580494Z [36;1mfi[0m
2026-02-21T08:43:09.9580621Z [36;1m[0m
2026-02-21T08:43:09.9580840Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \[0m
2026-02-21T08:43:09.9581141Z [36;1m  --schema-version "${SCHEMA_VERSION}" \[0m
2026-02-21T08:43:09.9581337Z [36;1m  --repo "${REPO}" \[0m
2026-02-21T08:43:09.9581514Z [36;1m  --head-branch "${HEAD_BRANCH}" \[0m
2026-02-21T08:43:09.9581706Z [36;1m  --head-sha "${HEAD_SHA}" \[0m
2026-02-21T08:43:09.9581965Z [36;1m  --workflow-id "${WORKFLOW_RUN_ID}" \[0m
2026-02-21T08:43:09.9582164Z [36;1m  --run-attempt "${RUN_ATTEMPT}" \[0m
2026-02-21T08:43:09.9582340Z [36;1m  --job-id "${JOB_ID}" \[0m
2026-02-21T08:43:09.9582517Z [36;1m  --job-name "${JOB_NAME}"[0m
2026-02-21T08:43:09.9582768Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:09.9582953Z env:
2026-02-21T08:43:09.9583085Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:09.9583282Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:09.9583519Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:09.9583858Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:09.9584070Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:09.9584271Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:09.9584619Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:09.9584993Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:09.9585203Z   SCHEMA_VERSION: v3
2026-02-21T08:43:09.9585359Z   REPO: pytorch/helion
2026-02-21T08:43:09.9585509Z   HEAD_BRANCH: refs/heads/main
2026-02-21T08:43:09.9585711Z   HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:43:09.9585909Z   WORKFLOW_RUN_ID: 22253280836
2026-02-21T08:43:09.9586077Z   RUN_ATTEMPT: 1
2026-02-21T08:43:09.9586211Z   JOB_ID: 64380329773
2026-02-21T08:43:09.9586409Z   JOB_NAME: run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200
2026-02-21T08:43:09.9586637Z ##[endgroup]
2026-02-21T08:43:10.0174548Z + [[ -n .venv/bin/activate ]]
2026-02-21T08:43:10.0174767Z + source .venv/bin/activate
2026-02-21T08:43:10.0174935Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0175081Z ++ '[' -n x ']'
2026-02-21T08:43:10.0175232Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T08:43:10.0175493Z ++ '[' .venv/bin/activate = /__w/_temp/297db34b-cc20-4609-b220-031348e5e286.sh ']'
2026-02-21T08:43:10.0175757Z ++ deactivate nondestructive
2026-02-21T08:43:10.0175921Z ++ unset -f pydoc
2026-02-21T08:43:10.0176052Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0176183Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0176303Z ++ hash -r
2026-02-21T08:43:10.0176426Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0176553Z ++ unset VIRTUAL_ENV
2026-02-21T08:43:10.0176706Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T08:43:10.0177153Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T08:43:10.0177368Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T08:43:10.0177646Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T08:43:10.0177839Z ++ '[' linux-gnu = msys ']'
2026-02-21T08:43:10.0178029Z ++ export VIRTUAL_ENV
2026-02-21T08:43:10.0178169Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0178313Z ++ unset SCRIPT_PATH
2026-02-21T08:43:10.0178921Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:43:10.0180009Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:43:10.0180646Z ++ export PATH
2026-02-21T08:43:10.0180784Z ++ '[' xhelion '!=' x ']'
2026-02-21T08:43:10.0180956Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T08:43:10.0181120Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T08:43:10.0181267Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0181395Z ++ '[' -z '' ']'
2026-02-21T08:43:10.0181521Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T08:43:10.0181667Z ++ PS1='(helion) '
2026-02-21T08:43:10.0181802Z ++ export PS1
2026-02-21T08:43:10.0182018Z ++ alias pydoc
2026-02-21T08:43:10.0182146Z ++ true
2026-02-21T08:43:10.0182273Z ++ hash -r
2026-02-21T08:43:10.0183220Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329773 --job-name 'run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200'
2026-02-21T08:43:10.0543606Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main
2026-02-21T08:43:10.0543866Z with:
2026-02-21T08:43:10.0544006Z   venv: .venv/bin/activate
2026-02-21T08:43:10.0544154Z env:
2026-02-21T08:43:10.0544292Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:10.0544570Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0544818Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:10.0545050Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0545269Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0545476Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0545829Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:10.0546223Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:10.0546432Z ##[endgroup]
2026-02-21T08:43:10.0555367Z ##[group]Run set -eux
2026-02-21T08:43:10.0555537Z [36;1mset -eux[0m
2026-02-21T08:43:10.0555679Z [36;1m[0m
2026-02-21T08:43:10.0555815Z [36;1mif command -v nvidia-smi; then[0m
2026-02-21T08:43:10.0556007Z [36;1m  DEVICE_NAME=cuda[0m
2026-02-21T08:43:10.0556167Z [36;1m  nvidia-smi[0m
2026-02-21T08:43:10.0556327Z [36;1melif command -v rocm-smi; then[0m
2026-02-21T08:43:10.0556509Z [36;1m  DEVICE_NAME=rocm[0m
2026-02-21T08:43:10.0556657Z [36;1m  rocm-smi[0m
2026-02-21T08:43:10.0556809Z [36;1melif command -v hl-smi; then[0m
2026-02-21T08:43:10.0556981Z [36;1m  DEVICE_NAME=hpu[0m
2026-02-21T08:43:10.0557133Z [36;1m  hl-smi[0m
2026-02-21T08:43:10.0557258Z [36;1melse[0m
2026-02-21T08:43:10.0557397Z [36;1m  arch=$(uname -m)[0m
2026-02-21T08:43:10.0557545Z [36;1m[0m
2026-02-21T08:43:10.0557667Z [36;1m  case "$arch" in[0m
2026-02-21T08:43:10.0557822Z [36;1m    aarch64|arm64)[0m
2026-02-21T08:43:10.0557978Z [36;1m      DEVICE_NAME=arm64-cpu[0m
2026-02-21T08:43:10.0558146Z [36;1m      ;;[0m
2026-02-21T08:43:10.0558271Z [36;1m    *)[0m
2026-02-21T08:43:10.0558409Z [36;1m      DEVICE_NAME=cpu[0m
2026-02-21T08:43:10.0558558Z [36;1m      ;;[0m
2026-02-21T08:43:10.0558686Z [36;1m  esac[0m
2026-02-21T08:43:10.0558810Z [36;1m  lscpu[0m
2026-02-21T08:43:10.0558945Z [36;1mfi[0m
2026-02-21T08:43:10.0559112Z [36;1mecho "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV[0m
2026-02-21T08:43:10.0559397Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:10.0559591Z env:
2026-02-21T08:43:10.0559719Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:10.0559915Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0560148Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:10.0560385Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0560596Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0560799Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.0561151Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:10.0561518Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:10.0561732Z ##[endgroup]
2026-02-21T08:43:10.1111408Z /usr/bin/nvidia-smi
2026-02-21T08:43:10.1112953Z + command -v nvidia-smi
2026-02-21T08:43:10.1113134Z + DEVICE_NAME=cuda
2026-02-21T08:43:10.1113277Z + nvidia-smi
2026-02-21T08:43:10.1256870Z Sat Feb 21 08:43:10 2026       
2026-02-21T08:43:10.1257220Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:43:10.1257618Z | NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
2026-02-21T08:43:10.1258001Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:43:10.1258381Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2026-02-21T08:43:10.1259367Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2026-02-21T08:43:10.1259741Z |                                         |                        |               MIG M. |
2026-02-21T08:43:10.1259970Z |=========================================+========================+======================|
2026-02-21T08:43:10.1318187Z |   0  NVIDIA B200                    Off |   00000000:9D:00.0 Off |                    0 |
2026-02-21T08:43:10.1318604Z | N/A   40C    P0            197W /  750W |       0MiB / 183359MiB |      0%      Default |
2026-02-21T08:43:10.1318932Z |                                         |                        |             Disabled |
2026-02-21T08:43:10.1319264Z +-----------------------------------------+------------------------+----------------------+
2026-02-21T08:43:10.1319460Z 
2026-02-21T08:43:10.1319585Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:43:10.1319883Z | Processes:                                                                              |
2026-02-21T08:43:10.1320169Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2026-02-21T08:43:10.1320429Z |        ID   ID                                                               Usage      |
2026-02-21T08:43:10.1320668Z |=========================================================================================|
2026-02-21T08:43:10.1320930Z |  No running processes found                                                             |
2026-02-21T08:43:10.1321228Z +-----------------------------------------------------------------------------------------+
2026-02-21T08:43:10.1592502Z + echo DEVICE_NAME=cuda
2026-02-21T08:43:10.1627426Z ##[group]Run set -eux
2026-02-21T08:43:10.1627619Z [36;1mset -eux[0m
2026-02-21T08:43:10.1627756Z [36;1m[0m
2026-02-21T08:43:10.1627915Z [36;1mif [[ "${DEVICE_NAME}" == "cuda" ]]; then[0m
2026-02-21T08:43:10.1628146Z [36;1m  # Return the same device name as PyTorch[0m
2026-02-21T08:43:10.1628467Z [36;1m  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader)[0m
2026-02-21T08:43:10.1628746Z [36;1melif [[ "${DEVICE_NAME}" == "rocm" ]]; then[0m
2026-02-21T08:43:10.1629047Z [36;1m  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs)[0m
2026-02-21T08:43:10.1629365Z [36;1melif [[ "${DEVICE_NAME}" == "hpu" ]]; then[0m
2026-02-21T08:43:10.1629713Z [36;1m  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//')[0m
2026-02-21T08:43:10.1630055Z [36;1melif [[ "${DEVICE_NAME}" == "cpu" ]]; then[0m
2026-02-21T08:43:10.1630724Z [36;1m  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))"[0m
2026-02-21T08:43:10.1631373Z [36;1melif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then[0m
2026-02-21T08:43:10.1631679Z [36;1m  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ")[0m
2026-02-21T08:43:10.1632014Z [36;1mfi[0m
2026-02-21T08:43:10.1632178Z [36;1mecho "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV[0m
2026-02-21T08:43:10.1632464Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:10.1632665Z env:
2026-02-21T08:43:10.1632795Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:10.1632989Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.1633222Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:10.1633457Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.1633660Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.1633868Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.1634297Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:10.1634667Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:10.1634884Z   DEVICE_NAME: cuda
2026-02-21T08:43:10.1635023Z ##[endgroup]
2026-02-21T08:43:10.2121025Z + [[ cuda == \c\u\d\a ]]
2026-02-21T08:43:10.2125611Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader
2026-02-21T08:43:10.2305549Z + DEVICE_TYPE='NVIDIA B200'
2026-02-21T08:43:10.2307203Z + echo 'DEVICE_TYPE=NVIDIA B200'
2026-02-21T08:43:10.2339998Z ##[group]Run set -eux
2026-02-21T08:43:10.2340165Z [36;1mset -eux[0m
2026-02-21T08:43:10.2340294Z [36;1m[0m
2026-02-21T08:43:10.2340446Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T08:43:10.2340642Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T08:43:10.2340813Z [36;1mfi[0m
2026-02-21T08:43:10.2340932Z [36;1m[0m
2026-02-21T08:43:10.2341123Z [36;1mpython3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82[0m
2026-02-21T08:43:10.2341461Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py"[0m
2026-02-21T08:43:10.2341838Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:10.2342104Z env:
2026-02-21T08:43:10.2342233Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:10.2342433Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.2342669Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:10.2342906Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.2343118Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.2343317Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:10.2343662Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:10.2344027Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:10.2344240Z   DEVICE_NAME: cuda
2026-02-21T08:43:10.2344379Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T08:43:10.2344533Z ##[endgroup]
2026-02-21T08:43:10.2822696Z + [[ -n .venv/bin/activate ]]
2026-02-21T08:43:10.2822974Z + source .venv/bin/activate
2026-02-21T08:43:10.2823168Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2823323Z ++ '[' -n x ']'
2026-02-21T08:43:10.2823476Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T08:43:10.2823735Z ++ '[' .venv/bin/activate = /__w/_temp/f0e3475c-e481-4719-a418-fb2e6ecaf6e2.sh ']'
2026-02-21T08:43:10.2824007Z ++ deactivate nondestructive
2026-02-21T08:43:10.2824171Z ++ unset -f pydoc
2026-02-21T08:43:10.2824303Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2824440Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2824576Z ++ hash -r
2026-02-21T08:43:10.2824705Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2824834Z ++ unset VIRTUAL_ENV
2026-02-21T08:43:10.2824994Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T08:43:10.2825173Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T08:43:10.2825373Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T08:43:10.2825555Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T08:43:10.2825708Z ++ '[' linux-gnu = msys ']'
2026-02-21T08:43:10.2825884Z ++ export VIRTUAL_ENV
2026-02-21T08:43:10.2826018Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2826152Z ++ unset SCRIPT_PATH
2026-02-21T08:43:10.2826758Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:43:10.2827858Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:43:10.2828501Z ++ export PATH
2026-02-21T08:43:10.2828641Z ++ '[' xhelion '!=' x ']'
2026-02-21T08:43:10.2828808Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T08:43:10.2828975Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T08:43:10.2829134Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2829450Z ++ '[' -z '' ']'
2026-02-21T08:43:10.2829587Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T08:43:10.2829737Z ++ PS1='(helion) '
2026-02-21T08:43:10.2829874Z ++ export PS1
2026-02-21T08:43:10.2830014Z ++ alias pydoc
2026-02-21T08:43:10.2830144Z ++ true
2026-02-21T08:43:10.2830331Z ++ hash -r
2026-02-21T08:43:10.2830562Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82
2026-02-21T08:43:10.9333144Z Collecting psutil==7.0.0
2026-02-21T08:43:10.9844739Z   Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
2026-02-21T08:43:11.0051804Z Collecting nvidia-ml-py==13.580.82
2026-02-21T08:43:11.0108076Z   Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB)
2026-02-21T08:43:11.0200804Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB)
2026-02-21T08:43:11.0388459Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB)
2026-02-21T08:43:11.1221699Z Installing collected packages: nvidia-ml-py, psutil
2026-02-21T08:43:11.1229116Z   Attempting uninstall: nvidia-ml-py
2026-02-21T08:43:11.1246995Z     Found existing installation: nvidia-ml-py 13.590.48
2026-02-21T08:43:11.1257486Z     Uninstalling nvidia-ml-py-13.590.48:
2026-02-21T08:43:11.1898196Z       Successfully uninstalled nvidia-ml-py-13.590.48
2026-02-21T08:43:11.2359727Z   Attempting uninstall: psutil
2026-02-21T08:43:11.2389655Z     Found existing installation: psutil 7.2.2
2026-02-21T08:43:11.2404589Z     Uninstalling psutil-7.2.2:
2026-02-21T08:43:11.2408592Z       Successfully uninstalled psutil-7.2.2
2026-02-21T08:43:11.3524495Z 
2026-02-21T08:43:11.3556459Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0
2026-02-21T08:43:11.4756596Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py
2026-02-21T08:43:13.1112720Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main
2026-02-21T08:43:13.1112993Z with:
2026-02-21T08:43:13.1113137Z   venv: .venv/bin/activate
2026-02-21T08:43:13.1113288Z env:
2026-02-21T08:43:13.1113431Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:13.1113629Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1113897Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:13.1114130Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1114345Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1114551Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1114901Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:13.1115278Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:13.1115488Z   DEVICE_NAME: cuda
2026-02-21T08:43:13.1115662Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T08:43:13.1115812Z ##[endgroup]
2026-02-21T08:43:13.1124363Z ##[group]Run set -eux
2026-02-21T08:43:13.1124530Z [36;1mset -eux[0m
2026-02-21T08:43:13.1124667Z [36;1m[0m
2026-02-21T08:43:13.1124815Z [36;1m# TODO (huydhn): Implement this part[0m
2026-02-21T08:43:13.1125047Z [36;1mecho "dependencies={}" >> "${GITHUB_OUTPUT}"[0m
2026-02-21T08:43:13.1125358Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T08:43:13.1125547Z env:
2026-02-21T08:43:13.1125685Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:13.1125875Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1126114Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:13.1126345Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1126559Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1126769Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1127123Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:13.1127502Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:13.1127710Z   DEVICE_NAME: cuda
2026-02-21T08:43:13.1127858Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T08:43:13.1128005Z ##[endgroup]
2026-02-21T08:43:13.1706392Z + echo 'dependencies={}'
2026-02-21T08:43:13.1755454Z ##[group]Run actions/upload-artifact@v6
2026-02-21T08:43:13.1755658Z with:
2026-02-21T08:43:13.1755811Z   name: benchmark-results-b200-kl_div
2026-02-21T08:43:13.1755997Z   path: test/test-reports
2026-02-21T08:43:13.1756163Z   if-no-files-found: warn
2026-02-21T08:43:13.1756318Z   compression-level: 6
2026-02-21T08:43:13.1756470Z   overwrite: false
2026-02-21T08:43:13.1756611Z   include-hidden-files: false
2026-02-21T08:43:13.1756774Z env:
2026-02-21T08:43:13.1756903Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:43:13.1757103Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1757341Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:43:13.1757583Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1757796Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1758003Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:43:13.1758377Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2026-02-21T08:43:13.1758773Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:43:13.1758988Z   DEVICE_NAME: cuda
2026-02-21T08:43:13.1759128Z   DEVICE_TYPE: NVIDIA B200
2026-02-21T08:43:13.1759285Z ##[endgroup]
2026-02-21T08:43:13.1761398Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:43:13.3962312Z With the provided path, there will be 1 file uploaded
2026-02-21T08:43:13.3966787Z Artifact name is valid!
2026-02-21T08:43:13.3971460Z Root directory input is valid!
2026-02-21T08:43:13.6655149Z Beginning upload of artifact content to blob storage
2026-02-21T08:43:14.0133362Z Uploaded bytes 622
2026-02-21T08:43:14.1034365Z Finished uploading artifact content to blob storage!
2026-02-21T08:43:14.1036020Z SHA256 digest of uploaded artifact zip is 5753666ca7007086ebe962314e4f20502e13d355286efeb938ff9ecc486f419a
2026-02-21T08:43:14.1036416Z Finalizing artifact upload
2026-02-21T08:43:14.4186289Z Artifact benchmark-results-b200-kl_div.zip successfully finalized. Artifact ID 5600481450
2026-02-21T08:43:14.4186860Z Artifact benchmark-results-b200-kl_div has been successfully uploaded! Final size is 622 bytes. Artifact ID is 5600481450
2026-02-21T08:43:14.4190494Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5600481450
2026-02-21T08:43:14.4311744Z Post job cleanup.
2026-02-21T08:43:14.4315561Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:43:14.6199438Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python
2026-02-21T08:43:14.6199931Z (node:83677) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T08:43:14.6200366Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T08:43:14.6294229Z Post job cleanup.
2026-02-21T08:43:14.6296587Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:43:14.8236290Z Post job cleanup.
2026-02-21T08:43:14.8239367Z ##[command]/usr/bin/docker exec  227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:43:14.9928874Z [command]/usr/bin/git version
2026-02-21T08:43:14.9958720Z git version 2.43.0
2026-02-21T08:43:14.9989677Z Temporarily overriding HOME='/__w/_temp/f15b19ce-d71f-469b-8167-b80cbddb17e7' before making global git config changes
2026-02-21T08:43:14.9991691Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T08:43:14.9992186Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T08:43:15.0028149Z Removing SSH command configuration
2026-02-21T08:43:15.0028471Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T08:43:15.0054257Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T08:43:15.0275987Z Removing HTTP extra header
2026-02-21T08:43:15.0276432Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T08:43:15.0297702Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T08:43:15.0504823Z Removing includeIf entries pointing to credentials config files
2026-02-21T08:43:15.0513433Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T08:43:15.0527265Z includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T08:43:15.0527565Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T08:43:15.0527837Z includeif.gitdir:/github/workspace/.git.path
2026-02-21T08:43:15.0528079Z includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T08:43:15.0534023Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T08:43:15.0550998Z /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0558638Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0580410Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T08:43:15.0607188Z /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0612191Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0636745Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path
2026-02-21T08:43:15.0653591Z /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0657675Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0684498Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T08:43:15.0704700Z /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0711788Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config
2026-02-21T08:43:15.0742613Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T08:43:15.0965489Z Removing credentials config '/__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config'
2026-02-21T08:43:15.1058465Z Stop and remove container: d92e73d387994e1e949d78541a22b449_nvidiacuda1301develubuntu2404_f71b97
2026-02-21T08:43:15.1061532Z ##[command]/usr/bin/docker rm --force 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:43:18.2876758Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12
2026-02-21T08:43:18.2905225Z Remove container network: github_network_635fe730ba6e422c803643553ff1a973
2026-02-21T08:43:18.2907943Z ##[command]/usr/bin/docker network rm github_network_635fe730ba6e422c803643553ff1a973
2026-02-21T08:43:18.7397906Z github_network_635fe730ba6e422c803643553ff1a973
2026-02-21T08:43:18.7452398Z Evaluate and set job outputs
2026-02-21T08:43:18.7457545Z Set output 'benchmark-metadata'
2026-02-21T08:43:18.7458985Z Set output 'runners-info'
2026-02-21T08:43:18.7459631Z Set output 'dependencies'
2026-02-21T08:43:18.7460035Z Cleaning up orphan processes