2026-02-21T08:04:43.3368283Z Current runner version: '2.331.0' 2026-02-21T08:04:43.3372126Z Runner name: 'dgxb200-04-1007' 2026-02-21T08:04:43.3372643Z Runner group name: 'default' 2026-02-21T08:04:43.3373259Z Machine name: 'a3bc1758654d' 2026-02-21T08:04:43.3374906Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T08:04:43.3376411Z Contents: read 2026-02-21T08:04:43.3376795Z Metadata: read 2026-02-21T08:04:43.3377191Z ##[endgroup] 2026-02-21T08:04:43.3378625Z Secret source: Actions 2026-02-21T08:04:43.3379112Z Prepare workflow directory 2026-02-21T08:04:43.3740661Z Prepare all required actions 2026-02-21T08:04:43.3769110Z Getting action download info 2026-02-21T08:04:43.8293574Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T08:04:44.1523697Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T08:04:44.5027040Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T08:04:44.8777290Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T08:04:45.5042756Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T08:04:45.9823539Z Getting action download info 2026-02-21T08:04:46.2022857Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T08:04:46.2026539Z ##[group] Inputs 2026-02-21T08:04:46.2027208Z runner: linux.dgx.b200 2026-02-21T08:04:46.2027895Z python-version: 3.12 2026-02-21T08:04:46.2028570Z image: nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:46.2029387Z runtime-version: cu130 2026-02-21T08:04:46.2030085Z container-options: --gpus all 2026-02-21T08:04:46.2030725Z alias: b200 2026-02-21T08:04:46.2031304Z kernels: kl_div 2026-02-21T08:04:46.2031921Z env-vars: 2026-02-21T08:04:46.2032472Z custom-args: 2026-02-21T08:04:46.2033252Z run_h100: true 2026-02-21T08:04:46.2033902Z run_b200: true 2026-02-21T08:04:46.2034443Z run_mi325x: true 2026-02-21T08:04:46.2035004Z ##[endgroup] 2026-02-21T08:04:46.2035842Z Complete job name: run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200 2026-02-21T08:04:46.2338073Z ##[group]Checking docker version 2026-02-21T08:04:46.2348010Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T08:04:46.3358808Z '1.53' 2026-02-21T08:04:46.3379514Z Docker daemon API version: '1.53' 2026-02-21T08:04:46.3380478Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T08:04:46.4254851Z '1.52' 2026-02-21T08:04:46.4278316Z Docker client API version: '1.52' 2026-02-21T08:04:46.4282964Z ##[endgroup] 2026-02-21T08:04:46.4285173Z ##[group]Clean up resources from previous jobs 2026-02-21T08:04:46.4289359Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=f446d1" 2026-02-21T08:04:46.4416515Z ##[command]/usr/bin/docker network prune --force --filter "label=f446d1" 2026-02-21T08:04:46.4535132Z ##[endgroup] 2026-02-21T08:04:46.4535817Z ##[group]Create local container network 2026-02-21T08:04:46.4543659Z ##[command]/usr/bin/docker network create --label f446d1 github_network_635fe730ba6e422c803643553ff1a973 2026-02-21T08:04:46.9217946Z 7031020d6c0021ee991d36b442e7325c86524e11541bb3c152cea95d7e7fb78f 2026-02-21T08:04:46.9241137Z ##[endgroup] 2026-02-21T08:04:46.9262699Z ##[group]Starting job container 2026-02-21T08:04:46.9279375Z ##[command]/usr/bin/docker pull nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:04:47.7508419Z 13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda 2026-02-21T08:04:48.0466017Z 1cd98a0b9132: Pulling fs layer 2026-02-21T08:04:48.0470152Z eea924c2c8fb: Pulling fs layer 2026-02-21T08:04:48.0474994Z afcf80b42416: Pulling fs layer 2026-02-21T08:04:48.0476203Z e93dd1223ff5: Pulling fs layer 2026-02-21T08:04:48.0476464Z 76249c7cd503: Pulling fs layer 2026-02-21T08:04:48.0476762Z c20926c42231: Pulling fs layer 2026-02-21T08:04:48.0477072Z c03b8ec8dd33: Pulling fs layer 2026-02-21T08:04:48.0477644Z d7913b78456a: Pulling fs layer 2026-02-21T08:04:48.0477901Z 8fb7ecb711ef: Pulling fs layer 2026-02-21T08:04:48.0478186Z ab7341a40ee7: Pulling fs layer 2026-02-21T08:04:48.0478409Z 401d11fb2a09: Pulling fs layer 2026-02-21T08:04:48.2432361Z 8fb7ecb711ef: Download complete 2026-02-21T08:04:48.2436722Z 1cd98a0b9132: Download complete 2026-02-21T08:04:48.3428527Z afcf80b42416: Download complete 2026-02-21T08:04:48.3431248Z c20926c42231: Download complete 2026-02-21T08:04:48.3438728Z c03b8ec8dd33: Download complete 2026-02-21T08:04:48.3444090Z d7913b78456a: Download complete 2026-02-21T08:04:48.4421893Z 401d11fb2a09: Download complete 2026-02-21T08:04:48.8422883Z 76249c7cd503: Download complete 2026-02-21T08:04:49.9423896Z ab7341a40ee7: Download complete 2026-02-21T08:04:49.9452086Z 76249c7cd503: Pull complete 2026-02-21T08:04:59.1420091Z eea924c2c8fb: Download complete 2026-02-21T08:05:01.6472411Z 401d11fb2a09: Pull complete 2026-02-21T08:05:06.1431667Z e93dd1223ff5: Download complete 2026-02-21T08:05:06.2457036Z c03b8ec8dd33: Pull complete 2026-02-21T08:05:06.2461983Z d7913b78456a: Pull complete 2026-02-21T08:05:06.2466800Z ab7341a40ee7: Pull complete 2026-02-21T08:05:22.6450886Z 8fb7ecb711ef: Pull complete 2026-02-21T08:05:22.6457911Z afcf80b42416: Pull complete 2026-02-21T08:05:22.6458366Z c20926c42231: Pull complete 2026-02-21T08:05:22.6461983Z eea924c2c8fb: Pull complete 2026-02-21T08:06:01.3445994Z e93dd1223ff5: Pull complete 2026-02-21T08:06:01.3653968Z 1cd98a0b9132: Pull complete 2026-02-21T08:06:01.3654572Z Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92 2026-02-21T08:06:01.3659936Z Status: Downloaded newer image for nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:01.3660834Z docker.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 2026-02-21T08:06:01.3731170Z ##[command]/usr/bin/docker create --name d92e73d387994e1e949d78541a22b449_nvidiacuda1301develubuntu2404_f71b97 --label f446d1 --workdir /__w/helion/helion --network github_network_635fe730ba6e422c803643553ff1a973 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/eve/_work":"/__w" -v "/home/eve/externals":"/__e":ro -v "/home/eve/_work/_temp":"/__w/_temp" -v "/home/eve/_work/_actions":"/__w/_actions" -v "/home/eve/_work/_tool":"/__w/_tool" -v "/home/eve/_work/_temp/_github_home":"/github/home" -v "/home/eve/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" nvidia/cuda:13.0.1-devel-ubuntu24.04 "-f" "/dev/null" 2026-02-21T08:06:01.4372048Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:06:01.4396375Z ##[command]/usr/bin/docker start 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:06:02.0761501Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:06:02.0782664Z ##[command]/usr/bin/docker ps --all --filter id=227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T08:06:02.0937938Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 Up Less than a second 2026-02-21T08:06:02.0958792Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:06:02.1059325Z HOME=/github/home 2026-02-21T08:06:02.1059637Z GITHUB_ACTIONS=true 2026-02-21T08:06:02.1059865Z CI=true 2026-02-21T08:06:02.1060386Z PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:02.1060792Z NVARCH=x86_64 2026-02-21T08:06:02.1065895Z NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 2026-02-21T08:06:02.1071292Z NV_CUDA_CUDART_VERSION=13.0.88-1 2026-02-21T08:06:02.1071574Z CUDA_VERSION=13.0.1 2026-02-21T08:06:02.1071932Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:02.1072378Z NVIDIA_VISIBLE_DEVICES=all 2026-02-21T08:06:02.1072670Z NVIDIA_DRIVER_CAPABILITIES=compute,utility 2026-02-21T08:06:02.1072957Z NV_CUDA_LIB_VERSION=13.0.1-1 2026-02-21T08:06:02.1073393Z NV_NVTX_VERSION=13.0.85-1 2026-02-21T08:06:02.1073628Z NV_LIBNPP_VERSION=13.0.1.2-1 2026-02-21T08:06:02.1073921Z NV_LIBNPP_PACKAGE=libnpp-13-0=13.0.1.2-1 2026-02-21T08:06:02.1074193Z NV_LIBCUSPARSE_VERSION=12.6.3.3-1 2026-02-21T08:06:02.1074563Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-13-0 2026-02-21T08:06:02.1074940Z NV_LIBCUBLAS_VERSION=13.0.2.14-1 2026-02-21T08:06:02.1075373Z NV_LIBCUBLAS_PACKAGE=libcublas-13-0=13.0.2.14-1 2026-02-21T08:06:02.1075816Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2026-02-21T08:06:02.1076162Z NV_LIBNCCL_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:02.1076572Z NCCL_VERSION=2.28.3-1 2026-02-21T08:06:02.1076887Z NV_LIBNCCL_PACKAGE=libnccl2=2.28.3-1+cuda13.0 2026-02-21T08:06:02.1077304Z NVIDIA_PRODUCT_NAME=CUDA 2026-02-21T08:06:02.1077587Z NV_CUDA_CUDART_DEV_VERSION=13.0.88-1 2026-02-21T08:06:02.1078031Z NV_NVML_DEV_VERSION=13.0.87-1 2026-02-21T08:06:02.1078419Z NV_LIBCUSPARSE_DEV_VERSION=12.6.3.3-1 2026-02-21T08:06:02.1078746Z NV_LIBNPP_DEV_VERSION=13.0.1.2-1 2026-02-21T08:06:02.1079210Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-13-0=13.0.1.2-1 2026-02-21T08:06:02.1079661Z NV_LIBCUBLAS_DEV_VERSION=13.0.2.14-1 2026-02-21T08:06:02.1080107Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-13-0 2026-02-21T08:06:02.1080568Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-13-0=13.0.2.14-1 2026-02-21T08:06:02.1081026Z NV_CUDA_NSIGHT_COMPUTE_VERSION=13.0.1-1 2026-02-21T08:06:02.1081491Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-13-0=13.0.1-1 2026-02-21T08:06:02.1082047Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2026-02-21T08:06:02.1082441Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.28.3-1 2026-02-21T08:06:02.1082794Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.28.3-1+cuda13.0 2026-02-21T08:06:02.1083170Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2026-02-21T08:06:02.1089072Z ##[endgroup] 2026-02-21T08:06:02.1096761Z ##[group]Waiting for all services to be ready 2026-02-21T08:06:02.1098264Z ##[endgroup] 2026-02-21T08:06:02.1236773Z ##[group]Run echo "Detected NVIDIA image" 2026-02-21T08:06:02.1237100Z echo "Detected NVIDIA image" 2026-02-21T08:06:02.1237421Z nvidia-smi || echo "nvidia-smi not found" 2026-02-21T08:06:02.1239606Z shell: bash -l {0} 2026-02-21T08:06:02.1239972Z env: 2026-02-21T08:06:02.1240159Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:02.1240459Z ##[endgroup] 2026-02-21T08:06:02.1902180Z Detected NVIDIA image 2026-02-21T08:06:02.2177926Z Sat Feb 21 08:06:02 2026 2026-02-21T08:06:02.2179673Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.2180134Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T08:06:02.2180572Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:02.2181140Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T08:06:02.2181663Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T08:06:02.2182272Z | | | MIG M. | 2026-02-21T08:06:02.2182575Z |=========================================+========================+======================| 2026-02-21T08:06:02.2264178Z | 0 NVIDIA B200 Off | 00000000:9D:00.0 Off | 0 | 2026-02-21T08:06:02.2264717Z | N/A 30C P0 141W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T08:06:02.2265149Z | | | Disabled | 2026-02-21T08:06:02.2265548Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:06:02.2265752Z 2026-02-21T08:06:02.2265987Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.2266464Z | Processes: | 2026-02-21T08:06:02.2266843Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T08:06:02.2267233Z | ID ID Usage | 2026-02-21T08:06:02.2267599Z |=========================================================================================| 2026-02-21T08:06:02.2267980Z | No running processes found | 2026-02-21T08:06:02.2268428Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:06:02.2653324Z ##[group]Run set -x 2026-02-21T08:06:02.2653603Z set -x 2026-02-21T08:06:02.2653847Z apt-get update 2026-02-21T08:06:02.2654051Z apt-get install -y git 2026-02-21T08:06:02.2654379Z shell: bash -l {0} 2026-02-21T08:06:02.2654606Z env: 2026-02-21T08:06:02.2654779Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:02.2655025Z ##[endgroup] 2026-02-21T08:06:02.3337203Z + apt-get update 2026-02-21T08:06:02.4037990Z Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease [1581 B] 2026-02-21T08:06:02.5048299Z Get:2 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T08:06:02.5217282Z Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 Packages [1218 kB] 2026-02-21T08:06:02.7038209Z Get:4 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T08:06:02.8853330Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T08:06:02.9798405Z Get:6 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T08:06:03.0879310Z Get:7 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T08:06:03.4928632Z Get:8 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T08:06:03.6780963Z Get:9 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T08:06:03.6803333Z Get:10 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T08:06:03.6820022Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T08:06:03.7134140Z Get:12 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T08:06:03.7487782Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T08:06:03.8063516Z Get:14 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T08:06:03.8349289Z Get:15 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T08:06:03.8360917Z Get:16 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T08:06:03.8367210Z Get:17 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T08:06:04.2925413Z Get:18 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T08:06:04.2926078Z Get:19 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T08:06:04.6281213Z Get:20 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T08:06:04.8164375Z Fetched 37.5 MB in 2s (15.3 MB/s) 2026-02-21T08:06:05.5357689Z Reading package lists... 2026-02-21T08:06:05.5507659Z W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T08:06:05.5517007Z + apt-get install -y git 2026-02-21T08:06:06.2345827Z Reading package lists... 2026-02-21T08:06:06.4085499Z Building dependency tree... 2026-02-21T08:06:06.4089604Z Reading state information... 2026-02-21T08:06:06.5763568Z The following additional packages will be installed: 2026-02-21T08:06:06.5764302Z git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 libcurl3t64-gnutls 2026-02-21T08:06:06.5764985Z libedit2 liberror-perl libexpat1 libfido2-1 libgssapi-krb5-2 libk5crypto3 2026-02-21T08:06:06.5765442Z libkeyutils1 libkrb5-3 libkrb5support0 libnghttp2-14 libpsl5t64 librtmp1 2026-02-21T08:06:06.5765917Z libssh-4 libx11-6 libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 2026-02-21T08:06:06.5767896Z openssh-client publicsuffix xauth 2026-02-21T08:06:06.5775229Z Suggested packages: 2026-02-21T08:06:06.5775649Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T08:06:06.5776553Z gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user keychain 2026-02-21T08:06:06.5776916Z libpam-ssh monkeysphere ssh-askpass 2026-02-21T08:06:06.6160387Z The following NEW packages will be installed: 2026-02-21T08:06:06.6162433Z git git-man krb5-locales less libbrotli1 libbsd0 libcbor0.10 2026-02-21T08:06:06.6162934Z libcurl3t64-gnutls libedit2 liberror-perl libexpat1 libfido2-1 2026-02-21T08:06:06.6163324Z libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3 libkrb5support0 2026-02-21T08:06:06.6163857Z libnghttp2-14 libpsl5t64 librtmp1 libssh-4 libx11-6 libx11-data libxau6 2026-02-21T08:06:06.6164249Z libxcb1 libxdmcp6 libxext6 libxmuu1 openssh-client publicsuffix xauth 2026-02-21T08:06:06.9720263Z 0 upgraded, 31 newly installed, 0 to remove and 86 not upgraded. 2026-02-21T08:06:06.9720968Z Need to get 8886 kB of archives. 2026-02-21T08:06:06.9721381Z After this operation, 38.0 MB of additional disk space will be used. 2026-02-21T08:06:06.9722438Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 krb5-locales all 1.20.1-6ubuntu2.6 [14.8 kB] 2026-02-21T08:06:07.3153653Z Get:2 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T08:06:07.7774995Z Get:3 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libbsd0 amd64 0.12.1-1build1.1 [41.2 kB] 2026-02-21T08:06:07.8465778Z Get:4 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libexpat1 amd64 2.6.1-2ubuntu0.4 [88.2 kB] 2026-02-21T08:06:07.9394552Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5support0 amd64 1.20.1-6ubuntu2.6 [34.4 kB] 2026-02-21T08:06:07.9682883Z Get:6 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libk5crypto3 amd64 1.20.1-6ubuntu2.6 [82.0 kB] 2026-02-21T08:06:08.0239354Z Get:7 http://archive.ubuntu.com/ubuntu noble/main amd64 libkeyutils1 amd64 1.6.3-3build1 [9490 B] 2026-02-21T08:06:08.0287320Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libkrb5-3 amd64 1.20.1-6ubuntu2.6 [348 kB] 2026-02-21T08:06:08.1843644Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libgssapi-krb5-2 amd64 1.20.1-6ubuntu2.6 [143 kB] 2026-02-21T08:06:08.2343739Z Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T08:06:08.2379714Z Get:11 http://archive.ubuntu.com/ubuntu noble/main amd64 libedit2 amd64 3.1-20230828-1build1 [97.6 kB] 2026-02-21T08:06:08.2676490Z Get:12 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T08:06:08.2807567Z Get:13 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libnghttp2-14 amd64 1.59.0-1ubuntu0.2 [74.3 kB] 2026-02-21T08:06:08.2924571Z Get:14 http://archive.ubuntu.com/ubuntu noble/main amd64 libpsl5t64 amd64 0.21.2-1.1build1 [57.1 kB] 2026-02-21T08:06:08.3023523Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 libxau6 amd64 1:1.0.9-1build6 [7160 B] 2026-02-21T08:06:08.3035612Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 libxdmcp6 amd64 1:1.1.3-0ubuntu6 [10.3 kB] 2026-02-21T08:06:08.3068365Z Get:17 http://archive.ubuntu.com/ubuntu noble/main amd64 libxcb1 amd64 1.15-1ubuntu2 [47.7 kB] 2026-02-21T08:06:08.3524776Z Get:18 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-data all 2:1.8.7-1build1 [115 kB] 2026-02-21T08:06:08.4107278Z Get:19 http://archive.ubuntu.com/ubuntu noble/main amd64 libx11-6 amd64 2:1.8.7-1build1 [650 kB] 2026-02-21T08:06:08.4849108Z Get:20 http://archive.ubuntu.com/ubuntu noble/main amd64 libxext6 amd64 2:1.3.4-1build2 [30.4 kB] 2026-02-21T08:06:08.4970868Z Get:21 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T08:06:08.4980376Z Get:22 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T08:06:08.5920916Z Get:23 http://archive.ubuntu.com/ubuntu noble/main amd64 publicsuffix all 20231001.0357-0.1 [129 kB] 2026-02-21T08:06:08.6032445Z Get:24 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T08:06:08.6060559Z Get:25 http://archive.ubuntu.com/ubuntu noble/main amd64 libbrotli1 amd64 1.1.0-2build2 [331 kB] 2026-02-21T08:06:08.6357989Z Get:26 http://archive.ubuntu.com/ubuntu noble/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-2build7 [56.3 kB] 2026-02-21T08:06:08.6408388Z Get:27 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libssh-4 amd64 0.10.6-2ubuntu0.3 [190 kB] 2026-02-21T08:06:08.6568880Z Get:28 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T08:06:08.6830526Z Get:29 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T08:06:08.6847321Z Get:30 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T08:06:08.7396742Z Get:31 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T08:06:08.9716914Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T08:06:08.9923189Z Fetched 8886 kB in 2s (3934 kB/s) 2026-02-21T08:06:09.0055352Z Selecting previously unselected package krb5-locales. 2026-02-21T08:06:09.0074257Z (Reading database ... 2026-02-21T08:06:09.0078929Z (Reading database ... 5% 2026-02-21T08:06:09.0080735Z (Reading database ... 10% 2026-02-21T08:06:09.0081011Z (Reading database ... 15% 2026-02-21T08:06:09.0081197Z (Reading database ... 20% 2026-02-21T08:06:09.0081474Z (Reading database ... 25% 2026-02-21T08:06:09.0081707Z (Reading database ... 30% 2026-02-21T08:06:09.0086274Z (Reading database ... 35% 2026-02-21T08:06:09.0088813Z (Reading database ... 40% 2026-02-21T08:06:09.0089145Z (Reading database ... 45% 2026-02-21T08:06:09.0089351Z (Reading database ... 50% 2026-02-21T08:06:09.0089700Z (Reading database ... 55% 2026-02-21T08:06:09.0089895Z (Reading database ... 60% 2026-02-21T08:06:09.0090109Z (Reading database ... 65% 2026-02-21T08:06:09.0090304Z (Reading database ... 70% 2026-02-21T08:06:09.0090548Z (Reading database ... 75% 2026-02-21T08:06:09.0099908Z (Reading database ... 80% 2026-02-21T08:06:09.0101711Z (Reading database ... 85% 2026-02-21T08:06:09.0108548Z (Reading database ... 90% 2026-02-21T08:06:09.0116650Z (Reading database ... 95% 2026-02-21T08:06:09.0118559Z (Reading database ... 100% 2026-02-21T08:06:09.0118867Z (Reading database ... 15507 files and directories currently installed.) 2026-02-21T08:06:09.0119384Z Preparing to unpack .../00-krb5-locales_1.20.1-6ubuntu2.6_all.deb ... 2026-02-21T08:06:09.0129883Z Unpacking krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.0278103Z Selecting previously unselected package less. 2026-02-21T08:06:09.0286577Z Preparing to unpack .../01-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T08:06:09.0305850Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T08:06:09.0461596Z Selecting previously unselected package libbsd0:amd64. 2026-02-21T08:06:09.0470415Z Preparing to unpack .../02-libbsd0_0.12.1-1build1.1_amd64.deb ... 2026-02-21T08:06:09.0488803Z Unpacking libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:09.0629183Z Selecting previously unselected package libexpat1:amd64. 2026-02-21T08:06:09.0636720Z Preparing to unpack .../03-libexpat1_2.6.1-2ubuntu0.4_amd64.deb ... 2026-02-21T08:06:09.0647265Z Unpacking libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:09.0789025Z Selecting previously unselected package libkrb5support0:amd64. 2026-02-21T08:06:09.0797332Z Preparing to unpack .../04-libkrb5support0_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:09.0807427Z Unpacking libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.0942930Z Selecting previously unselected package libk5crypto3:amd64. 2026-02-21T08:06:09.0949473Z Preparing to unpack .../05-libk5crypto3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:09.0961367Z Unpacking libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.1097081Z Selecting previously unselected package libkeyutils1:amd64. 2026-02-21T08:06:09.1104102Z Preparing to unpack .../06-libkeyutils1_1.6.3-3build1_amd64.deb ... 2026-02-21T08:06:09.1111341Z Unpacking libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:09.1235646Z Selecting previously unselected package libkrb5-3:amd64. 2026-02-21T08:06:09.1243638Z Preparing to unpack .../07-libkrb5-3_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:09.1255235Z Unpacking libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.1422641Z Selecting previously unselected package libgssapi-krb5-2:amd64. 2026-02-21T08:06:09.1430279Z Preparing to unpack .../08-libgssapi-krb5-2_1.20.1-6ubuntu2.6_amd64.deb ... 2026-02-21T08:06:09.1440274Z Unpacking libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.1583153Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T08:06:09.1589448Z Preparing to unpack .../09-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T08:06:09.1598912Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:09.1721952Z Selecting previously unselected package libedit2:amd64. 2026-02-21T08:06:09.1730425Z Preparing to unpack .../10-libedit2_3.1-20230828-1build1_amd64.deb ... 2026-02-21T08:06:09.1739574Z Unpacking libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:09.1878029Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T08:06:09.1885345Z Preparing to unpack .../11-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T08:06:09.1886070Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:09.2019500Z Selecting previously unselected package libnghttp2-14:amd64. 2026-02-21T08:06:09.2027268Z Preparing to unpack .../12-libnghttp2-14_1.59.0-1ubuntu0.2_amd64.deb ... 2026-02-21T08:06:09.2043392Z Unpacking libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:09.2186850Z Selecting previously unselected package libpsl5t64:amd64. 2026-02-21T08:06:09.2195601Z Preparing to unpack .../13-libpsl5t64_0.21.2-1.1build1_amd64.deb ... 2026-02-21T08:06:09.2209773Z Unpacking libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:09.2335222Z Selecting previously unselected package libxau6:amd64. 2026-02-21T08:06:09.2342902Z Preparing to unpack .../14-libxau6_1%3a1.0.9-1build6_amd64.deb ... 2026-02-21T08:06:09.2353063Z Unpacking libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:09.2466441Z Selecting previously unselected package libxdmcp6:amd64. 2026-02-21T08:06:09.2477128Z Preparing to unpack .../15-libxdmcp6_1%3a1.1.3-0ubuntu6_amd64.deb ... 2026-02-21T08:06:09.2491500Z Unpacking libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:09.2625605Z Selecting previously unselected package libxcb1:amd64. 2026-02-21T08:06:09.2637059Z Preparing to unpack .../16-libxcb1_1.15-1ubuntu2_amd64.deb ... 2026-02-21T08:06:09.2646789Z Unpacking libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:09.2769993Z Selecting previously unselected package libx11-data. 2026-02-21T08:06:09.2782165Z Preparing to unpack .../17-libx11-data_2%3a1.8.7-1build1_all.deb ... 2026-02-21T08:06:09.2792556Z Unpacking libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:09.3090406Z Selecting previously unselected package libx11-6:amd64. 2026-02-21T08:06:09.3101821Z Preparing to unpack .../18-libx11-6_2%3a1.8.7-1build1_amd64.deb ... 2026-02-21T08:06:09.3111278Z Unpacking libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:09.3296772Z Selecting previously unselected package libxext6:amd64. 2026-02-21T08:06:09.3308891Z Preparing to unpack .../19-libxext6_2%3a1.3.4-1build2_amd64.deb ... 2026-02-21T08:06:09.3318506Z Unpacking libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:09.3446903Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T08:06:09.3458272Z Preparing to unpack .../20-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T08:06:09.3468236Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:09.3608361Z Selecting previously unselected package openssh-client. 2026-02-21T08:06:09.3620610Z Preparing to unpack .../21-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T08:06:09.3677216Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:09.3973059Z Selecting previously unselected package publicsuffix. 2026-02-21T08:06:09.3980395Z Preparing to unpack .../22-publicsuffix_20231001.0357-0.1_all.deb ... 2026-02-21T08:06:09.3989600Z Unpacking publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:09.4116488Z Selecting previously unselected package xauth. 2026-02-21T08:06:09.4124568Z Preparing to unpack .../23-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T08:06:09.4133375Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:09.4260158Z Selecting previously unselected package libbrotli1:amd64. 2026-02-21T08:06:09.4266550Z Preparing to unpack .../24-libbrotli1_1.1.0-2build2_amd64.deb ... 2026-02-21T08:06:09.4277745Z Unpacking libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:09.4425849Z Selecting previously unselected package librtmp1:amd64. 2026-02-21T08:06:09.4434267Z Preparing to unpack .../25-librtmp1_2.4+20151223.gitfa8646d.1-2build7_amd64.deb ... 2026-02-21T08:06:09.4444839Z Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:09.4577054Z Selecting previously unselected package libssh-4:amd64. 2026-02-21T08:06:09.4585596Z Preparing to unpack .../26-libssh-4_0.10.6-2ubuntu0.3_amd64.deb ... 2026-02-21T08:06:09.4595036Z Unpacking libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:09.4739003Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T08:06:09.4749025Z Preparing to unpack .../27-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T08:06:09.4757863Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:09.4904845Z Selecting previously unselected package liberror-perl. 2026-02-21T08:06:09.4914741Z Preparing to unpack .../28-liberror-perl_0.17029-2_all.deb ... 2026-02-21T08:06:09.4919774Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T08:06:09.5049869Z Selecting previously unselected package git-man. 2026-02-21T08:06:09.5057488Z Preparing to unpack .../29-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T08:06:09.5067147Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:09.5265233Z Selecting previously unselected package git. 2026-02-21T08:06:09.5267652Z Preparing to unpack .../30-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T08:06:09.5328433Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:09.6510527Z Setting up libexpat1:amd64 (2.6.1-2ubuntu0.4) ... 2026-02-21T08:06:09.6546357Z Setting up libxau6:amd64 (1:1.0.9-1build6) ... 2026-02-21T08:06:09.6563363Z Setting up libkeyutils1:amd64 (1.6.3-3build1) ... 2026-02-21T08:06:09.6612604Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T08:06:09.6644258Z Setting up libbrotli1:amd64 (1.1.0-2build2) ... 2026-02-21T08:06:09.6665359Z Setting up libpsl5t64:amd64 (0.21.2-1.1build1) ... 2026-02-21T08:06:09.6726908Z Setting up libnghttp2-14:amd64 (1.59.0-1ubuntu0.2) ... 2026-02-21T08:06:09.6743008Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T08:06:09.6824734Z Setting up krb5-locales (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.6872876Z Setting up libkrb5support0:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.6950287Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T08:06:09.6969440Z Setting up libx11-data (2:1.8.7-1build1) ... 2026-02-21T08:06:09.6995074Z Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2build7) ... 2026-02-21T08:06:09.7025756Z Setting up libk5crypto3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.7059736Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:09.7071327Z Setting up libkrb5-3:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.7080207Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T08:06:09.7128228Z Setting up libbsd0:amd64 (0.12.1-1build1.1) ... 2026-02-21T08:06:09.7161499Z Setting up publicsuffix (20231001.0357-0.1) ... 2026-02-21T08:06:09.7175745Z Setting up libxdmcp6:amd64 (1:1.1.3-0ubuntu6) ... 2026-02-21T08:06:09.7195691Z Setting up libxcb1:amd64 (1.15-1ubuntu2) ... 2026-02-21T08:06:09.7254112Z Setting up libedit2:amd64 (3.1-20230828-1build1) ... 2026-02-21T08:06:09.7323351Z Setting up libgssapi-krb5-2:amd64 (1.20.1-6ubuntu2.6) ... 2026-02-21T08:06:09.7400996Z Setting up libssh-4:amd64 (0.10.6-2ubuntu0.3) ... 2026-02-21T08:06:09.7458860Z Setting up libx11-6:amd64 (2:1.8.7-1build1) ... 2026-02-21T08:06:09.7527549Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T08:06:09.7579733Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T08:06:09.8138336Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T08:06:09.8191227Z Setting up libxext6:amd64 (2:1.3.4-1build2) ... 2026-02-21T08:06:09.8215026Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T08:06:09.8285070Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T08:06:09.8304123Z Processing triggers for libc-bin (2.39-0ubuntu8.5) ... 2026-02-21T08:06:09.8690768Z ##[group]Run actions/checkout@v6 2026-02-21T08:06:09.8691044Z with: 2026-02-21T08:06:09.8691308Z repository: pytorch/helion 2026-02-21T08:06:09.8691714Z token: *** 2026-02-21T08:06:09.8692231Z ssh-strict: true 2026-02-21T08:06:09.8692473Z ssh-user: git 2026-02-21T08:06:09.8692704Z persist-credentials: true 2026-02-21T08:06:09.8692908Z clean: true 2026-02-21T08:06:09.8693168Z sparse-checkout-cone-mode: true 2026-02-21T08:06:09.8693391Z fetch-depth: 1 2026-02-21T08:06:09.8693606Z fetch-tags: false 2026-02-21T08:06:09.8693810Z show-progress: true 2026-02-21T08:06:09.8694205Z lfs: false 2026-02-21T08:06:09.8694427Z submodules: false 2026-02-21T08:06:09.8694621Z set-safe-directory: true 2026-02-21T08:06:09.8694866Z env: 2026-02-21T08:06:09.8695045Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:09.8695294Z ##[endgroup] 2026-02-21T08:06:09.8730208Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:10.0486316Z Syncing repository: pytorch/helion 2026-02-21T08:06:10.0487243Z ##[group]Getting Git version info 2026-02-21T08:06:10.0487593Z Working directory is '/__w/helion/helion' 2026-02-21T08:06:10.0488001Z [command]/usr/bin/git version 2026-02-21T08:06:10.0491608Z git version 2.43.0 2026-02-21T08:06:10.0508288Z ##[endgroup] 2026-02-21T08:06:10.0519414Z Temporarily overriding HOME='/__w/_temp/d4a90a64-5a5e-461b-8f77-058ba5cbd50e' before making global git config changes 2026-02-21T08:06:10.0519978Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:06:10.0528469Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:06:10.0547487Z Deleting the contents of '/__w/helion/helion' 2026-02-21T08:06:10.0552625Z ##[group]Initializing the repository 2026-02-21T08:06:10.0556425Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T08:06:10.0580423Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T08:06:10.0590371Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T08:06:10.0590831Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T08:06:10.0591151Z hint: 2026-02-21T08:06:10.0591427Z hint: git config --global init.defaultBranch 2026-02-21T08:06:10.0591708Z hint: 2026-02-21T08:06:10.0591991Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T08:06:10.0592408Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T08:06:10.0592725Z hint: 2026-02-21T08:06:10.0592940Z hint: git branch -m 2026-02-21T08:06:10.0593263Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T08:06:10.0593923Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T08:06:10.0621997Z ##[endgroup] 2026-02-21T08:06:10.0622376Z ##[group]Disabling automatic garbage collection 2026-02-21T08:06:10.0626603Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T08:06:10.0654043Z ##[endgroup] 2026-02-21T08:06:10.0654368Z ##[group]Setting up auth 2026-02-21T08:06:10.0654671Z Removing SSH command configuration 2026-02-21T08:06:10.0659730Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:06:10.0687422Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:06:10.0929312Z Removing HTTP extra header 2026-02-21T08:06:10.0931906Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:06:10.0952983Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:06:10.1186609Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:06:10.1187046Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:06:10.1211186Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:06:10.1436537Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T08:06:10.1468617Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:06:10.1495797Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:06:10.1522100Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:06:10.1548391Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:06:10.1572982Z ##[endgroup] 2026-02-21T08:06:10.1573542Z ##[group]Fetching the repository 2026-02-21T08:06:10.1581253Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T08:06:10.7027859Z From https://github.com/pytorch/helion 2026-02-21T08:06:10.7028419Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T08:06:10.7051106Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T08:06:10.7075046Z origin/main 2026-02-21T08:06:10.7080151Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T08:06:10.7099788Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:10.7106467Z ##[endgroup] 2026-02-21T08:06:10.7106904Z ##[group]Determining the checkout info 2026-02-21T08:06:10.7107310Z ##[endgroup] 2026-02-21T08:06:10.7107620Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T08:06:10.7137003Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T08:06:10.7158161Z ##[group]Checking out the ref 2026-02-21T08:06:10.7158722Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T08:06:10.7381140Z Switched to a new branch 'main' 2026-02-21T08:06:10.7383307Z branch 'main' set up to track 'origin/main'. 2026-02-21T08:06:10.7393489Z ##[endgroup] 2026-02-21T08:06:10.7418705Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T08:06:10.7440129Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:06:10.7620283Z ##[group]Run actions/setup-python@v6 2026-02-21T08:06:10.7620550Z with: 2026-02-21T08:06:10.7620795Z python-version: 3.12 2026-02-21T08:06:10.7621001Z check-latest: false 2026-02-21T08:06:10.7621347Z token: *** 2026-02-21T08:06:10.7621563Z update-environment: true 2026-02-21T08:06:10.7621822Z allow-prereleases: false 2026-02-21T08:06:10.7622350Z freethreaded: false 2026-02-21T08:06:10.7622599Z env: 2026-02-21T08:06:10.7622889Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:10.7623148Z ##[endgroup] 2026-02-21T08:06:10.7627502Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:10.9753142Z ##[group]Installed versions 2026-02-21T08:06:10.9765799Z Version 3.12 was not found in the local cache 2026-02-21T08:06:11.5393127Z Version 3.12 is available for downloading 2026-02-21T08:06:11.5393726Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T08:06:12.1030930Z Extract downloaded archive 2026-02-21T08:06:12.1136041Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/d62fbf56-d8ad-4d0d-af37-a0efc99c2fd5 -f /__w/_temp/e5950131-de28-49d4-8940-531b92dab25a 2026-02-21T08:06:13.9560102Z Execute installation script 2026-02-21T08:06:13.9655991Z Check if Python hostedtoolcache folder exist... 2026-02-21T08:06:13.9656468Z Creating Python hostedtoolcache folder... 2026-02-21T08:06:13.9665504Z Create Python 3.12.12 folder 2026-02-21T08:06:13.9675718Z Copy Python binaries to hostedtoolcache folder 2026-02-21T08:06:14.2252656Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T08:06:14.2291126Z Upgrading pip... 2026-02-21T08:06:15.9003864Z Looking in links: /tmp/tmp4xt3whd9 2026-02-21T08:06:15.9005741Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T08:06:15.9044470Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T08:06:16.5040699Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T08:06:16.6650737Z Collecting pip 2026-02-21T08:06:16.7748270Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T08:06:16.7795140Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T08:06:16.8153552Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 88.8 MB/s eta 0:00:00 2026-02-21T08:06:16.8154998Z 2026-02-21T08:06:16.8239742Z Installing collected packages: pip 2026-02-21T08:06:16.8243784Z Attempting uninstall: pip 2026-02-21T08:06:16.8252982Z Found existing installation: pip 25.0.1 2026-02-21T08:06:16.8425623Z Uninstalling pip-25.0.1: 2026-02-21T08:06:16.8459474Z Successfully uninstalled pip-25.0.1 2026-02-21T08:06:17.4624620Z Successfully installed pip-26.0.1 2026-02-21T08:06:17.5142795Z Create complete file 2026-02-21T08:06:17.5185301Z Successfully set up CPython (3.12.12) 2026-02-21T08:06:17.5186977Z ##[endgroup] 2026-02-21T08:06:17.5392054Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T08:06:17.5392359Z with: 2026-02-21T08:06:17.5392556Z activate-environment: false 2026-02-21T08:06:17.5392841Z working-directory: /home/eve/_work/helion/helion 2026-02-21T08:06:17.5393278Z github-token: *** 2026-02-21T08:06:17.5393473Z enable-cache: auto 2026-02-21T08:06:17.5393926Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T08:06:17.5394424Z restore-cache: true 2026-02-21T08:06:17.5394654Z save-cache: true 2026-02-21T08:06:17.5394868Z prune-cache: true 2026-02-21T08:06:17.5395085Z cache-python: false 2026-02-21T08:06:17.5395336Z ignore-nothing-to-cache: false 2026-02-21T08:06:17.5395562Z ignore-empty-workdir: false 2026-02-21T08:06:17.5395832Z add-problem-matchers: true 2026-02-21T08:06:17.5396048Z resolution-strategy: highest 2026-02-21T08:06:17.5396299Z env: 2026-02-21T08:06:17.5396581Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:17.5396834Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:17.5397138Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:17.5397492Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:17.5397799Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:17.5398045Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:17.5398517Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:17.5398909Z ##[endgroup] 2026-02-21T08:06:17.5405091Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:06:17.7586923Z (node:802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:06:17.7591350Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:06:17.7666079Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T08:06:17.7670809Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T08:06:17.7674890Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T08:06:17.7675854Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T08:06:17.7676430Z Getting latest version from GitHub API... 2026-02-21T08:06:17.9994286Z manifest-file not provided, reading from local file. 2026-02-21T08:06:18.0036840Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T08:06:18.0038096Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T08:06:18.2892404Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/64c5fce0-958c-4688-96ee-c42a87209806 -f /__w/_temp/a48647e9-acec-4d1e-9dfb-f63693d86f7a 2026-02-21T08:06:18.7027267Z Added /github/home/.local/bin to the path 2026-02-21T08:06:18.7027773Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T08:06:18.7028117Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T08:06:18.7028448Z Added /github/home/.local/share/uv/python to the path 2026-02-21T08:06:18.7037577Z Successfully installed uv version 0.10.4 2026-02-21T08:06:18.8429873Z ##[group]Run uv venv --python 3.12 2026-02-21T08:06:18.8430185Z uv venv --python 3.12 2026-02-21T08:06:18.8430549Z shell: bash -l {0} 2026-02-21T08:06:18.8430766Z env: 2026-02-21T08:06:18.8430981Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:18.8431274Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.8431611Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:18.8431990Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.8432314Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.8432575Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.8433030Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:18.8433525Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:18.8433790Z ##[endgroup] 2026-02-21T08:06:18.9687321Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T08:06:18.9687837Z Creating virtual environment at: .venv 2026-02-21T08:06:18.9692156Z Activate with: source .venv/bin/activate 2026-02-21T08:06:18.9752008Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:18.9752319Z source .venv/bin/activate 2026-02-21T08:06:18.9752744Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/cu130 2026-02-21T08:06:18.9753255Z shell: bash -l {0} 2026-02-21T08:06:18.9753452Z env: 2026-02-21T08:06:18.9753717Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:18.9754012Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.9754338Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:18.9754668Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.9754947Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.9755232Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:18.9755626Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:18.9756186Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:18.9756432Z ##[endgroup] 2026-02-21T08:06:19.7425218Z Resolved 26 packages in 677ms 2026-02-21T08:06:19.7455352Z Downloading networkx (2.0MiB) 2026-02-21T08:06:19.7455718Z Downloading sympy (6.0MiB) 2026-02-21T08:06:19.7516726Z Downloading torch (584.2MiB) 2026-02-21T08:06:19.7517053Z Downloading triton (162.6MiB) 2026-02-21T08:06:19.7619112Z Downloading nvidia-cuda-cupti (10.2MiB) 2026-02-21T08:06:19.7663527Z Downloading nvidia-cufft (204.2MiB) 2026-02-21T08:06:19.7815632Z Downloading nvidia-cuda-runtime (2.1MiB) 2026-02-21T08:06:19.7816127Z Downloading nvidia-cusolver (184.5MiB) 2026-02-21T08:06:19.7834810Z Downloading nvidia-curand (56.8MiB) 2026-02-21T08:06:19.7899546Z Downloading nvidia-cusparse (133.8MiB) 2026-02-21T08:06:19.7991568Z Downloading nvidia-nvshmem-cu13 (57.6MiB) 2026-02-21T08:06:19.8055341Z Downloading nvidia-cusparselt-cu13 (162.0MiB) 2026-02-21T08:06:19.8153455Z Downloading nvidia-cufile (1.2MiB) 2026-02-21T08:06:19.8254350Z Downloading nvidia-cuda-nvrtc (86.0MiB) 2026-02-21T08:06:19.8296238Z Downloading nvidia-nvjitlink (38.8MiB) 2026-02-21T08:06:19.8447679Z Downloading nvidia-cudnn-cu13 (332.4MiB) 2026-02-21T08:06:19.8457183Z Downloading nvidia-cublas (400.0MiB) 2026-02-21T08:06:19.8511738Z Downloading nvidia-nccl-cu13 (184.9MiB) 2026-02-21T08:06:20.1887214Z Downloaded nvidia-cufile 2026-02-21T08:06:20.3823310Z Downloaded nvidia-cuda-runtime 2026-02-21T08:06:20.7340746Z Downloaded networkx 2026-02-21T08:06:21.2139641Z Downloaded nvidia-cuda-cupti 2026-02-21T08:06:22.1827692Z Downloaded sympy 2026-02-21T08:06:22.8033537Z Downloaded triton 2026-02-21T08:06:24.1800704Z Downloaded nvidia-nvjitlink 2026-02-21T08:06:24.8383938Z Downloaded nvidia-curand 2026-02-21T08:06:24.8568612Z Downloaded nvidia-nvshmem-cu13 2026-02-21T08:06:26.4547457Z Downloaded nvidia-cuda-nvrtc 2026-02-21T08:06:27.2401085Z Downloaded nvidia-cusparse 2026-02-21T08:06:27.4027116Z Downloaded nvidia-cufft 2026-02-21T08:06:27.5774501Z Downloaded nvidia-cusolver 2026-02-21T08:06:27.7892806Z Downloaded nvidia-cusparselt-cu13 2026-02-21T08:06:28.0327616Z Downloaded nvidia-nccl-cu13 2026-02-21T08:06:29.2162216Z Downloaded nvidia-cudnn-cu13 2026-02-21T08:06:29.6663905Z Downloaded nvidia-cublas 2026-02-21T08:06:33.0426751Z Downloaded torch 2026-02-21T08:06:33.0431426Z Prepared 26 packages in 13.30s 2026-02-21T08:06:33.0465728Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:33.0466296Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:33.0466997Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:33.8675455Z Installed 26 packages in 823ms 2026-02-21T08:06:33.8677974Z + filelock==3.20.0 2026-02-21T08:06:33.8678297Z + fsspec==2025.12.0 2026-02-21T08:06:33.8678488Z + jinja2==3.1.6 2026-02-21T08:06:33.8678874Z + markupsafe==3.0.2 2026-02-21T08:06:33.8679401Z + mpmath==1.3.0 2026-02-21T08:06:33.8679619Z + networkx==3.6.1 2026-02-21T08:06:33.8679881Z + nvidia-cublas==13.0.0.19 2026-02-21T08:06:33.8680145Z + nvidia-cuda-cupti==13.0.48 2026-02-21T08:06:33.8680358Z + nvidia-cuda-nvrtc==13.0.48 2026-02-21T08:06:33.8680629Z + nvidia-cuda-runtime==13.0.48 2026-02-21T08:06:33.8680874Z + nvidia-cudnn-cu13==9.13.0.50 2026-02-21T08:06:33.8681091Z + nvidia-cufft==12.0.0.15 2026-02-21T08:06:33.8681344Z + nvidia-cufile==1.15.0.42 2026-02-21T08:06:33.8681542Z + nvidia-curand==10.4.0.35 2026-02-21T08:06:33.8681779Z + nvidia-cusolver==12.0.3.29 2026-02-21T08:06:33.8682212Z + nvidia-cusparse==12.6.2.49 2026-02-21T08:06:33.8682493Z + nvidia-cusparselt-cu13==0.8.0 2026-02-21T08:06:33.8682730Z + nvidia-nccl-cu13==2.27.7 2026-02-21T08:06:33.8682979Z + nvidia-nvjitlink==13.0.39 2026-02-21T08:06:33.8683242Z + nvidia-nvshmem-cu13==3.3.24 2026-02-21T08:06:33.8683458Z + nvidia-nvtx==13.0.39 2026-02-21T08:06:33.8683685Z + setuptools==70.2.0 2026-02-21T08:06:33.8683885Z + sympy==1.14.0 2026-02-21T08:06:33.8684108Z + torch==2.9.1+cu130 2026-02-21T08:06:33.8684275Z + triton==3.5.1 2026-02-21T08:06:33.8684526Z + typing-extensions==4.15.0 2026-02-21T08:06:33.8794968Z ##[group]Run source .venv/bin/activate 2026-02-21T08:06:33.8795289Z source .venv/bin/activate 2026-02-21T08:06:33.8795596Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T08:06:33.8795983Z python -c "import helion; print(helion.__name__)" 2026-02-21T08:06:33.8796445Z shell: bash -l {0} 2026-02-21T08:06:33.8796661Z env: 2026-02-21T08:06:33.8796888Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:33.8797282Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:33.8797609Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:33.8797904Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:33.8798200Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:33.8798456Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:33.8799012Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:33.8799444Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:33.8799748Z ##[endgroup] 2026-02-21T08:06:34.9777784Z Resolved 30 packages in 996ms 2026-02-21T08:06:34.9785668Z Building helion @ file:///__w/helion/helion 2026-02-21T08:06:34.9838804Z Downloading pygments (1.2MiB) 2026-02-21T08:06:34.9847993Z Downloading scikit-learn (8.5MiB) 2026-02-21T08:06:34.9848330Z Downloading virtualenv (5.6MiB) 2026-02-21T08:06:34.9992904Z Downloading numpy (15.8MiB) 2026-02-21T08:06:35.0029667Z Downloading scipy (33.4MiB) 2026-02-21T08:06:35.1469628Z Built helion @ file:///__w/helion/helion 2026-02-21T08:06:35.1711417Z Downloaded pygments 2026-02-21T08:06:35.1782854Z Downloaded virtualenv 2026-02-21T08:06:35.5851424Z Downloaded scikit-learn 2026-02-21T08:06:35.6512659Z Downloaded numpy 2026-02-21T08:06:36.0021729Z Downloaded scipy 2026-02-21T08:06:36.0029372Z Prepared 27 packages in 1.02s 2026-02-21T08:06:36.0036547Z Uninstalled 1 package in 0.63ms 2026-02-21T08:06:36.0037101Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:36.0037759Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:36.0038391Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:36.1024108Z Installed 29 packages in 98ms 2026-02-21T08:06:36.1025866Z + cfgv==3.5.0 2026-02-21T08:06:36.1026095Z + distlib==0.4.0 2026-02-21T08:06:36.1026312Z + expecttest==0.3.0 2026-02-21T08:06:36.1026587Z + filecheck==1.0.3 2026-02-21T08:06:36.1031795Z - filelock==3.20.0 2026-02-21T08:06:36.1035642Z + filelock==3.24.3 2026-02-21T08:06:36.1037428Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T08:06:36.1037753Z + hypothesis==6.151.9 2026-02-21T08:06:36.1037934Z + identify==2.6.16 2026-02-21T08:06:36.1038193Z + iniconfig==2.3.0 2026-02-21T08:06:36.1038409Z + joblib==1.5.3 2026-02-21T08:06:36.1038617Z + markdown-it-py==4.0.0 2026-02-21T08:06:36.1038853Z + mdurl==0.1.2 2026-02-21T08:06:36.1039073Z + nodeenv==1.10.0 2026-02-21T08:06:36.1039252Z + numpy==2.4.2 2026-02-21T08:06:36.1039480Z + packaging==26.0 2026-02-21T08:06:36.1039672Z + platformdirs==4.9.2 2026-02-21T08:06:36.1039906Z + pluggy==1.6.0 2026-02-21T08:06:36.1040138Z + pre-commit==4.5.1 2026-02-21T08:06:36.1040317Z + psutil==7.2.2 2026-02-21T08:06:36.1040525Z + pygments==2.19.2 2026-02-21T08:06:36.1040717Z + pytest==9.0.2 2026-02-21T08:06:36.1040944Z + pytest-timeout==2.4.0 2026-02-21T08:06:36.1041135Z + pyyaml==6.0.3 2026-02-21T08:06:36.1041347Z + rich==14.3.3 2026-02-21T08:06:36.1041531Z + scikit-learn==1.8.0 2026-02-21T08:06:36.1041744Z + scipy==1.17.0 2026-02-21T08:06:36.1042036Z + sortedcontainers==2.4.0 2026-02-21T08:06:36.1042250Z + threadpoolctl==3.6.0 2026-02-21T08:06:36.1042474Z + virtualenv==20.38.0 2026-02-21T08:06:47.3830423Z helion 2026-02-21T08:06:48.0677409Z ##[group]Run set -x 2026-02-21T08:06:48.0677651Z set -x 2026-02-21T08:06:48.0677891Z source .venv/bin/activate 2026-02-21T08:06:48.0678117Z uv pip install pip 2026-02-21T08:06:48.0678388Z uv pip install quack-kernels --no-deps 2026-02-21T08:06:48.0678702Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T08:06:48.0679010Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:06:48.0679331Z pushd tritonbench/ 2026-02-21T08:06:48.0679708Z git submodule update --init --recursive 2026-02-21T08:06:48.0680020Z uv pip install -r requirements.txt 2026-02-21T08:06:48.0680261Z python install.py --liger 2026-02-21T08:06:48.0680539Z uv pip install -e . --no-deps 2026-02-21T08:06:48.0680812Z popd 2026-02-21T08:06:48.0681000Z popd 2026-02-21T08:06:48.0681335Z shell: bash -l {0} 2026-02-21T08:06:48.0681548Z env: 2026-02-21T08:06:48.0681836Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:06:48.0682158Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.0682492Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:06:48.0682780Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.0683099Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.0683396Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:06:48.0683806Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:06:48.0684297Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:06:48.0684563Z ##[endgroup] 2026-02-21T08:06:48.2204564Z + source .venv/bin/activate 2026-02-21T08:06:48.2206124Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2206318Z ++ '[' -n x ']' 2026-02-21T08:06:48.2206578Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:06:48.2206942Z ++ '[' .venv/bin/activate = /__w/_temp/cca8e86e-7f80-40d3-a870-a3aad5960484.sh ']' 2026-02-21T08:06:48.2207295Z ++ deactivate nondestructive 2026-02-21T08:06:48.2207491Z ++ unset -f pydoc 2026-02-21T08:06:48.2207738Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2207913Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2208102Z ++ hash -r 2026-02-21T08:06:48.2208350Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2208528Z ++ unset VIRTUAL_ENV 2026-02-21T08:06:48.2208745Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:06:48.2209026Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:06:48.2209309Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:06:48.2209624Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:06:48.2209845Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:06:48.2210074Z ++ export VIRTUAL_ENV 2026-02-21T08:06:48.2210251Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2210510Z ++ unset SCRIPT_PATH 2026-02-21T08:06:48.2211169Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:48.2212404Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:06:48.2213146Z ++ export PATH 2026-02-21T08:06:48.2213335Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:06:48.2213578Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:06:48.2213819Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:06:48.2214054Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2214230Z ++ '[' -z '' ']' 2026-02-21T08:06:48.2214451Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:06:48.2214667Z ++ PS1='(helion) ' 2026-02-21T08:06:48.2214855Z ++ export PS1 2026-02-21T08:06:48.2215074Z ++ alias pydoc 2026-02-21T08:06:48.2215243Z ++ true 2026-02-21T08:06:48.2215448Z ++ hash -r 2026-02-21T08:06:48.2215984Z + uv pip install pip 2026-02-21T08:06:48.2541632Z Resolved 1 package in 24ms 2026-02-21T08:06:48.2592780Z Downloading pip (1.7MiB) 2026-02-21T08:06:48.3065572Z Downloaded pip 2026-02-21T08:06:48.3067111Z Prepared 1 package in 52ms 2026-02-21T08:06:48.3105218Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:48.3106222Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:48.3107279Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:48.3294480Z Installed 1 package in 22ms 2026-02-21T08:06:48.3294937Z + pip==26.0.1 2026-02-21T08:06:48.3330394Z + uv pip install quack-kernels --no-deps 2026-02-21T08:06:48.6477110Z Resolved 1 package in 304ms 2026-02-21T08:06:48.7483846Z Prepared 1 package in 100ms 2026-02-21T08:06:48.7512066Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:06:48.7512842Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:06:48.7513506Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:06:48.8735399Z Installed 1 package in 124ms 2026-02-21T08:06:48.8738471Z + quack-kernels==0.2.10 2026-02-21T08:06:48.8764043Z + mkdir -p benchmarks/ 2026-02-21T08:06:48.8772308Z + pushd benchmarks/ 2026-02-21T08:06:48.8772753Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T08:06:48.8773148Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:06:48.8785986Z Cloning into 'tritonbench'... 2026-02-21T08:06:50.7102251Z + pushd tritonbench/ 2026-02-21T08:06:50.7102745Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:06:50.7103340Z + git submodule update --init --recursive 2026-02-21T08:06:50.8431387Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T08:06:50.8432748Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T08:06:50.8439161Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T08:06:50.9193504Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T08:06:50.9198088Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T08:06:50.9232344Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T08:06:50.9254729Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T08:06:54.2042365Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T08:07:06.9085770Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T08:07:11.4233107Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T08:07:12.3114214Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T08:07:12.7746618Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T08:07:14.3082408Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T08:07:14.6208495Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T08:07:14.8557980Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T08:07:14.8582242Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T08:07:19.2397598Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T08:07:19.7049731Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T08:07:19.7866866Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T08:07:19.7884155Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T08:07:19.7886159Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T08:07:19.7910847Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T08:07:23.8118806Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T08:07:27.8667275Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T08:07:28.3451038Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T08:07:28.3711815Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T08:07:28.3727629Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T08:07:28.3749207Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T08:07:32.7798311Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T08:07:32.9380397Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T08:07:32.9395740Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T08:07:32.9396553Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T08:07:32.9398714Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T08:07:32.9423977Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T08:07:37.1043334Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T08:07:40.6255788Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T08:07:41.6066837Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T08:07:42.0968079Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T08:07:42.1509740Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T08:07:42.1529265Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T08:07:42.1534068Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T08:07:42.1555184Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T08:07:47.1167634Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T08:07:51.3367719Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T08:07:51.8245317Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T08:07:51.8287801Z + uv pip install -r requirements.txt 2026-02-21T08:07:51.8358686Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:51.9722735Z Resolved 30 packages in 135ms 2026-02-21T08:07:51.9779784Z Downloading fonttools (4.7MiB) 2026-02-21T08:07:51.9780008Z Downloading hf-xet (3.2MiB) 2026-02-21T08:07:51.9782280Z Downloading matplotlib (8.3MiB) 2026-02-21T08:07:51.9789336Z Downloading kiwisolver (1.4MiB) 2026-02-21T08:07:51.9789721Z Downloading transformers (10.3MiB) 2026-02-21T08:07:51.9805690Z Downloading tokenizers (3.0MiB) 2026-02-21T08:07:51.9805997Z Downloading pillow (6.7MiB) 2026-02-21T08:07:52.1314544Z Downloaded kiwisolver 2026-02-21T08:07:52.1940728Z Downloaded tokenizers 2026-02-21T08:07:52.1973595Z Downloaded hf-xet 2026-02-21T08:07:52.3719922Z Downloaded pillow 2026-02-21T08:07:52.3969043Z Downloaded fonttools 2026-02-21T08:07:52.5168300Z Downloaded matplotlib 2026-02-21T08:07:53.0360523Z Downloaded transformers 2026-02-21T08:07:53.0360951Z Prepared 23 packages in 1.06s 2026-02-21T08:07:53.0400588Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:53.0401400Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:53.0402491Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:53.1056476Z Installed 23 packages in 69ms 2026-02-21T08:07:53.1056689Z + certifi==2026.1.4 2026-02-21T08:07:53.1056869Z + charset-normalizer==3.4.4 2026-02-21T08:07:53.1057040Z + contourpy==1.3.3 2026-02-21T08:07:53.1057195Z + cycler==0.12.1 2026-02-21T08:07:53.1057329Z + fonttools==4.61.1 2026-02-21T08:07:53.1057479Z + hf-xet==1.2.0 2026-02-21T08:07:53.1057622Z + huggingface-hub==0.36.2 2026-02-21T08:07:53.1057783Z + idna==3.11 2026-02-21T08:07:53.1057914Z + kiwisolver==1.4.9 2026-02-21T08:07:53.1058082Z + matplotlib==3.10.8 2026-02-21T08:07:53.1058254Z + nvidia-ml-py==13.590.48 2026-02-21T08:07:53.1058404Z + pillow==12.1.1 2026-02-21T08:07:53.1058547Z + pyparsing==3.3.2 2026-02-21T08:07:53.1058691Z + python-dateutil==2.9.0.post0 2026-02-21T08:07:53.1058856Z + regex==2026.2.19 2026-02-21T08:07:53.1058986Z + requests==2.32.5 2026-02-21T08:07:53.1059131Z + safetensors==0.7.0 2026-02-21T08:07:53.1059269Z + six==1.17.0 2026-02-21T08:07:53.1059406Z + tabulate==0.9.0 2026-02-21T08:07:53.1059558Z + tokenizers==0.21.4 2026-02-21T08:07:53.1059695Z + tqdm==4.67.3 2026-02-21T08:07:53.1059836Z + transformers==4.53.0 2026-02-21T08:07:53.1059976Z + urllib3==2.6.3 2026-02-21T08:07:53.1163760Z + python install.py --liger 2026-02-21T08:07:58.3507675Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:58.3531332Z Audited 6 packages in 3ms 2026-02-21T08:07:58.4118462Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T08:07:58.4177595Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:58.5069653Z Resolved 1 package in 88ms 2026-02-21T08:07:58.5288593Z Prepared 1 package in 21ms 2026-02-21T08:07:58.5326361Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:58.5326982Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:58.5328010Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:58.5355299Z Installed 1 package in 7ms 2026-02-21T08:07:58.5355616Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T08:07:58.5386790Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T08:07:58.9814125Z + uv pip install -e . --no-deps 2026-02-21T08:07:59.0245768Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T08:07:59.0277546Z Resolved 1 package in 2ms 2026-02-21T08:07:59.0288069Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:07:59.8034807Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T08:07:59.8056166Z Prepared 1 package in 777ms 2026-02-21T08:07:59.8058374Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T08:07:59.8058879Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T08:07:59.8059424Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T08:07:59.8059767Z Installed 1 package in 0.60ms 2026-02-21T08:07:59.8060041Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T08:07:59.8138223Z + popd 2026-02-21T08:07:59.8138404Z + popd 2026-02-21T08:07:59.8140218Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T08:07:59.8140471Z /__w/helion/helion 2026-02-21T08:07:59.8194791Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:07:59.8195090Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T08:07:59.8195269Z  2026-02-21T08:07:59.8195411Z source .venv/bin/activate 2026-02-21T08:07:59.8195574Z  2026-02-21T08:07:59.8195732Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T08:07:59.8195935Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T08:07:59.8196118Z echo "$TEST_REPORTS_DIR" 2026-02-21T08:07:59.8196273Z  2026-02-21T08:07:59.8196401Z KERNEL_LIST="kl_div" 2026-02-21T08:07:59.8196583Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T08:07:59.8196795Z  echo "==========================================" 2026-02-21T08:07:59.8197034Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T08:07:59.8197250Z  echo "==========================================" 2026-02-21T08:07:59.8197438Z  2026-02-21T08:07:59.8197651Z  # Get available implementations and baseline for this kernel 2026-02-21T08:07:59.8198039Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T08:07:59.8198426Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T08:07:59.8198733Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T08:07:59.8198971Z  2026-02-21T08:07:59.8199107Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T08:07:59.8199356Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T08:07:59.8199618Z  continue 2026-02-21T08:07:59.8199771Z  fi 2026-02-21T08:07:59.8199918Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T08:07:59.8200161Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T08:07:59.8200393Z  continue 2026-02-21T08:07:59.8200546Z  fi 2026-02-21T08:07:59.8200691Z  echo "Using baseline: $BASELINE" 2026-02-21T08:07:59.8200923Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T08:07:59.8201127Z  2026-02-21T08:07:59.8201294Z  # Do autotuning but do not record the results 2026-02-21T08:07:59.8201497Z  python benchmarks/run.py \ 2026-02-21T08:07:59.8201676Z  --op $kernel \ 2026-02-21T08:07:59.8201898Z  --metrics speedup,accuracy \ 2026-02-21T08:07:59.8202125Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:07:59.8202346Z  --cudagraph \ 2026-02-21T08:07:59.8202513Z  --only $IMPLS \ 2026-02-21T08:07:59.8202766Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:07:59.8202964Z  --baseline $BASELINE \ 2026-02-21T08:07:59.8203138Z  --atol 1e-2 \ 2026-02-21T08:07:59.8203300Z  --rtol 1e-2 \ 2026-02-21T08:07:59.8203607Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:07:59.8203810Z  --keep-going \ 2026-02-21T08:07:59.8203957Z   2026-02-21T08:07:59.8204086Z  2026-02-21T08:07:59.8204209Z  # Relax the GPU 2026-02-21T08:07:59.8204364Z  sleep 2m 2026-02-21T08:07:59.8204493Z  2026-02-21T08:07:59.8204645Z  # Run again with cache and record results 2026-02-21T08:07:59.8204937Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T08:07:59.8205207Z  --op $kernel \ 2026-02-21T08:07:59.8205382Z  --metrics speedup,accuracy \ 2026-02-21T08:07:59.8205583Z  --latency-measure-mode triton_do_bench \ 2026-02-21T08:07:59.8205776Z  --cudagraph \ 2026-02-21T08:07:59.8205927Z  --only $IMPLS \ 2026-02-21T08:07:59.8206219Z  --only-match-mode prefix-with-baseline \ 2026-02-21T08:07:59.8206419Z  --baseline $BASELINE \ 2026-02-21T08:07:59.8206587Z  --atol 1e-2 \ 2026-02-21T08:07:59.8206738Z  --rtol 1e-2 \ 2026-02-21T08:07:59.8206913Z  --input-sample-mode equally-spaced-k \ 2026-02-21T08:07:59.8207145Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T08:07:59.8207371Z  --append-to-output \ 2026-02-21T08:07:59.8207559Z  --keep-going \ 2026-02-21T08:07:59.8207703Z   2026-02-21T08:07:59.8207834Z  2026-02-21T08:07:59.8208007Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T08:07:59.8208201Z done 2026-02-21T08:07:59.8208328Z  2026-02-21T08:07:59.8208491Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T08:07:59.8208738Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T08:07:59.8208925Z  exit 1 2026-02-21T08:07:59.8209065Z fi 2026-02-21T08:07:59.8209217Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T08:07:59.8209531Z shell: bash -l {0} 2026-02-21T08:07:59.8209673Z env: 2026-02-21T08:07:59.8209808Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:07:59.8210009Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.8210239Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:07:59.8210483Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.8210701Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.8210929Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:07:59.8211296Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:07:59.8211682Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:07:59.8211993Z ##[endgroup] 2026-02-21T08:07:59.8805612Z /__w/helion/helion/test/test-reports 2026-02-21T08:07:59.8805936Z ========================================== 2026-02-21T08:07:59.8806194Z Running benchmark for kernel: kl_div 2026-02-21T08:07:59.8806398Z ========================================== 2026-02-21T08:08:05.3733002Z Using baseline: torch_kl_div 2026-02-21T08:08:05.3735211Z Available implementations for kl_div: helion_kl_div_tritonbench,liger_kl_div,torch_compile_kl_div 2026-02-21T08:08:10.8191226Z Using num_inputs=20 for kl_div 2026-02-21T08:08:11.6803413Z Running kl_div benchmark with Helion implementation... 2026-02-21T08:08:11.6807425Z 2026-02-21T08:08:12.1529161Z Warning: Requested 20 inputs but only 6 available. Using all available inputs. 2026-02-21T08:08:12.1529540Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 6) 2026-02-21T08:08:12.1529874Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 3, 4, 5] 2026-02-21T08:08:12.1533956Z 2026-02-21T08:08:12.1545756Z 0%| | 0/6 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:22.5433207Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:09:22.5433414Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:22.5433647Z %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32> 2026-02-21T08:09:22.5433870Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:09:22.5434054Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:22.5434231Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:22.5434409Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:22.5434711Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:22.5435134Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:22.5435439Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:22.5435607Z %3 = arith.muli %2, %c512_i32 : i32 2026-02-21T08:09:22.5435850Z %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:09:22.5436382Z %5 = tt.splat %3 : i32 -> tensor<512xi32> 2026-02-21T08:09:22.5436580Z %6 = arith.addi %5, %4 : tensor<512xi32> 2026-02-21T08:09:22.5436894Z %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>) : i32 { 2026-02-21T08:09:22.5437297Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:09:22.5438104Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:09:22.5438461Z %13 = scf.if %arg3 -> (tensor<512x32xf32>) { 2026-02-21T08:09:22.5438838Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32> 2026-02-21T08:09:22.5439221Z %16 = arith.subf %12, %11 : tensor<512x32xf32> 2026-02-21T08:09:22.5439714Z %17 = arith.mulf %15, %16 : tensor<512x32xf32> 2026-02-21T08:09:22.5439938Z %18 = arith.addf %17, %cst : tensor<512x32xf32> 2026-02-21T08:09:22.5440152Z scf.yield %18 : tensor<512x32xf32> 2026-02-21T08:09:22.5441029Z } else { 2026-02-21T08:09:22.5441212Z %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32> 2026-02-21T08:09:22.5441448Z %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32> 2026-02-21T08:09:22.5441708Z %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32> 2026-02-21T08:09:22.5441969Z %18 = arith.ori %16, %17 : tensor<512x32xi1> 2026-02-21T08:09:22.5442206Z %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32> 2026-02-21T08:09:22.5442455Z %20 = math.log %19 : tensor<512x32xf32> 2026-02-21T08:09:22.5442657Z %21 = arith.subf %20, %11 : tensor<512x32xf32> 2026-02-21T08:09:22.5442849Z %22 = arith.mulf %12, %21 : tensor<512x32xf32> 2026-02-21T08:09:22.5443060Z %23 = arith.addf %22, %cst : tensor<512x32xf32> 2026-02-21T08:09:22.5443254Z scf.yield %23 : tensor<512x32xf32> 2026-02-21T08:09:22.5443426Z } 2026-02-21T08:09:22.5443565Z %14 = arith.addf %arg6, %13 : tensor<512x32xf32> 2026-02-21T08:09:22.5443759Z scf.yield %14 : tensor<512x32xf32> 2026-02-21T08:09:22.5444069Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:09:22.5444397Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:09:22.5444587Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:09:22.5444755Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:09:22.5444939Z tt.reduce.return %11 : f32 2026-02-21T08:09:22.5445113Z }) : (tensor<512x32xf32>) -> tensor<512xf32> 2026-02-21T08:09:22.5445340Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<512x!tt.ptr> 2026-02-21T08:09:22.5445623Z %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr>, tensor<512xi32> 2026-02-21T08:09:22.5445854Z tt.store %10, %8 : tensor<512x!tt.ptr> 2026-02-21T08:09:22.5446040Z tt.return 2026-02-21T08:09:22.5446160Z } 2026-02-21T08:09:22.5446281Z } 2026-02-21T08:09:22.5446346Z 2026-02-21T08:09:22.5446394Z {-# 2026-02-21T08:09:22.5446532Z external_resources: { 2026-02-21T08:09:22.5446687Z mlir_reproducer: { 2026-02-21T08:09:22.5451100Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:22.5455636Z disable_threading: false, 2026-02-21T08:09:22.5455818Z verify_each: true 2026-02-21T08:09:22.5455972Z } 2026-02-21T08:09:22.5456107Z } 2026-02-21T08:09:22.5456223Z #-} 2026-02-21T08:09:22.5456758Z /tmp/torchinductor_root/kq/ckqgjm23ayjvdfpaxlbrhuqqydpuweref52jm26k6bgk45sg2bj3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:22.5458046Z /tmp/torchinductor_root/kq/ckqgjm23ayjvdfpaxlbrhuqqydpuweref52jm26k6bgk45sg2bj3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:22.5459082Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:22.5460167Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:09:22.5461095Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:22.5461346Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:23.1297349Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:09:23.1298040Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:23.1298607Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:09:23.1299988Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:09:23.1300190Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:23.1300407Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:09:23.1306151Z %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32> 2026-02-21T08:09:23.1310015Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:09:23.1314642Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:23.1318620Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:23.1322501Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:23.1324631Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:23.1325071Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:23.1325382Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:23.1325559Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T08:09:23.1325745Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:23.1325972Z %4 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:09:23.1326514Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:09:23.1326690Z %6 = arith.divui %5, %c2368_i32 : i32 2026-02-21T08:09:23.1326864Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:09:23.1327041Z %7 = arith.remsi %6, %c4_i32 : i32 2026-02-21T08:09:23.1327208Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:09:23.1327376Z %9 = arith.muli %8, %c2368_i32 : i32 2026-02-21T08:09:23.1327544Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:09:23.1327724Z %11 = arith.muli %c2368_i32, %c4_i32 : i32 2026-02-21T08:09:23.1327916Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:09:23.1328110Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:09:23.1328385Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:23.1328634Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:09:23.1328823Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:09:23.1329222Z %16 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:09:23.1329629Z %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1329996Z %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1330286Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:09:23.1330655Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:09:23.1331029Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1331231Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:09:23.1331495Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1331693Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:09:23.1332040Z } else { 2026-02-21T08:09:23.1332219Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:09:23.1332437Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:09:23.1332661Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:09:23.1332871Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:09:23.1333115Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:09:23.1333362Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:09:23.1333554Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1333755Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:09:23.1333956Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1334155Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:09:23.1334319Z } 2026-02-21T08:09:23.1334474Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:09:23.1334667Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:09:23.1334894Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:09:23.1335121Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.1335305Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.1335485Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.1335665Z tt.reduce.return %50 : f32 2026-02-21T08:09:23.1335851Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:09:23.1336072Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:23.1336336Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:23.1336573Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:09:23.1336765Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:09:23.1336955Z %20 = arith.muli %c2368_i32, %c1_i32_0 : i32 2026-02-21T08:09:23.1337140Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:09:23.1337323Z %22 = arith.muli %21, %c16_i32 : i32 2026-02-21T08:09:23.1337617Z %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:23.1337858Z %24 = tt.splat %22 : i32 -> tensor<16xi32> 2026-02-21T08:09:23.1338042Z %25 = arith.addi %24, %23 : tensor<16xi32> 2026-02-21T08:09:23.1338354Z %26 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:09:23.1338751Z %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1339110Z %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1339400Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:09:23.1339752Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:09:23.1340176Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1340386Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:09:23.1340582Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1340775Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:09:23.1340938Z } else { 2026-02-21T08:09:23.1341098Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:09:23.1341306Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:09:23.1341524Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:09:23.1341731Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:09:23.1341994Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:09:23.1342235Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:09:23.1342423Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1342624Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:09:23.1342823Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1343028Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:09:23.1343198Z } 2026-02-21T08:09:23.1343351Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:09:23.1343555Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:09:23.1343770Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:09:23.1343997Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.1344189Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.1344375Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.1344558Z tt.reduce.return %50 : f32 2026-02-21T08:09:23.1344751Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:09:23.1344986Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:23.1345251Z %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:23.1345502Z tt.store %29, %27 : tensor<16x!tt.ptr> 2026-02-21T08:09:23.1345704Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:23.1345900Z %30 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:09:23.1346090Z %31 = arith.addi %arg5, %30 : i32 2026-02-21T08:09:23.1346279Z %32 = arith.muli %31, %c16_i32 : i32 2026-02-21T08:09:23.1346505Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:23.1346753Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T08:09:23.1346950Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T08:09:23.1347257Z %36 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:09:23.1347676Z %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1348058Z %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1348418Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:09:23.1348789Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:09:23.1349160Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1349376Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:09:23.1349586Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1349789Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:09:23.1349962Z } else { 2026-02-21T08:09:23.1350132Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:09:23.1350358Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:09:23.1350589Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:09:23.1350800Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:09:23.1351110Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:09:23.1351353Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:09:23.1351540Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1351738Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:09:23.1351974Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1352162Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:09:23.1352333Z } 2026-02-21T08:09:23.1352498Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:09:23.1352686Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:09:23.1352896Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:09:23.1353104Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.1353293Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.1353468Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.1353647Z tt.reduce.return %50 : f32 2026-02-21T08:09:23.1353833Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:09:23.1354052Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:23.1354308Z %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:23.1354535Z tt.store %39, %37 : tensor<16x!tt.ptr> 2026-02-21T08:09:23.1354730Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:09:23.1354903Z %40 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:09:23.1355087Z %41 = arith.addi %arg5, %40 : i32 2026-02-21T08:09:23.1355263Z %42 = arith.muli %41, %c16_i32 : i32 2026-02-21T08:09:23.1355478Z %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:23.1355714Z %44 = tt.splat %42 : i32 -> tensor<16xi32> 2026-02-21T08:09:23.1355895Z %45 = arith.addi %44, %43 : tensor<16xi32> 2026-02-21T08:09:23.1356195Z %46 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:09:23.1356583Z %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1356945Z %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1357231Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:09:23.1357580Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:09:23.1357941Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1358142Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:09:23.1358349Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1358545Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:09:23.1358708Z } else { 2026-02-21T08:09:23.1358870Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:09:23.1359134Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:09:23.1359349Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:09:23.1359550Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:09:23.1359785Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:09:23.1360022Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:09:23.1360208Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:09:23.1360408Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:09:23.1360605Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1360800Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:09:23.1360963Z } 2026-02-21T08:09:23.1361110Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:09:23.1361295Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:09:23.1361577Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:09:23.1361799Z %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.1362009Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.1362182Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.1362356Z tt.reduce.return %50 : f32 2026-02-21T08:09:23.1362540Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:09:23.1362762Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:23.1363011Z %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:23.1363242Z tt.store %49, %47 : tensor<16x!tt.ptr> 2026-02-21T08:09:23.1363414Z } 2026-02-21T08:09:23.1363581Z scf.for %arg5 = %10 to %c256_i32 step %c2368_i32 : i32 { 2026-02-21T08:09:23.1363787Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:09:23.1364012Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:23.1364244Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:09:23.1364437Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:09:23.1364738Z %16 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:09:23.1365121Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1365480Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:09:23.1365757Z %22 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:09:23.1366113Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:09:23.1366472Z %25 = arith.subf %21, %20 : tensor<16x32xf32> 2026-02-21T08:09:23.1366671Z %26 = arith.mulf %24, %25 : tensor<16x32xf32> 2026-02-21T08:09:23.1366880Z %27 = arith.addf %26, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1367069Z scf.yield %27 : tensor<16x32xf32> 2026-02-21T08:09:23.1367236Z } else { 2026-02-21T08:09:23.1367387Z %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:09:23.1367601Z %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32> 2026-02-21T08:09:23.1367814Z %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32> 2026-02-21T08:09:23.1368014Z %27 = arith.ori %25, %26 : tensor<16x32xi1> 2026-02-21T08:09:23.1368247Z %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:09:23.1368476Z %29 = math.log %28 : tensor<16x32xf32> 2026-02-21T08:09:23.1368667Z %30 = arith.subf %29, %20 : tensor<16x32xf32> 2026-02-21T08:09:23.1368859Z %31 = arith.mulf %21, %30 : tensor<16x32xf32> 2026-02-21T08:09:23.1369061Z %32 = arith.addf %31, %cst : tensor<16x32xf32> 2026-02-21T08:09:23.1369256Z scf.yield %32 : tensor<16x32xf32> 2026-02-21T08:09:23.1369471Z } 2026-02-21T08:09:23.1369616Z %23 = arith.addf %arg7, %22 : tensor<16x32xf32> 2026-02-21T08:09:23.1369803Z scf.yield %23 : tensor<16x32xf32> 2026-02-21T08:09:23.1370012Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:09:23.1370220Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:09:23.1370406Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:23.1370572Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:23.1370754Z tt.reduce.return %20 : f32 2026-02-21T08:09:23.1370936Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:09:23.1371157Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:23.1371420Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:23.1371651Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:09:23.1371888Z } {tt.num_stages = 1 : i32} 2026-02-21T08:09:23.1372102Z tt.return 2026-02-21T08:09:23.1372243Z } 2026-02-21T08:09:23.1372367Z } 2026-02-21T08:09:23.1372445Z 2026-02-21T08:09:23.1372495Z {-# 2026-02-21T08:09:23.1372634Z external_resources: { 2026-02-21T08:09:23.1372792Z mlir_reproducer: { 2026-02-21T08:09:23.1377162Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:23.1381536Z disable_threading: false, 2026-02-21T08:09:23.1381698Z verify_each: true 2026-02-21T08:09:23.1381844Z } 2026-02-21T08:09:23.1381988Z } 2026-02-21T08:09:23.1382104Z #-} 2026-02-21T08:09:23.1382542Z /tmp/torchinductor_root/tt/cttkhh7z2rjgcswtho32r5mlfurqew4gn2qvsirs7o74lkakfnym.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:23.1383838Z /tmp/torchinductor_root/tt/cttkhh7z2rjgcswtho32r5mlfurqew4gn2qvsirs7o74lkakfnym.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:23.1384917Z [64s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:23.1386089Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:09:23.1387207Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:23.1387463Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:25.6557917Z module { 2026-02-21T08:09:25.6560252Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:25.6560883Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:09:25.6561181Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:25.6561639Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:25.6562982Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32> 2026-02-21T08:09:25.6563223Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:09:25.6563404Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:25.6563589Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:25.6563760Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:25.6564073Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:25.6564505Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:25.6564802Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:25.6564979Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:09:25.6565157Z %4 = arith.minsi %3, %c256_i32 : i32 2026-02-21T08:09:25.6569132Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:09:25.6570158Z %5 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:09:25.6570412Z %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:09:25.6570674Z %7 = tt.splat %5 : i32 -> tensor<16xi32> 2026-02-21T08:09:25.6570863Z %8 = arith.addi %7, %6 : tensor<16xi32> 2026-02-21T08:09:25.6571174Z %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>) : i32 { 2026-02-21T08:09:25.6571582Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:09:25.6572094Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:09:25.6572392Z %15 = scf.if %arg3 -> (tensor<16x256xf32>) { 2026-02-21T08:09:25.6572761Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:09:25.6573148Z %18 = arith.subf %14, %13 : tensor<16x256xf32> 2026-02-21T08:09:25.6573360Z %19 = arith.mulf %17, %18 : tensor<16x256xf32> 2026-02-21T08:09:25.6573576Z %20 = arith.addf %19, %cst : tensor<16x256xf32> 2026-02-21T08:09:25.6573778Z scf.yield %20 : tensor<16x256xf32> 2026-02-21T08:09:25.6573948Z } else { 2026-02-21T08:09:25.6574119Z %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32> 2026-02-21T08:09:25.6574335Z %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32> 2026-02-21T08:09:25.6574559Z %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32> 2026-02-21T08:09:25.6574764Z %20 = arith.ori %18, %19 : tensor<16x256xi1> 2026-02-21T08:09:25.6575029Z %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:09:25.6575279Z %22 = math.log %21 : tensor<16x256xf32> 2026-02-21T08:09:25.6575473Z %23 = arith.subf %22, %13 : tensor<16x256xf32> 2026-02-21T08:09:25.6575687Z %24 = arith.mulf %14, %23 : tensor<16x256xf32> 2026-02-21T08:09:25.6576166Z %25 = arith.addf %24, %cst : tensor<16x256xf32> 2026-02-21T08:09:25.6576374Z scf.yield %25 : tensor<16x256xf32> 2026-02-21T08:09:25.6576556Z } 2026-02-21T08:09:25.6576711Z %16 = arith.addf %arg7, %15 : tensor<16x256xf32> 2026-02-21T08:09:25.6576918Z scf.yield %16 : tensor<16x256xf32> 2026-02-21T08:09:25.6577141Z } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:25.6577372Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:09:25.6577557Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:25.6577737Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:25.6577918Z tt.reduce.return %13 : f32 2026-02-21T08:09:25.6578109Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:09:25.6578339Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:09:25.6578677Z %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:09:25.6578916Z tt.store %12, %10 : tensor<16x!tt.ptr> 2026-02-21T08:09:25.6579092Z } 2026-02-21T08:09:25.6579217Z tt.return 2026-02-21T08:09:25.6579336Z } 2026-02-21T08:09:25.6579458Z } 2026-02-21T08:09:25.6579523Z 2026-02-21T08:09:25.6579571Z {-# 2026-02-21T08:09:25.6579701Z external_resources: { 2026-02-21T08:09:25.6579861Z mlir_reproducer: { 2026-02-21T08:09:25.6584179Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:25.6588587Z disable_threading: false, 2026-02-21T08:09:25.6588754Z verify_each: true 2026-02-21T08:09:25.6588890Z } 2026-02-21T08:09:25.6589010Z } 2026-02-21T08:09:25.6589116Z #-} 2026-02-21T08:09:25.6589525Z /tmp/torchinductor_root/um/cum4laz43eppv7xfzfxng5552jl2lswbog6nywjoe3ypha2sth6j.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:25.6590715Z /tmp/torchinductor_root/um/cum4laz43eppv7xfzfxng5552jl2lswbog6nywjoe3ypha2sth6j.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:25.6591684Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:25.6592833Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:09:25.6593769Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:25.6594014Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:26.0076310Z module { 2026-02-21T08:09:26.0077039Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:26.0077975Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:09:26.0078199Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:26.0078439Z %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32> 2026-02-21T08:09:26.0078673Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:09:26.0078867Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:26.0079051Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:26.0079238Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:26.0079554Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:26.0080002Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:26.0080322Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:26.0080518Z %3 = arith.muli %2, %c32_i32 : i32 2026-02-21T08:09:26.0080751Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:09:26.0081006Z %5 = tt.splat %3 : i32 -> tensor<32xi32> 2026-02-21T08:09:26.0081214Z %6 = arith.addi %5, %4 : tensor<32xi32> 2026-02-21T08:09:26.0081529Z %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c16_i32 iter_args(%arg6 = %cst) -> (tensor<32x16xf32>) : i32 { 2026-02-21T08:09:26.0082130Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<32x16xf32> 2026-02-21T08:09:26.0082527Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<32x16xf32> 2026-02-21T08:09:26.0082825Z %13 = scf.if %arg3 -> (tensor<32x16xf32>) { 2026-02-21T08:09:26.0083216Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x16xf32>) -> tensor<32x16xf32> 2026-02-21T08:09:26.0083607Z %16 = arith.subf %12, %11 : tensor<32x16xf32> 2026-02-21T08:09:26.0083820Z %17 = arith.mulf %15, %16 : tensor<32x16xf32> 2026-02-21T08:09:26.0084032Z %18 = arith.addf %17, %cst : tensor<32x16xf32> 2026-02-21T08:09:26.0084228Z scf.yield %18 : tensor<32x16xf32> 2026-02-21T08:09:26.0084402Z } else { 2026-02-21T08:09:26.0084556Z %15 = tt.splat %arg4 : f32 -> tensor<32x16xf32> 2026-02-21T08:09:26.0084775Z %16 = arith.cmpf ogt, %12, %15 : tensor<32x16xf32> 2026-02-21T08:09:26.0084985Z %17 = arith.cmpf une, %12, %12 : tensor<32x16xf32> 2026-02-21T08:09:26.0085189Z %18 = arith.ori %16, %17 : tensor<32x16xi1> 2026-02-21T08:09:26.0085427Z %19 = arith.select %18, %12, %15 : tensor<32x16xi1>, tensor<32x16xf32> 2026-02-21T08:09:26.0085659Z %20 = math.log %19 : tensor<32x16xf32> 2026-02-21T08:09:26.0085853Z %21 = arith.subf %20, %11 : tensor<32x16xf32> 2026-02-21T08:09:26.0086046Z %22 = arith.mulf %12, %21 : tensor<32x16xf32> 2026-02-21T08:09:26.0086249Z %23 = arith.addf %22, %cst : tensor<32x16xf32> 2026-02-21T08:09:26.0086435Z scf.yield %23 : tensor<32x16xf32> 2026-02-21T08:09:26.0086603Z } 2026-02-21T08:09:26.0086751Z %14 = arith.addf %arg6, %13 : tensor<32x16xf32> 2026-02-21T08:09:26.0087060Z scf.yield %14 : tensor<32x16xf32> 2026-02-21T08:09:26.0087245Z } {tt.warp_specialize} 2026-02-21T08:09:26.0087407Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:09:26.0087594Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:09:26.0087760Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:09:26.0087945Z tt.reduce.return %11 : f32 2026-02-21T08:09:26.0088120Z }) : (tensor<32x16xf32>) -> tensor<32xf32> 2026-02-21T08:09:26.0088349Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T08:09:26.0088606Z %10 = tt.addptr %9, %6 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T08:09:26.0088829Z tt.store %10, %8 : tensor<32x!tt.ptr> 2026-02-21T08:09:26.0089008Z tt.return 2026-02-21T08:09:26.0089126Z } 2026-02-21T08:09:26.0089275Z } 2026-02-21T08:09:26.0089339Z 2026-02-21T08:09:26.0089395Z {-# 2026-02-21T08:09:26.0089580Z external_resources: { 2026-02-21T08:09:26.0089750Z mlir_reproducer: { 2026-02-21T08:09:26.0093988Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:26.0098325Z disable_threading: false, 2026-02-21T08:09:26.0098502Z verify_each: true 2026-02-21T08:09:26.0098656Z } 2026-02-21T08:09:26.0098792Z } 2026-02-21T08:09:26.0098927Z #-} 2026-02-21T08:09:26.0099443Z /tmp/torchinductor_root/pj/cpjr5oe4aculykfgs75dfjbrgt34pqrvlqqderpr3fiksvzv4gvf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:26.0100804Z /tmp/torchinductor_root/pj/cpjr5oe4aculykfgs75dfjbrgt34pqrvlqqderpr3fiksvzv4gvf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:26.0101908Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:26.0103010Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:09:26.0104033Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:26.0104318Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:26.0578281Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:09:26.0581033Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:26.0581934Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:09:26.0586912Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:09:26.0588769Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:26.0589020Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:09:26.0589259Z %cst = arith.constant dense<4096> : tensor<512x1xi32> 2026-02-21T08:09:26.0589732Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<512x4xf32> 2026-02-21T08:09:26.0590006Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:09:26.0590196Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:26.0590384Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:26.0590568Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:26.0590882Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:26.0591205Z %1 = tt.get_program_id x : i32 2026-02-21T08:09:26.0591409Z scf.for %arg5 = %1 to %c8_i32 step %c9472_i32 : i32 { 2026-02-21T08:09:26.0591629Z %2 = arith.muli %arg5, %c512_i32 : i32 2026-02-21T08:09:26.0591923Z %3 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:09:26.0592187Z %4 = tt.splat %2 : i32 -> tensor<512xi32> 2026-02-21T08:09:26.0592388Z %5 = arith.addi %4, %3 : tensor<512xi32> 2026-02-21T08:09:26.0592582Z %c4092_i32 = arith.constant 4092 : i32 2026-02-21T08:09:26.0592783Z %c12_i32 = arith.constant 12 : i32 2026-02-21T08:09:26.0593091Z %6 = scf.for %arg6 = %c0_i32 to %c4092_i32 step %c12_i32 iter_args(%arg7 = %cst_0) -> (tensor<512x4xf32>) : i32 { 2026-02-21T08:09:26.0593462Z %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:26.0593709Z %26 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:09:26.0593922Z %27 = arith.addi %26, %25 : tensor<4xi32> 2026-02-21T08:09:26.0594216Z %28 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<512x4xf32> 2026-02-21T08:09:26.0594557Z %29 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:26.0594832Z %30 = arith.muli %29, %cst : tensor<512x1xi32> 2026-02-21T08:09:26.0595079Z %31 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:09:26.0595368Z %32 = tt.broadcast %30 : tensor<512x1xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0595623Z %33 = tt.broadcast %31 : tensor<1x4xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0595857Z %34 = arith.addi %32, %33 : tensor<512x4xi32> 2026-02-21T08:09:26.0596098Z %35 = tt.splat %arg1 : !tt.ptr -> tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0596367Z %36 = tt.addptr %35, %34 : tensor<512x4x!tt.ptr>, tensor<512x4xi32> 2026-02-21T08:09:26.0596616Z %37 = tt.load %36 : tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0596817Z %38 = scf.if %arg3 -> (tensor<512x4xf32>) { 2026-02-21T08:09:26.0597189Z %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32> 2026-02-21T08:09:26.0597562Z %75 = arith.subf %37, %28 : tensor<512x4xf32> 2026-02-21T08:09:26.0597805Z %76 = arith.mulf %74, %75 : tensor<512x4xf32> 2026-02-21T08:09:26.0598012Z %77 = arith.addf %76, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0598308Z scf.yield %77 : tensor<512x4xf32> 2026-02-21T08:09:26.0598479Z } else { 2026-02-21T08:09:26.0598649Z %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32> 2026-02-21T08:09:26.0598866Z %75 = arith.cmpf ogt, %37, %74 : tensor<512x4xf32> 2026-02-21T08:09:26.0599091Z %76 = arith.cmpf une, %37, %37 : tensor<512x4xf32> 2026-02-21T08:09:26.0599305Z %77 = arith.ori %75, %76 : tensor<512x4xi1> 2026-02-21T08:09:26.0599538Z %78 = arith.select %77, %37, %74 : tensor<512x4xi1>, tensor<512x4xf32> 2026-02-21T08:09:26.0599785Z %79 = math.log %78 : tensor<512x4xf32> 2026-02-21T08:09:26.0599981Z %80 = arith.subf %79, %28 : tensor<512x4xf32> 2026-02-21T08:09:26.0600187Z %81 = arith.mulf %37, %80 : tensor<512x4xf32> 2026-02-21T08:09:26.0600391Z %82 = arith.addf %81, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0600595Z scf.yield %82 : tensor<512x4xf32> 2026-02-21T08:09:26.0600841Z } 2026-02-21T08:09:26.0600992Z %39 = arith.addf %arg7, %38 : tensor<512x4xf32> 2026-02-21T08:09:26.0601194Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:26.0601385Z %40 = arith.muli %c4_i32, %c1_i32 : i32 2026-02-21T08:09:26.0601579Z %41 = arith.addi %arg6, %40 : i32 2026-02-21T08:09:26.0601816Z %42 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:26.0602098Z %43 = tt.splat %41 : i32 -> tensor<4xi32> 2026-02-21T08:09:26.0602286Z %44 = arith.addi %43, %42 : tensor<4xi32> 2026-02-21T08:09:26.0602560Z %45 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc> -> tensor<512x4xf32> 2026-02-21T08:09:26.0602891Z %46 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:26.0603144Z %47 = arith.muli %46, %cst : tensor<512x1xi32> 2026-02-21T08:09:26.0603397Z %48 = tt.expand_dims %44 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:09:26.0603674Z %49 = tt.broadcast %47 : tensor<512x1xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0603930Z %50 = tt.broadcast %48 : tensor<1x4xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0604160Z %51 = arith.addi %49, %50 : tensor<512x4xi32> 2026-02-21T08:09:26.0604384Z %52 = tt.splat %arg1 : !tt.ptr -> tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0604655Z %53 = tt.addptr %52, %51 : tensor<512x4x!tt.ptr>, tensor<512x4xi32> 2026-02-21T08:09:26.0604893Z %54 = tt.load %53 : tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0605100Z %55 = scf.if %arg3 -> (tensor<512x4xf32>) { 2026-02-21T08:09:26.0605447Z %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32> 2026-02-21T08:09:26.0605813Z %75 = arith.subf %54, %45 : tensor<512x4xf32> 2026-02-21T08:09:26.0606018Z %76 = arith.mulf %74, %75 : tensor<512x4xf32> 2026-02-21T08:09:26.0606225Z %77 = arith.addf %76, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0606424Z scf.yield %77 : tensor<512x4xf32> 2026-02-21T08:09:26.0606589Z } else { 2026-02-21T08:09:26.0606752Z %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32> 2026-02-21T08:09:26.0606964Z %75 = arith.cmpf ogt, %54, %74 : tensor<512x4xf32> 2026-02-21T08:09:26.0607181Z %76 = arith.cmpf une, %54, %54 : tensor<512x4xf32> 2026-02-21T08:09:26.0607389Z %77 = arith.ori %75, %76 : tensor<512x4xi1> 2026-02-21T08:09:26.0607617Z %78 = arith.select %77, %54, %74 : tensor<512x4xi1>, tensor<512x4xf32> 2026-02-21T08:09:26.0607854Z %79 = math.log %78 : tensor<512x4xf32> 2026-02-21T08:09:26.0608041Z %80 = arith.subf %79, %45 : tensor<512x4xf32> 2026-02-21T08:09:26.0608240Z %81 = arith.mulf %54, %80 : tensor<512x4xf32> 2026-02-21T08:09:26.0608437Z %82 = arith.addf %81, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0608635Z scf.yield %82 : tensor<512x4xf32> 2026-02-21T08:09:26.0608863Z } 2026-02-21T08:09:26.0609000Z %56 = arith.addf %39, %55 : tensor<512x4xf32> 2026-02-21T08:09:26.0609190Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:26.0609371Z %57 = arith.muli %c4_i32, %c2_i32 : i32 2026-02-21T08:09:26.0609554Z %58 = arith.addi %arg6, %57 : i32 2026-02-21T08:09:26.0609764Z %59 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:26.0609999Z %60 = tt.splat %58 : i32 -> tensor<4xi32> 2026-02-21T08:09:26.0610184Z %61 = arith.addi %60, %59 : tensor<4xi32> 2026-02-21T08:09:26.0610455Z %62 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc> -> tensor<512x4xf32> 2026-02-21T08:09:26.0610783Z %63 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:26.0611033Z %64 = arith.muli %63, %cst : tensor<512x1xi32> 2026-02-21T08:09:26.0611321Z %65 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:09:26.0611596Z %66 = tt.broadcast %64 : tensor<512x1xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0611871Z %67 = tt.broadcast %65 : tensor<1x4xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0612101Z %68 = arith.addi %66, %67 : tensor<512x4xi32> 2026-02-21T08:09:26.0612323Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0612592Z %70 = tt.addptr %69, %68 : tensor<512x4x!tt.ptr>, tensor<512x4xi32> 2026-02-21T08:09:26.0612832Z %71 = tt.load %70 : tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0613038Z %72 = scf.if %arg3 -> (tensor<512x4xf32>) { 2026-02-21T08:09:26.0613389Z %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32> 2026-02-21T08:09:26.0613752Z %75 = arith.subf %71, %62 : tensor<512x4xf32> 2026-02-21T08:09:26.0613955Z %76 = arith.mulf %74, %75 : tensor<512x4xf32> 2026-02-21T08:09:26.0614156Z %77 = arith.addf %76, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0614353Z scf.yield %77 : tensor<512x4xf32> 2026-02-21T08:09:26.0614514Z } else { 2026-02-21T08:09:26.0614677Z %74 = tt.splat %arg4 : f32 -> tensor<512x4xf32> 2026-02-21T08:09:26.0614887Z %75 = arith.cmpf ogt, %71, %74 : tensor<512x4xf32> 2026-02-21T08:09:26.0615107Z %76 = arith.cmpf une, %71, %71 : tensor<512x4xf32> 2026-02-21T08:09:26.0615313Z %77 = arith.ori %75, %76 : tensor<512x4xi1> 2026-02-21T08:09:26.0615535Z %78 = arith.select %77, %71, %74 : tensor<512x4xi1>, tensor<512x4xf32> 2026-02-21T08:09:26.0615769Z %79 = math.log %78 : tensor<512x4xf32> 2026-02-21T08:09:26.0615956Z %80 = arith.subf %79, %62 : tensor<512x4xf32> 2026-02-21T08:09:26.0616152Z %81 = arith.mulf %71, %80 : tensor<512x4xf32> 2026-02-21T08:09:26.0616392Z %82 = arith.addf %81, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0616596Z scf.yield %82 : tensor<512x4xf32> 2026-02-21T08:09:26.0616763Z } 2026-02-21T08:09:26.0616900Z %73 = arith.addf %56, %72 : tensor<512x4xf32> 2026-02-21T08:09:26.0617088Z scf.yield %73 : tensor<512x4xf32> 2026-02-21T08:09:26.0617269Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:09:26.0617497Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:26.0617730Z %8 = tt.splat %c4092_i32 : i32 -> tensor<4xi32> 2026-02-21T08:09:26.0617928Z %9 = arith.addi %8, %7 : tensor<4xi32> 2026-02-21T08:09:26.0618203Z %10 = tt.descriptor_load %0[%2, %c4092_i32] : !tt.tensordesc> -> tensor<512x4xf32> 2026-02-21T08:09:26.0618540Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<512xi32> -> tensor<512x1xi32> 2026-02-21T08:09:26.0618799Z %12 = arith.muli %11, %cst : tensor<512x1xi32> 2026-02-21T08:09:26.0619033Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:09:26.0619365Z %14 = tt.broadcast %12 : tensor<512x1xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0619606Z %15 = tt.broadcast %13 : tensor<1x4xi32> -> tensor<512x4xi32> 2026-02-21T08:09:26.0619828Z %16 = arith.addi %14, %15 : tensor<512x4xi32> 2026-02-21T08:09:26.0620054Z %17 = tt.splat %arg1 : !tt.ptr -> tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0620306Z %18 = tt.addptr %17, %16 : tensor<512x4x!tt.ptr>, tensor<512x4xi32> 2026-02-21T08:09:26.0620546Z %19 = tt.load %18 : tensor<512x4x!tt.ptr> 2026-02-21T08:09:26.0620739Z %20 = scf.if %arg3 -> (tensor<512x4xf32>) { 2026-02-21T08:09:26.0621086Z %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x4xf32>) -> tensor<512x4xf32> 2026-02-21T08:09:26.0621428Z %26 = arith.subf %19, %10 : tensor<512x4xf32> 2026-02-21T08:09:26.0621678Z %27 = arith.mulf %25, %26 : tensor<512x4xf32> 2026-02-21T08:09:26.0621924Z %28 = arith.addf %27, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0622116Z scf.yield %28 : tensor<512x4xf32> 2026-02-21T08:09:26.0622286Z } else { 2026-02-21T08:09:26.0622435Z %25 = tt.splat %arg4 : f32 -> tensor<512x4xf32> 2026-02-21T08:09:26.0622650Z %26 = arith.cmpf ogt, %19, %25 : tensor<512x4xf32> 2026-02-21T08:09:26.0622855Z %27 = arith.cmpf une, %19, %19 : tensor<512x4xf32> 2026-02-21T08:09:26.0623060Z %28 = arith.ori %26, %27 : tensor<512x4xi1> 2026-02-21T08:09:26.0623282Z %29 = arith.select %28, %19, %25 : tensor<512x4xi1>, tensor<512x4xf32> 2026-02-21T08:09:26.0623520Z %30 = math.log %29 : tensor<512x4xf32> 2026-02-21T08:09:26.0623714Z %31 = arith.subf %30, %10 : tensor<512x4xf32> 2026-02-21T08:09:26.0623907Z %32 = arith.mulf %19, %31 : tensor<512x4xf32> 2026-02-21T08:09:26.0624116Z %33 = arith.addf %32, %cst_0 : tensor<512x4xf32> 2026-02-21T08:09:26.0624306Z scf.yield %33 : tensor<512x4xf32> 2026-02-21T08:09:26.0624488Z } 2026-02-21T08:09:26.0624629Z %21 = arith.addf %6, %20 : tensor<512x4xf32> 2026-02-21T08:09:26.0624827Z %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({ 2026-02-21T08:09:26.0625016Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:26.0625186Z %25 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:26.0625369Z tt.reduce.return %25 : f32 2026-02-21T08:09:26.0625546Z }) : (tensor<512x4xf32>) -> tensor<512xf32> 2026-02-21T08:09:26.0625771Z %23 = tt.splat %arg2 : !tt.ptr -> tensor<512x!tt.ptr> 2026-02-21T08:09:26.0626021Z %24 = tt.addptr %23, %5 : tensor<512x!tt.ptr>, tensor<512xi32> 2026-02-21T08:09:26.0626253Z tt.store %24, %22 : tensor<512x!tt.ptr> 2026-02-21T08:09:26.0626511Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:26.0626749Z tt.return 2026-02-21T08:09:26.0626877Z } 2026-02-21T08:09:26.0626992Z } 2026-02-21T08:09:26.0627061Z 2026-02-21T08:09:26.0627119Z {-# 2026-02-21T08:09:26.0627241Z external_resources: { 2026-02-21T08:09:26.0627398Z mlir_reproducer: { 2026-02-21T08:09:26.0631656Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:26.0636218Z disable_threading: false, 2026-02-21T08:09:26.0636456Z verify_each: true 2026-02-21T08:09:26.0636601Z } 2026-02-21T08:09:26.0636750Z } 2026-02-21T08:09:26.0636875Z #-} 2026-02-21T08:09:26.0637356Z /tmp/torchinductor_root/yu/cyudquc76sft6j3sqezvgqcbsdkdqd5kew3iaciebsyyk2vta6eh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:26.0638594Z /tmp/torchinductor_root/yu/cyudquc76sft6j3sqezvgqcbsdkdqd5kew3iaciebsyyk2vta6eh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:26.0639616Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:26.0640702Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[3, 2], range_unroll_factors=[1, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:09:26.0641675Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:26.0641957Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:27.0517642Z module { 2026-02-21T08:09:27.0518367Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:27.0518970Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:09:27.0519163Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:27.0519335Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:27.0519719Z %cst = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T08:09:27.0520001Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:09:27.0520188Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:27.0520373Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:27.0520547Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:27.0520856Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:27.0521356Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:27.0521734Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:27.0522133Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:09:27.0522317Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:09:27.0522517Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:09:27.0522714Z %5 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:09:27.0522952Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:09:27.0523516Z %7 = tt.splat %5 : i32 -> tensor<64xi32> 2026-02-21T08:09:27.0523712Z %8 = arith.addi %7, %6 : tensor<64xi32> 2026-02-21T08:09:27.0524017Z %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg7 = %cst) -> (tensor<64x128xf32>) : i32 { 2026-02-21T08:09:27.0524429Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:09:27.0524800Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:09:27.0525089Z %15 = scf.if %arg3 -> (tensor<64x128xf32>) { 2026-02-21T08:09:27.0525466Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T08:09:27.0525832Z %18 = arith.subf %14, %13 : tensor<64x128xf32> 2026-02-21T08:09:27.0526146Z %19 = arith.mulf %17, %18 : tensor<64x128xf32> 2026-02-21T08:09:27.0526364Z %20 = arith.addf %19, %cst : tensor<64x128xf32> 2026-02-21T08:09:27.0526557Z scf.yield %20 : tensor<64x128xf32> 2026-02-21T08:09:27.0526732Z } else { 2026-02-21T08:09:27.0526893Z %17 = tt.splat %arg4 : f32 -> tensor<64x128xf32> 2026-02-21T08:09:27.0527119Z %18 = arith.cmpf ogt, %14, %17 : tensor<64x128xf32> 2026-02-21T08:09:27.0527339Z %19 = arith.cmpf une, %14, %14 : tensor<64x128xf32> 2026-02-21T08:09:27.0527554Z %20 = arith.ori %18, %19 : tensor<64x128xi1> 2026-02-21T08:09:27.0527798Z %21 = arith.select %20, %14, %17 : tensor<64x128xi1>, tensor<64x128xf32> 2026-02-21T08:09:27.0528037Z %22 = math.log %21 : tensor<64x128xf32> 2026-02-21T08:09:27.0528239Z %23 = arith.subf %22, %13 : tensor<64x128xf32> 2026-02-21T08:09:27.0528435Z %24 = arith.mulf %14, %23 : tensor<64x128xf32> 2026-02-21T08:09:27.0528646Z %25 = arith.addf %24, %cst : tensor<64x128xf32> 2026-02-21T08:09:27.0528840Z scf.yield %25 : tensor<64x128xf32> 2026-02-21T08:09:27.0529010Z } 2026-02-21T08:09:27.0529152Z %16 = arith.addf %arg7, %15 : tensor<64x128xf32> 2026-02-21T08:09:27.0529346Z scf.yield %16 : tensor<64x128xf32> 2026-02-21T08:09:27.0529552Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:09:27.0529755Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.0529945Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.0530120Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.0530308Z tt.reduce.return %13 : f32 2026-02-21T08:09:27.0530488Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T08:09:27.0530720Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:09:27.0530978Z %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:09:27.0531201Z tt.store %12, %10 : tensor<64x!tt.ptr> 2026-02-21T08:09:27.0531399Z } {tt.flatten, tt.num_stages = 4 : i32} 2026-02-21T08:09:27.0531568Z tt.return 2026-02-21T08:09:27.0531693Z } 2026-02-21T08:09:27.0531807Z } 2026-02-21T08:09:27.0531937Z 2026-02-21T08:09:27.0531987Z {-# 2026-02-21T08:09:27.0532110Z external_resources: { 2026-02-21T08:09:27.0532269Z mlir_reproducer: { 2026-02-21T08:09:27.0536660Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:27.0541051Z disable_threading: false, 2026-02-21T08:09:27.0541249Z verify_each: true 2026-02-21T08:09:27.0541402Z } 2026-02-21T08:09:27.0541539Z } 2026-02-21T08:09:27.0541661Z #-} 2026-02-21T08:09:27.0542185Z /tmp/torchinductor_root/q6/cq6q5pv2cbszkz2i6st67lcbiwqdbcytfsidsup2i3mkuyuhy5nm.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:27.0543498Z /tmp/torchinductor_root/q6/cq6q5pv2cbszkz2i6st67lcbiwqdbcytfsidsup2i3mkuyuhy5nm.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:27.0544542Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:27.0545725Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[0, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:09:27.0546717Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:27.0547007Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:27.1087234Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:09:27.1087944Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:09:27.1088574Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:09:27.1088783Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:09:27.1088961Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:09:27.1089182Z %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32> 2026-02-21T08:09:27.1089405Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:09:27.1089597Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:09:27.1089800Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:09:27.1089989Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:09:27.1090313Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:27.1090756Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:09:27.1091076Z %2 = tt.get_program_id x : i32 2026-02-21T08:09:27.1091261Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:09:27.1091441Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:09:27.1091931Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:09:27.1092112Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:09:27.1092310Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:09:27.1092492Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:09:27.1092671Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:09:27.1092863Z %c4_i32_1 = arith.constant 4 : i32 2026-02-21T08:09:27.1093067Z %9 = arith.remsi %8, %c4_i32_1 : i32 2026-02-21T08:09:27.1093260Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:09:27.1093438Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:09:27.1093633Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:09:27.1093822Z %13 = arith.muli %c1_i32, %c4_i32_1 : i32 2026-02-21T08:09:27.1094042Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:09:27.1094248Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:09:27.1094488Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:27.1094807Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:09:27.1095014Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:09:27.1095345Z %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:09:27.1095779Z %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1096173Z %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1096464Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:09:27.1096842Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:09:27.1097221Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1097420Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:09:27.1097632Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1097830Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:09:27.1098003Z } else { 2026-02-21T08:09:27.1098161Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:09:27.1098380Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:09:27.1098597Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:09:27.1098801Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:09:27.1099042Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:09:27.1099274Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:09:27.1099469Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1099662Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:09:27.1099863Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1100052Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:09:27.1100223Z } 2026-02-21T08:09:27.1100372Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:09:27.1100561Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:09:27.1100875Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:27.1101200Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.1101392Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.1101566Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.1101751Z tt.reduce.return %52 : f32 2026-02-21T08:09:27.1101982Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:09:27.1102203Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:09:27.1102468Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:09:27.1102697Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:09:27.1102892Z %c1_i32_2 = arith.constant 1 : i32 2026-02-21T08:09:27.1103077Z %22 = arith.muli %c1_i32, %c1_i32_2 : i32 2026-02-21T08:09:27.1103329Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:09:27.1103545Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T08:09:27.1103762Z %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:27.1104003Z %26 = tt.splat %24 : i32 -> tensor<4xi32> 2026-02-21T08:09:27.1104191Z %27 = arith.addi %26, %25 : tensor<4xi32> 2026-02-21T08:09:27.1104498Z %28 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:09:27.1104893Z %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1105250Z %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1105539Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:09:27.1105946Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:09:27.1106313Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1106516Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:09:27.1106715Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1106912Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:09:27.1107075Z } else { 2026-02-21T08:09:27.1107240Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:09:27.1107451Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:09:27.1107667Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:09:27.1107874Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:09:27.1108104Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:09:27.1108346Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:09:27.1108539Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1108740Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:09:27.1108933Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1109127Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:09:27.1109296Z } 2026-02-21T08:09:27.1109434Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:09:27.1109626Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:09:27.1109935Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:27.1110265Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.1110448Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.1110626Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.1110867Z tt.reduce.return %52 : f32 2026-02-21T08:09:27.1111055Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:09:27.1111281Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:09:27.1111541Z %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:09:27.1111804Z tt.store %31, %29 : tensor<4x!tt.ptr> 2026-02-21T08:09:27.1112038Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:09:27.1112226Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:09:27.1112406Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:09:27.1112583Z %34 = arith.muli %33, %c4_i32 : i32 2026-02-21T08:09:27.1112796Z %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:27.1113034Z %36 = tt.splat %34 : i32 -> tensor<4xi32> 2026-02-21T08:09:27.1113220Z %37 = arith.addi %36, %35 : tensor<4xi32> 2026-02-21T08:09:27.1113517Z %38 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:09:27.1113906Z %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1114321Z %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1114600Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:09:27.1114948Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:09:27.1115305Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1115505Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:09:27.1115697Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1115890Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:09:27.1116050Z } else { 2026-02-21T08:09:27.1116211Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:09:27.1116416Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:09:27.1116627Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:09:27.1116884Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:09:27.1117118Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:09:27.1117353Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:09:27.1117538Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1117733Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:09:27.1117922Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1118112Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:09:27.1118270Z } 2026-02-21T08:09:27.1118412Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:09:27.1118599Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:09:27.1118895Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:27.1119220Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.1119405Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.1119584Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.1119758Z tt.reduce.return %52 : f32 2026-02-21T08:09:27.1119940Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:09:27.1120159Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:09:27.1120403Z %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:09:27.1120635Z tt.store %41, %39 : tensor<4x!tt.ptr> 2026-02-21T08:09:27.1120822Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:09:27.1121004Z %42 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:09:27.1121175Z %43 = arith.addi %arg5, %42 : i32 2026-02-21T08:09:27.1121348Z %44 = arith.muli %43, %c4_i32 : i32 2026-02-21T08:09:27.1121555Z %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:27.1121790Z %46 = tt.splat %44 : i32 -> tensor<4xi32> 2026-02-21T08:09:27.1122031Z %47 = arith.addi %46, %45 : tensor<4xi32> 2026-02-21T08:09:27.1122328Z %48 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:09:27.1122725Z %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1123084Z %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1123375Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:09:27.1123735Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:09:27.1124090Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1124292Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:09:27.1124490Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1124691Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:09:27.1124862Z } else { 2026-02-21T08:09:27.1125101Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:09:27.1125318Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:09:27.1125530Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:09:27.1125758Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:09:27.1125995Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:09:27.1126230Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:09:27.1126426Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:09:27.1126619Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:09:27.1126831Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1127013Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:09:27.1127181Z } 2026-02-21T08:09:27.1127315Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:09:27.1127556Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:09:27.1127862Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:27.1128177Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.1128365Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.1128533Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.1128715Z tt.reduce.return %52 : f32 2026-02-21T08:09:27.1128890Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:09:27.1129108Z %50 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:09:27.1129358Z %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:09:27.1129577Z tt.store %51, %49 : tensor<4x!tt.ptr> 2026-02-21T08:09:27.1129768Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:09:27.1129954Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:09:27.1130155Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:09:27.1130370Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:09:27.1130602Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:09:27.1130791Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:09:27.1131075Z %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:09:27.1131449Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1131797Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:09:27.1132110Z %24 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:09:27.1132453Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:09:27.1132805Z %27 = arith.subf %23, %22 : tensor<4x4xf32> 2026-02-21T08:09:27.1133009Z %28 = arith.mulf %26, %27 : tensor<4x4xf32> 2026-02-21T08:09:27.1133206Z %29 = arith.addf %28, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1133398Z scf.yield %29 : tensor<4x4xf32> 2026-02-21T08:09:27.1133558Z } else { 2026-02-21T08:09:27.1133718Z %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:09:27.1133925Z %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32> 2026-02-21T08:09:27.1134146Z %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32> 2026-02-21T08:09:27.1134356Z %29 = arith.ori %27, %28 : tensor<4x4xi1> 2026-02-21T08:09:27.1134584Z %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:09:27.1134819Z %31 = math.log %30 : tensor<4x4xf32> 2026-02-21T08:09:27.1135002Z %32 = arith.subf %31, %22 : tensor<4x4xf32> 2026-02-21T08:09:27.1135196Z %33 = arith.mulf %23, %32 : tensor<4x4xf32> 2026-02-21T08:09:27.1135388Z %34 = arith.addf %33, %cst : tensor<4x4xf32> 2026-02-21T08:09:27.1135635Z scf.yield %34 : tensor<4x4xf32> 2026-02-21T08:09:27.1135800Z } 2026-02-21T08:09:27.1135934Z %25 = arith.addf %arg7, %24 : tensor<4x4xf32> 2026-02-21T08:09:27.1136122Z scf.yield %25 : tensor<4x4xf32> 2026-02-21T08:09:27.1136422Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:09:27.1136742Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:09:27.1136925Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:09:27.1137103Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:09:27.1137279Z tt.reduce.return %22 : f32 2026-02-21T08:09:27.1137459Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:09:27.1137676Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:09:27.1137921Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:09:27.1138198Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:09:27.1138421Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:09:27.1138623Z tt.return 2026-02-21T08:09:27.1138742Z } 2026-02-21T08:09:27.1138860Z } 2026-02-21T08:09:27.1138926Z 2026-02-21T08:09:27.1138983Z {-# 2026-02-21T08:09:27.1139105Z external_resources: { 2026-02-21T08:09:27.1139261Z mlir_reproducer: { 2026-02-21T08:09:27.1143521Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:09:27.1147868Z disable_threading: false, 2026-02-21T08:09:27.1148076Z verify_each: true 2026-02-21T08:09:27.1148233Z } 2026-02-21T08:09:27.1148358Z } 2026-02-21T08:09:27.1148495Z #-} 2026-02-21T08:09:27.1148988Z /tmp/torchinductor_root/dc/cdco3tfp4dqsv4m67ihzdj7b22uvz6pscxnp3nmyybieeqfozo77.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:09:27.1150228Z /tmp/torchinductor_root/dc/cdco3tfp4dqsv4m67ihzdj7b22uvz6pscxnp3nmyybieeqfozo77.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:09:27.1151239Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:09:27.1152498Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:09:27.1153557Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:09:27.1153812Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:09:28.5226629Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.1 configs/s 2026-02-21T08:09:28.5238123Z [69s] Adaptive compile timeout: 30s (90% percentile=2.7s, bounds=[30.0s, 60s]) 2026-02-21T08:09:29.5539385Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 985.5 configs/s 2026-02-21T08:09:29.6109262Z [70s] Initial random population of 100, 5 starting points: 2026-02-21T08:09:29.6111008Z error=9 2026-02-21T08:09:29.6111160Z timeout=3 2026-02-21T08:09:29.6111290Z ok=88 2026-02-21T08:09:29.6111410Z min=0.0482 2026-02-21T08:09:29.6111537Z mid=0.4474 2026-02-21T08:09:29.6111652Z max=23.3493 2026-02-21T08:09:29.6111799Z best={'block_sizes': [256, 1], 2026-02-21T08:09:29.6112124Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:09:29.6112382Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:09:29.6112580Z 'num_sm_multiplier': 64, 2026-02-21T08:09:29.6112734Z 'num_stages': 7, 2026-02-21T08:09:29.6112879Z 'num_warps': 2, 2026-02-21T08:09:29.6113032Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:09:29.6113228Z 'range_flattens': [False, False], 2026-02-21T08:09:29.6113413Z 'range_multi_buffers': [True, True], 2026-02-21T08:09:29.6113600Z 'range_num_stages': [1, 3], 2026-02-21T08:09:29.6113781Z 'range_unroll_factors': [4, 3], 2026-02-21T08:09:29.6113970Z 'range_warp_specializes': [False, None]} 2026-02-21T08:09:29.6130142Z [70s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:09:30.7586740Z [71s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T08:09:39.2258333Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 3.1 configs/s 2026-02-21T08:09:44.3928397Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 17.2 configs/s 2026-02-21T08:09:50.5439347Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 167.6 2026-02-21T08:09:50.5443234Z configs/s 2026-02-21T08:09:50.9958506Z [91s] Generation 1 complete: 2026-02-21T08:09:50.9963243Z error=1 2026-02-21T08:09:50.9964984Z ok=89 2026-02-21T08:09:50.9965185Z min=0.0420 2026-02-21T08:09:50.9969786Z mid=0.0564 2026-02-21T08:09:50.9971160Z max=0.5583 2026-02-21T08:09:50.9971330Z best={'block_sizes': [1024, 1], 2026-02-21T08:09:50.9971597Z 'indexing': ['tensor_descriptor', 'pointer', 'pointer'], 2026-02-21T08:09:50.9972196Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:09:50.9976863Z 'num_stages': 6, 2026-02-21T08:09:50.9978527Z 'num_warps': 16, 2026-02-21T08:09:50.9978759Z 'pid_type': 'flat', 2026-02-21T08:09:50.9984036Z 'range_flattens': [None, False], 2026-02-21T08:09:50.9988036Z 'range_multi_buffers': [None, None], 2026-02-21T08:09:50.9991708Z 'range_num_stages': [0, 1], 2026-02-21T08:09:50.9995649Z 'range_unroll_factors': [0, 0], 2026-02-21T08:09:50.9997600Z 'range_warp_specializes': [None, None]} 2026-02-21T08:09:50.9997889Z [91s] Fitting surrogate: 190 points, 190 targets 2026-02-21T08:09:52.3318551Z [93s] Generation 2 starting: 83 neighbors, 5 active search path(s) 2026-02-21T08:09:56.0529604Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 16.4 configs/s 2026-02-21T08:10:01.1281008Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 17.1 configs/s 2026-02-21T08:10:08.1211338Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 148.9 2026-02-21T08:10:08.1212728Z configs/s 2026-02-21T08:10:08.5232435Z [109s] Generation 2 complete: 2026-02-21T08:10:08.5232785Z ok=89 2026-02-21T08:10:08.5232986Z min=0.0420 2026-02-21T08:10:08.5233176Z mid=0.0481 2026-02-21T08:10:08.5233475Z max=0.2386 2026-02-21T08:10:08.5233743Z best={'block_sizes': [256, 1], 2026-02-21T08:10:08.5234114Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:10:08.5234494Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:10:08.5234806Z 'num_stages': 7, 2026-02-21T08:10:08.5235018Z 'num_warps': 1, 2026-02-21T08:10:08.5235241Z 'pid_type': 'flat', 2026-02-21T08:10:08.5235491Z 'range_flattens': [None, False], 2026-02-21T08:10:08.5235795Z 'range_multi_buffers': [None, None], 2026-02-21T08:10:08.5236100Z 'range_num_stages': [0, 4], 2026-02-21T08:10:08.5236722Z 'range_unroll_factors': [0, 2], 2026-02-21T08:10:08.5237049Z 'range_warp_specializes': [None, None]} 2026-02-21T08:10:08.5247759Z [109s] Fitting surrogate: 279 points, 279 targets 2026-02-21T08:10:09.4625057Z [110s] Generation 3 starting: 72 neighbors, 5 active search path(s) 2026-02-21T08:10:14.4937969Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 6.1 configs/s 2026-02-21T08:10:18.9886983Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.8 configs/s 2026-02-21T08:10:25.5608795Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 159.8 2026-02-21T08:10:25.5612730Z configs/s 2026-02-21T08:10:25.9245601Z [126s] Generation 3 complete: 2026-02-21T08:10:25.9249682Z ok=77 2026-02-21T08:10:25.9253589Z min=0.0420 2026-02-21T08:10:25.9258057Z mid=0.0441 2026-02-21T08:10:25.9262581Z max=0.1893 2026-02-21T08:10:25.9267099Z best={'block_sizes': [1024, 1], 2026-02-21T08:10:25.9271239Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:25.9271625Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:10:25.9271928Z 'num_sm_multiplier': 64, 2026-02-21T08:10:25.9276701Z 'num_stages': 7, 2026-02-21T08:10:25.9280514Z 'num_warps': 1, 2026-02-21T08:10:25.9285132Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:25.9289428Z 'range_flattens': [False, False], 2026-02-21T08:10:25.9290965Z 'range_multi_buffers': [True, True], 2026-02-21T08:10:25.9291193Z 'range_num_stages': [0, 3], 2026-02-21T08:10:25.9291361Z 'range_unroll_factors': [0, 3], 2026-02-21T08:10:25.9291546Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:25.9291836Z [126s] Fitting surrogate: 356 points, 356 targets 2026-02-21T08:10:26.7770502Z [127s] Generation 4 starting: 48 neighbors, 4 active search path(s) 2026-02-21T08:10:29.3484173Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 11.1 configs/s 2026-02-21T08:10:32.2508841Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.5 configs/s 2026-02-21T08:10:36.5705667Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 236.5 2026-02-21T08:10:36.5707360Z configs/s 2026-02-21T08:10:36.8224670Z [137s] Generation 4 complete: 2026-02-21T08:10:36.8227023Z error=1 2026-02-21T08:10:36.8227154Z ok=52 2026-02-21T08:10:36.8227280Z min=0.0419 2026-02-21T08:10:36.8227403Z mid=0.0440 2026-02-21T08:10:36.8227516Z max=0.0707 2026-02-21T08:10:36.8227651Z best={'block_sizes': [1024, 1], 2026-02-21T08:10:36.8227881Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:36.8228147Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:10:36.8228337Z 'num_sm_multiplier': 64, 2026-02-21T08:10:36.8228495Z 'num_stages': 7, 2026-02-21T08:10:36.8228628Z 'num_warps': 1, 2026-02-21T08:10:36.8228784Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:36.8228975Z 'range_flattens': [False, False], 2026-02-21T08:10:36.8229182Z 'range_multi_buffers': [True, True], 2026-02-21T08:10:36.8229786Z 'range_num_stages': [0, 3], 2026-02-21T08:10:36.8229952Z 'range_unroll_factors': [0, 3], 2026-02-21T08:10:36.8230134Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:36.8237450Z [137s] Fitting surrogate: 409 points, 409 targets 2026-02-21T08:10:37.3216462Z [138s] Generation 5 starting: 24 neighbors, 2 active search path(s) 2026-02-21T08:10:40.7628589Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 3.9 configs/s 2026-02-21T08:10:42.2705342Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.1 configs/s 2026-02-21T08:10:44.7157139Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 462.6 2026-02-21T08:10:44.7160868Z configs/s 2026-02-21T08:10:44.8720037Z [145s] Generation 5 complete: 2026-02-21T08:10:44.8720348Z ok=27 2026-02-21T08:10:44.8720540Z min=0.0420 2026-02-21T08:10:44.8721172Z mid=0.0441 2026-02-21T08:10:44.8721386Z max=0.1976 2026-02-21T08:10:44.8721587Z best={'block_sizes': [1024, 1], 2026-02-21T08:10:44.8722258Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:44.8722687Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:10:44.8723008Z 'num_sm_multiplier': 64, 2026-02-21T08:10:44.8723247Z 'num_stages': 7, 2026-02-21T08:10:44.8723466Z 'num_warps': 1, 2026-02-21T08:10:44.8723710Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:44.8724004Z 'range_flattens': [False, False], 2026-02-21T08:10:44.8724301Z 'range_multi_buffers': [True, True], 2026-02-21T08:10:44.8724589Z 'range_num_stages': [0, 3], 2026-02-21T08:10:44.8724852Z 'range_unroll_factors': [0, 3], 2026-02-21T08:10:44.8725131Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:44.8742106Z [145s] Fitting surrogate: 436 points, 436 targets 2026-02-21T08:10:45.2098794Z [146s] Generation 6 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:10:46.6141522Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 10.5 configs/s 2026-02-21T08:10:47.2602137Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.4 configs/s 2026-02-21T08:10:48.2960374Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 972.6 2026-02-21T08:10:48.2964316Z configs/s 2026-02-21T08:10:48.3748171Z [149s] Generation 6 complete: 2026-02-21T08:10:48.3752556Z ok=13 2026-02-21T08:10:48.3756931Z min=0.0420 2026-02-21T08:10:48.3760956Z mid=0.0420 2026-02-21T08:10:48.3763052Z max=0.0727 2026-02-21T08:10:48.3763262Z best={'block_sizes': [1024, 1], 2026-02-21T08:10:48.3768209Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:10:48.3769704Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:10:48.3769975Z 'num_sm_multiplier': 64, 2026-02-21T08:10:48.3772919Z 'num_stages': 7, 2026-02-21T08:10:48.3773145Z 'num_warps': 1, 2026-02-21T08:10:48.3777372Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:10:48.3782492Z 'range_flattens': [False, False], 2026-02-21T08:10:48.3784024Z 'range_multi_buffers': [True, True], 2026-02-21T08:10:48.3784293Z 'range_num_stages': [0, 3], 2026-02-21T08:10:48.3789543Z 'range_unroll_factors': [0, 3], 2026-02-21T08:10:48.3791126Z 'range_warp_specializes': [True, None]} 2026-02-21T08:10:48.3791418Z [149s] Fitting surrogate: 449 points, 449 targets 2026-02-21T08:10:48.5521459Z [149s] Autotuning complete in 149.5s after searching 432 configs. 2026-02-21T08:10:48.5521773Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:10:48.5522978Z @helion.kernel(config=helion.Config(block_sizes=[1024, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=64, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:10:48.5524195Z 2026-02-21T08:10:48.5524438Z [149s] Code of selected kernel: /tmp/torchinductor_root/6p/c6pg5h2vgc4upe4bcu6zqjxnou4tx4etxnen3szzszsx4spcfpfc.py 2026-02-21T08:10:49.4870238Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:10:49.4870521Z (B, T, V) 2026-02-21T08:10:49.4870658Z -------------- 2026-02-21T08:10:49.4870805Z (8, 512, 4096) 2026-02-21T08:10:49.4870950Z 2026-02-21T08:10:49.4889042Z 17%|█▋ | 1/6 [02:37<13:06, 157.33s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1: 2026-02-21T08:10:49.4893116Z (B, T, V) 2026-02-21T08:10:49.4897566Z -------------- 2026-02-21T08:10:49.4899026Z (8, 512, 8192) 2026-02-21T08:10:49.4899386Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:10:50.6456954Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:10:51.9928881Z INFO:tritonbench.utils.triton_op:Took 2.71ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:10:54.8907807Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:10:54.8911910Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:10:54.8913251Z 'dtype': 'torch.float32', 2026-02-21T08:10:54.8913469Z 'shape': (4096, 8192), 2026-02-21T08:10:54.8913652Z 'stride': (8192, 1)}, 2026-02-21T08:10:54.8913819Z { 'device': 'cuda:0', 2026-02-21T08:10:54.8913999Z 'dtype': 'torch.float32', 2026-02-21T08:10:54.8914174Z 'shape': (4096, 8192), 2026-02-21T08:10:54.8914348Z 'stride': (8192, 1)}), 2026-02-21T08:10:54.8914503Z 'kwargs': {}} 2026-02-21T08:10:54.8927652Z INFO:tritonbench.utils.triton_op:Took 2.47ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:10:55.1297852Z [0s] Autotune random seed: 2134765727 2026-02-21T08:10:55.2897126Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:11:27.4924418Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=128, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[0, 1], range_warp_specializes=[False, None]) 2026-02-21T08:11:27.8236028Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:11:28.2982647Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=64, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[None, None]) 2026-02-21T08:11:28.4597333Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 4], range_warp_specializes=[False, None]) 2026-02-21T08:11:28.5632963Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, False]) 2026-02-21T08:11:28.8492042Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', ''], num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[4, 1], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:11:28.8509287Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T08:11:29.0294922Z module { 2026-02-21T08:11:29.0299529Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:11:29.0300506Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:11:29.0300708Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:29.0300949Z %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32> 2026-02-21T08:11:29.0301185Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:11:29.0301380Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:29.0301577Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:11:29.0301758Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:11:29.0302057Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:29.0305539Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:29.0309560Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:29.0313413Z %2 = tt.get_program_id x : i32 2026-02-21T08:11:29.0316681Z %3 = arith.muli %2, %c512_i32 : i32 2026-02-21T08:11:29.0320756Z %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:11:29.0322543Z %5 = tt.splat %3 : i32 -> tensor<512xi32> 2026-02-21T08:11:29.0322760Z %6 = arith.addi %5, %4 : tensor<512xi32> 2026-02-21T08:11:29.0323080Z %7 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>) : i32 { 2026-02-21T08:11:29.0323504Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:11:29.0323871Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:11:29.0324154Z %13 = scf.if %arg3 -> (tensor<512x32xf32>) { 2026-02-21T08:11:29.0324526Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32> 2026-02-21T08:11:29.0324894Z %16 = arith.subf %12, %11 : tensor<512x32xf32> 2026-02-21T08:11:29.0325099Z %17 = arith.mulf %15, %16 : tensor<512x32xf32> 2026-02-21T08:11:29.0325316Z %18 = arith.addf %17, %cst : tensor<512x32xf32> 2026-02-21T08:11:29.0325514Z scf.yield %18 : tensor<512x32xf32> 2026-02-21T08:11:29.0325690Z } else { 2026-02-21T08:11:29.0325847Z %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32> 2026-02-21T08:11:29.0326072Z %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32> 2026-02-21T08:11:29.0326295Z %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32> 2026-02-21T08:11:29.0326501Z %18 = arith.ori %16, %17 : tensor<512x32xi1> 2026-02-21T08:11:29.0326743Z %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32> 2026-02-21T08:11:29.0326982Z %20 = math.log %19 : tensor<512x32xf32> 2026-02-21T08:11:29.0327183Z %21 = arith.subf %20, %11 : tensor<512x32xf32> 2026-02-21T08:11:29.0327381Z %22 = arith.mulf %12, %21 : tensor<512x32xf32> 2026-02-21T08:11:29.0327596Z %23 = arith.addf %22, %cst : tensor<512x32xf32> 2026-02-21T08:11:29.0328046Z scf.yield %23 : tensor<512x32xf32> 2026-02-21T08:11:29.0328218Z } 2026-02-21T08:11:29.0328375Z %14 = arith.addf %arg6, %13 : tensor<512x32xf32> 2026-02-21T08:11:29.0328569Z scf.yield %14 : tensor<512x32xf32> 2026-02-21T08:11:29.0328895Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:11:29.0329222Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.0329421Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:11:29.0329598Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:11:29.0329777Z tt.reduce.return %11 : f32 2026-02-21T08:11:29.0329961Z }) : (tensor<512x32xf32>) -> tensor<512xf32> 2026-02-21T08:11:29.0330184Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<512x!tt.ptr> 2026-02-21T08:11:29.0330450Z %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr>, tensor<512xi32> 2026-02-21T08:11:29.0330805Z tt.store %10, %8 : tensor<512x!tt.ptr> 2026-02-21T08:11:29.0330999Z tt.return 2026-02-21T08:11:29.0331129Z } 2026-02-21T08:11:29.0331260Z } 2026-02-21T08:11:29.0331329Z 2026-02-21T08:11:29.0331390Z {-# 2026-02-21T08:11:29.0331522Z external_resources: { 2026-02-21T08:11:29.0331684Z mlir_reproducer: { 2026-02-21T08:11:29.0335992Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:29.0340312Z disable_threading: false, 2026-02-21T08:11:29.0340472Z verify_each: true 2026-02-21T08:11:29.0340621Z } 2026-02-21T08:11:29.0340736Z } 2026-02-21T08:11:29.0340853Z #-} 2026-02-21T08:11:29.0341264Z /tmp/torchinductor_root/6v/c6vcceyrhox3eack2pqh6el7mbxcvppgwliuhit5zyi62nwxsnqp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:29.0342539Z /tmp/torchinductor_root/6v/c6vcceyrhox3eack2pqh6el7mbxcvppgwliuhit5zyi62nwxsnqp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:29.0343519Z [33s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:29.0344499Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:11:29.0345469Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:29.0345726Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:29.6553584Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:11:29.6558997Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:11:29.6560058Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:11:29.6560638Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:11:29.6560855Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:29.6561042Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:11:29.6561266Z %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32> 2026-02-21T08:11:29.6561495Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:11:29.6561667Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:29.6562023Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:11:29.6562217Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:11:29.6562394Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:29.6562711Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:29.6563133Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:29.6563443Z %2 = tt.get_program_id x : i32 2026-02-21T08:11:29.6563622Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T08:11:29.6563809Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:29.6563996Z %4 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:11:29.6564177Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:11:29.6564360Z %6 = arith.divui %5, %c2368_i32 : i32 2026-02-21T08:11:29.6564543Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:11:29.6564721Z %7 = arith.remsi %6, %c4_i32 : i32 2026-02-21T08:11:29.6564890Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:11:29.6565059Z %9 = arith.muli %8, %c2368_i32 : i32 2026-02-21T08:11:29.6565231Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:11:29.6565414Z %11 = arith.muli %c2368_i32, %c4_i32 : i32 2026-02-21T08:11:29.6565618Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:11:29.6565813Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:11:29.6566048Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:29.6566297Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:11:29.6566500Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:11:29.6566816Z %16 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:11:29.6567226Z %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6567597Z %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6567885Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:11:29.6568259Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:11:29.6568622Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6568836Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:11:29.6569051Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6569250Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:11:29.6569548Z } else { 2026-02-21T08:11:29.6569705Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:11:29.6569929Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:11:29.6570144Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:11:29.6570358Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:11:29.6570589Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:11:29.6570833Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:11:29.6571042Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6571247Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:11:29.6571460Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6571657Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:11:29.6571835Z } 2026-02-21T08:11:29.6572085Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:11:29.6572296Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:11:29.6572521Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:11:29.6572744Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.6572939Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:29.6573114Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:29.6573302Z tt.reduce.return %50 : f32 2026-02-21T08:11:29.6573486Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:11:29.6573724Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:29.6573987Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:29.6574234Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:11:29.6574439Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:11:29.6574628Z %20 = arith.muli %c2368_i32, %c1_i32_0 : i32 2026-02-21T08:11:29.6574828Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:11:29.6575013Z %22 = arith.muli %21, %c16_i32 : i32 2026-02-21T08:11:29.6575249Z %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:29.6575495Z %24 = tt.splat %22 : i32 -> tensor<16xi32> 2026-02-21T08:11:29.6575702Z %25 = arith.addi %24, %23 : tensor<16xi32> 2026-02-21T08:11:29.6576025Z %26 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:11:29.6576476Z %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6576854Z %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6577152Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:11:29.6577525Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:11:29.6577901Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6578116Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:11:29.6578322Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6578527Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:11:29.6578699Z } else { 2026-02-21T08:11:29.6578871Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:11:29.6579092Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:11:29.6579321Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:11:29.6579539Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:11:29.6579782Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:11:29.6580017Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:11:29.6580205Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6580476Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:11:29.6580668Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6580863Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:11:29.6581032Z } 2026-02-21T08:11:29.6581170Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:11:29.6581361Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:11:29.6581565Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:11:29.6581779Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.6581992Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:29.6582173Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:29.6582353Z tt.reduce.return %50 : f32 2026-02-21T08:11:29.6582537Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:11:29.6582766Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:29.6583073Z %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:29.6583316Z tt.store %29, %27 : tensor<16x!tt.ptr> 2026-02-21T08:11:29.6583511Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:29.6583697Z %30 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:11:29.6583881Z %31 = arith.addi %arg5, %30 : i32 2026-02-21T08:11:29.6584061Z %32 = arith.muli %31, %c16_i32 : i32 2026-02-21T08:11:29.6584288Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:29.6584530Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T08:11:29.6584723Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T08:11:29.6585024Z %36 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:11:29.6585420Z %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6585826Z %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6586122Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:11:29.6586483Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:11:29.6586852Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6587056Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:11:29.6587269Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6587461Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:11:29.6587633Z } else { 2026-02-21T08:11:29.6587792Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:11:29.6588013Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:11:29.6588239Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:11:29.6588450Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:11:29.6588697Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:11:29.6588934Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:11:29.6589136Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6589332Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:11:29.6589540Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6589743Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:11:29.6589911Z } 2026-02-21T08:11:29.6590063Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:11:29.6590255Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:11:29.6590467Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:11:29.6590684Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.6590879Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:29.6591060Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:29.6591289Z tt.reduce.return %50 : f32 2026-02-21T08:11:29.6591470Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:11:29.6591762Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:29.6592051Z %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:29.6592276Z tt.store %39, %37 : tensor<16x!tt.ptr> 2026-02-21T08:11:29.6592473Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:11:29.6592648Z %40 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:11:29.6592835Z %41 = arith.addi %arg5, %40 : i32 2026-02-21T08:11:29.6593015Z %42 = arith.muli %41, %c16_i32 : i32 2026-02-21T08:11:29.6593233Z %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:29.6593470Z %44 = tt.splat %42 : i32 -> tensor<16xi32> 2026-02-21T08:11:29.6593653Z %45 = arith.addi %44, %43 : tensor<16xi32> 2026-02-21T08:11:29.6594013Z %46 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:11:29.6594403Z %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6594764Z %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6595049Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:11:29.6595399Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:11:29.6595760Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6595957Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:11:29.6596163Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6596359Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:11:29.6596525Z } else { 2026-02-21T08:11:29.6596692Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:11:29.6596906Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:11:29.6597133Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:11:29.6597337Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:11:29.6597571Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:11:29.6597810Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:11:29.6597997Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:11:29.6598194Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:11:29.6598387Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6598582Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:11:29.6598744Z } 2026-02-21T08:11:29.6598889Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:11:29.6599079Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:11:29.6599290Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:11:29.6599503Z %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.6599683Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:29.6599863Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:29.6600039Z tt.reduce.return %50 : f32 2026-02-21T08:11:29.6600221Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:11:29.6600437Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:29.6600696Z %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:29.6600923Z tt.store %49, %47 : tensor<16x!tt.ptr> 2026-02-21T08:11:29.6601107Z } 2026-02-21T08:11:29.6601272Z scf.for %arg5 = %10 to %c256_i32 step %c2368_i32 : i32 { 2026-02-21T08:11:29.6601480Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:11:29.6601707Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:29.6602023Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:11:29.6602212Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:11:29.6602507Z %16 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:11:29.6602899Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6603259Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:11:29.6603537Z %22 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:11:29.6603896Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:11:29.6604253Z %25 = arith.subf %21, %20 : tensor<16x32xf32> 2026-02-21T08:11:29.6604510Z %26 = arith.mulf %24, %25 : tensor<16x32xf32> 2026-02-21T08:11:29.6604721Z %27 = arith.addf %26, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6604908Z scf.yield %27 : tensor<16x32xf32> 2026-02-21T08:11:29.6605076Z } else { 2026-02-21T08:11:29.6605230Z %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:11:29.6605448Z %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32> 2026-02-21T08:11:29.6605661Z %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32> 2026-02-21T08:11:29.6605875Z %27 = arith.ori %25, %26 : tensor<16x32xi1> 2026-02-21T08:11:29.6606108Z %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:11:29.6606338Z %29 = math.log %28 : tensor<16x32xf32> 2026-02-21T08:11:29.6606532Z %30 = arith.subf %29, %20 : tensor<16x32xf32> 2026-02-21T08:11:29.6606725Z %31 = arith.mulf %21, %30 : tensor<16x32xf32> 2026-02-21T08:11:29.6606934Z %32 = arith.addf %31, %cst : tensor<16x32xf32> 2026-02-21T08:11:29.6607124Z scf.yield %32 : tensor<16x32xf32> 2026-02-21T08:11:29.6607293Z } 2026-02-21T08:11:29.6607432Z %23 = arith.addf %arg7, %22 : tensor<16x32xf32> 2026-02-21T08:11:29.6607630Z scf.yield %23 : tensor<16x32xf32> 2026-02-21T08:11:29.6607844Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:11:29.6608053Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:11:29.6608239Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:29.6608405Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:29.6608583Z tt.reduce.return %20 : f32 2026-02-21T08:11:29.6608757Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:11:29.6608980Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:29.6609237Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:29.6609463Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:11:29.6609662Z } {tt.num_stages = 1 : i32} 2026-02-21T08:11:29.6609816Z tt.return 2026-02-21T08:11:29.6609945Z } 2026-02-21T08:11:29.6610058Z } 2026-02-21T08:11:29.6610131Z 2026-02-21T08:11:29.6610180Z {-# 2026-02-21T08:11:29.6610303Z external_resources: { 2026-02-21T08:11:29.6610457Z mlir_reproducer: { 2026-02-21T08:11:29.6614788Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:29.6619480Z disable_threading: false, 2026-02-21T08:11:29.6619689Z verify_each: true 2026-02-21T08:11:29.6619866Z } 2026-02-21T08:11:29.6620014Z } 2026-02-21T08:11:29.6620145Z #-} 2026-02-21T08:11:29.6620637Z /tmp/torchinductor_root/np/cnp3647mdqcwjvzyuqilvqr5f6dahdv7ylecezdqdpc2zi24si6c.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:29.6622000Z /tmp/torchinductor_root/np/cnp3647mdqcwjvzyuqilvqr5f6dahdv7ylecezdqdpc2zi24si6c.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:29.6623057Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:29.6624197Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:11:29.6625246Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:29.6625502Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:32.4247265Z module { 2026-02-21T08:11:32.4248125Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:11:32.4248867Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:11:32.4249527Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:32.4253262Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:32.4257426Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32> 2026-02-21T08:11:32.4261554Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:11:32.4265460Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:32.4269342Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:11:32.4272834Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:11:32.4275474Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:32.4275835Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:32.4276300Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:32.4276624Z %2 = tt.get_program_id x : i32 2026-02-21T08:11:32.4276828Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:11:32.4277076Z %4 = arith.minsi %3, %c256_i32 : i32 2026-02-21T08:11:32.4277738Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:11:32.4278043Z %5 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:11:32.4278324Z %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:11:32.4278617Z %7 = tt.splat %5 : i32 -> tensor<16xi32> 2026-02-21T08:11:32.4278839Z %8 = arith.addi %7, %6 : tensor<16xi32> 2026-02-21T08:11:32.4279281Z %9 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>) : i32 { 2026-02-21T08:11:32.4279937Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:11:32.4280519Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:11:32.4280920Z %15 = scf.if %arg3 -> (tensor<16x256xf32>) { 2026-02-21T08:11:32.4281629Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:11:32.4282093Z %18 = arith.subf %14, %13 : tensor<16x256xf32> 2026-02-21T08:11:32.4282300Z %19 = arith.mulf %17, %18 : tensor<16x256xf32> 2026-02-21T08:11:32.4282538Z %20 = arith.addf %19, %cst : tensor<16x256xf32> 2026-02-21T08:11:32.4282745Z scf.yield %20 : tensor<16x256xf32> 2026-02-21T08:11:32.4282912Z } else { 2026-02-21T08:11:32.4283081Z %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32> 2026-02-21T08:11:32.4283299Z %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32> 2026-02-21T08:11:32.4283527Z %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32> 2026-02-21T08:11:32.4283745Z %20 = arith.ori %18, %19 : tensor<16x256xi1> 2026-02-21T08:11:32.4283984Z %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:11:32.4284234Z %22 = math.log %21 : tensor<16x256xf32> 2026-02-21T08:11:32.4284433Z %23 = arith.subf %22, %13 : tensor<16x256xf32> 2026-02-21T08:11:32.4284638Z %24 = arith.mulf %14, %23 : tensor<16x256xf32> 2026-02-21T08:11:32.4284837Z %25 = arith.addf %24, %cst : tensor<16x256xf32> 2026-02-21T08:11:32.4285039Z scf.yield %25 : tensor<16x256xf32> 2026-02-21T08:11:32.4285213Z } 2026-02-21T08:11:32.4285355Z %16 = arith.addf %arg7, %15 : tensor<16x256xf32> 2026-02-21T08:11:32.4285552Z scf.yield %16 : tensor<16x256xf32> 2026-02-21T08:11:32.4285768Z } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:32.4286000Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:11:32.4286183Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:32.4286361Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:32.4286538Z tt.reduce.return %13 : f32 2026-02-21T08:11:32.4286723Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:11:32.4286955Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:11:32.4287210Z %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:11:32.4287445Z tt.store %12, %10 : tensor<16x!tt.ptr> 2026-02-21T08:11:32.4287619Z } 2026-02-21T08:11:32.4287740Z tt.return 2026-02-21T08:11:32.4287858Z } 2026-02-21T08:11:32.4287976Z } 2026-02-21T08:11:32.4288039Z 2026-02-21T08:11:32.4288085Z {-# 2026-02-21T08:11:32.4288211Z external_resources: { 2026-02-21T08:11:32.4288365Z mlir_reproducer: { 2026-02-21T08:11:32.4292704Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:32.4297234Z disable_threading: false, 2026-02-21T08:11:32.4297410Z verify_each: true 2026-02-21T08:11:32.4297553Z } 2026-02-21T08:11:32.4297662Z } 2026-02-21T08:11:32.4297781Z #-} 2026-02-21T08:11:32.4298184Z /tmp/torchinductor_root/dc/cdcak6bwi27fdqg3bzvvkvgo6selj43m4xajbp5b6gu3kl7zznvk.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:32.4299360Z /tmp/torchinductor_root/dc/cdcak6bwi27fdqg3bzvvkvgo6selj43m4xajbp5b6gu3kl7zznvk.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:32.4300322Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:32.4301334Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:11:32.4302310Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:32.4302567Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:33.8815557Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:11:33.8816279Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:11:33.8816892Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:11:33.8817091Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:33.8817283Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:33.8817503Z %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32> 2026-02-21T08:11:33.8817736Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:11:33.8817928Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:33.8818115Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:11:33.8818304Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:11:33.8818485Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:33.8818811Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:33.8819259Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:33.8819895Z %2 = tt.get_program_id x : i32 2026-02-21T08:11:33.8820088Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:11:33.8820279Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:11:33.8820473Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:11:33.8820648Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:11:33.8820845Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:11:33.8821028Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:11:33.8821203Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:11:33.8821380Z %c4_i32_1 = arith.constant 4 : i32 2026-02-21T08:11:33.8821569Z %9 = arith.remsi %8, %c4_i32_1 : i32 2026-02-21T08:11:33.8821753Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:11:33.8825274Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:11:33.8825469Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:11:33.8825662Z %13 = arith.muli %c1_i32, %c4_i32_1 : i32 2026-02-21T08:11:33.8826001Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:11:33.8826219Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:11:33.8826469Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:33.8826729Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:11:33.8826943Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:11:33.8827269Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:11:33.8827691Z %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8828043Z %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8828320Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:11:33.8828688Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:11:33.8829048Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8829245Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:11:33.8829450Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8829639Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:11:33.8829808Z } else { 2026-02-21T08:11:33.8829960Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:11:33.8830171Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:11:33.8830383Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:11:33.8830581Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:11:33.8830815Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:11:33.8831046Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:11:33.8831236Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8831426Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:11:33.8831625Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8831829Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:11:33.8832053Z } 2026-02-21T08:11:33.8832210Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:11:33.8832407Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:11:33.8832743Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:33.8833089Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:11:33.8833296Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:33.8833481Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:33.8833693Z tt.reduce.return %52 : f32 2026-02-21T08:11:33.8833876Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:11:33.8834093Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:33.8834350Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:33.8834684Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:11:33.8834910Z %c1_i32_2 = arith.constant 1 : i32 2026-02-21T08:11:33.8835095Z %22 = arith.muli %c1_i32, %c1_i32_2 : i32 2026-02-21T08:11:33.8835274Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:11:33.8835452Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T08:11:33.8835669Z %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:33.8835895Z %26 = tt.splat %24 : i32 -> tensor<4xi32> 2026-02-21T08:11:33.8836084Z %27 = arith.addi %26, %25 : tensor<4xi32> 2026-02-21T08:11:33.8836374Z %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:11:33.8836754Z %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8837164Z %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8837446Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:11:33.8837799Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:11:33.8838147Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8838346Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:11:33.8838539Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8838732Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:11:33.8838898Z } else { 2026-02-21T08:11:33.8839052Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:11:33.8839268Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:11:33.8839526Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:11:33.8839735Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:11:33.8839964Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:11:33.8840196Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:11:33.8840386Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8840573Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:11:33.8840775Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8840960Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:11:33.8841127Z } 2026-02-21T08:11:33.8841262Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:11:33.8841449Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:11:33.8841750Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:33.8842112Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:11:33.8842300Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:33.8842475Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:33.8842673Z tt.reduce.return %52 : f32 2026-02-21T08:11:33.8842863Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:11:33.8843096Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:33.8843360Z %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:33.8843603Z tt.store %31, %29 : tensor<4x!tt.ptr> 2026-02-21T08:11:33.8843808Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:11:33.8843999Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:11:33.8844181Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:11:33.8844351Z %34 = arith.muli %33, %c4_i32 : i32 2026-02-21T08:11:33.8844571Z %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:33.8844826Z %36 = tt.splat %34 : i32 -> tensor<4xi32> 2026-02-21T08:11:33.8845038Z %37 = arith.addi %36, %35 : tensor<4xi32> 2026-02-21T08:11:33.8845363Z %38 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:11:33.8845849Z %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8846244Z %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8846553Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:11:33.8846949Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:11:33.8847344Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8847569Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:11:33.8847796Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8848007Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:11:33.8848197Z } else { 2026-02-21T08:11:33.8848425Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:11:33.8848669Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:11:33.8848899Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:11:33.8849129Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:11:33.8849388Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:11:33.8849644Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:11:33.8849860Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8850071Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:11:33.8850294Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8850501Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:11:33.8850687Z } 2026-02-21T08:11:33.8850839Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:11:33.8851077Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:11:33.8851383Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:33.8851694Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:11:33.8851934Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:33.8852121Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:33.8852319Z tt.reduce.return %52 : f32 2026-02-21T08:11:33.8852509Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:11:33.8852744Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:33.8853020Z %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:33.8853262Z tt.store %41, %39 : tensor<4x!tt.ptr> 2026-02-21T08:11:33.8853456Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:11:33.8853631Z %42 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:11:33.8853812Z %43 = arith.addi %arg5, %42 : i32 2026-02-21T08:11:33.8853981Z %44 = arith.muli %43, %c4_i32 : i32 2026-02-21T08:11:33.8854201Z %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:33.8854430Z %46 = tt.splat %44 : i32 -> tensor<4xi32> 2026-02-21T08:11:33.8854612Z %47 = arith.addi %46, %45 : tensor<4xi32> 2026-02-21T08:11:33.8854902Z %48 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:11:33.8855268Z %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8855614Z %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8855885Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:11:33.8856235Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:11:33.8856586Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8856844Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:11:33.8857052Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8857243Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:11:33.8857415Z } else { 2026-02-21T08:11:33.8857574Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:11:33.8857791Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:11:33.8858009Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:11:33.8858212Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:11:33.8858449Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:11:33.8858684Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:11:33.8858881Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:11:33.8859078Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:11:33.8859348Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8859544Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:11:33.8859703Z } 2026-02-21T08:11:33.8859846Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:11:33.8860029Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:11:33.8860333Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:33.8860646Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:11:33.8860838Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:33.8861008Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:33.8861199Z tt.reduce.return %52 : f32 2026-02-21T08:11:33.8861387Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:11:33.8861599Z %50 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:33.8861909Z %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:33.8862150Z tt.store %51, %49 : tensor<4x!tt.ptr> 2026-02-21T08:11:33.8862387Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:11:33.8862593Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:11:33.8862802Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:11:33.8863036Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:33.8863288Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:11:33.8863520Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:11:33.8863813Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:11:33.8864177Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8864529Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:11:33.8864812Z %24 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:11:33.8865157Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:11:33.8865504Z %27 = arith.subf %23, %22 : tensor<4x4xf32> 2026-02-21T08:11:33.8865695Z %28 = arith.mulf %26, %27 : tensor<4x4xf32> 2026-02-21T08:11:33.8865894Z %29 = arith.addf %28, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8866086Z scf.yield %29 : tensor<4x4xf32> 2026-02-21T08:11:33.8866245Z } else { 2026-02-21T08:11:33.8866404Z %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:11:33.8866605Z %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32> 2026-02-21T08:11:33.8866817Z %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32> 2026-02-21T08:11:33.8867010Z %29 = arith.ori %27, %28 : tensor<4x4xi1> 2026-02-21T08:11:33.8867239Z %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:11:33.8867464Z %31 = math.log %30 : tensor<4x4xf32> 2026-02-21T08:11:33.8867710Z %32 = arith.subf %31, %22 : tensor<4x4xf32> 2026-02-21T08:11:33.8867901Z %33 = arith.mulf %23, %32 : tensor<4x4xf32> 2026-02-21T08:11:33.8868087Z %34 = arith.addf %33, %cst : tensor<4x4xf32> 2026-02-21T08:11:33.8868274Z scf.yield %34 : tensor<4x4xf32> 2026-02-21T08:11:33.8868433Z } 2026-02-21T08:11:33.8868575Z %25 = arith.addf %arg7, %24 : tensor<4x4xf32> 2026-02-21T08:11:33.8868757Z scf.yield %25 : tensor<4x4xf32> 2026-02-21T08:11:33.8869056Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:33.8869377Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:11:33.8869555Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:33.8869730Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:33.8869902Z tt.reduce.return %22 : f32 2026-02-21T08:11:33.8870160Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:11:33.8870379Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:33.8870629Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:33.8870848Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:11:33.8871071Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:11:33.8871271Z tt.return 2026-02-21T08:11:33.8871391Z } 2026-02-21T08:11:33.8871516Z } 2026-02-21T08:11:33.8871583Z 2026-02-21T08:11:33.8871634Z {-# 2026-02-21T08:11:33.8871764Z external_resources: { 2026-02-21T08:11:33.8871947Z mlir_reproducer: { 2026-02-21T08:11:33.8876559Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:33.8881415Z disable_threading: false, 2026-02-21T08:11:33.8881596Z verify_each: true 2026-02-21T08:11:33.8881762Z } 2026-02-21T08:11:33.8881960Z } 2026-02-21T08:11:33.8882143Z #-} 2026-02-21T08:11:33.8882718Z /tmp/torchinductor_root/vo/cvoeviq4uu2ccyyvereddu2zu4p225xiu5p2fh3dy2k7ekhtaajo.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:33.8884068Z /tmp/torchinductor_root/vo/cvoeviq4uu2ccyyvereddu2zu4p225xiu5p2fh3dy2k7ekhtaajo.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:33.8885235Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:33.8886492Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:11:33.8887590Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:33.8887869Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:35.3642032Z module { 2026-02-21T08:11:35.3643012Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:11:35.3643639Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:11:35.3643852Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:11:35.3644045Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:11:35.3644278Z %cst = arith.constant dense<0.000000e+00> : tensor<4x8192xf32> 2026-02-21T08:11:35.3644521Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:11:35.3644715Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:11:35.3644905Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:11:35.3645096Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:11:35.3645275Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:11:35.3645609Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:35.3646071Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:11:35.3646399Z %2 = tt.get_program_id x : i32 2026-02-21T08:11:35.3646588Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:11:35.3646774Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:11:35.3646985Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:11:35.3647193Z %5 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:11:35.3647432Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:11:35.3647689Z %7 = tt.splat %5 : i32 -> tensor<4xi32> 2026-02-21T08:11:35.3647891Z %8 = arith.addi %7, %6 : tensor<4xi32> 2026-02-21T08:11:35.3648180Z %9 = tt.descriptor_load %0[%5, %c0_i32] : !tt.tensordesc> -> tensor<4x8192xf32> 2026-02-21T08:11:35.3648587Z %10 = tt.descriptor_load %1[%5, %c0_i32] : !tt.tensordesc> -> tensor<4x8192xf32> 2026-02-21T08:11:35.3648902Z %11 = scf.if %arg3 -> (tensor<4x8192xf32>) { 2026-02-21T08:11:35.3649290Z %16 = tt.extern_elementwise %10 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x8192xf32>) -> tensor<4x8192xf32> 2026-02-21T08:11:35.3649689Z %17 = arith.subf %10, %9 : tensor<4x8192xf32> 2026-02-21T08:11:35.3649905Z %18 = arith.mulf %16, %17 : tensor<4x8192xf32> 2026-02-21T08:11:35.3650130Z %19 = arith.addf %18, %cst : tensor<4x8192xf32> 2026-02-21T08:11:35.3650356Z scf.yield %19 : tensor<4x8192xf32> 2026-02-21T08:11:35.3650531Z } else { 2026-02-21T08:11:35.3650706Z %16 = tt.splat %arg4 : f32 -> tensor<4x8192xf32> 2026-02-21T08:11:35.3650935Z %17 = arith.cmpf ogt, %10, %16 : tensor<4x8192xf32> 2026-02-21T08:11:35.3651175Z %18 = arith.cmpf une, %10, %10 : tensor<4x8192xf32> 2026-02-21T08:11:35.3651395Z %19 = arith.ori %17, %18 : tensor<4x8192xi1> 2026-02-21T08:11:35.3651656Z %20 = arith.select %19, %10, %16 : tensor<4x8192xi1>, tensor<4x8192xf32> 2026-02-21T08:11:35.3652112Z %21 = math.log %20 : tensor<4x8192xf32> 2026-02-21T08:11:35.3652323Z %22 = arith.subf %21, %9 : tensor<4x8192xf32> 2026-02-21T08:11:35.3652527Z %23 = arith.mulf %10, %22 : tensor<4x8192xf32> 2026-02-21T08:11:35.3652734Z %24 = arith.addf %23, %cst : tensor<4x8192xf32> 2026-02-21T08:11:35.3652943Z scf.yield %24 : tensor<4x8192xf32> 2026-02-21T08:11:35.3653121Z } 2026-02-21T08:11:35.3653279Z %12 = arith.addf %11, %cst : tensor<4x8192xf32> 2026-02-21T08:11:35.3653493Z %13 = "tt.reduce"(%12) <{axis = 1 : i32}> ({ 2026-02-21T08:11:35.3653675Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:11:35.3653851Z %16 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:11:35.3654026Z tt.reduce.return %16 : f32 2026-02-21T08:11:35.3654217Z }) : (tensor<4x8192xf32>) -> tensor<4xf32> 2026-02-21T08:11:35.3654526Z %14 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:11:35.3654807Z %15 = tt.addptr %14, %8 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:11:35.3655046Z tt.store %15, %13 : tensor<4x!tt.ptr> 2026-02-21T08:11:35.3655272Z } {tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:11:35.3655473Z tt.return 2026-02-21T08:11:35.3655601Z } 2026-02-21T08:11:35.3655733Z } 2026-02-21T08:11:35.3655798Z 2026-02-21T08:11:35.3655845Z {-# 2026-02-21T08:11:35.3655975Z external_resources: { 2026-02-21T08:11:35.3656122Z mlir_reproducer: { 2026-02-21T08:11:35.3660359Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:11:35.3664899Z disable_threading: false, 2026-02-21T08:11:35.3665069Z verify_each: true 2026-02-21T08:11:35.3665215Z } 2026-02-21T08:11:35.3665329Z } 2026-02-21T08:11:35.3665445Z #-} 2026-02-21T08:11:35.3665857Z /tmp/torchinductor_root/ax/caxj6wc7yve7diibicypm2atgzdubmmgoizzagqbuz2cpheuc57o.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:11:35.3667071Z /tmp/torchinductor_root/ax/caxj6wc7yve7diibicypm2atgzdubmmgoizzagqbuz2cpheuc57o.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:11:35.3668130Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:11:35.3669220Z Config: @helion.kernel(config=helion.Config(block_sizes=[8192, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_sm_multiplier=16, num_stages=1, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:11:35.3670242Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:11:35.3670551Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:11:35.4301140Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.3 configs/s 2026-02-21T08:11:35.4310297Z [40s] Adaptive compile timeout: 30s (90% percentile=4.0s, bounds=[30.0s, 30s]) 2026-02-21T08:11:36.1785666Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1326.9 configs/s 2026-02-21T08:11:36.2241445Z [40s] Initial random population of 100, 5 starting points: 2026-02-21T08:11:36.2243279Z error=7 2026-02-21T08:11:36.2243430Z timeout=6 2026-02-21T08:11:36.2243560Z ok=87 2026-02-21T08:11:36.2243678Z min=0.0747 2026-02-21T08:11:36.2243805Z mid=0.8366 2026-02-21T08:11:36.2243920Z max=40.4429 2026-02-21T08:11:36.2244062Z best={'block_sizes': [1024, 1], 2026-02-21T08:11:36.2244281Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:11:36.2244520Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:11:36.2244702Z 'num_sm_multiplier': 16, 2026-02-21T08:11:36.2244860Z 'num_stages': 1, 2026-02-21T08:11:36.2245001Z 'num_warps': 1, 2026-02-21T08:11:36.2245153Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:11:36.2245348Z 'range_flattens': [None, None], 2026-02-21T08:11:36.2245544Z 'range_multi_buffers': [False, True], 2026-02-21T08:11:36.2245740Z 'range_num_stages': [0, 4], 2026-02-21T08:11:36.2245901Z 'range_unroll_factors': [2, 0], 2026-02-21T08:11:36.2246086Z 'range_warp_specializes': [None, True]} 2026-02-21T08:11:36.2258366Z [40s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:11:37.4728448Z [42s] Generation 1 starting: 89 neighbors, 5 active search path(s) 2026-02-21T08:11:46.8057786Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 4.2 configs/s 2026-02-21T08:11:52.4587462Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.4 configs/s 2026-02-21T08:11:59.5651347Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.2 2026-02-21T08:11:59.5655436Z configs/s 2026-02-21T08:11:59.8595130Z [64s] Generation 1 complete: 2026-02-21T08:11:59.8597589Z ok=95 2026-02-21T08:11:59.8597790Z min=0.0645 2026-02-21T08:11:59.8597979Z mid=0.0851 2026-02-21T08:11:59.8598203Z max=0.6297 2026-02-21T08:11:59.8598959Z best={'block_sizes': [256, 1], 2026-02-21T08:11:59.8599294Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:11:59.8599655Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:11:59.8599881Z 'num_stages': 7, 2026-02-21T08:11:59.8600045Z 'num_warps': 1, 2026-02-21T08:11:59.8600372Z 'pid_type': 'flat', 2026-02-21T08:11:59.8600573Z 'range_flattens': [None, False], 2026-02-21T08:11:59.8600848Z 'range_multi_buffers': [None, True], 2026-02-21T08:11:59.8606276Z 'range_num_stages': [0, 3], 2026-02-21T08:11:59.8610708Z 'range_unroll_factors': [0, 3], 2026-02-21T08:11:59.8615319Z 'range_warp_specializes': [None, None]} 2026-02-21T08:11:59.8620250Z [64s] Fitting surrogate: 195 points, 195 targets 2026-02-21T08:12:00.9236668Z [65s] Generation 2 starting: 75 neighbors, 5 active search path(s) 2026-02-21T08:12:04.3550523Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 39.9 configs/s 2026-02-21T08:12:09.3159184Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 15.8 configs/s 2026-02-21T08:12:16.9608069Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 132.2 2026-02-21T08:12:16.9611313Z configs/s 2026-02-21T08:12:17.3219381Z [82s] Generation 2 complete: 2026-02-21T08:12:17.3219690Z ok=81 2026-02-21T08:12:17.3219848Z min=0.0624 2026-02-21T08:12:17.3219975Z mid=0.0746 2026-02-21T08:12:17.3220135Z max=0.1874 2026-02-21T08:12:17.3220675Z best={'block_sizes': [1024, 1], 2026-02-21T08:12:17.3220976Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:17.3221245Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:12:17.3221438Z 'num_stages': 7, 2026-02-21T08:12:17.3221577Z 'num_warps': 4, 2026-02-21T08:12:17.3221728Z 'pid_type': 'flat', 2026-02-21T08:12:17.3222083Z 'range_flattens': [None, False], 2026-02-21T08:12:17.3222309Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:17.3222500Z 'range_num_stages': [0, 3], 2026-02-21T08:12:17.3222669Z 'range_unroll_factors': [0, 3], 2026-02-21T08:12:17.3222840Z 'range_warp_specializes': [None, None]} 2026-02-21T08:12:17.3234351Z [82s] Fitting surrogate: 276 points, 276 targets 2026-02-21T08:12:18.5042873Z [83s] Generation 3 starting: 75 neighbors, 5 active search path(s) 2026-02-21T08:12:23.4683753Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 7.3 configs/s 2026-02-21T08:12:28.1895755Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.5 configs/s 2026-02-21T08:12:36.2874726Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 124.9 2026-02-21T08:12:36.2875365Z configs/s 2026-02-21T08:12:36.6560792Z [101s] Generation 3 complete: 2026-02-21T08:12:36.6564523Z ok=81 2026-02-21T08:12:36.6568916Z min=0.0624 2026-02-21T08:12:36.6572717Z mid=0.0708 2026-02-21T08:12:36.6577447Z max=0.1916 2026-02-21T08:12:36.6581831Z best={'block_sizes': [1024, 1], 2026-02-21T08:12:36.6585676Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:36.6586959Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:12:36.6587171Z 'num_stages': 1, 2026-02-21T08:12:36.6587329Z 'num_warps': 4, 2026-02-21T08:12:36.6587477Z 'pid_type': 'flat', 2026-02-21T08:12:36.6587631Z 'range_flattens': [None, None], 2026-02-21T08:12:36.6587813Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:36.6587989Z 'range_num_stages': [0, 3], 2026-02-21T08:12:36.6588154Z 'range_unroll_factors': [0, 0], 2026-02-21T08:12:36.6588325Z 'range_warp_specializes': [None, True]} 2026-02-21T08:12:36.6588536Z [101s] Fitting surrogate: 357 points, 357 targets 2026-02-21T08:12:37.7026476Z [102s] Generation 4 starting: 74 neighbors, 5 active search path(s) 2026-02-21T08:12:40.9885624Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 30.8 configs/s 2026-02-21T08:12:45.5758147Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.7 configs/s 2026-02-21T08:12:53.6633084Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 125.0 2026-02-21T08:12:53.6633493Z configs/s 2026-02-21T08:12:54.0263904Z [118s] Generation 4 complete: 2026-02-21T08:12:54.0268124Z error=1 2026-02-21T08:12:54.0272572Z ok=79 2026-02-21T08:12:54.0276478Z min=0.0604 2026-02-21T08:12:54.0281015Z mid=0.0696 2026-02-21T08:12:54.0284917Z max=0.1322 2026-02-21T08:12:54.0286412Z best={'block_sizes': [1024, 1], 2026-02-21T08:12:54.0286717Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:12:54.0289724Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:12:54.0289959Z 'num_stages': 1, 2026-02-21T08:12:54.0290105Z 'num_warps': 4, 2026-02-21T08:12:54.0290253Z 'pid_type': 'flat', 2026-02-21T08:12:54.0290411Z 'range_flattens': [None, None], 2026-02-21T08:12:54.0291017Z 'range_multi_buffers': [None, True], 2026-02-21T08:12:54.0291231Z 'range_num_stages': [0, 3], 2026-02-21T08:12:54.0291404Z 'range_unroll_factors': [0, 1], 2026-02-21T08:12:54.0291586Z 'range_warp_specializes': [None, True]} 2026-02-21T08:12:54.0291806Z [118s] Fitting surrogate: 437 points, 437 targets 2026-02-21T08:12:55.0173603Z [119s] Generation 5 starting: 70 neighbors, 5 active search path(s) 2026-02-21T08:13:00.7318807Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 6.2 configs/s 2026-02-21T08:13:04.9773114Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 17.4 configs/s 2026-02-21T08:13:12.4134181Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.0 2026-02-21T08:13:12.4134709Z configs/s 2026-02-21T08:13:12.7733467Z [137s] Generation 5 complete: 2026-02-21T08:13:12.7737870Z ok=75 2026-02-21T08:13:12.7741723Z min=0.0624 2026-02-21T08:13:12.7744850Z mid=0.0726 2026-02-21T08:13:12.7748821Z max=0.2263 2026-02-21T08:13:12.7750406Z best={'block_sizes': [1024, 1], 2026-02-21T08:13:12.7750688Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:12.7750944Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:13:12.7751154Z 'num_stages': 1, 2026-02-21T08:13:12.7751313Z 'num_warps': 4, 2026-02-21T08:13:12.7751462Z 'pid_type': 'flat', 2026-02-21T08:13:12.7751636Z 'range_flattens': [None, None], 2026-02-21T08:13:12.7751822Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:12.7752099Z 'range_num_stages': [0, 3], 2026-02-21T08:13:12.7752275Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:12.7752471Z 'range_warp_specializes': [None, True]} 2026-02-21T08:13:12.7752701Z [137s] Fitting surrogate: 512 points, 512 targets 2026-02-21T08:13:13.6527774Z [138s] Generation 6 starting: 51 neighbors, 4 active search path(s) 2026-02-21T08:13:18.1085863Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 5.4 configs/s 2026-02-21T08:13:21.1280833Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 17.5 configs/s 2026-02-21T08:13:27.0178495Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 179.0 2026-02-21T08:13:27.0178903Z configs/s 2026-02-21T08:13:27.2826117Z [151s] Generation 6 complete: 2026-02-21T08:13:27.2828017Z ok=55 2026-02-21T08:13:27.2828258Z min=0.0624 2026-02-21T08:13:27.2832414Z mid=0.0646 2026-02-21T08:13:27.2836322Z max=0.2303 2026-02-21T08:13:27.2841050Z best={'block_sizes': [2048, 1], 2026-02-21T08:13:27.2845579Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:27.2847148Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:13:27.2847428Z 'num_stages': 1, 2026-02-21T08:13:27.2852485Z 'num_warps': 8, 2026-02-21T08:13:27.2854620Z 'pid_type': 'flat', 2026-02-21T08:13:27.2854831Z 'range_flattens': [None, False], 2026-02-21T08:13:27.2855028Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:27.2855263Z 'range_num_stages': [0, 3], 2026-02-21T08:13:27.2855854Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:27.2856047Z 'range_warp_specializes': [None, True]} 2026-02-21T08:13:27.2856330Z [151s] Fitting surrogate: 567 points, 567 targets 2026-02-21T08:13:27.8487395Z [152s] Generation 7 starting: 24 neighbors, 2 active search path(s) 2026-02-21T08:13:29.6097197Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 20.5 configs/s 2026-02-21T08:13:31.0620069Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 25/25 17.8 configs/s 2026-02-21T08:13:33.8545070Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 361.1 2026-02-21T08:13:33.8546378Z configs/s 2026-02-21T08:13:33.9919309Z [158s] Generation 7 complete: 2026-02-21T08:13:33.9924182Z ok=27 2026-02-21T08:13:33.9926124Z min=0.0624 2026-02-21T08:13:33.9930463Z mid=0.0688 2026-02-21T08:13:33.9934219Z max=0.1669 2026-02-21T08:13:33.9938616Z best={'block_sizes': [2048, 1], 2026-02-21T08:13:33.9941608Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:13:33.9941967Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:13:33.9942172Z 'num_stages': 1, 2026-02-21T08:13:33.9942313Z 'num_warps': 8, 2026-02-21T08:13:33.9942459Z 'pid_type': 'flat', 2026-02-21T08:13:33.9942614Z 'range_flattens': [None, False], 2026-02-21T08:13:33.9942795Z 'range_multi_buffers': [None, True], 2026-02-21T08:13:33.9942969Z 'range_num_stages': [0, 3], 2026-02-21T08:13:33.9943133Z 'range_unroll_factors': [0, 1], 2026-02-21T08:13:33.9943311Z 'range_warp_specializes': [None, True]} 2026-02-21T08:13:33.9943507Z [158s] Fitting surrogate: 594 points, 594 targets 2026-02-21T08:13:34.4725561Z [159s] Generation 8 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:14:05.2519152Z [189s] Timeout after 30s compiling Config(block_sizes=[2048, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:14:05.2534664Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.3 configs/s 2026-02-21T08:14:06.4147079Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 18.9 configs/s 2026-02-21T08:14:08.8177140Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 419.4 2026-02-21T08:14:08.8182162Z configs/s 2026-02-21T08:14:08.9470828Z [193s] Generation 8 complete: 2026-02-21T08:14:08.9474705Z timeout=1 2026-02-21T08:14:08.9476634Z ok=22 2026-02-21T08:14:08.9476846Z min=0.0624 2026-02-21T08:14:08.9482358Z mid=0.0644 2026-02-21T08:14:08.9486720Z max=0.0829 2026-02-21T08:14:08.9488299Z best={'block_sizes': [2048, 1], 2026-02-21T08:14:08.9488636Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:14:08.9493353Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:14:08.9493649Z 'num_stages': 1, 2026-02-21T08:14:08.9493832Z 'num_warps': 8, 2026-02-21T08:14:08.9493993Z 'pid_type': 'flat', 2026-02-21T08:14:08.9494180Z 'range_flattens': [None, False], 2026-02-21T08:14:08.9499135Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:08.9504334Z 'range_num_stages': [0, 3], 2026-02-21T08:14:08.9508769Z 'range_unroll_factors': [0, 1], 2026-02-21T08:14:08.9512650Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:08.9517690Z [193s] Fitting surrogate: 617 points, 617 targets 2026-02-21T08:14:09.2981525Z [194s] Generation 9 starting: 13 neighbors, 1 active search path(s) 2026-02-21T08:14:15.3992819Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 1.3 configs/s 2026-02-21T08:14:16.1632726Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 13/13 18.1 configs/s 2026-02-21T08:14:17.5385322Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 730.0 2026-02-21T08:14:17.5386005Z configs/s 2026-02-21T08:14:17.6117190Z [202s] Generation 9 complete: 2026-02-21T08:14:17.6120624Z ok=15 2026-02-21T08:14:17.6123388Z min=0.0624 2026-02-21T08:14:17.6126160Z mid=0.0625 2026-02-21T08:14:17.6129582Z max=0.3105 2026-02-21T08:14:17.6129777Z best={'block_sizes': [2048, 1], 2026-02-21T08:14:17.6130002Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:14:17.6130245Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:14:17.6130435Z 'num_stages': 1, 2026-02-21T08:14:17.6130572Z 'num_warps': 8, 2026-02-21T08:14:17.6130713Z 'pid_type': 'flat', 2026-02-21T08:14:17.6130866Z 'range_flattens': [None, False], 2026-02-21T08:14:17.6131050Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:17.6131226Z 'range_num_stages': [0, 3], 2026-02-21T08:14:17.6131393Z 'range_unroll_factors': [0, 1], 2026-02-21T08:14:17.6131592Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:17.6133438Z [202s] Fitting surrogate: 632 points, 632 targets 2026-02-21T08:14:18.0095214Z [202s] Generation 10 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:14:18.5545670Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 25.3 configs/s 2026-02-21T08:14:19.1826644Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 19.0 configs/s 2026-02-21T08:14:20.5471640Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 736.2 2026-02-21T08:14:20.5472605Z configs/s 2026-02-21T08:14:20.6467157Z [205s] Generation 10 complete: 2026-02-21T08:14:20.6467455Z ok=13 2026-02-21T08:14:20.6467646Z min=0.0624 2026-02-21T08:14:20.6467832Z mid=0.0625 2026-02-21T08:14:20.6468011Z max=0.0687 2026-02-21T08:14:20.6468212Z best={'block_sizes': [2048, 1], 2026-02-21T08:14:20.6468652Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:14:20.6469075Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:14:20.6469409Z 'num_stages': 1, 2026-02-21T08:14:20.6469626Z 'num_warps': 8, 2026-02-21T08:14:20.6469835Z 'pid_type': 'flat', 2026-02-21T08:14:20.6470076Z 'range_flattens': [None, False], 2026-02-21T08:14:20.6470349Z 'range_multi_buffers': [None, True], 2026-02-21T08:14:20.6470639Z 'range_num_stages': [0, 3], 2026-02-21T08:14:20.6470898Z 'range_unroll_factors': [0, 1], 2026-02-21T08:14:20.6471182Z 'range_warp_specializes': [None, True]} 2026-02-21T08:14:20.6497534Z [205s] Fitting surrogate: 645 points, 645 targets 2026-02-21T08:14:20.9183245Z [205s] Autotuning complete in 205.6s after searching 611 configs. 2026-02-21T08:14:20.9187728Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:14:20.9192569Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:14:20.9193847Z 2026-02-21T08:14:20.9199984Z [205s] Code of selected kernel: /tmp/torchinductor_root/kn/ckn2rv4kwlogyvhbtasra32a2ixoldjoc2cgkod7mj3op4wnp5w3.py 2026-02-21T08:14:21.8348638Z WARNING:tritonbench.utils.triton_op:Completed input ID 1: 2026-02-21T08:14:21.8352796Z (B, T, V) 2026-02-21T08:14:21.8357250Z -------------- 2026-02-21T08:14:21.8360621Z (8, 512, 8192) 2026-02-21T08:14:21.8364606Z 2026-02-21T08:14:21.8366986Z 33%|███▎ | 2/6 [06:09<12:38, 189.69s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T08:14:21.8367468Z (B, T, V) 2026-02-21T08:14:21.8367656Z --------------- 2026-02-21T08:14:21.8372233Z (8, 512, 16384) 2026-02-21T08:14:21.8372556Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for torch_kl_div 2026-02-21T08:14:22.9761309Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:14:24.3426268Z INFO:tritonbench.utils.triton_op:Took 2.56ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:14:28.0402546Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:14:28.0403888Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:14:28.0404123Z 'dtype': 'torch.float32', 2026-02-21T08:14:28.0404313Z 'shape': (4096, 16384), 2026-02-21T08:14:28.0404501Z 'stride': (16384, 1)}, 2026-02-21T08:14:28.0404670Z { 'device': 'cuda:0', 2026-02-21T08:14:28.0404841Z 'dtype': 'torch.float32', 2026-02-21T08:14:28.0405023Z 'shape': (4096, 16384), 2026-02-21T08:14:28.0405186Z 'stride': (16384, 1)}), 2026-02-21T08:14:28.0405346Z 'kwargs': {}} 2026-02-21T08:14:28.0429364Z INFO:tritonbench.utils.triton_op:Took 3.22ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:14:28.3593507Z [0s] Autotune random seed: 2134765727 2026-02-21T08:14:28.5074679Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:15:00.7860780Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=128, num_stages=5, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[0, 1], range_warp_specializes=[False, None]) 2026-02-21T08:15:01.2981405Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[None, None]) 2026-02-21T08:15:02.0949821Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=64, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[None, None]) 2026-02-21T08:15:02.2762955Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 4], range_warp_specializes=[False, None]) 2026-02-21T08:15:02.3881234Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 16], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, False]) 2026-02-21T08:15:02.4974407Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:15:02.4990852Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s 2026-02-21T08:15:02.7415672Z module { 2026-02-21T08:15:02.7416695Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:15:02.7422209Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:15:02.7427924Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:02.7431959Z %cst = arith.constant dense<0.000000e+00> : tensor<512x32xf32> 2026-02-21T08:15:02.7433890Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:15:02.7434121Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:15:02.7434319Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:15:02.7434508Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:15:02.7434698Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:02.7435009Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:02.7435437Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:02.7435750Z %2 = tt.get_program_id x : i32 2026-02-21T08:15:02.7435940Z %3 = arith.muli %2, %c512_i32 : i32 2026-02-21T08:15:02.7436168Z %4 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:15:02.7436406Z %5 = tt.splat %3 : i32 -> tensor<512xi32> 2026-02-21T08:15:02.7436599Z %6 = arith.addi %5, %4 : tensor<512xi32> 2026-02-21T08:15:02.7436898Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<512x32xf32>) : i32 { 2026-02-21T08:15:02.7437306Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:15:02.7437677Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<512x32xf32> 2026-02-21T08:15:02.7437958Z %13 = scf.if %arg3 -> (tensor<512x32xf32>) { 2026-02-21T08:15:02.7438324Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<512x32xf32>) -> tensor<512x32xf32> 2026-02-21T08:15:02.7438709Z %16 = arith.subf %12, %11 : tensor<512x32xf32> 2026-02-21T08:15:02.7438920Z %17 = arith.mulf %15, %16 : tensor<512x32xf32> 2026-02-21T08:15:02.7439130Z %18 = arith.addf %17, %cst : tensor<512x32xf32> 2026-02-21T08:15:02.7439323Z scf.yield %18 : tensor<512x32xf32> 2026-02-21T08:15:02.7439494Z } else { 2026-02-21T08:15:02.7439651Z %15 = tt.splat %arg4 : f32 -> tensor<512x32xf32> 2026-02-21T08:15:02.7439872Z %16 = arith.cmpf ogt, %12, %15 : tensor<512x32xf32> 2026-02-21T08:15:02.7440090Z %17 = arith.cmpf une, %12, %12 : tensor<512x32xf32> 2026-02-21T08:15:02.7440301Z %18 = arith.ori %16, %17 : tensor<512x32xi1> 2026-02-21T08:15:02.7440545Z %19 = arith.select %18, %12, %15 : tensor<512x32xi1>, tensor<512x32xf32> 2026-02-21T08:15:02.7440783Z %20 = math.log %19 : tensor<512x32xf32> 2026-02-21T08:15:02.7440981Z %21 = arith.subf %20, %11 : tensor<512x32xf32> 2026-02-21T08:15:02.7441175Z %22 = arith.mulf %12, %21 : tensor<512x32xf32> 2026-02-21T08:15:02.7441709Z %23 = arith.addf %22, %cst : tensor<512x32xf32> 2026-02-21T08:15:02.7441974Z scf.yield %23 : tensor<512x32xf32> 2026-02-21T08:15:02.7442148Z } 2026-02-21T08:15:02.7442286Z %14 = arith.addf %arg6, %13 : tensor<512x32xf32> 2026-02-21T08:15:02.7442484Z scf.yield %14 : tensor<512x32xf32> 2026-02-21T08:15:02.7442805Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:15:02.7443131Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:15:02.7443327Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:15:02.7443503Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:15:02.7443694Z tt.reduce.return %11 : f32 2026-02-21T08:15:02.7443874Z }) : (tensor<512x32xf32>) -> tensor<512xf32> 2026-02-21T08:15:02.7444110Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<512x!tt.ptr> 2026-02-21T08:15:02.7444438Z %10 = tt.addptr %9, %6 : tensor<512x!tt.ptr>, tensor<512xi32> 2026-02-21T08:15:02.7444680Z tt.store %10, %8 : tensor<512x!tt.ptr> 2026-02-21T08:15:02.7444872Z tt.return 2026-02-21T08:15:02.7445003Z } 2026-02-21T08:15:02.7445134Z } 2026-02-21T08:15:02.7445204Z 2026-02-21T08:15:02.7445257Z {-# 2026-02-21T08:15:02.7445398Z external_resources: { 2026-02-21T08:15:02.7445557Z mlir_reproducer: { 2026-02-21T08:15:02.7449886Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:02.7454270Z disable_threading: false, 2026-02-21T08:15:02.7454438Z verify_each: true 2026-02-21T08:15:02.7454576Z } 2026-02-21T08:15:02.7454698Z } 2026-02-21T08:15:02.7454806Z #-} 2026-02-21T08:15:02.7455220Z /tmp/torchinductor_root/gz/cgztxd3tqx7ki7scwmrlgdnafqpeglbx7qjnld2ruexcgpkellr2.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:02.7456402Z /tmp/torchinductor_root/gz/cgztxd3tqx7ki7scwmrlgdnafqpeglbx7qjnld2ruexcgpkellr2.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:02.7457386Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:02.7458432Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:15:02.7459300Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:02.7459584Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:03.5012878Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:15:03.5015411Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:15:03.5016892Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:15:03.5017256Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:15:03.5018773Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:03.5019100Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:15:03.5019404Z %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32> 2026-02-21T08:15:03.5019689Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:15:03.5019913Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:15:03.5020146Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:15:03.5020389Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:15:03.5020612Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:03.5020969Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:03.5021474Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:03.5021965Z %2 = tt.get_program_id x : i32 2026-02-21T08:15:03.5022176Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T08:15:03.5022378Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:15:03.5022571Z %4 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:15:03.5022774Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:15:03.5022957Z %6 = arith.divui %5, %c2368_i32 : i32 2026-02-21T08:15:03.5023166Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:15:03.5023372Z %7 = arith.remsi %6, %c4_i32 : i32 2026-02-21T08:15:03.5023593Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:15:03.5023800Z %9 = arith.muli %8, %c2368_i32 : i32 2026-02-21T08:15:03.5024005Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:15:03.5024217Z %11 = arith.muli %c2368_i32, %c4_i32 : i32 2026-02-21T08:15:03.5024438Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:15:03.5024652Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:15:03.5024895Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:03.5025207Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:15:03.5025420Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:15:03.5025777Z %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:15:03.5026243Z %50 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5026760Z %51 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5027103Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:15:03.5027517Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:15:03.5027977Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5028239Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:15:03.5028495Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5028990Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:15:03.5029178Z } else { 2026-02-21T08:15:03.5029362Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:15:03.5029605Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:15:03.5029843Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:15:03.5030071Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:15:03.5030317Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:15:03.5030572Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:15:03.5030770Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5031019Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:15:03.5031235Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5031526Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:15:03.5031705Z } 2026-02-21T08:15:03.5031902Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:15:03.5032101Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:15:03.5032323Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:15:03.5032544Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:15:03.5032744Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:03.5032921Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:03.5033116Z tt.reduce.return %50 : f32 2026-02-21T08:15:03.5033310Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:15:03.5033542Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:03.5033817Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:03.5034056Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:15:03.5034262Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:15:03.5034456Z %20 = arith.muli %c2368_i32, %c1_i32_0 : i32 2026-02-21T08:15:03.5034659Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:15:03.5034847Z %22 = arith.muli %21, %c16_i32 : i32 2026-02-21T08:15:03.5035077Z %23 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:03.5035333Z %24 = tt.splat %22 : i32 -> tensor<16xi32> 2026-02-21T08:15:03.5035528Z %25 = arith.addi %24, %23 : tensor<16xi32> 2026-02-21T08:15:03.5035854Z %26 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:15:03.5036261Z %50 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5036643Z %51 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5036941Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:15:03.5037307Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:15:03.5037690Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5037898Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:15:03.5038115Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5038311Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:15:03.5038513Z } else { 2026-02-21T08:15:03.5038685Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:15:03.5038905Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:15:03.5039152Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:15:03.5039364Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:15:03.5039610Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:15:03.5039857Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:15:03.5040164Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5040375Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:15:03.5040580Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5040785Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:15:03.5040954Z } 2026-02-21T08:15:03.5041106Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:15:03.5041302Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:15:03.5041525Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:15:03.5041745Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:15:03.5041987Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:03.5042238Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:03.5042425Z tt.reduce.return %50 : f32 2026-02-21T08:15:03.5042615Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:15:03.5042903Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:03.5043182Z %29 = tt.addptr %28, %25 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:03.5043427Z tt.store %29, %27 : tensor<16x!tt.ptr> 2026-02-21T08:15:03.5043639Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:15:03.5043835Z %30 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:15:03.5044026Z %31 = arith.addi %arg5, %30 : i32 2026-02-21T08:15:03.5044218Z %32 = arith.muli %31, %c16_i32 : i32 2026-02-21T08:15:03.5044447Z %33 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:03.5044703Z %34 = tt.splat %32 : i32 -> tensor<16xi32> 2026-02-21T08:15:03.5044901Z %35 = arith.addi %34, %33 : tensor<16xi32> 2026-02-21T08:15:03.5045222Z %36 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:15:03.5045642Z %50 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5046025Z %51 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5046334Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:15:03.5046707Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:15:03.5047094Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5047317Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:15:03.5047533Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5047748Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:15:03.5047927Z } else { 2026-02-21T08:15:03.5048111Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:15:03.5048339Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:15:03.5048595Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:15:03.5048811Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:15:03.5049056Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:15:03.5049319Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:15:03.5049515Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5049725Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:15:03.5049931Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5050133Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:15:03.5050301Z } 2026-02-21T08:15:03.5050452Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:15:03.5050651Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:15:03.5050863Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:15:03.5051093Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:15:03.5051349Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:03.5051535Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:03.5051717Z tt.reduce.return %50 : f32 2026-02-21T08:15:03.5051940Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:15:03.5052170Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:03.5052439Z %39 = tt.addptr %38, %35 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:03.5052681Z tt.store %39, %37 : tensor<16x!tt.ptr> 2026-02-21T08:15:03.5052878Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:15:03.5053069Z %40 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:15:03.5053255Z %41 = arith.addi %arg5, %40 : i32 2026-02-21T08:15:03.5053435Z %42 = arith.muli %41, %c16_i32 : i32 2026-02-21T08:15:03.5053657Z %43 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:03.5053955Z %44 = tt.splat %42 : i32 -> tensor<16xi32> 2026-02-21T08:15:03.5054159Z %45 = arith.addi %44, %43 : tensor<16xi32> 2026-02-21T08:15:03.5054463Z %46 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:15:03.5054865Z %50 = tt.descriptor_load %0[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5055232Z %51 = tt.descriptor_load %1[%42, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5055529Z %52 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:15:03.5055894Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:15:03.5056274Z %55 = arith.subf %51, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5056485Z %56 = arith.mulf %54, %55 : tensor<16x32xf32> 2026-02-21T08:15:03.5056692Z %57 = arith.addf %56, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5056900Z scf.yield %57 : tensor<16x32xf32> 2026-02-21T08:15:03.5057068Z } else { 2026-02-21T08:15:03.5057235Z %54 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:15:03.5057451Z %55 = arith.cmpf ogt, %51, %54 : tensor<16x32xf32> 2026-02-21T08:15:03.5057678Z %56 = arith.cmpf une, %51, %51 : tensor<16x32xf32> 2026-02-21T08:15:03.5057895Z %57 = arith.ori %55, %56 : tensor<16x32xi1> 2026-02-21T08:15:03.5058131Z %58 = arith.select %57, %51, %54 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:15:03.5058379Z %59 = math.log %58 : tensor<16x32xf32> 2026-02-21T08:15:03.5058573Z %60 = arith.subf %59, %50 : tensor<16x32xf32> 2026-02-21T08:15:03.5058781Z %61 = arith.mulf %51, %60 : tensor<16x32xf32> 2026-02-21T08:15:03.5058982Z %62 = arith.addf %61, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5059186Z scf.yield %62 : tensor<16x32xf32> 2026-02-21T08:15:03.5059365Z } 2026-02-21T08:15:03.5059518Z %53 = arith.addf %arg7, %52 : tensor<16x32xf32> 2026-02-21T08:15:03.5059720Z scf.yield %53 : tensor<16x32xf32> 2026-02-21T08:15:03.5059929Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:15:03.5060154Z %47 = "tt.reduce"(%46) <{axis = 1 : i32}> ({ 2026-02-21T08:15:03.5060341Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:03.5060522Z %50 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:03.5060704Z tt.reduce.return %50 : f32 2026-02-21T08:15:03.5060895Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:15:03.5061131Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:03.5061396Z %49 = tt.addptr %48, %45 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:03.5061645Z tt.store %49, %47 : tensor<16x!tt.ptr> 2026-02-21T08:15:03.5061829Z } 2026-02-21T08:15:03.5062042Z scf.for %arg5 = %10 to %c256_i32 step %c2368_i32 : i32 { 2026-02-21T08:15:03.5062264Z %12 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:15:03.5062603Z %13 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:03.5062869Z %14 = tt.splat %12 : i32 -> tensor<16xi32> 2026-02-21T08:15:03.5063087Z %15 = arith.addi %14, %13 : tensor<16xi32> 2026-02-21T08:15:03.5063448Z %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<16x32xf32>) : i32 { 2026-02-21T08:15:03.5063889Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5064302Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<16x32xf32> 2026-02-21T08:15:03.5064602Z %22 = scf.if %arg3 -> (tensor<16x32xf32>) { 2026-02-21T08:15:03.5065060Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x32xf32>) -> tensor<16x32xf32> 2026-02-21T08:15:03.5065515Z %25 = arith.subf %21, %20 : tensor<16x32xf32> 2026-02-21T08:15:03.5065734Z %26 = arith.mulf %24, %25 : tensor<16x32xf32> 2026-02-21T08:15:03.5065949Z %27 = arith.addf %26, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5066159Z scf.yield %27 : tensor<16x32xf32> 2026-02-21T08:15:03.5066350Z } else { 2026-02-21T08:15:03.5066510Z %24 = tt.splat %arg4 : f32 -> tensor<16x32xf32> 2026-02-21T08:15:03.5066733Z %25 = arith.cmpf ogt, %21, %24 : tensor<16x32xf32> 2026-02-21T08:15:03.5066997Z %26 = arith.cmpf une, %21, %21 : tensor<16x32xf32> 2026-02-21T08:15:03.5067227Z %27 = arith.ori %25, %26 : tensor<16x32xi1> 2026-02-21T08:15:03.5067500Z %28 = arith.select %27, %21, %24 : tensor<16x32xi1>, tensor<16x32xf32> 2026-02-21T08:15:03.5067753Z %29 = math.log %28 : tensor<16x32xf32> 2026-02-21T08:15:03.5067954Z %30 = arith.subf %29, %20 : tensor<16x32xf32> 2026-02-21T08:15:03.5068169Z %31 = arith.mulf %21, %30 : tensor<16x32xf32> 2026-02-21T08:15:03.5068382Z %32 = arith.addf %31, %cst : tensor<16x32xf32> 2026-02-21T08:15:03.5068578Z scf.yield %32 : tensor<16x32xf32> 2026-02-21T08:15:03.5068755Z } 2026-02-21T08:15:03.5068909Z %23 = arith.addf %arg7, %22 : tensor<16x32xf32> 2026-02-21T08:15:03.5069112Z scf.yield %23 : tensor<16x32xf32> 2026-02-21T08:15:03.5069334Z } {tt.disallow_acc_multi_buffer, tt.warp_specialize} 2026-02-21T08:15:03.5069571Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:15:03.5069786Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:03.5069962Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:03.5070152Z tt.reduce.return %20 : f32 2026-02-21T08:15:03.5070334Z }) : (tensor<16x32xf32>) -> tensor<16xf32> 2026-02-21T08:15:03.5070567Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:03.5070838Z %19 = tt.addptr %18, %15 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:03.5071077Z tt.store %19, %17 : tensor<16x!tt.ptr> 2026-02-21T08:15:03.5071276Z } {tt.num_stages = 1 : i32} 2026-02-21T08:15:03.5071435Z tt.return 2026-02-21T08:15:03.5071569Z } 2026-02-21T08:15:03.5071686Z } 2026-02-21T08:15:03.5071761Z 2026-02-21T08:15:03.5071812Z {-# 2026-02-21T08:15:03.5071984Z external_resources: { 2026-02-21T08:15:03.5072143Z mlir_reproducer: { 2026-02-21T08:15:03.5076650Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:03.5081112Z disable_threading: false, 2026-02-21T08:15:03.5081286Z verify_each: true 2026-02-21T08:15:03.5081426Z } 2026-02-21T08:15:03.5081548Z } 2026-02-21T08:15:03.5081657Z #-} 2026-02-21T08:15:03.5082123Z /tmp/torchinductor_root/fx/cfxhv7mynb2iqmyluksc2n3g4deg6txguqnoxbc4ybakkfpdvbpq.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:03.5083331Z /tmp/torchinductor_root/fx/cfxhv7mynb2iqmyluksc2n3g4deg6txguqnoxbc4ybakkfpdvbpq.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:03.5084339Z [34s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:03.5085450Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[4, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:15:03.5086462Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:03.5086708Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:07.1647991Z module { 2026-02-21T08:15:07.1648895Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:15:07.1649600Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:15:07.1649823Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:07.1650025Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:15:07.1650257Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32> 2026-02-21T08:15:07.1650525Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:15:07.1650710Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:15:07.1650919Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:15:07.1651111Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:15:07.1651288Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:07.1651609Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:07.1652293Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:07.1652620Z %2 = tt.get_program_id x : i32 2026-02-21T08:15:07.1653167Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:15:07.1653347Z %4 = arith.minsi %3, %c256_i32 : i32 2026-02-21T08:15:07.1653551Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:15:07.1653766Z %5 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:15:07.1654003Z %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:15:07.1654270Z %7 = tt.splat %5 : i32 -> tensor<16xi32> 2026-02-21T08:15:07.1654468Z %8 = arith.addi %7, %6 : tensor<16xi32> 2026-02-21T08:15:07.1654786Z %9 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>) : i32 { 2026-02-21T08:15:07.1655229Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:15:07.1655621Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:15:07.1656013Z %15 = scf.if %arg3 -> (tensor<16x256xf32>) { 2026-02-21T08:15:07.1656420Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:15:07.1656807Z %18 = arith.subf %14, %13 : tensor<16x256xf32> 2026-02-21T08:15:07.1657016Z %19 = arith.mulf %17, %18 : tensor<16x256xf32> 2026-02-21T08:15:07.1657233Z %20 = arith.addf %19, %cst : tensor<16x256xf32> 2026-02-21T08:15:07.1657434Z scf.yield %20 : tensor<16x256xf32> 2026-02-21T08:15:07.1657612Z } else { 2026-02-21T08:15:07.1657777Z %17 = tt.splat %arg4 : f32 -> tensor<16x256xf32> 2026-02-21T08:15:07.1658105Z %18 = arith.cmpf ogt, %14, %17 : tensor<16x256xf32> 2026-02-21T08:15:07.1658350Z %19 = arith.cmpf une, %14, %14 : tensor<16x256xf32> 2026-02-21T08:15:07.1658575Z %20 = arith.ori %18, %19 : tensor<16x256xi1> 2026-02-21T08:15:07.1658827Z %21 = arith.select %20, %14, %17 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:15:07.1659074Z %22 = math.log %21 : tensor<16x256xf32> 2026-02-21T08:15:07.1659277Z %23 = arith.subf %22, %13 : tensor<16x256xf32> 2026-02-21T08:15:07.1659497Z %24 = arith.mulf %14, %23 : tensor<16x256xf32> 2026-02-21T08:15:07.1659728Z %25 = arith.addf %24, %cst : tensor<16x256xf32> 2026-02-21T08:15:07.1659943Z scf.yield %25 : tensor<16x256xf32> 2026-02-21T08:15:07.1660129Z } 2026-02-21T08:15:07.1660285Z %16 = arith.addf %arg7, %15 : tensor<16x256xf32> 2026-02-21T08:15:07.1660479Z scf.yield %16 : tensor<16x256xf32> 2026-02-21T08:15:07.1660710Z } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:07.1660939Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:15:07.1661134Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:07.1661316Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:07.1661512Z tt.reduce.return %13 : f32 2026-02-21T08:15:07.1661710Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:15:07.1661989Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:15:07.1662259Z %12 = tt.addptr %11, %8 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:15:07.1662488Z tt.store %12, %10 : tensor<16x!tt.ptr> 2026-02-21T08:15:07.1662671Z } 2026-02-21T08:15:07.1662791Z tt.return 2026-02-21T08:15:07.1662921Z } 2026-02-21T08:15:07.1663040Z } 2026-02-21T08:15:07.1663119Z 2026-02-21T08:15:07.1663170Z {-# 2026-02-21T08:15:07.1663302Z external_resources: { 2026-02-21T08:15:07.1663454Z mlir_reproducer: { 2026-02-21T08:15:07.1667761Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:07.1672083Z disable_threading: false, 2026-02-21T08:15:07.1672250Z verify_each: true 2026-02-21T08:15:07.1672386Z } 2026-02-21T08:15:07.1672505Z } 2026-02-21T08:15:07.1672612Z #-} 2026-02-21T08:15:07.1673026Z /tmp/torchinductor_root/zv/czvcur7zmglxpsruw55sec2h27wqfefv4zsbjzrwcbfysarfkw6k.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:07.1678791Z /tmp/torchinductor_root/zv/czvcur7zmglxpsruw55sec2h27wqfefv4zsbjzrwcbfysarfkw6k.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:07.1679826Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:07.1680862Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:15:07.1681776Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:07.1682096Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:08.9224387Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:15:08.9225080Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:15:08.9225646Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:15:08.9225837Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:08.9226029Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:15:08.9226242Z %cst = arith.constant dense<0.000000e+00> : tensor<4x4xf32> 2026-02-21T08:15:08.9226463Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:15:08.9226637Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:15:08.9226831Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:15:08.9227020Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:15:08.9227194Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:08.9227513Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:08.9228265Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:08.9228579Z %2 = tt.get_program_id x : i32 2026-02-21T08:15:08.9228750Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:15:08.9228934Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:15:08.9229119Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:15:08.9229287Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:15:08.9229469Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:15:08.9229640Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:15:08.9229808Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:15:08.9229976Z %c4_i32_1 = arith.constant 4 : i32 2026-02-21T08:15:08.9230150Z %9 = arith.remsi %8, %c4_i32_1 : i32 2026-02-21T08:15:08.9230316Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:15:08.9230487Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:15:08.9230749Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:15:08.9230929Z %13 = arith.muli %c1_i32, %c4_i32_1 : i32 2026-02-21T08:15:08.9231132Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:15:08.9231380Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:15:08.9231611Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:15:08.9231949Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:15:08.9232145Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:15:08.9232457Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:15:08.9232850Z %52 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9233216Z %53 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9233506Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:15:08.9233871Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:15:08.9234247Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9234447Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:15:08.9234659Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9234854Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:15:08.9235034Z } else { 2026-02-21T08:15:08.9235198Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:15:08.9235405Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:15:08.9235622Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:15:08.9235821Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:15:08.9236055Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:15:08.9236288Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:15:08.9236487Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9236684Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:15:08.9236879Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9237073Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:15:08.9237236Z } 2026-02-21T08:15:08.9237384Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:15:08.9237577Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:15:08.9237911Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:08.9238252Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:15:08.9238453Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:08.9238640Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:08.9238825Z tt.reduce.return %52 : f32 2026-02-21T08:15:08.9239017Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:15:08.9239250Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:15:08.9239716Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:15:08.9239959Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:15:08.9240181Z %c1_i32_2 = arith.constant 1 : i32 2026-02-21T08:15:08.9240380Z %22 = arith.muli %c1_i32, %c1_i32_2 : i32 2026-02-21T08:15:08.9240587Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:15:08.9240769Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T08:15:08.9241007Z %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:15:08.9241257Z %26 = tt.splat %24 : i32 -> tensor<4xi32> 2026-02-21T08:15:08.9241449Z %27 = arith.addi %26, %25 : tensor<4xi32> 2026-02-21T08:15:08.9241767Z %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:15:08.9242261Z %52 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9242638Z %53 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9242927Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:15:08.9243299Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:15:08.9243671Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9243876Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:15:08.9244090Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9244288Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:15:08.9244468Z } else { 2026-02-21T08:15:08.9244626Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:15:08.9244849Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:15:08.9245075Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:15:08.9245285Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:15:08.9245527Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:15:08.9245762Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:15:08.9245959Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9246152Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:15:08.9246357Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9246553Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:15:08.9246723Z } 2026-02-21T08:15:08.9246863Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:15:08.9247045Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:15:08.9247353Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:08.9247673Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:15:08.9247867Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:08.9248036Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:08.9248219Z tt.reduce.return %52 : f32 2026-02-21T08:15:08.9248402Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:15:08.9248614Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:15:08.9248869Z %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:15:08.9249094Z tt.store %31, %29 : tensor<4x!tt.ptr> 2026-02-21T08:15:08.9249287Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:15:08.9249460Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:15:08.9249640Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:15:08.9249816Z %34 = arith.muli %33, %c4_i32 : i32 2026-02-21T08:15:08.9250026Z %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:15:08.9250262Z %36 = tt.splat %34 : i32 -> tensor<4xi32> 2026-02-21T08:15:08.9250507Z %37 = arith.addi %36, %35 : tensor<4xi32> 2026-02-21T08:15:08.9250847Z %38 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:15:08.9251255Z %52 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9251639Z %53 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9251966Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:15:08.9252324Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:15:08.9252715Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9252930Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:15:08.9253167Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9253463Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:15:08.9253632Z } else { 2026-02-21T08:15:08.9253795Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:15:08.9254005Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:15:08.9254238Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:15:08.9254451Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:15:08.9254702Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:15:08.9254948Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:15:08.9255145Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9255363Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:15:08.9255577Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9255790Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:15:08.9255971Z } 2026-02-21T08:15:08.9256122Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:15:08.9256311Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:15:08.9256732Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:08.9257076Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:15:08.9257261Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:08.9257448Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:08.9257621Z tt.reduce.return %52 : f32 2026-02-21T08:15:08.9257799Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:15:08.9258009Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:15:08.9258260Z %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:15:08.9258479Z tt.store %41, %39 : tensor<4x!tt.ptr> 2026-02-21T08:15:08.9258672Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:15:08.9258856Z %42 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:15:08.9259032Z %43 = arith.addi %arg5, %42 : i32 2026-02-21T08:15:08.9259204Z %44 = arith.muli %43, %c4_i32 : i32 2026-02-21T08:15:08.9259409Z %45 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:15:08.9259641Z %46 = tt.splat %44 : i32 -> tensor<4xi32> 2026-02-21T08:15:08.9259823Z %47 = arith.addi %46, %45 : tensor<4xi32> 2026-02-21T08:15:08.9260120Z %48 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:15:08.9260499Z %52 = tt.descriptor_load %0[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9260840Z %53 = tt.descriptor_load %1[%44, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9261118Z %54 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:15:08.9261462Z %56 = tt.extern_elementwise %53 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:15:08.9261906Z %57 = arith.subf %53, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9262130Z %58 = arith.mulf %56, %57 : tensor<4x4xf32> 2026-02-21T08:15:08.9262323Z %59 = arith.addf %58, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9262515Z scf.yield %59 : tensor<4x4xf32> 2026-02-21T08:15:08.9262717Z } else { 2026-02-21T08:15:08.9262918Z %56 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:15:08.9263213Z %57 = arith.cmpf ogt, %53, %56 : tensor<4x4xf32> 2026-02-21T08:15:08.9263468Z %58 = arith.cmpf une, %53, %53 : tensor<4x4xf32> 2026-02-21T08:15:08.9263670Z %59 = arith.ori %57, %58 : tensor<4x4xi1> 2026-02-21T08:15:08.9263889Z %60 = arith.select %59, %53, %56 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:15:08.9264123Z %61 = math.log %60 : tensor<4x4xf32> 2026-02-21T08:15:08.9264307Z %62 = arith.subf %61, %52 : tensor<4x4xf32> 2026-02-21T08:15:08.9264570Z %63 = arith.mulf %53, %62 : tensor<4x4xf32> 2026-02-21T08:15:08.9264782Z %64 = arith.addf %63, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9265038Z scf.yield %64 : tensor<4x4xf32> 2026-02-21T08:15:08.9265269Z } 2026-02-21T08:15:08.9265452Z %55 = arith.addf %arg7, %54 : tensor<4x4xf32> 2026-02-21T08:15:08.9265668Z scf.yield %55 : tensor<4x4xf32> 2026-02-21T08:15:08.9265973Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:08.9266302Z %49 = "tt.reduce"(%48) <{axis = 1 : i32}> ({ 2026-02-21T08:15:08.9266491Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:08.9266675Z %52 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:08.9266861Z tt.reduce.return %52 : f32 2026-02-21T08:15:08.9267033Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:15:08.9267246Z %50 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:15:08.9267495Z %51 = tt.addptr %50, %47 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:15:08.9267722Z tt.store %51, %49 : tensor<4x!tt.ptr> 2026-02-21T08:15:08.9267906Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:15:08.9268101Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:15:08.9268297Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:15:08.9268510Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:15:08.9268741Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:15:08.9268921Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:15:08.9269215Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:15:08.9269597Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9269949Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:15:08.9270230Z %24 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:15:08.9270570Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:15:08.9270922Z %27 = arith.subf %23, %22 : tensor<4x4xf32> 2026-02-21T08:15:08.9271113Z %28 = arith.mulf %26, %27 : tensor<4x4xf32> 2026-02-21T08:15:08.9271313Z %29 = arith.addf %28, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9271506Z scf.yield %29 : tensor<4x4xf32> 2026-02-21T08:15:08.9271665Z } else { 2026-02-21T08:15:08.9271825Z %26 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:15:08.9272059Z %27 = arith.cmpf ogt, %23, %26 : tensor<4x4xf32> 2026-02-21T08:15:08.9272268Z %28 = arith.cmpf une, %23, %23 : tensor<4x4xf32> 2026-02-21T08:15:08.9272462Z %29 = arith.ori %27, %28 : tensor<4x4xi1> 2026-02-21T08:15:08.9272696Z %30 = arith.select %29, %23, %26 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:15:08.9272995Z %31 = math.log %30 : tensor<4x4xf32> 2026-02-21T08:15:08.9273186Z %32 = arith.subf %31, %22 : tensor<4x4xf32> 2026-02-21T08:15:08.9273387Z %33 = arith.mulf %23, %32 : tensor<4x4xf32> 2026-02-21T08:15:08.9273583Z %34 = arith.addf %33, %cst : tensor<4x4xf32> 2026-02-21T08:15:08.9273779Z scf.yield %34 : tensor<4x4xf32> 2026-02-21T08:15:08.9273940Z } 2026-02-21T08:15:08.9274087Z %25 = arith.addf %arg7, %24 : tensor<4x4xf32> 2026-02-21T08:15:08.9274278Z scf.yield %25 : tensor<4x4xf32> 2026-02-21T08:15:08.9274596Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:15:08.9274926Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:15:08.9275117Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:08.9275370Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:08.9275553Z tt.reduce.return %22 : f32 2026-02-21T08:15:08.9275732Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:15:08.9275939Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:15:08.9276193Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:15:08.9276422Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:15:08.9276641Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:15:08.9276846Z tt.return 2026-02-21T08:15:08.9276964Z } 2026-02-21T08:15:08.9277086Z } 2026-02-21T08:15:08.9277152Z 2026-02-21T08:15:08.9277199Z {-# 2026-02-21T08:15:08.9277329Z external_resources: { 2026-02-21T08:15:08.9277477Z mlir_reproducer: { 2026-02-21T08:15:08.9281751Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:08.9286101Z disable_threading: false, 2026-02-21T08:15:08.9286265Z verify_each: true 2026-02-21T08:15:08.9286405Z } 2026-02-21T08:15:08.9286517Z } 2026-02-21T08:15:08.9286629Z #-} 2026-02-21T08:15:08.9287038Z /tmp/torchinductor_root/yn/cynvmjngx6ng2kx3ixz3yqs4355ndpgxqui52zyrc6jmswkt3tv3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:08.9288223Z /tmp/torchinductor_root/yn/cynvmjngx6ng2kx3ixz3yqs4355ndpgxqui52zyrc6jmswkt3tv3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:08.9289236Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:08.9290310Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=6, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:15:08.9291286Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:08.9291585Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:09.8608250Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:15:09.8609277Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:15:09.8610213Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:15:09.8610501Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:15:09.8610810Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:15:09.8611092Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:15:09.8611442Z %cst = arith.constant dense<0.000000e+00> : tensor<8x1024xf32> 2026-02-21T08:15:09.8611815Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:15:09.8612520Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:15:09.8612831Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:15:09.8613153Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:15:09.8613453Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:15:09.8613994Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:09.8614750Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:15:09.8615263Z %2 = tt.get_program_id x : i32 2026-02-21T08:15:09.8615542Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:15:09.8615815Z %4 = arith.minsi %3, %c512_i32 : i32 2026-02-21T08:15:09.8616094Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:15:09.8616354Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:15:09.8616643Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:15:09.8616917Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:15:09.8617180Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:15:09.8617451Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:15:09.8617723Z %9 = arith.remsi %8, %c2_i32 : i32 2026-02-21T08:15:09.8617997Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:15:09.8618260Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:15:09.8618538Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:15:09.8618803Z %13 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:15:09.8619110Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:15:09.8619406Z %14 = arith.muli %arg5, %c8_i32 : i32 2026-02-21T08:15:09.8619767Z %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:15:09.8620162Z %16 = tt.splat %14 : i32 -> tensor<8xi32> 2026-02-21T08:15:09.8620458Z %17 = arith.addi %16, %15 : tensor<8xi32> 2026-02-21T08:15:09.8620983Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>) : i32 { 2026-02-21T08:15:09.8621672Z %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8622776Z %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8623369Z %34 = scf.if %arg3 -> (tensor<8x1024xf32>) { 2026-02-21T08:15:09.8624224Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:15:09.8624964Z %37 = arith.subf %33, %32 : tensor<8x1024xf32> 2026-02-21T08:15:09.8625369Z %38 = arith.mulf %36, %37 : tensor<8x1024xf32> 2026-02-21T08:15:09.8625863Z %39 = arith.addf %38, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8626387Z scf.yield %39 : tensor<8x1024xf32> 2026-02-21T08:15:09.8626836Z } else { 2026-02-21T08:15:09.8627140Z %36 = tt.splat %arg4 : f32 -> tensor<8x1024xf32> 2026-02-21T08:15:09.8627731Z %37 = arith.cmpf ogt, %33, %36 : tensor<8x1024xf32> 2026-02-21T08:15:09.8628332Z %38 = arith.cmpf une, %33, %33 : tensor<8x1024xf32> 2026-02-21T08:15:09.8628736Z %39 = arith.ori %37, %38 : tensor<8x1024xi1> 2026-02-21T08:15:09.8629280Z %40 = arith.select %39, %33, %36 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:15:09.8629754Z %41 = math.log %40 : tensor<8x1024xf32> 2026-02-21T08:15:09.8630164Z %42 = arith.subf %41, %32 : tensor<8x1024xf32> 2026-02-21T08:15:09.8630598Z %43 = arith.mulf %33, %42 : tensor<8x1024xf32> 2026-02-21T08:15:09.8631045Z %44 = arith.addf %43, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8631450Z scf.yield %44 : tensor<8x1024xf32> 2026-02-21T08:15:09.8631817Z } 2026-02-21T08:15:09.8632195Z %35 = arith.addf %arg7, %34 : tensor<8x1024xf32> 2026-02-21T08:15:09.8632574Z scf.yield %35 : tensor<8x1024xf32> 2026-02-21T08:15:09.8633164Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:15:09.8633754Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:15:09.8634145Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:09.8634577Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:09.8634933Z tt.reduce.return %32 : f32 2026-02-21T08:15:09.8635307Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:15:09.8635792Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:15:09.8636309Z %21 = tt.addptr %20, %17 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:15:09.8636787Z tt.store %21, %19 : tensor<8x!tt.ptr> 2026-02-21T08:15:09.8637218Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:15:09.8637621Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:15:09.8638100Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:15:09.8638520Z %24 = arith.muli %23, %c8_i32 : i32 2026-02-21T08:15:09.8638964Z %25 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:15:09.8639412Z %26 = tt.splat %24 : i32 -> tensor<8xi32> 2026-02-21T08:15:09.8639868Z %27 = arith.addi %26, %25 : tensor<8xi32> 2026-02-21T08:15:09.8640453Z %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>) : i32 { 2026-02-21T08:15:09.8641230Z %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8642045Z %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8642593Z %34 = scf.if %arg3 -> (tensor<8x1024xf32>) { 2026-02-21T08:15:09.8643307Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:15:09.8643989Z %37 = arith.subf %33, %32 : tensor<8x1024xf32> 2026-02-21T08:15:09.8644437Z %38 = arith.mulf %36, %37 : tensor<8x1024xf32> 2026-02-21T08:15:09.8644896Z %39 = arith.addf %38, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8645380Z scf.yield %39 : tensor<8x1024xf32> 2026-02-21T08:15:09.8645764Z } else { 2026-02-21T08:15:09.8646089Z %36 = tt.splat %arg4 : f32 -> tensor<8x1024xf32> 2026-02-21T08:15:09.8646555Z %37 = arith.cmpf ogt, %33, %36 : tensor<8x1024xf32> 2026-02-21T08:15:09.8646999Z %38 = arith.cmpf une, %33, %33 : tensor<8x1024xf32> 2026-02-21T08:15:09.8647460Z %39 = arith.ori %37, %38 : tensor<8x1024xi1> 2026-02-21T08:15:09.8647981Z %40 = arith.select %39, %33, %36 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:15:09.8648450Z %41 = math.log %40 : tensor<8x1024xf32> 2026-02-21T08:15:09.8648890Z %42 = arith.subf %41, %32 : tensor<8x1024xf32> 2026-02-21T08:15:09.8649265Z %43 = arith.mulf %33, %42 : tensor<8x1024xf32> 2026-02-21T08:15:09.8649739Z %44 = arith.addf %43, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8650131Z scf.yield %44 : tensor<8x1024xf32> 2026-02-21T08:15:09.8650565Z } 2026-02-21T08:15:09.8650945Z %35 = arith.addf %arg7, %34 : tensor<8x1024xf32> 2026-02-21T08:15:09.8651328Z scf.yield %35 : tensor<8x1024xf32> 2026-02-21T08:15:09.8651901Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:15:09.8652486Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:15:09.8652885Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:09.8653228Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:09.8653665Z tt.reduce.return %32 : f32 2026-02-21T08:15:09.8654059Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:15:09.8654484Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:15:09.8655040Z %31 = tt.addptr %30, %27 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:15:09.8655476Z tt.store %31, %29 : tensor<8x!tt.ptr> 2026-02-21T08:15:09.8655863Z } 2026-02-21T08:15:09.8656177Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:15:09.8656594Z %14 = arith.muli %arg5, %c8_i32 : i32 2026-02-21T08:15:09.8657058Z %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:15:09.8657530Z %16 = tt.splat %14 : i32 -> tensor<8xi32> 2026-02-21T08:15:09.8657957Z %17 = arith.addi %16, %15 : tensor<8xi32> 2026-02-21T08:15:09.8658538Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<8x1024xf32>) : i32 { 2026-02-21T08:15:09.8659357Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8660101Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<8x1024xf32> 2026-02-21T08:15:09.8660645Z %24 = scf.if %arg3 -> (tensor<8x1024xf32>) { 2026-02-21T08:15:09.8661383Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x1024xf32>) -> tensor<8x1024xf32> 2026-02-21T08:15:09.8662382Z %27 = arith.subf %23, %22 : tensor<8x1024xf32> 2026-02-21T08:15:09.8662823Z %28 = arith.mulf %26, %27 : tensor<8x1024xf32> 2026-02-21T08:15:09.8663280Z %29 = arith.addf %28, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8663675Z scf.yield %29 : tensor<8x1024xf32> 2026-02-21T08:15:09.8664061Z } else { 2026-02-21T08:15:09.8664354Z %26 = tt.splat %arg4 : f32 -> tensor<8x1024xf32> 2026-02-21T08:15:09.8664866Z %27 = arith.cmpf ogt, %23, %26 : tensor<8x1024xf32> 2026-02-21T08:15:09.8665285Z %28 = arith.cmpf une, %23, %23 : tensor<8x1024xf32> 2026-02-21T08:15:09.8665734Z %29 = arith.ori %27, %28 : tensor<8x1024xi1> 2026-02-21T08:15:09.8666267Z %30 = arith.select %29, %23, %26 : tensor<8x1024xi1>, tensor<8x1024xf32> 2026-02-21T08:15:09.8666741Z %31 = math.log %30 : tensor<8x1024xf32> 2026-02-21T08:15:09.8667142Z %32 = arith.subf %31, %22 : tensor<8x1024xf32> 2026-02-21T08:15:09.8667657Z %33 = arith.mulf %23, %32 : tensor<8x1024xf32> 2026-02-21T08:15:09.8668105Z %34 = arith.addf %33, %cst : tensor<8x1024xf32> 2026-02-21T08:15:09.8668486Z scf.yield %34 : tensor<8x1024xf32> 2026-02-21T08:15:09.8668883Z } 2026-02-21T08:15:09.8669223Z %25 = arith.addf %arg7, %24 : tensor<8x1024xf32> 2026-02-21T08:15:09.8669606Z scf.yield %25 : tensor<8x1024xf32> 2026-02-21T08:15:09.8670180Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:15:09.8670734Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:15:09.8671135Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:15:09.8671545Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:15:09.8671940Z tt.reduce.return %22 : f32 2026-02-21T08:15:09.8672331Z }) : (tensor<8x1024xf32>) -> tensor<8xf32> 2026-02-21T08:15:09.8672851Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:15:09.8673386Z %21 = tt.addptr %20, %17 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:15:09.8673816Z tt.store %21, %19 : tensor<8x!tt.ptr> 2026-02-21T08:15:09.8674237Z } {tt.num_stages = 1 : i32} 2026-02-21T08:15:09.8674595Z tt.return 2026-02-21T08:15:09.8674851Z } 2026-02-21T08:15:09.8675147Z } 2026-02-21T08:15:09.8675295Z 2026-02-21T08:15:09.8675394Z {-# 2026-02-21T08:15:09.8675687Z external_resources: { 2026-02-21T08:15:09.8675977Z mlir_reproducer: { 2026-02-21T08:15:09.8683726Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:15:09.8691773Z disable_threading: false, 2026-02-21T08:15:09.8692209Z verify_each: true 2026-02-21T08:15:09.8692531Z } 2026-02-21T08:15:09.8692752Z } 2026-02-21T08:15:09.8693058Z #-} 2026-02-21T08:15:09.8693815Z /tmp/torchinductor_root/q2/cq2juvjyoieznyrwo64yoc5zbzf6hjv6zdhby6kzf5phwmki44fv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:15:09.8696045Z /tmp/torchinductor_root/q2/cq2juvjyoieznyrwo64yoc5zbzf6hjv6zdhby6kzf5phwmki44fv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:15:09.8697974Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:15:09.8699907Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=4, num_stages=7, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:15:09.8701716Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:15:09.8702302Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:15:10.8636009Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.0 configs/s 2026-02-21T08:15:10.8646522Z [42s] Adaptive compile timeout: 30s (90% percentile=4.2s, bounds=[30.0s, 30s]) 2026-02-21T08:15:12.2538563Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 711.2 configs/s 2026-02-21T08:15:12.3129089Z [43s] Initial random population of 100, 5 starting points: 2026-02-21T08:15:12.3132442Z error=8 2026-02-21T08:15:12.3137420Z timeout=6 2026-02-21T08:15:12.3141124Z ok=86 2026-02-21T08:15:12.3146800Z min=0.1260 2026-02-21T08:15:12.3148234Z mid=1.5774 2026-02-21T08:15:12.3148490Z max=95.2207 2026-02-21T08:15:12.3148676Z best={'block_sizes': [1024, 1], 2026-02-21T08:15:12.3149141Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:15:12.3149464Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:15:12.3149756Z 'num_sm_multiplier': 16, 2026-02-21T08:15:12.3149978Z 'num_stages': 1, 2026-02-21T08:15:12.3150201Z 'num_warps': 1, 2026-02-21T08:15:12.3150436Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:15:12.3150741Z 'range_flattens': [None, None], 2026-02-21T08:15:12.3151037Z 'range_multi_buffers': [False, True], 2026-02-21T08:15:12.3151302Z 'range_num_stages': [0, 4], 2026-02-21T08:15:12.3151582Z 'range_unroll_factors': [2, 0], 2026-02-21T08:15:12.3151844Z 'range_warp_specializes': [None, True]} 2026-02-21T08:15:12.3152167Z [43s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:15:13.7099841Z [45s] Generation 1 starting: 94 neighbors, 5 active search path(s) 2026-02-21T08:15:23.2513234Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 3.9 configs/s 2026-02-21T08:15:28.8528543Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 17.6 configs/s 2026-02-21T08:15:42.1906187Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 75.5 configs/s 2026-02-21T08:15:42.5876067Z [74s] Generation 1 complete: 2026-02-21T08:15:42.5880212Z error=2 2026-02-21T08:15:42.5884067Z ok=98 2026-02-21T08:15:42.5886162Z min=0.1116 2026-02-21T08:15:42.5886725Z mid=0.1424 2026-02-21T08:15:42.5887003Z max=0.6441 2026-02-21T08:15:42.5887237Z best={'block_sizes': [1024, 1], 2026-02-21T08:15:42.5887575Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:15:42.5887963Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:15:42.5888246Z 'num_sm_multiplier': 64, 2026-02-21T08:15:42.5888453Z 'num_stages': 7, 2026-02-21T08:15:42.5888684Z 'num_warps': 8, 2026-02-21T08:15:42.5888883Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:15:42.5889148Z 'range_flattens': [False, False], 2026-02-21T08:15:42.5889377Z 'range_multi_buffers': [True, True], 2026-02-21T08:15:42.5889650Z 'range_num_stages': [1, 3], 2026-02-21T08:15:42.5889951Z 'range_unroll_factors': [0, 3], 2026-02-21T08:15:42.5890201Z 'range_warp_specializes': [True, None]} 2026-02-21T08:15:42.5891074Z [74s] Fitting surrogate: 200 points, 200 targets 2026-02-21T08:15:44.0695451Z [75s] Generation 2 starting: 97 neighbors, 5 active search path(s) 2026-02-21T08:16:02.3798557Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 1.3 configs/s 2026-02-21T08:16:08.1785404Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 17.6 configs/s 2026-02-21T08:16:21.8105636Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 73.8 configs/s 2026-02-21T08:16:22.2793538Z [113s] Generation 2 complete: 2026-02-21T08:16:22.2796755Z error=1 2026-02-21T08:16:22.2797344Z ok=102 2026-02-21T08:16:22.2797605Z min=0.1136 2026-02-21T08:16:22.2797811Z mid=0.1382 2026-02-21T08:16:22.2798009Z max=0.7056 2026-02-21T08:16:22.2798183Z best={'block_sizes': [1024, 1], 2026-02-21T08:16:22.2798521Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:16:22.2798826Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:16:22.2799083Z 'num_sm_multiplier': 64, 2026-02-21T08:16:22.2799284Z 'num_stages': 7, 2026-02-21T08:16:22.2799495Z 'num_warps': 8, 2026-02-21T08:16:22.2799705Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:16:22.2799937Z 'range_flattens': [False, False], 2026-02-21T08:16:22.2800335Z 'range_multi_buffers': [True, True], 2026-02-21T08:16:22.2800878Z 'range_num_stages': [1, 3], 2026-02-21T08:16:22.2801163Z 'range_unroll_factors': [0, 3], 2026-02-21T08:16:22.2801387Z 'range_warp_specializes': [True, None]} 2026-02-21T08:16:22.2820474Z [113s] Fitting surrogate: 303 points, 303 targets 2026-02-21T08:16:23.8236134Z [115s] Generation 3 starting: 82 neighbors, 5 active search path(s) 2026-02-21T08:16:44.6867847Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.6 configs/s 2026-02-21T08:16:46.7051756Z module { 2026-02-21T08:16:46.7055856Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:16:46.7060129Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:16:46.7062312Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:16:46.7062997Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:16:46.7063572Z %cst = arith.constant dense<16384> : tensor<4x1xi32> 2026-02-21T08:16:46.7063922Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:16:46.7064246Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:16:46.7064497Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:16:46.7064728Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:16:46.7065022Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:16:46.7065252Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:16:46.7065645Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:16:46.7066105Z %1 = tt.get_program_id x : i32 2026-02-21T08:16:46.7066423Z scf.for %arg5 = %1 to %c1024_i32 step %c9472_i32 : i32 { 2026-02-21T08:16:46.7066673Z %2 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:16:46.7067277Z %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:16:46.7067608Z %4 = tt.splat %2 : i32 -> tensor<4xi32> 2026-02-21T08:16:46.7067880Z %5 = arith.addi %4, %3 : tensor<4xi32> 2026-02-21T08:16:46.7068160Z %c15360_i32 = arith.constant 15360 : i32 2026-02-21T08:16:46.7068390Z %c3072_i32 = arith.constant 3072 : i32 2026-02-21T08:16:46.7068795Z %6 = scf.for %arg6 = %c0_i32 to %c15360_i32 step %c3072_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:16:46.7069226Z %25 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:16:46.7069559Z %26 = tt.splat %arg6 : i32 -> tensor<1024xi32> 2026-02-21T08:16:46.7069855Z %27 = arith.addi %26, %25 : tensor<1024xi32> 2026-02-21T08:16:46.7070202Z %28 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:16:46.7070605Z %29 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:16:46.7070906Z %30 = arith.muli %29, %cst : tensor<4x1xi32> 2026-02-21T08:16:46.7071284Z %31 = tt.expand_dims %27 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T08:16:46.7072035Z %32 = tt.broadcast %30 : tensor<4x1xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7072339Z %33 = tt.broadcast %31 : tensor<1x1024xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7072685Z %34 = arith.addi %32, %33 : tensor<4x1024xi32> 2026-02-21T08:16:46.7072971Z %35 = tt.splat %arg1 : !tt.ptr -> tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7073327Z %36 = tt.addptr %35, %34 : tensor<4x1024x!tt.ptr>, tensor<4x1024xi32> 2026-02-21T08:16:46.7073725Z %37 = tt.load %36 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7074032Z %38 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:16:46.7074467Z %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:16:46.7075019Z %75 = arith.subf %37, %28 : tensor<4x1024xf32> 2026-02-21T08:16:46.7075319Z %76 = arith.mulf %74, %75 : tensor<4x1024xf32> 2026-02-21T08:16:46.7075614Z %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7076253Z scf.yield %77 : tensor<4x1024xf32> 2026-02-21T08:16:46.7076503Z } else { 2026-02-21T08:16:46.7076698Z %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:16:46.7077017Z %75 = arith.cmpf ogt, %37, %74 : tensor<4x1024xf32> 2026-02-21T08:16:46.7077290Z %76 = arith.cmpf une, %37, %37 : tensor<4x1024xf32> 2026-02-21T08:16:46.7077562Z %77 = arith.ori %75, %76 : tensor<4x1024xi1> 2026-02-21T08:16:46.7077890Z %78 = arith.select %77, %37, %74 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:16:46.7078185Z %79 = math.log %78 : tensor<4x1024xf32> 2026-02-21T08:16:46.7078445Z %80 = arith.subf %79, %28 : tensor<4x1024xf32> 2026-02-21T08:16:46.7078716Z %81 = arith.mulf %37, %80 : tensor<4x1024xf32> 2026-02-21T08:16:46.7079005Z %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7079246Z scf.yield %82 : tensor<4x1024xf32> 2026-02-21T08:16:46.7079500Z } 2026-02-21T08:16:46.7079728Z %39 = arith.addf %arg7, %38 : tensor<4x1024xf32> 2026-02-21T08:16:46.7079966Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:16:46.7080245Z %40 = arith.muli %c1024_i32, %c1_i32 : i32 2026-02-21T08:16:46.7080481Z %41 = arith.addi %arg6, %40 : i32 2026-02-21T08:16:46.7080782Z %42 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:16:46.7081072Z %43 = tt.splat %41 : i32 -> tensor<1024xi32> 2026-02-21T08:16:46.7081362Z %44 = arith.addi %43, %42 : tensor<1024xi32> 2026-02-21T08:16:46.7081716Z %45 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:16:46.7082122Z %46 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:16:46.7082463Z %47 = arith.muli %46, %cst : tensor<4x1xi32> 2026-02-21T08:16:46.7082757Z %48 = tt.expand_dims %44 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T08:16:46.7083120Z %49 = tt.broadcast %47 : tensor<4x1xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7083466Z %50 = tt.broadcast %48 : tensor<1x1024xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7083745Z %51 = arith.addi %49, %50 : tensor<4x1024xi32> 2026-02-21T08:16:46.7084057Z %52 = tt.splat %arg1 : !tt.ptr -> tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7084384Z %53 = tt.addptr %52, %51 : tensor<4x1024x!tt.ptr>, tensor<4x1024xi32> 2026-02-21T08:16:46.7084746Z %54 = tt.load %53 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7085030Z %55 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:16:46.7085492Z %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:16:46.7086017Z %75 = arith.subf %54, %45 : tensor<4x1024xf32> 2026-02-21T08:16:46.7086261Z %76 = arith.mulf %74, %75 : tensor<4x1024xf32> 2026-02-21T08:16:46.7086568Z %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7086811Z scf.yield %77 : tensor<4x1024xf32> 2026-02-21T08:16:46.7087055Z } else { 2026-02-21T08:16:46.7087311Z %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:16:46.7087571Z %75 = arith.cmpf ogt, %54, %74 : tensor<4x1024xf32> 2026-02-21T08:16:46.7087859Z %76 = arith.cmpf une, %54, %54 : tensor<4x1024xf32> 2026-02-21T08:16:46.7088137Z %77 = arith.ori %75, %76 : tensor<4x1024xi1> 2026-02-21T08:16:46.7088450Z %78 = arith.select %77, %54, %74 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:16:46.7088727Z %79 = math.log %78 : tensor<4x1024xf32> 2026-02-21T08:16:46.7089080Z %80 = arith.subf %79, %45 : tensor<4x1024xf32> 2026-02-21T08:16:46.7089365Z %81 = arith.mulf %54, %80 : tensor<4x1024xf32> 2026-02-21T08:16:46.7089618Z %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7089900Z scf.yield %82 : tensor<4x1024xf32> 2026-02-21T08:16:46.7090121Z } 2026-02-21T08:16:46.7090991Z %56 = arith.addf %39, %55 : tensor<4x1024xf32> 2026-02-21T08:16:46.7091227Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:16:46.7091506Z %57 = arith.muli %c1024_i32, %c2_i32 : i32 2026-02-21T08:16:46.7091764Z %58 = arith.addi %arg6, %57 : i32 2026-02-21T08:16:46.7092077Z %59 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:16:46.7092419Z %60 = tt.splat %58 : i32 -> tensor<1024xi32> 2026-02-21T08:16:46.7092657Z %61 = arith.addi %60, %59 : tensor<1024xi32> 2026-02-21T08:16:46.7093005Z %62 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:16:46.7093434Z %63 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:16:46.7093723Z %64 = arith.muli %63, %cst : tensor<4x1xi32> 2026-02-21T08:16:46.7094046Z %65 = tt.expand_dims %61 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T08:16:46.7094389Z %66 = tt.broadcast %64 : tensor<4x1xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7094717Z %67 = tt.broadcast %65 : tensor<1x1024xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7094976Z %68 = arith.addi %66, %67 : tensor<4x1024xi32> 2026-02-21T08:16:46.7095320Z %69 = tt.splat %arg1 : !tt.ptr -> tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7095661Z %70 = tt.addptr %69, %68 : tensor<4x1024x!tt.ptr>, tensor<4x1024xi32> 2026-02-21T08:16:46.7095979Z %71 = tt.load %70 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7096342Z %72 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:16:46.7096743Z %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:16:46.7097170Z %75 = arith.subf %71, %62 : tensor<4x1024xf32> 2026-02-21T08:16:46.7097484Z %76 = arith.mulf %74, %75 : tensor<4x1024xf32> 2026-02-21T08:16:46.7097737Z %77 = arith.addf %76, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7098002Z scf.yield %77 : tensor<4x1024xf32> 2026-02-21T08:16:46.7098249Z } else { 2026-02-21T08:16:46.7098490Z %74 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:16:46.7098768Z %75 = arith.cmpf ogt, %71, %74 : tensor<4x1024xf32> 2026-02-21T08:16:46.7099087Z %76 = arith.cmpf une, %71, %71 : tensor<4x1024xf32> 2026-02-21T08:16:46.7099378Z %77 = arith.ori %75, %76 : tensor<4x1024xi1> 2026-02-21T08:16:46.7099679Z %78 = arith.select %77, %71, %74 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:16:46.7100079Z %79 = math.log %78 : tensor<4x1024xf32> 2026-02-21T08:16:46.7100332Z %80 = arith.subf %79, %62 : tensor<4x1024xf32> 2026-02-21T08:16:46.7100628Z %81 = arith.mulf %71, %80 : tensor<4x1024xf32> 2026-02-21T08:16:46.7100946Z %82 = arith.addf %81, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7101210Z scf.yield %82 : tensor<4x1024xf32> 2026-02-21T08:16:46.7101463Z } 2026-02-21T08:16:46.7101655Z %73 = arith.addf %56, %72 : tensor<4x1024xf32> 2026-02-21T08:16:46.7101985Z scf.yield %73 : tensor<4x1024xf32> 2026-02-21T08:16:46.7102222Z } {tt.num_stages = 1 : i32} 2026-02-21T08:16:46.7102660Z %7 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:16:46.7103023Z %8 = tt.splat %c15360_i32 : i32 -> tensor<1024xi32> 2026-02-21T08:16:46.7103286Z %9 = arith.addi %8, %7 : tensor<1024xi32> 2026-02-21T08:16:46.7103739Z %10 = tt.descriptor_load %0[%2, %c15360_i32] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:16:46.7104169Z %11 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:16:46.7104509Z %12 = arith.muli %11, %cst : tensor<4x1xi32> 2026-02-21T08:16:46.7104847Z %13 = tt.expand_dims %9 {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> 2026-02-21T08:16:46.7105209Z %14 = tt.broadcast %12 : tensor<4x1xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7105557Z %15 = tt.broadcast %13 : tensor<1x1024xi32> -> tensor<4x1024xi32> 2026-02-21T08:16:46.7105830Z %16 = arith.addi %14, %15 : tensor<4x1024xi32> 2026-02-21T08:16:46.7106185Z %17 = tt.splat %arg1 : !tt.ptr -> tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7106499Z %18 = tt.addptr %17, %16 : tensor<4x1024x!tt.ptr>, tensor<4x1024xi32> 2026-02-21T08:16:46.7106851Z %19 = tt.load %18 evictionPolicy = evict_first : tensor<4x1024x!tt.ptr> 2026-02-21T08:16:46.7107224Z %20 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:16:46.7107626Z %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:16:46.7108055Z %26 = arith.subf %19, %10 : tensor<4x1024xf32> 2026-02-21T08:16:46.7108331Z %27 = arith.mulf %25, %26 : tensor<4x1024xf32> 2026-02-21T08:16:46.7108610Z %28 = arith.addf %27, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7108876Z scf.yield %28 : tensor<4x1024xf32> 2026-02-21T08:16:46.7109119Z } else { 2026-02-21T08:16:46.7109350Z %25 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:16:46.7109617Z %26 = arith.cmpf ogt, %19, %25 : tensor<4x1024xf32> 2026-02-21T08:16:46.7109922Z %27 = arith.cmpf une, %19, %19 : tensor<4x1024xf32> 2026-02-21T08:16:46.7110178Z %28 = arith.ori %26, %27 : tensor<4x1024xi1> 2026-02-21T08:16:46.7110502Z %29 = arith.select %28, %19, %25 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:16:46.7110834Z %30 = math.log %29 : tensor<4x1024xf32> 2026-02-21T08:16:46.7111070Z %31 = arith.subf %30, %10 : tensor<4x1024xf32> 2026-02-21T08:16:46.7111354Z %32 = arith.mulf %19, %31 : tensor<4x1024xf32> 2026-02-21T08:16:46.7111608Z %33 = arith.addf %32, %cst_0 : tensor<4x1024xf32> 2026-02-21T08:16:46.7111924Z scf.yield %33 : tensor<4x1024xf32> 2026-02-21T08:16:46.7112146Z } 2026-02-21T08:16:46.7112366Z %21 = arith.addf %6, %20 : tensor<4x1024xf32> 2026-02-21T08:16:46.7112650Z %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({ 2026-02-21T08:16:46.7112890Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:16:46.7113141Z %25 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:16:46.7113380Z tt.reduce.return %25 : f32 2026-02-21T08:16:46.7113644Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:16:46.7113901Z %23 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:16:46.7114319Z %24 = tt.addptr %23, %5 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:16:46.7114621Z tt.store %24, %22 : tensor<4x!tt.ptr> 2026-02-21T08:16:46.7114866Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:16:46.7115154Z tt.return 2026-02-21T08:16:46.7115328Z } 2026-02-21T08:16:46.7115508Z } 2026-02-21T08:16:46.7115616Z 2026-02-21T08:16:46.7115706Z {-# 2026-02-21T08:16:46.7115912Z external_resources: { 2026-02-21T08:16:46.7116108Z mlir_reproducer: { 2026-02-21T08:16:46.7120584Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:16:46.7125073Z disable_threading: false, 2026-02-21T08:16:46.7125344Z verify_each: true 2026-02-21T08:16:46.7125530Z } 2026-02-21T08:16:46.7125715Z } 2026-02-21T08:16:46.7125907Z #-} 2026-02-21T08:16:46.7126401Z /tmp/torchinductor_root/d7/cd7j2kp2pczr6gv2rsagvocxjuaejcorrxzh2minax3zj62t4r32.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:16:46.7127653Z /tmp/torchinductor_root/d7/cd7j2kp2pczr6gv2rsagvocxjuaejcorrxzh2minax3zj62t4r32.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:16:46.7128723Z [138s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:16:46.7130984Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=64, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:16:46.7132049Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:16:46.7132415Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:16:49.4638130Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 18.0 configs/s 2026-02-21T08:17:00.8528662Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 88.3 configs/s 2026-02-21T08:17:01.2177135Z [152s] Generation 3 complete: 2026-02-21T08:17:01.2177558Z error=3 2026-02-21T08:17:01.2177964Z ok=85 2026-02-21T08:17:01.2178288Z min=0.1116 2026-02-21T08:17:01.2178539Z mid=0.1340 2026-02-21T08:17:01.2178857Z max=1.3301 2026-02-21T08:17:01.2179139Z best={'block_sizes': [1024, 1], 2026-02-21T08:17:01.2179637Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:17:01.2180191Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:17:01.2180576Z 'num_sm_multiplier': 64, 2026-02-21T08:17:01.2180921Z 'num_stages': 7, 2026-02-21T08:17:01.2181209Z 'num_warps': 8, 2026-02-21T08:17:01.2181578Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:17:01.2182293Z 'range_flattens': [False, False], 2026-02-21T08:17:01.2182707Z 'range_multi_buffers': [True, True], 2026-02-21T08:17:01.2183084Z 'range_num_stages': [1, 2], 2026-02-21T08:17:01.2183446Z 'range_unroll_factors': [0, 3], 2026-02-21T08:17:01.2184268Z 'range_warp_specializes': [True, None]} 2026-02-21T08:17:01.2206706Z [152s] Fitting surrogate: 391 points, 391 targets 2026-02-21T08:17:02.4524032Z [153s] Generation 4 starting: 86 neighbors, 5 active search path(s) 2026-02-21T08:17:07.1291633Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 16.5 configs/s 2026-02-21T08:17:12.2350736Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 17.6 configs/s 2026-02-21T08:17:26.3318287Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 71.4 configs/s 2026-02-21T08:17:26.9034885Z [178s] Generation 4 complete: 2026-02-21T08:17:26.9035268Z ok=91 2026-02-21T08:17:26.9035708Z min=0.1116 2026-02-21T08:17:26.9035990Z mid=0.1300 2026-02-21T08:17:26.9036320Z max=0.5458 2026-02-21T08:17:26.9036600Z best={'block_sizes': [1024, 1], 2026-02-21T08:17:26.9037100Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:17:26.9037710Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:17:26.9038203Z 'num_sm_multiplier': 64, 2026-02-21T08:17:26.9038580Z 'num_stages': 7, 2026-02-21T08:17:26.9038893Z 'num_warps': 8, 2026-02-21T08:17:26.9039244Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:17:26.9039604Z 'range_flattens': [False, False], 2026-02-21T08:17:26.9040015Z 'range_multi_buffers': [True, True], 2026-02-21T08:17:26.9040373Z 'range_num_stages': [1, 2], 2026-02-21T08:17:26.9040738Z 'range_unroll_factors': [0, 3], 2026-02-21T08:17:26.9041138Z 'range_warp_specializes': [True, None]} 2026-02-21T08:17:26.9070160Z [178s] Fitting surrogate: 482 points, 482 targets 2026-02-21T08:17:28.3099343Z [179s] Generation 5 starting: 82 neighbors, 5 active search path(s) 2026-02-21T08:17:33.5673036Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 5.7 configs/s 2026-02-21T08:17:38.4081119Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 17.5 configs/s 2026-02-21T08:17:50.6193169Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.4 configs/s 2026-02-21T08:17:50.9954071Z [202s] Generation 5 complete: 2026-02-21T08:17:50.9955774Z ok=87 2026-02-21T08:17:50.9956064Z min=0.1116 2026-02-21T08:17:50.9956391Z mid=0.1281 2026-02-21T08:17:50.9956595Z max=0.9944 2026-02-21T08:17:50.9956793Z best={'block_sizes': [1024, 1], 2026-02-21T08:17:50.9957167Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:17:50.9957520Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:17:50.9957775Z 'num_sm_multiplier': 64, 2026-02-21T08:17:50.9958037Z 'num_stages': 7, 2026-02-21T08:17:50.9958225Z 'num_warps': 8, 2026-02-21T08:17:50.9958527Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:17:50.9958803Z 'range_flattens': [False, False], 2026-02-21T08:17:50.9959072Z 'range_multi_buffers': [True, True], 2026-02-21T08:17:50.9959998Z 'range_num_stages': [1, 2], 2026-02-21T08:17:50.9960262Z 'range_unroll_factors': [0, 3], 2026-02-21T08:17:50.9960525Z 'range_warp_specializes': [True, None]} 2026-02-21T08:17:50.9975769Z [202s] Fitting surrogate: 569 points, 569 targets 2026-02-21T08:17:52.0050557Z [203s] Generation 6 starting: 55 neighbors, 3 active search path(s) 2026-02-21T08:17:55.2340577Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 53.9 configs/s 2026-02-21T08:17:58.5042924Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 17.7 configs/s 2026-02-21T08:18:07.5940343Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 111.2 2026-02-21T08:18:07.5944329Z configs/s 2026-02-21T08:18:07.9003695Z [219s] Generation 6 complete: 2026-02-21T08:18:07.9004144Z ok=59 2026-02-21T08:18:07.9004626Z min=0.1134 2026-02-21T08:18:07.9004928Z mid=0.1342 2026-02-21T08:18:07.9005259Z max=0.3552 2026-02-21T08:18:07.9005598Z best={'block_sizes': [2048, 1], 2026-02-21T08:18:07.9006112Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:18:07.9006585Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:18:07.9007477Z 'num_sm_multiplier': 16, 2026-02-21T08:18:07.9007905Z 'num_stages': 1, 2026-02-21T08:18:07.9008227Z 'num_warps': 8, 2026-02-21T08:18:07.9008664Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:18:07.9009075Z 'range_flattens': [False, False], 2026-02-21T08:18:07.9009479Z 'range_multi_buffers': [True, True], 2026-02-21T08:18:07.9009807Z 'range_num_stages': [0, 4], 2026-02-21T08:18:07.9010224Z 'range_unroll_factors': [0, 3], 2026-02-21T08:18:07.9010559Z 'range_warp_specializes': [True, None]} 2026-02-21T08:18:07.9026376Z [219s] Fitting surrogate: 628 points, 628 targets 2026-02-21T08:18:08.7498914Z [220s] Generation 7 starting: 53 neighbors, 3 active search path(s) 2026-02-21T08:18:11.4477787Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 24.9 configs/s 2026-02-21T08:18:14.5033400Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.6 configs/s 2026-02-21T08:18:23.8971007Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 111.4 2026-02-21T08:18:23.8971401Z configs/s 2026-02-21T08:18:24.1710654Z [235s] Generation 7 complete: 2026-02-21T08:18:24.1714945Z ok=56 2026-02-21T08:18:24.1716770Z min=0.1096 2026-02-21T08:18:24.1716938Z mid=0.1240 2026-02-21T08:18:24.1717064Z max=0.1946 2026-02-21T08:18:24.1717214Z best={'block_sizes': [2048, 1], 2026-02-21T08:18:24.1717438Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:18:24.1717690Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:18:24.1717890Z 'num_stages': 1, 2026-02-21T08:18:24.1718030Z 'num_warps': 8, 2026-02-21T08:18:24.1718178Z 'pid_type': 'flat', 2026-02-21T08:18:24.1718336Z 'range_flattens': [None, False], 2026-02-21T08:18:24.1718521Z 'range_multi_buffers': [None, True], 2026-02-21T08:18:24.1718698Z 'range_num_stages': [0, 4], 2026-02-21T08:18:24.1718868Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:24.1719040Z 'range_warp_specializes': [None, True]} 2026-02-21T08:18:24.1731004Z [235s] Fitting surrogate: 684 points, 684 targets 2026-02-21T08:18:24.7789195Z [236s] Generation 8 starting: 30 neighbors, 2 active search path(s) 2026-02-21T08:18:26.6547864Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 14.9 configs/s 2026-02-21T08:18:28.3843129Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 17.8 configs/s 2026-02-21T08:18:33.3608566Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 203.1 2026-02-21T08:18:33.3609886Z configs/s 2026-02-21T08:18:33.5289236Z [245s] Generation 8 complete: 2026-02-21T08:18:33.5292395Z ok=32 2026-02-21T08:18:33.5293874Z min=0.1096 2026-02-21T08:18:33.5294029Z mid=0.1261 2026-02-21T08:18:33.5294156Z max=0.1976 2026-02-21T08:18:33.5294293Z best={'block_sizes': [2048, 1], 2026-02-21T08:18:33.5294536Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:18:33.5294775Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:18:33.5295350Z 'num_stages': 1, 2026-02-21T08:18:33.5295521Z 'num_warps': 8, 2026-02-21T08:18:33.5295671Z 'pid_type': 'flat', 2026-02-21T08:18:33.5295849Z 'range_flattens': [None, False], 2026-02-21T08:18:33.5296037Z 'range_multi_buffers': [None, True], 2026-02-21T08:18:33.5296234Z 'range_num_stages': [0, 4], 2026-02-21T08:18:33.5296402Z 'range_unroll_factors': [0, 0], 2026-02-21T08:18:33.5296592Z 'range_warp_specializes': [None, True]} 2026-02-21T08:18:33.5313072Z [245s] Fitting surrogate: 716 points, 716 targets 2026-02-21T08:18:33.8191643Z [245s] Autotuning complete in 245.3s after searching 691 configs. 2026-02-21T08:18:33.8192080Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:18:33.8193028Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:18:33.8193872Z 2026-02-21T08:18:33.8194129Z [245s] Code of selected kernel: /tmp/torchinductor_root/pg/cpgzaakzb2yxo736ajfixkoxq5uniyijriimmreocuh7kbuo7e5a.py 2026-02-21T08:18:34.9426624Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T08:18:34.9430756Z (B, T, V) 2026-02-21T08:18:34.9435447Z --------------- 2026-02-21T08:18:34.9439957Z (8, 512, 16384) 2026-02-21T08:18:34.9445254Z 2026-02-21T08:18:34.9462375Z 50%|█████ | 3/6 [10:22<10:55, 218.65s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T08:18:34.9463514Z (B, T, V) 2026-02-21T08:18:34.9463688Z --------------- 2026-02-21T08:18:34.9463840Z (8, 512, 32768) 2026-02-21T08:18:34.9464192Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:18:36.0372789Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:18:37.1399751Z INFO:tritonbench.utils.triton_op:Took 2.45ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:18:41.0186952Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:18:41.0187321Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:18:41.0187603Z 'dtype': 'torch.float32', 2026-02-21T08:18:41.0187897Z 'shape': (4096, 32768), 2026-02-21T08:18:41.0188172Z 'stride': (32768, 1)}, 2026-02-21T08:18:41.0188436Z { 'device': 'cuda:0', 2026-02-21T08:18:41.0188711Z 'dtype': 'torch.float32', 2026-02-21T08:18:41.0188989Z 'shape': (4096, 32768), 2026-02-21T08:18:41.0189261Z 'stride': (32768, 1)}), 2026-02-21T08:18:41.0189516Z 'kwargs': {}} 2026-02-21T08:18:41.0218205Z INFO:tritonbench.utils.triton_op:Took 3.65ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:18:41.2257327Z [0s] Autotune random seed: 2134765727 2026-02-21T08:18:41.3791055Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:19:13.5560953Z [32s] Timeout after 30s compiling Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T08:19:13.8538850Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], maxnreg=128, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[3, 2], range_warp_specializes=[False, None]) 2026-02-21T08:19:13.9047268Z [32s] Timeout after 30s compiling Config(block_sizes=[1024, 64], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=256, num_sm_multiplier=16, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]) 2026-02-21T08:19:14.5742865Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=16, num_stages=5, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:19:15.9552905Z [34s] Timeout after 30s compiling Config(block_sizes=[256, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', ''], num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:19:16.2317633Z [34s] Timeout after 30s compiling Config(block_sizes=[2048, 64], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[False, False]) 2026-02-21T08:19:16.3061367Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_sm_multiplier=32, num_stages=8, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[3, 2], range_warp_specializes=[False, False]) 2026-02-21T08:19:16.6930642Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 512], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=64, num_sm_multiplier=16, num_stages=8, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[4, 2], range_warp_specializes=[False, False]) 2026-02-21T08:19:16.6945203Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:19:17.3394435Z module { 2026-02-21T08:19:17.3399186Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:19:17.3403909Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:19:17.3407380Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:19:17.3411308Z %c296_i32 = arith.constant 296 : i32 2026-02-21T08:19:17.3412642Z %cst = arith.constant dense<0.000000e+00> : tensor<64x1024xf32> 2026-02-21T08:19:17.3412900Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:19:17.3413117Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:19:17.3413320Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:19:17.3413528Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:19:17.3413717Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:19:17.3414062Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:19:17.3414534Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:19:17.3414857Z %2 = tt.get_program_id x : i32 2026-02-21T08:19:17.3415085Z scf.for %arg5 = %2 to %c64_i32 step %c296_i32 : i32 { 2026-02-21T08:19:17.3415294Z %3 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:19:17.3415639Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:19:17.3415915Z %5 = tt.splat %3 : i32 -> tensor<64xi32> 2026-02-21T08:19:17.3416113Z %6 = arith.addi %5, %4 : tensor<64xi32> 2026-02-21T08:19:17.3416443Z %7 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<64x1024xf32>) : i32 { 2026-02-21T08:19:17.3416866Z %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<64x1024xf32> 2026-02-21T08:19:17.3417258Z %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<64x1024xf32> 2026-02-21T08:19:17.3417561Z %13 = scf.if %arg3 -> (tensor<64x1024xf32>) { 2026-02-21T08:19:17.3417950Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x1024xf32>) -> tensor<64x1024xf32> 2026-02-21T08:19:17.3418374Z %16 = arith.subf %12, %11 : tensor<64x1024xf32> 2026-02-21T08:19:17.3418596Z %17 = arith.mulf %15, %16 : tensor<64x1024xf32> 2026-02-21T08:19:17.3418824Z %18 = arith.addf %17, %cst : tensor<64x1024xf32> 2026-02-21T08:19:17.3419036Z scf.yield %18 : tensor<64x1024xf32> 2026-02-21T08:19:17.3419213Z } else { 2026-02-21T08:19:17.3419389Z %15 = tt.splat %arg4 : f32 -> tensor<64x1024xf32> 2026-02-21T08:19:17.3419614Z %16 = arith.cmpf ogt, %12, %15 : tensor<64x1024xf32> 2026-02-21T08:19:17.3419847Z %17 = arith.cmpf une, %12, %12 : tensor<64x1024xf32> 2026-02-21T08:19:17.3420067Z %18 = arith.ori %16, %17 : tensor<64x1024xi1> 2026-02-21T08:19:17.3420323Z %19 = arith.select %18, %12, %15 : tensor<64x1024xi1>, tensor<64x1024xf32> 2026-02-21T08:19:17.3420580Z %20 = math.log %19 : tensor<64x1024xf32> 2026-02-21T08:19:17.3420786Z %21 = arith.subf %20, %11 : tensor<64x1024xf32> 2026-02-21T08:19:17.3421002Z %22 = arith.mulf %12, %21 : tensor<64x1024xf32> 2026-02-21T08:19:17.3421220Z %23 = arith.addf %22, %cst : tensor<64x1024xf32> 2026-02-21T08:19:17.3421429Z scf.yield %23 : tensor<64x1024xf32> 2026-02-21T08:19:17.3421604Z } 2026-02-21T08:19:17.3421768Z %14 = arith.addf %arg7, %13 : tensor<64x1024xf32> 2026-02-21T08:19:17.3421998Z scf.yield %14 : tensor<64x1024xf32> 2026-02-21T08:19:17.3422193Z } {tt.warp_specialize} 2026-02-21T08:19:17.3422374Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:19:17.3422558Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:19:17.3422741Z %11 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:19:17.3422922Z tt.reduce.return %11 : f32 2026-02-21T08:19:17.3423116Z }) : (tensor<64x1024xf32>) -> tensor<64xf32> 2026-02-21T08:19:17.3423343Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:19:17.3423610Z %10 = tt.addptr %9, %6 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:19:17.3423842Z tt.store %10, %8 : tensor<64x!tt.ptr> 2026-02-21T08:19:17.3424172Z } {tt.disallow_acc_multi_buffer} 2026-02-21T08:19:17.3424353Z tt.return 2026-02-21T08:19:17.3424484Z } 2026-02-21T08:19:17.3424620Z } 2026-02-21T08:19:17.3424692Z 2026-02-21T08:19:17.3424746Z {-# 2026-02-21T08:19:17.3424892Z external_resources: { 2026-02-21T08:19:17.3425053Z mlir_reproducer: { 2026-02-21T08:19:17.3429394Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:19:17.3433709Z disable_threading: false, 2026-02-21T08:19:17.3433880Z verify_each: true 2026-02-21T08:19:17.3434023Z } 2026-02-21T08:19:17.3434135Z } 2026-02-21T08:19:17.3434250Z #-} 2026-02-21T08:19:17.3434652Z /tmp/torchinductor_root/cl/cclsdp73pnkd5ihcd7ywluctzniw5lm2m64ipuaqnio43c7av6wv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:19:17.3435827Z /tmp/torchinductor_root/cl/cclsdp73pnkd5ihcd7ywluctzniw5lm2m64ipuaqnio43c7av6wv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:19:17.3436769Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:19:17.3437831Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], num_sm_multiplier=2, num_stages=6, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:19:17.3438799Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:19:17.3439049Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:19:18.0472416Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:19:18.0477575Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:19:18.0481735Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:19:18.0483230Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:19:18.0483446Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:19:18.0483639Z %c4736_i32 = arith.constant 4736 : i32 2026-02-21T08:19:18.0483848Z %cst = arith.constant dense<32768> : tensor<4x1xi32> 2026-02-21T08:19:18.0484101Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x64xf32> 2026-02-21T08:19:18.0484322Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:19:18.0484503Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:19:18.0484689Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:19:18.0484894Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:19:18.0485066Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:19:18.0485378Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:19:18.0485693Z %1 = tt.get_program_id x : i32 2026-02-21T08:19:18.0486168Z scf.for %arg5 = %1 to %c1024_i32 step %c4736_i32 : i32 { 2026-02-21T08:19:18.0486422Z %2 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:19:18.0486649Z %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:19:18.0486898Z %4 = tt.splat %2 : i32 -> tensor<4xi32> 2026-02-21T08:19:18.0487078Z %5 = arith.addi %4, %3 : tensor<4xi32> 2026-02-21T08:19:18.0487261Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:19:18.0487569Z %6 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c128_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x64xf32>) : i32 { 2026-02-21T08:19:18.0487948Z %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:19:18.0488199Z %11 = tt.splat %arg6 : i32 -> tensor<64xi32> 2026-02-21T08:19:18.0488402Z %12 = arith.addi %11, %10 : tensor<64xi32> 2026-02-21T08:19:18.0488641Z %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:19:18.0488896Z %14 = arith.muli %13, %cst : tensor<4x1xi32> 2026-02-21T08:19:18.0489141Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:19:18.0489423Z %16 = tt.broadcast %14 : tensor<4x1xi32> -> tensor<4x64xi32> 2026-02-21T08:19:18.0489669Z %17 = tt.broadcast %15 : tensor<1x64xi32> -> tensor<4x64xi32> 2026-02-21T08:19:18.0489895Z %18 = arith.addi %16, %17 : tensor<4x64xi32> 2026-02-21T08:19:18.0490127Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<4x64x!tt.ptr> 2026-02-21T08:19:18.0490389Z %20 = tt.addptr %19, %18 : tensor<4x64x!tt.ptr>, tensor<4x64xi32> 2026-02-21T08:19:18.0490638Z %21 = tt.load %20 : tensor<4x64x!tt.ptr> 2026-02-21T08:19:18.0490916Z %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<4x64xf32> 2026-02-21T08:19:18.0491199Z %23 = scf.if %arg3 -> (tensor<4x64xf32>) { 2026-02-21T08:19:18.0491572Z %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x64xf32>) -> tensor<4x64xf32> 2026-02-21T08:19:18.0491991Z %43 = arith.subf %22, %21 : tensor<4x64xf32> 2026-02-21T08:19:18.0492198Z %44 = arith.mulf %42, %43 : tensor<4x64xf32> 2026-02-21T08:19:18.0492397Z %45 = arith.addf %44, %cst_0 : tensor<4x64xf32> 2026-02-21T08:19:18.0492597Z scf.yield %45 : tensor<4x64xf32> 2026-02-21T08:19:18.0492762Z } else { 2026-02-21T08:19:18.0492928Z %42 = tt.splat %arg4 : f32 -> tensor<4x64xf32> 2026-02-21T08:19:18.0493147Z %43 = arith.cmpf ogt, %22, %42 : tensor<4x64xf32> 2026-02-21T08:19:18.0493361Z %44 = arith.cmpf une, %22, %22 : tensor<4x64xf32> 2026-02-21T08:19:18.0493569Z %45 = arith.ori %43, %44 : tensor<4x64xi1> 2026-02-21T08:19:18.0493797Z %46 = arith.select %45, %22, %42 : tensor<4x64xi1>, tensor<4x64xf32> 2026-02-21T08:19:18.0494036Z %47 = math.log %46 : tensor<4x64xf32> 2026-02-21T08:19:18.0494322Z %48 = arith.subf %47, %21 : tensor<4x64xf32> 2026-02-21T08:19:18.0494521Z %49 = arith.mulf %22, %48 : tensor<4x64xf32> 2026-02-21T08:19:18.0494725Z %50 = arith.addf %49, %cst_0 : tensor<4x64xf32> 2026-02-21T08:19:18.0494915Z scf.yield %50 : tensor<4x64xf32> 2026-02-21T08:19:18.0495084Z } 2026-02-21T08:19:18.0495224Z %24 = arith.addf %arg7, %23 : tensor<4x64xf32> 2026-02-21T08:19:18.0495428Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:19:18.0495609Z %25 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T08:19:18.0495796Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:19:18.0496014Z %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:19:18.0496261Z %28 = tt.splat %26 : i32 -> tensor<64xi32> 2026-02-21T08:19:18.0496461Z %29 = arith.addi %28, %27 : tensor<64xi32> 2026-02-21T08:19:18.0496764Z %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:19:18.0497033Z %31 = arith.muli %30, %cst : tensor<4x1xi32> 2026-02-21T08:19:18.0497267Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:19:18.0497546Z %33 = tt.broadcast %31 : tensor<4x1xi32> -> tensor<4x64xi32> 2026-02-21T08:19:18.0497787Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<4x64xi32> 2026-02-21T08:19:18.0498013Z %35 = arith.addi %33, %34 : tensor<4x64xi32> 2026-02-21T08:19:18.0498239Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<4x64x!tt.ptr> 2026-02-21T08:19:18.0498492Z %37 = tt.addptr %36, %35 : tensor<4x64x!tt.ptr>, tensor<4x64xi32> 2026-02-21T08:19:18.0498735Z %38 = tt.load %37 : tensor<4x64x!tt.ptr> 2026-02-21T08:19:18.0499002Z %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc> -> tensor<4x64xf32> 2026-02-21T08:19:18.0499282Z %40 = scf.if %arg3 -> (tensor<4x64xf32>) { 2026-02-21T08:19:18.0499631Z %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x64xf32>) -> tensor<4x64xf32> 2026-02-21T08:19:18.0499986Z %43 = arith.subf %39, %38 : tensor<4x64xf32> 2026-02-21T08:19:18.0500185Z %44 = arith.mulf %42, %43 : tensor<4x64xf32> 2026-02-21T08:19:18.0500384Z %45 = arith.addf %44, %cst_0 : tensor<4x64xf32> 2026-02-21T08:19:18.0500581Z scf.yield %45 : tensor<4x64xf32> 2026-02-21T08:19:18.0500745Z } else { 2026-02-21T08:19:18.0500906Z %42 = tt.splat %arg4 : f32 -> tensor<4x64xf32> 2026-02-21T08:19:18.0501111Z %43 = arith.cmpf ogt, %39, %42 : tensor<4x64xf32> 2026-02-21T08:19:18.0501323Z %44 = arith.cmpf une, %39, %39 : tensor<4x64xf32> 2026-02-21T08:19:18.0501527Z %45 = arith.ori %43, %44 : tensor<4x64xi1> 2026-02-21T08:19:18.0501747Z %46 = arith.select %45, %39, %42 : tensor<4x64xi1>, tensor<4x64xf32> 2026-02-21T08:19:18.0502028Z %47 = math.log %46 : tensor<4x64xf32> 2026-02-21T08:19:18.0502217Z %48 = arith.subf %47, %38 : tensor<4x64xf32> 2026-02-21T08:19:18.0502410Z %49 = arith.mulf %39, %48 : tensor<4x64xf32> 2026-02-21T08:19:18.0502604Z %50 = arith.addf %49, %cst_0 : tensor<4x64xf32> 2026-02-21T08:19:18.0502799Z scf.yield %50 : tensor<4x64xf32> 2026-02-21T08:19:18.0502964Z } 2026-02-21T08:19:18.0503099Z %41 = arith.addf %24, %40 : tensor<4x64xf32> 2026-02-21T08:19:18.0503285Z scf.yield %41 : tensor<4x64xf32> 2026-02-21T08:19:18.0503492Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:19:18.0503715Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:19:18.0503894Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:19:18.0504074Z %10 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:19:18.0504250Z tt.reduce.return %10 : f32 2026-02-21T08:19:18.0504436Z }) : (tensor<4x64xf32>) -> tensor<4xf32> 2026-02-21T08:19:18.0504656Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:19:18.0504967Z %9 = tt.addptr %8, %5 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:19:18.0505192Z tt.store %9, %7 : tensor<4x!tt.ptr> 2026-02-21T08:19:18.0505469Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:19:18.0505740Z tt.return 2026-02-21T08:19:18.0505861Z } 2026-02-21T08:19:18.0505982Z } 2026-02-21T08:19:18.0506048Z 2026-02-21T08:19:18.0506103Z {-# 2026-02-21T08:19:18.0506223Z external_resources: { 2026-02-21T08:19:18.0506380Z mlir_reproducer: { 2026-02-21T08:19:18.0510681Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:19:18.0515064Z disable_threading: false, 2026-02-21T08:19:18.0515233Z verify_each: true 2026-02-21T08:19:18.0515383Z } 2026-02-21T08:19:18.0515500Z } 2026-02-21T08:19:18.0515624Z #-} 2026-02-21T08:19:18.0516036Z /tmp/torchinductor_root/2p/c2p632ghn6dygk7gjcvqdffs6626zjbyiy4i6kevgfmfqap3huip.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:19:18.0517207Z /tmp/torchinductor_root/2p/c2p632ghn6dygk7gjcvqdffs6626zjbyiy4i6kevgfmfqap3huip.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:19:18.0518160Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:19:18.0519211Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], maxnreg=128, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:19:18.0520158Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:19:18.0520411Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:19:19.2010095Z Initial population exploring neighbors 28% ━━━╸ 28/100 12.2 configs/s 2026-02-21T08:19:19.2014258Z 2026-02-21T08:19:19.2016798Z 50%|█████ | 3/6 [11:07<11:07, 222.35s/it] 2026-02-21T08:19:19.2021911Z WARNING:tritonbench.utils.triton_op:Caught exception on backend helion_kl_div_tritonbench, terminating early with partial results 2026-02-21T08:19:19.2025871Z Traceback (most recent call last): 2026-02-21T08:19:19.2041104Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1199, in run 2026-02-21T08:19:19.2041571Z y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce( 2026-02-21T08:19:19.2041833Z ^^^^^^^^^^^^^^^^^ 2026-02-21T08:19:19.2042321Z File "/__w/helion/helion/benchmarks/tritonbench/tritonbench/utils/triton_op.py", line 1188, in _reduce_benchmarks 2026-02-21T08:19:19.2042683Z torch.accelerator.synchronize() 2026-02-21T08:19:19.2043266Z File "/__w/helion/helion/.venv/lib/python3.12/site-packages/torch/accelerator/__init__.py", line 235, in synchronize 2026-02-21T08:19:19.2043658Z torch._C._accelerator_synchronizeDevice(device_index) 2026-02-21T08:19:19.2043910Z torch.AcceleratorError: CUDA error: misaligned address 2026-02-21T08:19:19.2044349Z Search for `cudaErrorMisalignedAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. 2026-02-21T08:19:19.2044892Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2026-02-21T08:19:19.2045272Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1 2026-02-21T08:19:19.2045539Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2026-02-21T08:19:19.2045709Z 2026-02-21T08:19:19.2045919Z WARNING:tritonbench.utils.triton_op:Failing input: --input-id 3 --num-inputs 1 --input-sample-mode first-k 2026-02-21T08:19:19.2046338Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmprfy49r42.csv 2026-02-21T08:19:19.8589813Z (B, T, V) liger_kl_div-speedup liger_kl_div-accuracy torch_compile_kl_div-speedup torch_compile_kl_div-accuracy helion_kl_div_tritonbench-speedup helion_kl_div_tritonbench-accuracy 2026-02-21T08:19:19.8591496Z --------------- ---------------------- ----------------------- ------------------------------ ------------------------------- ----------------------------------- ------------------------------------ 2026-02-21T08:19:19.8592288Z (8, 512, 4096) 3.10454 1 3.03078 1 3.58413 1 2026-02-21T08:19:19.8596901Z (8, 512, 8192) 3.52643 1 3.20763 1 4.03754 1 2026-02-21T08:19:19.8598427Z (8, 512, 16384) 4.00574 1 3.27739 1 3.96176 1 2026-02-21T08:19:19.8598986Z average 3.54557 1 3.17193 1 3.86114 1 2026-02-21T08:21:26.9323454Z Using num_inputs=20 for kl_div 2026-02-21T08:21:27.2957645Z Running kl_div benchmark with Helion implementation... 2026-02-21T08:21:27.2958684Z 2026-02-21T08:21:27.6420488Z Warning: Requested 20 inputs but only 6 available. Using all available inputs. 2026-02-21T08:21:27.6425505Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 6) 2026-02-21T08:21:27.6430080Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 3, 4, 5] 2026-02-21T08:21:27.6434649Z 2026-02-21T08:21:27.6446782Z 0%| | 0/6 [00:00 {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:37.2745529Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:22:37.2745756Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:22:37.2745935Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:37.2746116Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:37.2746329Z %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32> 2026-02-21T08:22:37.2746569Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:22:37.2746765Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:37.2746984Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:37.2747470Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:37.2747784Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.2748211Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.2748510Z %2 = tt.get_program_id x : i32 2026-02-21T08:22:37.2748697Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:22:37.2748882Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T08:22:37.2749056Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:22:37.2749232Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:22:37.2749409Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:22:37.2749587Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:22:37.2749745Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:22:37.2749922Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:22:37.2750197Z %9 = arith.remsi %8, %c2_i32 : i32 2026-02-21T08:22:37.2750378Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:22:37.2750542Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:22:37.2750723Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:22:37.2750900Z %13 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:22:37.2751090Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:22:37.2751290Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:22:37.2751516Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:22:37.2751774Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:22:37.2752074Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:22:37.2752395Z %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:22:37.2752809Z %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2753182Z %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2753487Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:22:37.2753854Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:22:37.2754234Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:22:37.2754448Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:22:37.2754652Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2754857Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:22:37.2755025Z } else { 2026-02-21T08:22:37.2755189Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:22:37.2755402Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:22:37.2755628Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:22:37.2755848Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:22:37.2756086Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:22:37.2756327Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:22:37.2756521Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:22:37.2756723Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:22:37.2756923Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2757120Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:22:37.2757284Z } 2026-02-21T08:22:37.2757435Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:22:37.2757632Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:22:37.2757828Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:22:37.2758039Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:22:37.2758226Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:37.2758494Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:37.2758675Z tt.reduce.return %32 : f32 2026-02-21T08:22:37.2758859Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:22:37.2759088Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:22:37.2759344Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:22:37.2759588Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:22:37.2759781Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:22:37.2759974Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:22:37.2760155Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:22:37.2760334Z %24 = arith.muli %23, %c256_i32 : i32 2026-02-21T08:22:37.2760551Z %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:22:37.2760797Z %26 = tt.splat %24 : i32 -> tensor<256xi32> 2026-02-21T08:22:37.2761047Z %27 = arith.addi %26, %25 : tensor<256xi32> 2026-02-21T08:22:37.2761350Z %28 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:22:37.2761744Z %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2762149Z %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2762441Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:22:37.2762806Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:22:37.2763170Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:22:37.2763379Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:22:37.2763583Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2763788Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:22:37.2763956Z } else { 2026-02-21T08:22:37.2764124Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:22:37.2764348Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:22:37.2764565Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:22:37.2764778Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:22:37.2765014Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:22:37.2765259Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:22:37.2765449Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:22:37.2765649Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:22:37.2765856Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2766044Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:22:37.2766212Z } 2026-02-21T08:22:37.2766356Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:22:37.2766552Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:22:37.2766746Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:22:37.2766953Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:22:37.2767141Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:37.2767328Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:37.2767519Z tt.reduce.return %32 : f32 2026-02-21T08:22:37.2767704Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:22:37.2767945Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:22:37.2768215Z %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:22:37.2768462Z tt.store %31, %29 : tensor<256x!tt.ptr> 2026-02-21T08:22:37.2768665Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:22:37.2768881Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:22:37.2769154Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:22:37.2769390Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:22:37.2769648Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:22:37.2769846Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:22:37.2770165Z %18 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:22:37.2770578Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2770964Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:22:37.2771268Z %24 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:22:37.2771696Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:22:37.2772114Z %27 = arith.subf %23, %22 : tensor<256x32xf32> 2026-02-21T08:22:37.2772322Z %28 = arith.mulf %26, %27 : tensor<256x32xf32> 2026-02-21T08:22:37.2772539Z %29 = arith.addf %28, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2772748Z scf.yield %29 : tensor<256x32xf32> 2026-02-21T08:22:37.2772921Z } else { 2026-02-21T08:22:37.2773091Z %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:22:37.2773316Z %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32> 2026-02-21T08:22:37.2773552Z %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32> 2026-02-21T08:22:37.2773766Z %29 = arith.ori %27, %28 : tensor<256x32xi1> 2026-02-21T08:22:37.2774021Z %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:22:37.2774278Z %31 = math.log %30 : tensor<256x32xf32> 2026-02-21T08:22:37.2774481Z %32 = arith.subf %31, %22 : tensor<256x32xf32> 2026-02-21T08:22:37.2774698Z %33 = arith.mulf %23, %32 : tensor<256x32xf32> 2026-02-21T08:22:37.2774910Z %34 = arith.addf %33, %cst : tensor<256x32xf32> 2026-02-21T08:22:37.2775117Z scf.yield %34 : tensor<256x32xf32> 2026-02-21T08:22:37.2775280Z } 2026-02-21T08:22:37.2775430Z %25 = arith.addf %arg7, %24 : tensor<256x32xf32> 2026-02-21T08:22:37.2775617Z scf.yield %25 : tensor<256x32xf32> 2026-02-21T08:22:37.2775816Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:22:37.2776019Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:22:37.2776199Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:37.2776375Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:37.2776550Z tt.reduce.return %22 : f32 2026-02-21T08:22:37.2776733Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:22:37.2776958Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:22:37.2777224Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:22:37.2777465Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:22:37.2777656Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:22:37.2777829Z tt.return 2026-02-21T08:22:37.2777948Z } 2026-02-21T08:22:37.2778068Z } 2026-02-21T08:22:37.2778133Z 2026-02-21T08:22:37.2778182Z {-# 2026-02-21T08:22:37.2778310Z external_resources: { 2026-02-21T08:22:37.2778463Z mlir_reproducer: { 2026-02-21T08:22:37.2782928Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:37.2787426Z disable_threading: false, 2026-02-21T08:22:37.2787591Z verify_each: true 2026-02-21T08:22:37.2787739Z } 2026-02-21T08:22:37.2787862Z } 2026-02-21T08:22:37.2787973Z #-} 2026-02-21T08:22:37.2788411Z /tmp/torchinductor_root/g7/cg7uh4kqlcjratoh3u7e26ozcc5fu6il4yz5rozohovifnytn3wd.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:37.2789639Z /tmp/torchinductor_root/g7/cg7uh4kqlcjratoh3u7e26ozcc5fu6il4yz5rozohovifnytn3wd.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:37.2790653Z [65s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:37.2791801Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:22:37.2792822Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:37.2793069Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:37.7676007Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:22:37.7676858Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:37.7677530Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:22:37.7677750Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:37.7677951Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:22:37.7678193Z %cst = arith.constant dense<0.000000e+00> : tensor<64x256xf32> 2026-02-21T08:22:37.7678452Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:22:37.7678638Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:37.7678833Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:37.7679036Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:37.7679419Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.7679929Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.7680294Z %2 = tt.get_program_id x : i32 2026-02-21T08:22:37.7680881Z scf.for %arg5 = %2 to %c64_i32 step %c9472_i32 : i32 { 2026-02-21T08:22:37.7681103Z %3 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:22:37.7681344Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:22:37.7681592Z %5 = tt.splat %3 : i32 -> tensor<64xi32> 2026-02-21T08:22:37.7681784Z %6 = arith.addi %5, %4 : tensor<64xi32> 2026-02-21T08:22:37.7682051Z %c3840_i32 = arith.constant 3840 : i32 2026-02-21T08:22:37.7682231Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:22:37.7682544Z %7 = scf.for %arg6 = %c0_i32 to %c3840_i32 step %c768_i32 iter_args(%arg7 = %cst) -> (tensor<64x256xf32>) : i32 { 2026-02-21T08:22:37.7682944Z %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7683320Z %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7683707Z %17 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:22:37.7684086Z %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:22:37.7684462Z %32 = arith.subf %16, %15 : tensor<64x256xf32> 2026-02-21T08:22:37.7684665Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:22:37.7684877Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7685073Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:22:37.7685250Z } else { 2026-02-21T08:22:37.7685409Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:22:37.7685633Z %32 = arith.cmpf ogt, %16, %31 : tensor<64x256xf32> 2026-02-21T08:22:37.7685858Z %33 = arith.cmpf une, %16, %16 : tensor<64x256xf32> 2026-02-21T08:22:37.7686064Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:22:37.7686311Z %35 = arith.select %34, %16, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:22:37.7686553Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:22:37.7686756Z %37 = arith.subf %36, %15 : tensor<64x256xf32> 2026-02-21T08:22:37.7686952Z %38 = arith.mulf %16, %37 : tensor<64x256xf32> 2026-02-21T08:22:37.7687158Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7687356Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:22:37.7687525Z } 2026-02-21T08:22:37.7687675Z %18 = arith.addf %arg7, %17 : tensor<64x256xf32> 2026-02-21T08:22:37.7687866Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:37.7688056Z %19 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:22:37.7688237Z %20 = arith.addi %arg6, %19 : i32 2026-02-21T08:22:37.7688507Z %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7688872Z %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7689157Z %23 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:22:37.7689517Z %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:22:37.7689874Z %32 = arith.subf %22, %21 : tensor<64x256xf32> 2026-02-21T08:22:37.7690077Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:22:37.7690276Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7690472Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:22:37.7690645Z } else { 2026-02-21T08:22:37.7690800Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:22:37.7691017Z %32 = arith.cmpf ogt, %22, %31 : tensor<64x256xf32> 2026-02-21T08:22:37.7691232Z %33 = arith.cmpf une, %22, %22 : tensor<64x256xf32> 2026-02-21T08:22:37.7691450Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:22:37.7691792Z %35 = arith.select %34, %22, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:22:37.7692072Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:22:37.7692269Z %37 = arith.subf %36, %21 : tensor<64x256xf32> 2026-02-21T08:22:37.7692465Z %38 = arith.mulf %22, %37 : tensor<64x256xf32> 2026-02-21T08:22:37.7692675Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7692866Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:22:37.7693035Z } 2026-02-21T08:22:37.7693173Z %24 = arith.addf %18, %23 : tensor<64x256xf32> 2026-02-21T08:22:37.7693371Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:22:37.7693561Z %25 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:22:37.7693744Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:22:37.7694078Z %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7694431Z %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7694711Z %29 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:22:37.7695058Z %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:22:37.7695417Z %32 = arith.subf %28, %27 : tensor<64x256xf32> 2026-02-21T08:22:37.7695619Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:22:37.7695814Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7696009Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:22:37.7696172Z } else { 2026-02-21T08:22:37.7696334Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:22:37.7696544Z %32 = arith.cmpf ogt, %28, %31 : tensor<64x256xf32> 2026-02-21T08:22:37.7696763Z %33 = arith.cmpf une, %28, %28 : tensor<64x256xf32> 2026-02-21T08:22:37.7696972Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:22:37.7697200Z %35 = arith.select %34, %28, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:22:37.7697440Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:22:37.7697627Z %37 = arith.subf %36, %27 : tensor<64x256xf32> 2026-02-21T08:22:37.7697835Z %38 = arith.mulf %28, %37 : tensor<64x256xf32> 2026-02-21T08:22:37.7698040Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7698245Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:22:37.7698421Z } 2026-02-21T08:22:37.7698565Z %30 = arith.addf %24, %29 : tensor<64x256xf32> 2026-02-21T08:22:37.7698766Z scf.yield %30 : tensor<64x256xf32> 2026-02-21T08:22:37.7698951Z } {tt.num_stages = 1 : i32} 2026-02-21T08:22:37.7699244Z %8 = tt.descriptor_load %0[%3, %c3840_i32] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7699638Z %9 = tt.descriptor_load %1[%3, %c3840_i32] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:22:37.7699943Z %10 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:22:37.7700322Z %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:22:37.7700694Z %16 = arith.subf %9, %8 : tensor<64x256xf32> 2026-02-21T08:22:37.7700909Z %17 = arith.mulf %15, %16 : tensor<64x256xf32> 2026-02-21T08:22:37.7701117Z %18 = arith.addf %17, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7701322Z scf.yield %18 : tensor<64x256xf32> 2026-02-21T08:22:37.7701494Z } else { 2026-02-21T08:22:37.7701658Z %15 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:22:37.7701899Z %16 = arith.cmpf ogt, %9, %15 : tensor<64x256xf32> 2026-02-21T08:22:37.7702130Z %17 = arith.cmpf une, %9, %9 : tensor<64x256xf32> 2026-02-21T08:22:37.7702410Z %18 = arith.ori %16, %17 : tensor<64x256xi1> 2026-02-21T08:22:37.7702646Z %19 = arith.select %18, %9, %15 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:22:37.7702897Z %20 = math.log %19 : tensor<64x256xf32> 2026-02-21T08:22:37.7703099Z %21 = arith.subf %20, %8 : tensor<64x256xf32> 2026-02-21T08:22:37.7703309Z %22 = arith.mulf %9, %21 : tensor<64x256xf32> 2026-02-21T08:22:37.7703517Z %23 = arith.addf %22, %cst : tensor<64x256xf32> 2026-02-21T08:22:37.7703719Z scf.yield %23 : tensor<64x256xf32> 2026-02-21T08:22:37.7703898Z } 2026-02-21T08:22:37.7704043Z %11 = arith.addf %7, %10 : tensor<64x256xf32> 2026-02-21T08:22:37.7704274Z %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({ 2026-02-21T08:22:37.7704465Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:37.7704653Z %15 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:37.7704839Z tt.reduce.return %15 : f32 2026-02-21T08:22:37.7705088Z }) : (tensor<64x256xf32>) -> tensor<64xf32> 2026-02-21T08:22:37.7705333Z %13 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:22:37.7705600Z %14 = tt.addptr %13, %6 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:22:37.7705836Z tt.store %14, %12 : tensor<64x!tt.ptr> 2026-02-21T08:22:37.7706091Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:22:37.7706339Z tt.return 2026-02-21T08:22:37.7706458Z } 2026-02-21T08:22:37.7706578Z } 2026-02-21T08:22:37.7706643Z 2026-02-21T08:22:37.7706697Z {-# 2026-02-21T08:22:37.7706817Z external_resources: { 2026-02-21T08:22:37.7706972Z mlir_reproducer: { 2026-02-21T08:22:37.7711254Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:37.7715605Z disable_threading: false, 2026-02-21T08:22:37.7715774Z verify_each: true 2026-02-21T08:22:37.7715910Z } 2026-02-21T08:22:37.7716029Z } 2026-02-21T08:22:37.7716133Z #-} 2026-02-21T08:22:37.7716545Z /tmp/torchinductor_root/zw/czwn57nyont3ac4ro5t4qpyubljjxltznojztyn6lmiuy36skcnd.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:37.7717725Z /tmp/torchinductor_root/zw/czwn57nyont3ac4ro5t4qpyubljjxltznojztyn6lmiuy36skcnd.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:37.7718737Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:37.7719805Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:22:37.7720771Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:37.7721014Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:37.7989146Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:22:37.7994543Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:37.7995467Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:22:37.7995671Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:37.7995892Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:37.7996110Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32> 2026-02-21T08:22:37.7996339Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:22:37.7996517Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:37.7996702Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:37.7996868Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:37.7997188Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.7997622Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:37.7997923Z %2 = tt.get_program_id x : i32 2026-02-21T08:22:37.7998099Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:22:37.7998271Z %4 = arith.minsi %3, %c32_i32 : i32 2026-02-21T08:22:37.7998468Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:22:37.7998666Z %5 = arith.muli %arg5, %c128_i32 : i32 2026-02-21T08:22:37.7998896Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:22:37.7999139Z %7 = tt.splat %5 : i32 -> tensor<128xi32> 2026-02-21T08:22:37.7999332Z %8 = arith.addi %7, %6 : tensor<128xi32> 2026-02-21T08:22:37.7999637Z %9 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>) : i32 { 2026-02-21T08:22:37.8000033Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:22:37.8000405Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:22:37.8000693Z %15 = scf.if %arg3 -> (tensor<128x32xf32>) { 2026-02-21T08:22:37.8001063Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32> 2026-02-21T08:22:37.8001432Z %18 = arith.subf %14, %13 : tensor<128x32xf32> 2026-02-21T08:22:37.8001635Z %19 = arith.mulf %17, %18 : tensor<128x32xf32> 2026-02-21T08:22:37.8001846Z %20 = arith.addf %19, %cst : tensor<128x32xf32> 2026-02-21T08:22:37.8002120Z scf.yield %20 : tensor<128x32xf32> 2026-02-21T08:22:37.8002297Z } else { 2026-02-21T08:22:37.8002459Z %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32> 2026-02-21T08:22:37.8002686Z %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32> 2026-02-21T08:22:37.8003120Z %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32> 2026-02-21T08:22:37.8003328Z %20 = arith.ori %18, %19 : tensor<128x32xi1> 2026-02-21T08:22:37.8003569Z %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32> 2026-02-21T08:22:37.8003807Z %22 = math.log %21 : tensor<128x32xf32> 2026-02-21T08:22:37.8004011Z %23 = arith.subf %22, %13 : tensor<128x32xf32> 2026-02-21T08:22:37.8004209Z %24 = arith.mulf %14, %23 : tensor<128x32xf32> 2026-02-21T08:22:37.8004417Z %25 = arith.addf %24, %cst : tensor<128x32xf32> 2026-02-21T08:22:37.8004617Z scf.yield %25 : tensor<128x32xf32> 2026-02-21T08:22:37.8004780Z } 2026-02-21T08:22:37.8004932Z %16 = arith.addf %arg7, %15 : tensor<128x32xf32> 2026-02-21T08:22:37.8005122Z scf.yield %16 : tensor<128x32xf32> 2026-02-21T08:22:37.8005399Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:22:37.8005740Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:22:37.8005955Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:37.8006138Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:37.8006324Z tt.reduce.return %13 : f32 2026-02-21T08:22:37.8006522Z }) : (tensor<128x32xf32>) -> tensor<128xf32> 2026-02-21T08:22:37.8006762Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:22:37.8007021Z %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:22:37.8007266Z tt.store %12, %10 : tensor<128x!tt.ptr> 2026-02-21T08:22:37.8007456Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T08:22:37.8007627Z tt.return 2026-02-21T08:22:37.8007747Z } 2026-02-21T08:22:37.8007865Z } 2026-02-21T08:22:37.8007930Z 2026-02-21T08:22:37.8007978Z {-# 2026-02-21T08:22:37.8008108Z external_resources: { 2026-02-21T08:22:37.8008261Z mlir_reproducer: { 2026-02-21T08:22:37.8012716Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:37.8017302Z disable_threading: false, 2026-02-21T08:22:37.8017467Z verify_each: true 2026-02-21T08:22:37.8017618Z } 2026-02-21T08:22:37.8017732Z } 2026-02-21T08:22:37.8017849Z #-} 2026-02-21T08:22:37.8018304Z /tmp/torchinductor_root/eg/cegvfk24qtnxvmxn4bycu6f6zzbkjopadvnuuaxwludmaet2ilma.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:37.8019682Z /tmp/torchinductor_root/eg/cegvfk24qtnxvmxn4bycu6f6zzbkjopadvnuuaxwludmaet2ilma.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:37.8020708Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:37.8021916Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:22:37.8022941Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:37.8023195Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:38.2299361Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:22:38.2301548Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:38.2302398Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:22:38.2302590Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:38.2302782Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:22:38.2302988Z %cst = arith.constant dense<4096> : tensor<4x1xi32> 2026-02-21T08:22:38.2303240Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x4xf32> 2026-02-21T08:22:38.2303491Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:22:38.2303687Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:38.2303860Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:38.2304038Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:38.2304343Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:38.2304645Z %1 = tt.get_program_id x : i32 2026-02-21T08:22:38.2304820Z %2 = arith.subi %c1024_i32, %1 : i32 2026-02-21T08:22:38.2304989Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:38.2305166Z %3 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:22:38.2305353Z %4 = arith.addi %2, %3 : i32 2026-02-21T08:22:38.2305524Z %5 = arith.divui %4, %c2368_i32 : i32 2026-02-21T08:22:38.2305979Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:22:38.2306149Z %6 = arith.remsi %5, %c3_i32 : i32 2026-02-21T08:22:38.2306325Z %7 = arith.subi %5, %6 : i32 2026-02-21T08:22:38.2306490Z %8 = arith.muli %7, %c2368_i32 : i32 2026-02-21T08:22:38.2306670Z %9 = arith.addi %1, %8 : i32 2026-02-21T08:22:38.2306840Z %10 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:22:38.2307037Z scf.for %arg5 = %1 to %9 step %10 : i32 { 2026-02-21T08:22:38.2307231Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:22:38.2307451Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:22:38.2307699Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2307885Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:22:38.2308191Z %15 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:22:38.2308497Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2308702Z %40 = arith.addi %39, %12 : tensor<4xi32> 2026-02-21T08:22:38.2308994Z %41 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:22:38.2312740Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:22:38.2317431Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:22:38.2319466Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2319807Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2324561Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:22:38.2324922Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2325231Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:22:38.2330391Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2334419Z %50 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:22:38.2336447Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:22:38.2337137Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:22:38.2337537Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2341995Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:22:38.2345806Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2350283Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:22:38.2351906Z } else { 2026-02-21T08:22:38.2352128Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:22:38.2352426Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:22:38.2356903Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:22:38.2357225Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:22:38.2357494Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:22:38.2363296Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:22:38.2365194Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2365432Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:22:38.2365645Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2365842Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:22:38.2366016Z } 2026-02-21T08:22:38.2366161Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:22:38.2366357Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:22:38.2366607Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:22:38.2366882Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:22:38.2367075Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:38.2367247Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:38.2367431Z tt.reduce.return %39 : f32 2026-02-21T08:22:38.2367612Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:22:38.2367854Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:22:38.2368111Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:22:38.2368346Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:22:38.2368544Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:22:38.2368724Z %19 = arith.muli %c2368_i32, %c1_i32_1 : i32 2026-02-21T08:22:38.2368917Z %20 = arith.addi %arg5, %19 : i32 2026-02-21T08:22:38.2369088Z %21 = arith.muli %20, %c4_i32 : i32 2026-02-21T08:22:38.2369308Z %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:22:38.2369539Z %23 = tt.splat %21 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2369767Z %24 = arith.addi %23, %22 : tensor<4xi32> 2026-02-21T08:22:38.2370076Z %25 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:22:38.2370389Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2370586Z %40 = arith.addi %39, %22 : tensor<4xi32> 2026-02-21T08:22:38.2371044Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:22:38.2371300Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:22:38.2371546Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:22:38.2371835Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2372154Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2372393Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:22:38.2372622Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2372895Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:22:38.2373179Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2373592Z %50 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:22:38.2373888Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:22:38.2374242Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:22:38.2374601Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2374795Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:22:38.2375005Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2375195Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:22:38.2375365Z } else { 2026-02-21T08:22:38.2375526Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:22:38.2375731Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:22:38.2375947Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:22:38.2376145Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:22:38.2376377Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:22:38.2376603Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:22:38.2376794Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2376985Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:22:38.2377179Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2377371Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:22:38.2377530Z } 2026-02-21T08:22:38.2377670Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:22:38.2377851Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:22:38.2378093Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:22:38.2378346Z %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({ 2026-02-21T08:22:38.2378532Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:38.2378706Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:38.2378883Z tt.reduce.return %39 : f32 2026-02-21T08:22:38.2379065Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:22:38.2379275Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:22:38.2379527Z %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:22:38.2379748Z tt.store %28, %26 : tensor<4x!tt.ptr> 2026-02-21T08:22:38.2379940Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:22:38.2380123Z %29 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:22:38.2380300Z %30 = arith.addi %arg5, %29 : i32 2026-02-21T08:22:38.2380476Z %31 = arith.muli %30, %c4_i32 : i32 2026-02-21T08:22:38.2380684Z %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:22:38.2380918Z %33 = tt.splat %31 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2381101Z %34 = arith.addi %33, %32 : tensor<4xi32> 2026-02-21T08:22:38.2381403Z %35 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:22:38.2381784Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2382021Z %40 = arith.addi %39, %32 : tensor<4xi32> 2026-02-21T08:22:38.2382265Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:22:38.2382515Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:22:38.2382758Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:22:38.2383024Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2383269Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2383492Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:22:38.2383717Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2384042Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:22:38.2384318Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2384638Z %50 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:22:38.2384914Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:22:38.2385264Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:22:38.2385619Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2385812Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:22:38.2386017Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2386205Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:22:38.2386373Z } else { 2026-02-21T08:22:38.2386526Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:22:38.2386742Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:22:38.2386958Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:22:38.2387156Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:22:38.2387383Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:22:38.2387607Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:22:38.2387800Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:22:38.2387985Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:22:38.2388187Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2388377Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:22:38.2388537Z } 2026-02-21T08:22:38.2388683Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:22:38.2388867Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:22:38.2389115Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:22:38.2389402Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T08:22:38.2389597Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:38.2389771Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:38.2389957Z tt.reduce.return %39 : f32 2026-02-21T08:22:38.2390135Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:22:38.2390356Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:22:38.2390615Z %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:22:38.2390840Z tt.store %38, %36 : tensor<4x!tt.ptr> 2026-02-21T08:22:38.2391035Z } {tt.num_stages = 1 : i32} 2026-02-21T08:22:38.2391233Z scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32 : i32 { 2026-02-21T08:22:38.2391451Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:22:38.2391670Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:22:38.2391947Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2392206Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:22:38.2392507Z %15 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:22:38.2392820Z %19 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:22:38.2393021Z %20 = arith.addi %19, %12 : tensor<4xi32> 2026-02-21T08:22:38.2393262Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:22:38.2393508Z %22 = arith.muli %21, %cst : tensor<4x1xi32> 2026-02-21T08:22:38.2393747Z %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:22:38.2394014Z %24 = tt.broadcast %22 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2394249Z %25 = tt.broadcast %23 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:22:38.2394467Z %26 = arith.addi %24, %25 : tensor<4x4xi32> 2026-02-21T08:22:38.2394754Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2395016Z %28 = tt.addptr %27, %26 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:22:38.2395285Z %29 = tt.load %28 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:22:38.2395605Z %30 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:22:38.2395885Z %31 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:22:38.2396226Z %33 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:22:38.2396576Z %34 = arith.subf %30, %29 : tensor<4x4xf32> 2026-02-21T08:22:38.2396767Z %35 = arith.mulf %33, %34 : tensor<4x4xf32> 2026-02-21T08:22:38.2396974Z %36 = arith.addf %35, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2397165Z scf.yield %36 : tensor<4x4xf32> 2026-02-21T08:22:38.2397337Z } else { 2026-02-21T08:22:38.2397499Z %33 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:22:38.2397703Z %34 = arith.cmpf ogt, %30, %33 : tensor<4x4xf32> 2026-02-21T08:22:38.2397917Z %35 = arith.cmpf une, %30, %30 : tensor<4x4xf32> 2026-02-21T08:22:38.2398112Z %36 = arith.ori %34, %35 : tensor<4x4xi1> 2026-02-21T08:22:38.2398341Z %37 = arith.select %36, %30, %33 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:22:38.2398564Z %38 = math.log %37 : tensor<4x4xf32> 2026-02-21T08:22:38.2398756Z %39 = arith.subf %38, %29 : tensor<4x4xf32> 2026-02-21T08:22:38.2398951Z %40 = arith.mulf %30, %39 : tensor<4x4xf32> 2026-02-21T08:22:38.2399144Z %41 = arith.addf %40, %cst_0 : tensor<4x4xf32> 2026-02-21T08:22:38.2399338Z scf.yield %41 : tensor<4x4xf32> 2026-02-21T08:22:38.2399500Z } 2026-02-21T08:22:38.2399645Z %32 = arith.addf %arg7, %31 : tensor<4x4xf32> 2026-02-21T08:22:38.2399831Z scf.yield %32 : tensor<4x4xf32> 2026-02-21T08:22:38.2400084Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:22:38.2400340Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:22:38.2400537Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:38.2400713Z %19 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:38.2400888Z tt.reduce.return %19 : f32 2026-02-21T08:22:38.2401069Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:22:38.2401279Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:22:38.2401531Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:22:38.2401752Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:22:38.2401977Z } {tt.num_stages = 1 : i32} 2026-02-21T08:22:38.2402140Z tt.return 2026-02-21T08:22:38.2402260Z } 2026-02-21T08:22:38.2402378Z } 2026-02-21T08:22:38.2402445Z 2026-02-21T08:22:38.2402495Z {-# 2026-02-21T08:22:38.2402629Z external_resources: { 2026-02-21T08:22:38.2402834Z mlir_reproducer: { 2026-02-21T08:22:38.2407137Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:38.2411450Z disable_threading: false, 2026-02-21T08:22:38.2411616Z verify_each: true 2026-02-21T08:22:38.2411753Z } 2026-02-21T08:22:38.2411911Z } 2026-02-21T08:22:38.2412026Z #-} 2026-02-21T08:22:38.2412440Z /tmp/torchinductor_root/5a/c5aid2qwpcuu4qg3gno4tcuqrrl3ww6r6k5sx3ichpbpqpwiqtw4.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:38.2413624Z /tmp/torchinductor_root/5a/c5aid2qwpcuu4qg3gno4tcuqrrl3ww6r6k5sx3ichpbpqpwiqtw4.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:38.2414589Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:38.2415651Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:22:38.2416620Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:38.2416864Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:38.5164816Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:22:38.5169054Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:38.5173236Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:22:38.5176714Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:22:38.5181408Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:38.5183413Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:22:38.5183754Z %cst = arith.constant dense<4096> : tensor<32x1xi32> 2026-02-21T08:22:38.5184326Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x8xf32> 2026-02-21T08:22:38.5189217Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:22:38.5189491Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:38.5189727Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:38.5189947Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:38.5190290Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:38.5190635Z %1 = tt.get_program_id x : i32 2026-02-21T08:22:38.5190854Z scf.for %arg5 = %1 to %c128_i32 step %c592_i32 : i32 { 2026-02-21T08:22:38.5191070Z %2 = arith.muli %arg5, %c32_i32 : i32 2026-02-21T08:22:38.5191302Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:22:38.5191540Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:22:38.5192110Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:22:38.5192318Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:22:38.5192631Z %6 = scf.for %arg6 = %c0_i32 to %c4096_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<32x8xf32>) : i32 { 2026-02-21T08:22:38.5192986Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:22:38.5193231Z %11 = tt.splat %arg6 : i32 -> tensor<8xi32> 2026-02-21T08:22:38.5193436Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:22:38.5193677Z %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:22:38.5193938Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:22:38.5194177Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:22:38.5194456Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:22:38.5194707Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:22:38.5194930Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:22:38.5195161Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:22:38.5195417Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:22:38.5195704Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:22:38.5196035Z %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<32x8xf32> 2026-02-21T08:22:38.5196334Z %23 = scf.if %arg3 -> (tensor<32x8xf32>) { 2026-02-21T08:22:38.5196693Z %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:22:38.5197053Z %43 = arith.subf %22, %21 : tensor<32x8xf32> 2026-02-21T08:22:38.5197250Z %44 = arith.mulf %42, %43 : tensor<32x8xf32> 2026-02-21T08:22:38.5197461Z %45 = arith.addf %44, %cst_0 : tensor<32x8xf32> 2026-02-21T08:22:38.5197656Z scf.yield %45 : tensor<32x8xf32> 2026-02-21T08:22:38.5197827Z } else { 2026-02-21T08:22:38.5197985Z %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32> 2026-02-21T08:22:38.5198204Z %43 = arith.cmpf ogt, %22, %42 : tensor<32x8xf32> 2026-02-21T08:22:38.5198420Z %44 = arith.cmpf une, %22, %22 : tensor<32x8xf32> 2026-02-21T08:22:38.5198617Z %45 = arith.ori %43, %44 : tensor<32x8xi1> 2026-02-21T08:22:38.5198849Z %46 = arith.select %45, %22, %42 : tensor<32x8xi1>, tensor<32x8xf32> 2026-02-21T08:22:38.5199076Z %47 = math.log %46 : tensor<32x8xf32> 2026-02-21T08:22:38.5199269Z %48 = arith.subf %47, %21 : tensor<32x8xf32> 2026-02-21T08:22:38.5199456Z %49 = arith.mulf %22, %48 : tensor<32x8xf32> 2026-02-21T08:22:38.5199659Z %50 = arith.addf %49, %cst_0 : tensor<32x8xf32> 2026-02-21T08:22:38.5199848Z scf.yield %50 : tensor<32x8xf32> 2026-02-21T08:22:38.5200016Z } 2026-02-21T08:22:38.5200262Z %24 = arith.addf %arg7, %23 : tensor<32x8xf32> 2026-02-21T08:22:38.5200450Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:38.5200639Z %25 = arith.muli %c8_i32, %c1_i32 : i32 2026-02-21T08:22:38.5200817Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:22:38.5201039Z %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:22:38.5201272Z %28 = tt.splat %26 : i32 -> tensor<8xi32> 2026-02-21T08:22:38.5201469Z %29 = arith.addi %28, %27 : tensor<8xi32> 2026-02-21T08:22:38.5201710Z %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:22:38.5202003Z %31 = arith.muli %30, %cst : tensor<32x1xi32> 2026-02-21T08:22:38.5202247Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:22:38.5202513Z %33 = tt.broadcast %31 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:22:38.5202840Z %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:22:38.5203063Z %35 = arith.addi %33, %34 : tensor<32x8xi32> 2026-02-21T08:22:38.5203291Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:22:38.5203552Z %37 = tt.addptr %36, %35 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:22:38.5203831Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:22:38.5204156Z %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc> -> tensor<32x8xf32> 2026-02-21T08:22:38.5204428Z %40 = scf.if %arg3 -> (tensor<32x8xf32>) { 2026-02-21T08:22:38.5204784Z %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:22:38.5205145Z %43 = arith.subf %39, %38 : tensor<32x8xf32> 2026-02-21T08:22:38.5205346Z %44 = arith.mulf %42, %43 : tensor<32x8xf32> 2026-02-21T08:22:38.5205566Z %45 = arith.addf %44, %cst_0 : tensor<32x8xf32> 2026-02-21T08:22:38.5205758Z scf.yield %45 : tensor<32x8xf32> 2026-02-21T08:22:38.5205929Z } else { 2026-02-21T08:22:38.5206080Z %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32> 2026-02-21T08:22:38.5206292Z %43 = arith.cmpf ogt, %39, %42 : tensor<32x8xf32> 2026-02-21T08:22:38.5206495Z %44 = arith.cmpf une, %39, %39 : tensor<32x8xf32> 2026-02-21T08:22:38.5206699Z %45 = arith.ori %43, %44 : tensor<32x8xi1> 2026-02-21T08:22:38.5206928Z %46 = arith.select %45, %39, %42 : tensor<32x8xi1>, tensor<32x8xf32> 2026-02-21T08:22:38.5207152Z %47 = math.log %46 : tensor<32x8xf32> 2026-02-21T08:22:38.5207342Z %48 = arith.subf %47, %38 : tensor<32x8xf32> 2026-02-21T08:22:38.5207528Z %49 = arith.mulf %39, %48 : tensor<32x8xf32> 2026-02-21T08:22:38.5207730Z %50 = arith.addf %49, %cst_0 : tensor<32x8xf32> 2026-02-21T08:22:38.5207918Z scf.yield %50 : tensor<32x8xf32> 2026-02-21T08:22:38.5208090Z } 2026-02-21T08:22:38.5208230Z %41 = arith.addf %24, %40 : tensor<32x8xf32> 2026-02-21T08:22:38.5208412Z scf.yield %41 : tensor<32x8xf32> 2026-02-21T08:22:38.5208626Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:22:38.5208842Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:22:38.5209031Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:38.5209200Z %10 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:38.5209382Z tt.reduce.return %10 : f32 2026-02-21T08:22:38.5209558Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:22:38.5209780Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T08:22:38.5210030Z %9 = tt.addptr %8, %5 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T08:22:38.5210250Z tt.store %9, %7 : tensor<32x!tt.ptr> 2026-02-21T08:22:38.5210448Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:22:38.5210671Z tt.return 2026-02-21T08:22:38.5210798Z } 2026-02-21T08:22:38.5210911Z } 2026-02-21T08:22:38.5210985Z 2026-02-21T08:22:38.5211032Z {-# 2026-02-21T08:22:38.5211154Z external_resources: { 2026-02-21T08:22:38.5211309Z mlir_reproducer: { 2026-02-21T08:22:38.5215654Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:38.5220307Z disable_threading: false, 2026-02-21T08:22:38.5220480Z verify_each: true 2026-02-21T08:22:38.5220623Z } 2026-02-21T08:22:38.5220748Z } 2026-02-21T08:22:38.5220860Z #-} 2026-02-21T08:22:38.5221293Z /tmp/torchinductor_root/5y/c5yk235kd52a3fiyoi3qqhw5kqfyvszntwziwlt5uvp3ycrgotdh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:38.5222556Z /tmp/torchinductor_root/5y/c5yk235kd52a3fiyoi3qqhw5kqfyvszntwziwlt5uvp3ycrgotdh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:38.5223558Z [66s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:38.5224661Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:22:38.5225620Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:38.5225864Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:39.2821589Z module { 2026-02-21T08:22:39.2823024Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:39.2823617Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:22:39.2823844Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:39.2824123Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x8xf32> 2026-02-21T08:22:39.2824757Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:22:39.2824981Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:39.2825184Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:39.2825397Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:39.2825772Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:39.2826282Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:39.2826624Z %2 = tt.get_program_id x : i32 2026-02-21T08:22:39.2826822Z %3 = arith.muli %2, %c1024_i32 : i32 2026-02-21T08:22:39.2827069Z %4 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:22:39.2827357Z %5 = tt.splat %3 : i32 -> tensor<1024xi32> 2026-02-21T08:22:39.2827675Z %6 = arith.addi %5, %4 : tensor<1024xi32> 2026-02-21T08:22:39.2827997Z %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c8_i32 iter_args(%arg6 = %cst) -> (tensor<1024x8xf32>) : i32 { 2026-02-21T08:22:39.2828390Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<1024x8xf32> 2026-02-21T08:22:39.2828757Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<1024x8xf32> 2026-02-21T08:22:39.2829037Z %13 = scf.if %arg3 -> (tensor<1024x8xf32>) { 2026-02-21T08:22:39.2829409Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x8xf32>) -> tensor<1024x8xf32> 2026-02-21T08:22:39.2829784Z %16 = arith.subf %12, %11 : tensor<1024x8xf32> 2026-02-21T08:22:39.2829983Z %17 = arith.mulf %15, %16 : tensor<1024x8xf32> 2026-02-21T08:22:39.2830192Z %18 = arith.addf %17, %cst : tensor<1024x8xf32> 2026-02-21T08:22:39.2830383Z scf.yield %18 : tensor<1024x8xf32> 2026-02-21T08:22:39.2830557Z } else { 2026-02-21T08:22:39.2830710Z %15 = tt.splat %arg4 : f32 -> tensor<1024x8xf32> 2026-02-21T08:22:39.2830930Z %16 = arith.cmpf ogt, %12, %15 : tensor<1024x8xf32> 2026-02-21T08:22:39.2831141Z %17 = arith.cmpf une, %12, %12 : tensor<1024x8xf32> 2026-02-21T08:22:39.2831357Z %18 = arith.ori %16, %17 : tensor<1024x8xi1> 2026-02-21T08:22:39.2831596Z %19 = arith.select %18, %12, %15 : tensor<1024x8xi1>, tensor<1024x8xf32> 2026-02-21T08:22:39.2831833Z %20 = math.log %19 : tensor<1024x8xf32> 2026-02-21T08:22:39.2832076Z %21 = arith.subf %20, %11 : tensor<1024x8xf32> 2026-02-21T08:22:39.2832269Z %22 = arith.mulf %12, %21 : tensor<1024x8xf32> 2026-02-21T08:22:39.2832474Z %23 = arith.addf %22, %cst : tensor<1024x8xf32> 2026-02-21T08:22:39.2832664Z scf.yield %23 : tensor<1024x8xf32> 2026-02-21T08:22:39.2832835Z } 2026-02-21T08:22:39.2832983Z %14 = arith.addf %arg6, %13 : tensor<1024x8xf32> 2026-02-21T08:22:39.2833174Z scf.yield %14 : tensor<1024x8xf32> 2026-02-21T08:22:39.2833430Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:22:39.2833689Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:22:39.2833878Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:22:39.2834046Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:22:39.2834233Z tt.reduce.return %11 : f32 2026-02-21T08:22:39.2834412Z }) : (tensor<1024x8xf32>) -> tensor<1024xf32> 2026-02-21T08:22:39.2834646Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<1024x!tt.ptr> 2026-02-21T08:22:39.2834917Z %10 = tt.addptr %9, %6 : tensor<1024x!tt.ptr>, tensor<1024xi32> 2026-02-21T08:22:39.2835153Z tt.store %10, %8 : tensor<1024x!tt.ptr> 2026-02-21T08:22:39.2835345Z tt.return 2026-02-21T08:22:39.2835467Z } 2026-02-21T08:22:39.2835586Z } 2026-02-21T08:22:39.2835652Z 2026-02-21T08:22:39.2835699Z {-# 2026-02-21T08:22:39.2835830Z external_resources: { 2026-02-21T08:22:39.2836062Z mlir_reproducer: { 2026-02-21T08:22:39.2840399Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:39.2844760Z disable_threading: false, 2026-02-21T08:22:39.2844925Z verify_each: true 2026-02-21T08:22:39.2845073Z } 2026-02-21T08:22:39.2845187Z } 2026-02-21T08:22:39.2845311Z #-} 2026-02-21T08:22:39.2845743Z /tmp/torchinductor_root/xg/cxgm3kagergn5wzsdsmbbpk6fkefzss35oywbsrjzcxa47fbxywa.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:39.2846984Z /tmp/torchinductor_root/xg/cxgm3kagergn5wzsdsmbbpk6fkefzss35oywbsrjzcxa47fbxywa.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:39.2847981Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:39.2849003Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:22:39.2849902Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:39.2850156Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:42.0238511Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:22:42.0242849Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:22:42.0244380Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:22:42.0244611Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:22:42.0244797Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:22:42.0244987Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:22:42.0245212Z %cst = arith.constant dense<0.000000e+00> : tensor<8x64xf32> 2026-02-21T08:22:42.0245477Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:22:42.0245985Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:22:42.0246170Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:22:42.0246362Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:22:42.0246674Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:42.0247104Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:22:42.0247406Z %2 = tt.get_program_id x : i32 2026-02-21T08:22:42.0247616Z scf.for %arg5 = %2 to %c512_i32 step %c592_i32 : i32 { 2026-02-21T08:22:42.0247832Z %3 = arith.muli %arg5, %c8_i32 : i32 2026-02-21T08:22:42.0248050Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:22:42.0248301Z %5 = tt.splat %3 : i32 -> tensor<8xi32> 2026-02-21T08:22:42.0248579Z %6 = arith.addi %5, %4 : tensor<8xi32> 2026-02-21T08:22:42.0248774Z %c4032_i32 = arith.constant 4032 : i32 2026-02-21T08:22:42.0248952Z %c192_i32 = arith.constant 192 : i32 2026-02-21T08:22:42.0249257Z %7 = scf.for %arg6 = %c0_i32 to %c4032_i32 step %c192_i32 iter_args(%arg7 = %cst) -> (tensor<8x64xf32>) : i32 { 2026-02-21T08:22:42.0249658Z %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0250011Z %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0250296Z %17 = scf.if %arg3 -> (tensor<8x64xf32>) { 2026-02-21T08:22:42.0250656Z %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32> 2026-02-21T08:22:42.0251020Z %32 = arith.subf %16, %15 : tensor<8x64xf32> 2026-02-21T08:22:42.0251219Z %33 = arith.mulf %31, %32 : tensor<8x64xf32> 2026-02-21T08:22:42.0251463Z %34 = arith.addf %33, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0251666Z scf.yield %34 : tensor<8x64xf32> 2026-02-21T08:22:42.0251830Z } else { 2026-02-21T08:22:42.0252090Z %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32> 2026-02-21T08:22:42.0252313Z %32 = arith.cmpf ogt, %16, %31 : tensor<8x64xf32> 2026-02-21T08:22:42.0252526Z %33 = arith.cmpf une, %16, %16 : tensor<8x64xf32> 2026-02-21T08:22:42.0252738Z %34 = arith.ori %32, %33 : tensor<8x64xi1> 2026-02-21T08:22:42.0252972Z %35 = arith.select %34, %16, %31 : tensor<8x64xi1>, tensor<8x64xf32> 2026-02-21T08:22:42.0253216Z %36 = math.log %35 : tensor<8x64xf32> 2026-02-21T08:22:42.0253406Z %37 = arith.subf %36, %15 : tensor<8x64xf32> 2026-02-21T08:22:42.0253606Z %38 = arith.mulf %16, %37 : tensor<8x64xf32> 2026-02-21T08:22:42.0253811Z %39 = arith.addf %38, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0253999Z scf.yield %39 : tensor<8x64xf32> 2026-02-21T08:22:42.0254174Z } 2026-02-21T08:22:42.0254313Z %18 = arith.addf %arg7, %17 : tensor<8x64xf32> 2026-02-21T08:22:42.0254506Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:22:42.0254685Z %19 = arith.muli %c64_i32, %c1_i32 : i32 2026-02-21T08:22:42.0254870Z %20 = arith.addi %arg6, %19 : i32 2026-02-21T08:22:42.0255126Z %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0255470Z %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0255745Z %23 = scf.if %arg3 -> (tensor<8x64xf32>) { 2026-02-21T08:22:42.0256094Z %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32> 2026-02-21T08:22:42.0256446Z %32 = arith.subf %22, %21 : tensor<8x64xf32> 2026-02-21T08:22:42.0256643Z %33 = arith.mulf %31, %32 : tensor<8x64xf32> 2026-02-21T08:22:42.0256964Z %34 = arith.addf %33, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0257160Z scf.yield %34 : tensor<8x64xf32> 2026-02-21T08:22:42.0257328Z } else { 2026-02-21T08:22:42.0257495Z %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32> 2026-02-21T08:22:42.0257712Z %32 = arith.cmpf ogt, %22, %31 : tensor<8x64xf32> 2026-02-21T08:22:42.0257936Z %33 = arith.cmpf une, %22, %22 : tensor<8x64xf32> 2026-02-21T08:22:42.0258140Z %34 = arith.ori %32, %33 : tensor<8x64xi1> 2026-02-21T08:22:42.0258386Z %35 = arith.select %34, %22, %31 : tensor<8x64xi1>, tensor<8x64xf32> 2026-02-21T08:22:42.0258634Z %36 = math.log %35 : tensor<8x64xf32> 2026-02-21T08:22:42.0258825Z %37 = arith.subf %36, %21 : tensor<8x64xf32> 2026-02-21T08:22:42.0259023Z %38 = arith.mulf %22, %37 : tensor<8x64xf32> 2026-02-21T08:22:42.0259218Z %39 = arith.addf %38, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0259502Z scf.yield %39 : tensor<8x64xf32> 2026-02-21T08:22:42.0259671Z } 2026-02-21T08:22:42.0259815Z %24 = arith.addf %18, %23 : tensor<8x64xf32> 2026-02-21T08:22:42.0259998Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:22:42.0260185Z %25 = arith.muli %c64_i32, %c2_i32 : i32 2026-02-21T08:22:42.0260371Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:22:42.0260625Z %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0260969Z %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0261235Z %29 = scf.if %arg3 -> (tensor<8x64xf32>) { 2026-02-21T08:22:42.0261589Z %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32> 2026-02-21T08:22:42.0261972Z %32 = arith.subf %28, %27 : tensor<8x64xf32> 2026-02-21T08:22:42.0262176Z %33 = arith.mulf %31, %32 : tensor<8x64xf32> 2026-02-21T08:22:42.0262382Z %34 = arith.addf %33, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0262570Z scf.yield %34 : tensor<8x64xf32> 2026-02-21T08:22:42.0262739Z } else { 2026-02-21T08:22:42.0262894Z %31 = tt.splat %arg4 : f32 -> tensor<8x64xf32> 2026-02-21T08:22:42.0263109Z %32 = arith.cmpf ogt, %28, %31 : tensor<8x64xf32> 2026-02-21T08:22:42.0263319Z %33 = arith.cmpf une, %28, %28 : tensor<8x64xf32> 2026-02-21T08:22:42.0263525Z %34 = arith.ori %32, %33 : tensor<8x64xi1> 2026-02-21T08:22:42.0263758Z %35 = arith.select %34, %28, %31 : tensor<8x64xi1>, tensor<8x64xf32> 2026-02-21T08:22:42.0263992Z %36 = math.log %35 : tensor<8x64xf32> 2026-02-21T08:22:42.0264187Z %37 = arith.subf %36, %27 : tensor<8x64xf32> 2026-02-21T08:22:42.0264379Z %38 = arith.mulf %28, %37 : tensor<8x64xf32> 2026-02-21T08:22:42.0264584Z %39 = arith.addf %38, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0264771Z scf.yield %39 : tensor<8x64xf32> 2026-02-21T08:22:42.0264938Z } 2026-02-21T08:22:42.0265073Z %30 = arith.addf %24, %29 : tensor<8x64xf32> 2026-02-21T08:22:42.0265261Z scf.yield %30 : tensor<8x64xf32> 2026-02-21T08:22:42.0265443Z } {tt.num_stages = 1 : i32} 2026-02-21T08:22:42.0265705Z %8 = tt.descriptor_load %0[%3, %c4032_i32] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0266070Z %9 = tt.descriptor_load %1[%3, %c4032_i32] : !tt.tensordesc> -> tensor<8x64xf32> 2026-02-21T08:22:42.0266350Z %10 = scf.if %arg3 -> (tensor<8x64xf32>) { 2026-02-21T08:22:42.0266720Z %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x64xf32>) -> tensor<8x64xf32> 2026-02-21T08:22:42.0267089Z %16 = arith.subf %9, %8 : tensor<8x64xf32> 2026-02-21T08:22:42.0267293Z %17 = arith.mulf %15, %16 : tensor<8x64xf32> 2026-02-21T08:22:42.0267513Z %18 = arith.addf %17, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0267782Z scf.yield %18 : tensor<8x64xf32> 2026-02-21T08:22:42.0267959Z } else { 2026-02-21T08:22:42.0268112Z %15 = tt.splat %arg4 : f32 -> tensor<8x64xf32> 2026-02-21T08:22:42.0268337Z %16 = arith.cmpf ogt, %9, %15 : tensor<8x64xf32> 2026-02-21T08:22:42.0268551Z %17 = arith.cmpf une, %9, %9 : tensor<8x64xf32> 2026-02-21T08:22:42.0268766Z %18 = arith.ori %16, %17 : tensor<8x64xi1> 2026-02-21T08:22:42.0269014Z %19 = arith.select %18, %9, %15 : tensor<8x64xi1>, tensor<8x64xf32> 2026-02-21T08:22:42.0269253Z %20 = math.log %19 : tensor<8x64xf32> 2026-02-21T08:22:42.0269466Z %21 = arith.subf %20, %8 : tensor<8x64xf32> 2026-02-21T08:22:42.0269673Z %22 = arith.mulf %9, %21 : tensor<8x64xf32> 2026-02-21T08:22:42.0269897Z %23 = arith.addf %22, %cst : tensor<8x64xf32> 2026-02-21T08:22:42.0270093Z scf.yield %23 : tensor<8x64xf32> 2026-02-21T08:22:42.0270339Z } 2026-02-21T08:22:42.0270491Z %11 = arith.addf %7, %10 : tensor<8x64xf32> 2026-02-21T08:22:42.0270688Z %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({ 2026-02-21T08:22:42.0270886Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:22:42.0271065Z %15 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:22:42.0271281Z tt.reduce.return %15 : f32 2026-02-21T08:22:42.0271474Z }) : (tensor<8x64xf32>) -> tensor<8xf32> 2026-02-21T08:22:42.0271706Z %13 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:22:42.0272001Z %14 = tt.addptr %13, %6 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:22:42.0272239Z tt.store %14, %12 : tensor<8x!tt.ptr> 2026-02-21T08:22:42.0272441Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:22:42.0272621Z tt.return 2026-02-21T08:22:42.0272747Z } 2026-02-21T08:22:42.0272872Z } 2026-02-21T08:22:42.0272941Z 2026-02-21T08:22:42.0272992Z {-# 2026-02-21T08:22:42.0273131Z external_resources: { 2026-02-21T08:22:42.0273290Z mlir_reproducer: { 2026-02-21T08:22:42.0277570Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:22:42.0281958Z disable_threading: false, 2026-02-21T08:22:42.0282122Z verify_each: true 2026-02-21T08:22:42.0282289Z } 2026-02-21T08:22:42.0282422Z } 2026-02-21T08:22:42.0282648Z #-} 2026-02-21T08:22:42.0283153Z /tmp/torchinductor_root/2f/c2ftjbl75eyrgfb27icab6zcwzvurrffx5xrfg7hnekeoijqcnby.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:22:42.0284496Z /tmp/torchinductor_root/2f/c2ftjbl75eyrgfb27icab6zcwzvurrffx5xrfg7hnekeoijqcnby.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:22:42.0285528Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:22:42.0286799Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=64, num_sm_multiplier=4, num_stages=7, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:22:42.0287858Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:22:42.0288126Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:22:42.0288725Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.2 configs/s 2026-02-21T08:22:42.0289095Z [70s] Adaptive compile timeout: 30s (90% percentile=24.9s, bounds=[30.0s, 60s]) 2026-02-21T08:22:42.6896593Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1510.5 configs/s 2026-02-21T08:22:42.7343628Z [70s] Initial random population of 100, 5 starting points: 2026-02-21T08:22:42.7345340Z error=15 2026-02-21T08:22:42.7345520Z timeout=4 2026-02-21T08:22:42.7345677Z ok=81 2026-02-21T08:22:42.7345821Z min=0.0471 2026-02-21T08:22:42.7345976Z mid=0.6176 2026-02-21T08:22:42.7346119Z max=37.2368 2026-02-21T08:22:42.7346341Z best={'block_sizes': [2048, 2], 2026-02-21T08:22:42.7346598Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:22:42.7346840Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:22:42.7347060Z 'num_stages': 5, 2026-02-21T08:22:42.7347215Z 'num_warps': 4, 2026-02-21T08:22:42.7347388Z 'pid_type': 'flat', 2026-02-21T08:22:42.7347562Z 'range_flattens': [None, None], 2026-02-21T08:22:42.7347747Z 'range_multi_buffers': [None, None], 2026-02-21T08:22:42.7347946Z 'range_num_stages': [0, 3], 2026-02-21T08:22:42.7348123Z 'range_unroll_factors': [0, 1], 2026-02-21T08:22:42.7348322Z 'range_warp_specializes': [None, True]} 2026-02-21T08:22:42.7359051Z [70s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:22:43.8964203Z [72s] Generation 1 starting: 87 neighbors, 5 active search path(s) 2026-02-21T08:22:48.8908961Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 11.7 configs/s 2026-02-21T08:22:54.5204447Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.5 configs/s 2026-02-21T08:23:00.9891489Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 159.0 2026-02-21T08:23:00.9893336Z configs/s 2026-02-21T08:23:01.3797600Z [89s] Generation 1 complete: 2026-02-21T08:23:01.3799154Z error=1 2026-02-21T08:23:01.3799309Z ok=92 2026-02-21T08:23:01.3799434Z min=0.0441 2026-02-21T08:23:01.3799563Z mid=0.0562 2026-02-21T08:23:01.3799677Z max=0.2325 2026-02-21T08:23:01.3799820Z best={'block_sizes': [2048, 2], 2026-02-21T08:23:01.3800059Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:01.3800323Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:01.3800506Z 'num_stages': 7, 2026-02-21T08:23:01.3800650Z 'num_warps': 32, 2026-02-21T08:23:01.3800788Z 'pid_type': 'flat', 2026-02-21T08:23:01.3800938Z 'range_flattens': [None, True], 2026-02-21T08:23:01.3801115Z 'range_multi_buffers': [None, None], 2026-02-21T08:23:01.3801328Z 'range_num_stages': [0, 1], 2026-02-21T08:23:01.3802103Z 'range_unroll_factors': [0, 3], 2026-02-21T08:23:01.3802284Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:01.3810431Z [89s] Fitting surrogate: 193 points, 193 targets 2026-02-21T08:23:02.4552061Z [90s] Generation 2 starting: 71 neighbors, 5 active search path(s) 2026-02-21T08:23:05.4714568Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 27.0 configs/s 2026-02-21T08:23:06.3911684Z module { 2026-02-21T08:23:06.3916573Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:23:06.3921086Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:23:06.3926038Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:23:06.3931066Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32> 2026-02-21T08:23:06.3935814Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:23:06.3937424Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:23:06.3937647Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:23:06.3937834Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:23:06.3938145Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:23:06.3938579Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:23:06.3938892Z %2 = tt.get_program_id x : i32 2026-02-21T08:23:06.3939066Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:23:06.3939287Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:23:06.3939517Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:23:06.3939707Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:23:06.3940014Z %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>) : i32 { 2026-02-21T08:23:06.3940420Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:23:06.3940779Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:23:06.3941057Z %13 = scf.if %arg3 -> (tensor<4x256xf32>) { 2026-02-21T08:23:06.3941420Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32> 2026-02-21T08:23:06.3941780Z %16 = arith.subf %12, %11 : tensor<4x256xf32> 2026-02-21T08:23:06.3942077Z %17 = arith.mulf %15, %16 : tensor<4x256xf32> 2026-02-21T08:23:06.3942284Z %18 = arith.addf %17, %cst : tensor<4x256xf32> 2026-02-21T08:23:06.3942473Z scf.yield %18 : tensor<4x256xf32> 2026-02-21T08:23:06.3942647Z } else { 2026-02-21T08:23:06.3942804Z %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32> 2026-02-21T08:23:06.3943034Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32> 2026-02-21T08:23:06.3943246Z %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32> 2026-02-21T08:23:06.3943472Z %18 = arith.ori %16, %17 : tensor<4x256xi1> 2026-02-21T08:23:06.3943710Z %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32> 2026-02-21T08:23:06.3943943Z %20 = math.log %19 : tensor<4x256xf32> 2026-02-21T08:23:06.3944140Z %21 = arith.subf %20, %11 : tensor<4x256xf32> 2026-02-21T08:23:06.3944329Z %22 = arith.mulf %12, %21 : tensor<4x256xf32> 2026-02-21T08:23:06.3944530Z %23 = arith.addf %22, %cst : tensor<4x256xf32> 2026-02-21T08:23:06.3944725Z scf.yield %23 : tensor<4x256xf32> 2026-02-21T08:23:06.3944899Z } 2026-02-21T08:23:06.3945050Z %14 = arith.addf %arg6, %13 : tensor<4x256xf32> 2026-02-21T08:23:06.3945236Z scf.yield %14 : tensor<4x256xf32> 2026-02-21T08:23:06.3945489Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:23:06.3946044Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:23:06.3946230Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:23:06.3946398Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:23:06.3946585Z tt.reduce.return %11 : f32 2026-02-21T08:23:06.3946759Z }) : (tensor<4x256xf32>) -> tensor<4xf32> 2026-02-21T08:23:06.3946986Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:23:06.3947246Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:23:06.3947467Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:23:06.3947659Z tt.return 2026-02-21T08:23:06.3947777Z } 2026-02-21T08:23:06.3947899Z } 2026-02-21T08:23:06.3947963Z 2026-02-21T08:23:06.3948011Z {-# 2026-02-21T08:23:06.3948137Z external_resources: { 2026-02-21T08:23:06.3948291Z mlir_reproducer: { 2026-02-21T08:23:06.3952556Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:23:06.3956898Z disable_threading: false, 2026-02-21T08:23:06.3957067Z verify_each: true 2026-02-21T08:23:06.3957207Z } 2026-02-21T08:23:06.3957327Z } 2026-02-21T08:23:06.3957435Z #-} 2026-02-21T08:23:06.3957874Z /tmp/torchinductor_root/55/c55vipmor3vwbduj4eavmyyhpqhttt56s25htj5r6oae4vfksynf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:23:06.3959197Z /tmp/torchinductor_root/55/c55vipmor3vwbduj4eavmyyhpqhttt56s25htj5r6oae4vfksynf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:23:06.3960249Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:23:06.3961337Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=6, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:23:06.3962320Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:23:06.3962645Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:23:06.5902045Z module { 2026-02-21T08:23:06.5906370Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:23:06.5907643Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:23:06.5907874Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:23:06.5910950Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:23:06.5911262Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:23:06.5916040Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:23:06.5919276Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T08:23:06.5921337Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:23:06.5922003Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:23:06.5922481Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c4096_i32], [%c4096_i64, %c1_i64] : , > 2026-02-21T08:23:06.5922795Z %2 = tt.get_program_id x : i32 2026-02-21T08:23:06.5922979Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:23:06.5923208Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:23:06.5923440Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:23:06.5923631Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:23:06.5923934Z %7 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:23:06.5924338Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:23:06.5924710Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:23:06.5924991Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:23:06.5925364Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:23:06.5925726Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:23:06.5925931Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:23:06.5926131Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:23:06.5926331Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:23:06.5926501Z } else { 2026-02-21T08:23:06.5926658Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:23:06.5926878Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:23:06.5927090Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:23:06.5927302Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:23:06.5927540Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:23:06.5927786Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:23:06.5927981Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:23:06.5928171Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:23:06.5928395Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:23:06.5928591Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:23:06.5928760Z } 2026-02-21T08:23:06.5928922Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:23:06.5929114Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:23:06.5929360Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:23:06.5929631Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:23:06.5929809Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:23:06.5929985Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:23:06.5930172Z tt.reduce.return %11 : f32 2026-02-21T08:23:06.5930450Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:23:06.5930673Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:23:06.5930916Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:23:06.5931146Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:23:06.5931317Z tt.return 2026-02-21T08:23:06.5931444Z } 2026-02-21T08:23:06.5931558Z } 2026-02-21T08:23:06.5931632Z 2026-02-21T08:23:06.5931680Z {-# 2026-02-21T08:23:06.5931807Z external_resources: { 2026-02-21T08:23:06.5932001Z mlir_reproducer: { 2026-02-21T08:23:06.5936352Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:23:06.5940644Z disable_threading: false, 2026-02-21T08:23:06.5940807Z verify_each: true 2026-02-21T08:23:06.5940942Z } 2026-02-21T08:23:06.5941062Z } 2026-02-21T08:23:06.5941168Z #-} 2026-02-21T08:23:06.5941569Z /tmp/torchinductor_root/h2/ch2ffeze3ynkvw7chal4c5bahlgouvx7edgegll52ryqk4xw6kh7.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:23:06.5942793Z /tmp/torchinductor_root/h2/ch2ffeze3ynkvw7chal4c5bahlgouvx7edgegll52ryqk4xw6kh7.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:23:06.5943756Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:23:06.5944707Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:23:06.5945561Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:23:06.5945804Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:23:09.6154394Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 17.3 configs/s 2026-02-21T08:23:15.7767856Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 165.5 2026-02-21T08:23:15.7768642Z configs/s 2026-02-21T08:23:16.1488227Z [104s] Generation 2 complete: 2026-02-21T08:23:16.1493182Z error=2 2026-02-21T08:23:16.1494525Z ok=74 2026-02-21T08:23:16.1494688Z min=0.0439 2026-02-21T08:23:16.1494814Z mid=0.0480 2026-02-21T08:23:16.1494939Z max=0.1137 2026-02-21T08:23:16.1495070Z best={'block_sizes': [256, 1], 2026-02-21T08:23:16.1495335Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:16.1495598Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:23:16.1495779Z 'num_stages': 5, 2026-02-21T08:23:16.1495916Z 'num_warps': 1, 2026-02-21T08:23:16.1496064Z 'pid_type': 'flat', 2026-02-21T08:23:16.1496224Z 'range_flattens': [None, False], 2026-02-21T08:23:16.1496398Z 'range_multi_buffers': [None, False], 2026-02-21T08:23:16.1496581Z 'range_num_stages': [0, 1], 2026-02-21T08:23:16.1497046Z 'range_unroll_factors': [0, 1], 2026-02-21T08:23:16.1497267Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:16.1504214Z [104s] Fitting surrogate: 269 points, 269 targets 2026-02-21T08:23:17.3596444Z [105s] Generation 3 starting: 68 neighbors, 5 active search path(s) 2026-02-21T08:23:20.0443414Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 48.9 configs/s 2026-02-21T08:23:24.0639764Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.4 configs/s 2026-02-21T08:23:30.3983306Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 165.0 2026-02-21T08:23:30.3983827Z configs/s 2026-02-21T08:23:30.8063304Z [119s] Generation 3 complete: 2026-02-21T08:23:30.8067523Z error=1 2026-02-21T08:23:30.8067755Z ok=72 2026-02-21T08:23:30.8067899Z min=0.0419 2026-02-21T08:23:30.8068024Z mid=0.0461 2026-02-21T08:23:30.8068197Z max=0.0891 2026-02-21T08:23:30.8068383Z best={'block_sizes': [256, 1], 2026-02-21T08:23:30.8069062Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:30.8069380Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:30.8069574Z 'num_stages': 5, 2026-02-21T08:23:30.8072673Z 'num_warps': 1, 2026-02-21T08:23:30.8072913Z 'pid_type': 'flat', 2026-02-21T08:23:30.8073120Z 'range_flattens': [None, False], 2026-02-21T08:23:30.8073337Z 'range_multi_buffers': [None, False], 2026-02-21T08:23:30.8073544Z 'range_num_stages': [0, 1], 2026-02-21T08:23:30.8073740Z 'range_unroll_factors': [0, 1], 2026-02-21T08:23:30.8073924Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:30.8075338Z [119s] Fitting surrogate: 342 points, 342 targets 2026-02-21T08:23:31.9201173Z [120s] Generation 4 starting: 64 neighbors, 5 active search path(s) 2026-02-21T08:23:34.4654667Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 48.4 configs/s 2026-02-21T08:23:38.2996783Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.1 configs/s 2026-02-21T08:23:44.0829286Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 175.4 2026-02-21T08:23:44.0829620Z configs/s 2026-02-21T08:23:44.4997514Z [132s] Generation 4 complete: 2026-02-21T08:23:44.4999211Z ok=70 2026-02-21T08:23:44.4999386Z min=0.0419 2026-02-21T08:23:44.4999517Z mid=0.0460 2026-02-21T08:23:44.4999649Z max=0.2673 2026-02-21T08:23:44.4999787Z best={'block_sizes': [256, 1], 2026-02-21T08:23:44.5000057Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:44.5000334Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:44.5000524Z 'num_stages': 5, 2026-02-21T08:23:44.5000666Z 'num_warps': 1, 2026-02-21T08:23:44.5000803Z 'pid_type': 'flat', 2026-02-21T08:23:44.5000962Z 'range_flattens': [None, False], 2026-02-21T08:23:44.5001134Z 'range_multi_buffers': [None, False], 2026-02-21T08:23:44.5001356Z 'range_num_stages': [0, 1], 2026-02-21T08:23:44.5002263Z 'range_unroll_factors': [0, 1], 2026-02-21T08:23:44.5002446Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:44.5025592Z [132s] Fitting surrogate: 412 points, 412 targets 2026-02-21T08:23:45.5938998Z [133s] Generation 5 starting: 39 neighbors, 3 active search path(s) 2026-02-21T08:23:47.3373944Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 61.4 configs/s 2026-02-21T08:23:49.6600338Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 17.1 configs/s 2026-02-21T08:23:53.3166462Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.8 2026-02-21T08:23:53.3170302Z configs/s 2026-02-21T08:23:53.5229371Z [141s] Generation 5 complete: 2026-02-21T08:23:53.5229754Z ok=43 2026-02-21T08:23:53.5230002Z min=0.0419 2026-02-21T08:23:53.5230159Z mid=0.0460 2026-02-21T08:23:53.5230300Z max=0.0624 2026-02-21T08:23:53.5230824Z best={'block_sizes': [256, 2], 2026-02-21T08:23:53.5231116Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:53.5231380Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:53.5231587Z 'num_stages': 5, 2026-02-21T08:23:53.5231730Z 'num_warps': 4, 2026-02-21T08:23:53.5232081Z 'pid_type': 'flat', 2026-02-21T08:23:53.5232254Z 'range_flattens': [None, False], 2026-02-21T08:23:53.5232430Z 'range_multi_buffers': [None, True], 2026-02-21T08:23:53.5232615Z 'range_num_stages': [0, 1], 2026-02-21T08:23:53.5232773Z 'range_unroll_factors': [0, 0], 2026-02-21T08:23:53.5232954Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:53.5247198Z [141s] Fitting surrogate: 455 points, 455 targets 2026-02-21T08:23:54.0003923Z [142s] Generation 6 starting: 27 neighbors, 2 active search path(s) 2026-02-21T08:23:55.2545270Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 37.5 configs/s 2026-02-21T08:23:56.8989081Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 17.5 configs/s 2026-02-21T08:23:59.5726979Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 413.6 2026-02-21T08:23:59.5727650Z configs/s 2026-02-21T08:23:59.7140133Z [147s] Generation 6 complete: 2026-02-21T08:23:59.7144224Z ok=29 2026-02-21T08:23:59.7145657Z min=0.0420 2026-02-21T08:23:59.7145819Z mid=0.0440 2026-02-21T08:23:59.7145944Z max=0.0603 2026-02-21T08:23:59.7146081Z best={'block_sizes': [256, 1], 2026-02-21T08:23:59.7146336Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:23:59.7146607Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:23:59.7146801Z 'num_stages': 5, 2026-02-21T08:23:59.7146939Z 'num_warps': 8, 2026-02-21T08:23:59.7147085Z 'pid_type': 'flat', 2026-02-21T08:23:59.7147240Z 'range_flattens': [None, False], 2026-02-21T08:23:59.7147421Z 'range_multi_buffers': [None, False], 2026-02-21T08:23:59.7147637Z 'range_num_stages': [0, 1], 2026-02-21T08:23:59.7148237Z 'range_unroll_factors': [0, 0], 2026-02-21T08:23:59.7148425Z 'range_warp_specializes': [None, False]} 2026-02-21T08:23:59.7154519Z [147s] Fitting surrogate: 484 points, 484 targets 2026-02-21T08:24:00.1964705Z [148s] Generation 7 starting: 27 neighbors, 2 active search path(s) 2026-02-21T08:24:01.4861311Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 60.2 configs/s 2026-02-21T08:24:03.0847372Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 17.4 configs/s 2026-02-21T08:24:05.2310964Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 471.2 2026-02-21T08:24:05.2311296Z configs/s 2026-02-21T08:24:05.3567109Z [153s] Generation 7 complete: 2026-02-21T08:24:05.3568822Z ok=29 2026-02-21T08:24:05.3568986Z min=0.0419 2026-02-21T08:24:05.3569110Z mid=0.0420 2026-02-21T08:24:05.3569237Z max=0.2732 2026-02-21T08:24:05.3569688Z best={'block_sizes': [256, 1], 2026-02-21T08:24:05.3569969Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:24:05.3570227Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:24:05.3570415Z 'num_stages': 5, 2026-02-21T08:24:05.3570550Z 'num_warps': 8, 2026-02-21T08:24:05.3570691Z 'pid_type': 'flat', 2026-02-21T08:24:05.3570844Z 'range_flattens': [None, False], 2026-02-21T08:24:05.3571025Z 'range_multi_buffers': [None, False], 2026-02-21T08:24:05.3571228Z 'range_num_stages': [0, 1], 2026-02-21T08:24:05.3571399Z 'range_unroll_factors': [0, 0], 2026-02-21T08:24:05.3571570Z 'range_warp_specializes': [None, False]} 2026-02-21T08:24:05.3583956Z [153s] Fitting surrogate: 513 points, 513 targets 2026-02-21T08:24:05.6214372Z [153s] Autotuning complete in 153.9s after searching 487 configs. 2026-02-21T08:24:05.6216256Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:24:05.6217244Z @helion.kernel(config=helion.Config(block_sizes=[256, 1], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:24:05.6218213Z 2026-02-21T08:24:05.6218481Z [153s] Code of selected kernel: /tmp/torchinductor_root/2e/c2ef76cyfenisfioqqaqdn2hszipb6ekljuztdrkll7knehpfj34.py 2026-02-21T08:24:05.6400780Z from __future__ import annotations 2026-02-21T08:24:05.6401001Z 2026-02-21T08:24:05.6405609Z import torch 2026-02-21T08:24:05.6409558Z import triton 2026-02-21T08:24:05.6411075Z import triton.language as tl 2026-02-21T08:24:05.6411363Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:24:05.6411641Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:24:05.6416897Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:24:05.6418446Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:24:05.6418650Z 2026-02-21T08:24:05.6418724Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:24:05.6418910Z _BLOCK_SIZE_0 = tl.constexpr(256) 2026-02-21T08:24:05.6419064Z 2026-02-21T08:24:05.6419127Z @triton.jit 2026-02-21T08:24:05.6419327Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:24:05.6424300Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:24:05.6426341Z pid_0 = tl.program_id(0) 2026-02-21T08:24:05.6426590Z offset_1 = pid_0 2026-02-21T08:24:05.6426775Z indices_1 = offset_1 + tl.zeros([1], tl.int32) 2026-02-21T08:24:05.6431687Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:24:05.6433378Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:24:05.6433703Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:24:05.6434097Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:24:05.6434682Z # src[kl_div.py:92-112]: ... 2026-02-21T08:24:05.6439260Z for offset_0 in tl.range(0, 4096, _BLOCK_SIZE_0, warp_specialize=False, num_stages=1, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T08:24:05.6440748Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:24:05.6441021Z loss_sum_copy = loss_sum 2026-02-21T08:24:05.6441213Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:24:05.6441509Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:24:05.6441846Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:24:05.6442207Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:24:05.6442800Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 4096 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:24:05.6443192Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:24:05.6443549Z y_true_val = tl.load(y_true + (indices_1[:, None] * 4096 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:24:05.6443885Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:24:05.6444173Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:24:05.6444483Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:24:05.6444713Z # src[kl_div.py:98-106]: ... 2026-02-21T08:24:05.6444888Z if log_target: 2026-02-21T08:24:05.6445057Z y_true_val_copy = y_true_val 2026-02-21T08:24:05.6445252Z y_pred_val_copy = y_pred_val 2026-02-21T08:24:05.6445432Z kl_loss_copy = kl_loss 2026-02-21T08:24:05.6445627Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:24:05.6445828Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:24:05.6446024Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:24:05.6446238Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:24:05.6446476Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:24:05.6446725Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:24:05.6446990Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:24:05.6447191Z v_2 = v_0 * v_1 2026-02-21T08:24:05.6447357Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:24:05.6447554Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:24:05.6447811Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:24:05.6448113Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:24:05.6448326Z # src[kl_div.py:98-106]: ... 2026-02-21T08:24:05.6448510Z _not = not log_target 2026-02-21T08:24:05.6448677Z if _not: 2026-02-21T08:24:05.6448823Z y_true_val_copy_1 = y_true_val 2026-02-21T08:24:05.6449011Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:24:05.6449188Z kl_loss_copy_1 = kl_loss 2026-02-21T08:24:05.6449387Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:24:05.6449580Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:24:05.6449776Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:24:05.6450014Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:24:05.6450302Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:24:05.6450513Z v_5 = tl_math.log(v_4) 2026-02-21T08:24:05.6450725Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:24:05.6450962Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:24:05.6451138Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:24:05.6451455Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:24:05.6451639Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:24:05.6451834Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:24:05.6452092Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:24:05.6452317Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:24:05.6452525Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:24:05.6452652Z 2026-02-21T08:24:05.6452954Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:24:05.6453339Z """ 2026-02-21T08:24:05.6453476Z Compute KL Divergence loss. 2026-02-21T08:24:05.6453585Z 2026-02-21T08:24:05.6453637Z Args: 2026-02-21T08:24:05.6453807Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:24:05.6454156Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:24:05.6454492Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:24:05.6454803Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:24:05.6455040Z eps: Small value to avoid numerical issues 2026-02-21T08:24:05.6455171Z 2026-02-21T08:24:05.6455231Z Returns: 2026-02-21T08:24:05.6455364Z loss: KL divergence loss 2026-02-21T08:24:05.6455522Z """ 2026-02-21T08:24:05.6455657Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:24:05.6455842Z BT, V = y_pred.shape 2026-02-21T08:24:05.6456031Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:24:05.6456305Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:24:05.6456548Z # src[kl_div.py:77]: ) 2026-02-21T08:24:05.6456793Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:24:05.6457087Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:24:05.6457303Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:24:05.6457508Z # src[kl_div.py:82]: else: 2026-02-21T08:24:05.6457661Z # src[kl_div.py:80-83]: ... 2026-02-21T08:24:05.6457823Z if reduction == 'none': 2026-02-21T08:24:05.6458004Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:24:05.6458210Z loss = torch.zeros_like(y_pred) 2026-02-21T08:24:05.6458378Z else: 2026-02-21T08:24:05.6458592Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:24:05.6458920Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:24:05.6459200Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:24:05.6459515Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:24:05.6459773Z # src[kl_div.py:89-115]: ... 2026-02-21T08:24:05.6460075Z _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=5) 2026-02-21T08:24:05.6460412Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:24:05.6460641Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:24:05.6460872Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:24:05.6461057Z # src[kl_div.py:118-125]: ... 2026-02-21T08:24:05.6461226Z if reduction == 'batchmean': 2026-02-21T08:24:05.6461417Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:24:05.6461630Z final_loss = torch.sum(loss) / BT 2026-02-21T08:24:05.6461811Z elif reduction == 'sum': 2026-02-21T08:24:05.6462030Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:24:05.6462246Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:24:05.6462422Z elif reduction == 'mean': 2026-02-21T08:24:05.6462626Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:24:05.6462902Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:24:05.6463075Z else: 2026-02-21T08:24:05.6463210Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:24:05.6463389Z final_loss = loss 2026-02-21T08:24:05.6463551Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:24:05.6463717Z return final_loss 2026-02-21T08:24:06.3863360Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T08:24:06.3866358Z (B, T, V) 2026-02-21T08:24:06.3870909Z -------------- 2026-02-21T08:24:06.3875641Z (8, 512, 4096) 2026-02-21T08:24:06.3880090Z 2026-02-21T08:24:06.3885679Z 17%|█▋ | 1/6 [02:38<13:13, 158.74s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1: 2026-02-21T08:24:06.3889671Z (B, T, V) 2026-02-21T08:24:06.3891075Z -------------- 2026-02-21T08:24:06.3891266Z (8, 512, 8192) 2026-02-21T08:24:06.3891834Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:24:07.5717779Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:24:08.6712898Z INFO:tritonbench.utils.triton_op:Took 2.53ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:24:10.0853200Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:24:10.0854717Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:24:10.0854951Z 'dtype': 'torch.float32', 2026-02-21T08:24:10.0855149Z 'shape': (4096, 8192), 2026-02-21T08:24:10.0855324Z 'stride': (8192, 1)}, 2026-02-21T08:24:10.0855510Z { 'device': 'cuda:0', 2026-02-21T08:24:10.0855686Z 'dtype': 'torch.float32', 2026-02-21T08:24:10.0855872Z 'shape': (4096, 8192), 2026-02-21T08:24:10.0856051Z 'stride': (8192, 1)}), 2026-02-21T08:24:10.0856215Z 'kwargs': {}} 2026-02-21T08:24:10.0870307Z INFO:tritonbench.utils.triton_op:Took 1.99ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:24:10.2818123Z [0s] Autotune random seed: 2135561342 2026-02-21T08:24:10.3158497Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:24:42.7911097Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[3, 3], range_warp_specializes=[None, None]) 2026-02-21T08:24:42.9785705Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=2, num_stages=5, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[False, None]) 2026-02-21T08:24:43.2263283Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:24:43.5086750Z [33s] Timeout after 30s compiling Config(block_sizes=[4096, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', ''], maxnreg=32, num_sm_multiplier=128, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[2, 2], range_warp_specializes=[False, None]) 2026-02-21T08:24:43.8508309Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'last'], maxnreg=128, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 3], range_unroll_factors=[3, 3], range_warp_specializes=[None, None]) 2026-02-21T08:24:43.9744928Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 128], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', ''], maxnreg=32, num_sm_multiplier=32, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[None, False]) 2026-02-21T08:24:44.2789379Z [33s] Timeout after 30s compiling Config(block_sizes=[256, 256], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[None, False]) 2026-02-21T08:24:44.3474285Z [34s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=128, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 1], range_warp_specializes=[None, None]) 2026-02-21T08:24:44.3829933Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T08:24:44.3848072Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.1 configs/s 2026-02-21T08:24:46.7521025Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:24:46.7526336Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:46.7531805Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:24:46.7537460Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:24:46.7539100Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:46.7539299Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:46.7542678Z %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32> 2026-02-21T08:24:46.7542952Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:24:46.7543138Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:46.7543351Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:46.7543588Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:46.7545972Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:46.7546396Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:46.7546853Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:46.7547175Z %2 = tt.get_program_id x : i32 2026-02-21T08:24:46.7547369Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:24:46.7547557Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T08:24:46.7547748Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:24:46.7547927Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:24:46.7548122Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:24:46.7548305Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:24:46.7548532Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:24:46.7549108Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:46.7549286Z %9 = arith.remsi %8, %c2_i32 : i32 2026-02-21T08:24:46.7549468Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:24:46.7549642Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:24:46.7549828Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:24:46.7550007Z %13 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:24:46.7550221Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:24:46.7550422Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:24:46.7550667Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:46.7550942Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:24:46.7551133Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:24:46.7551449Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:24:46.7552026Z %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7552447Z %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7552743Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:24:46.7553111Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:24:46.7553491Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:24:46.7553702Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:24:46.7553910Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7554118Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:24:46.7554291Z } else { 2026-02-21T08:24:46.7554461Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:24:46.7554688Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:24:46.7554924Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:24:46.7555141Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:24:46.7555380Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:24:46.7555626Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:24:46.7555827Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:24:46.7556037Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:24:46.7556241Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7556447Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:24:46.7556615Z } 2026-02-21T08:24:46.7556770Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:24:46.7556969Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:24:46.7557171Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:24:46.7557382Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:24:46.7557572Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:46.7557753Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:46.7557935Z tt.reduce.return %32 : f32 2026-02-21T08:24:46.7558124Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:24:46.7558358Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:24:46.7558614Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:24:46.7558854Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:24:46.7559046Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:24:46.7559239Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:24:46.7559422Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:24:46.7559603Z %24 = arith.muli %23, %c256_i32 : i32 2026-02-21T08:24:46.7559826Z %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:46.7560074Z %26 = tt.splat %24 : i32 -> tensor<256xi32> 2026-02-21T08:24:46.7560357Z %27 = arith.addi %26, %25 : tensor<256xi32> 2026-02-21T08:24:46.7560653Z %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:24:46.7561053Z %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7561415Z %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7561708Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:24:46.7562100Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:24:46.7562458Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:24:46.7562661Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:24:46.7562922Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7563128Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:24:46.7563291Z } else { 2026-02-21T08:24:46.7563454Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:24:46.7563672Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:24:46.7563883Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:24:46.7564092Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:24:46.7564322Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:24:46.7564562Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:24:46.7564779Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:24:46.7564980Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:24:46.7565182Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7565376Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:24:46.7565550Z } 2026-02-21T08:24:46.7565691Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:24:46.7565888Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:24:46.7566083Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:24:46.7566293Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:24:46.7566476Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:46.7566654Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:46.7566841Z tt.reduce.return %32 : f32 2026-02-21T08:24:46.7567024Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:24:46.7567253Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:24:46.7567513Z %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:24:46.7567748Z tt.store %31, %29 : tensor<256x!tt.ptr> 2026-02-21T08:24:46.7567942Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:46.7568147Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:24:46.7568348Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:24:46.7568574Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:24:46.7568814Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:24:46.7569000Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:24:46.7569304Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:24:46.7569693Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7570064Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:24:46.7570353Z %24 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:24:46.7570709Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:24:46.7571130Z %27 = arith.subf %23, %22 : tensor<256x32xf32> 2026-02-21T08:24:46.7571331Z %28 = arith.mulf %26, %27 : tensor<256x32xf32> 2026-02-21T08:24:46.7571542Z %29 = arith.addf %28, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7571756Z scf.yield %29 : tensor<256x32xf32> 2026-02-21T08:24:46.7571951Z } else { 2026-02-21T08:24:46.7572120Z %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:24:46.7572337Z %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32> 2026-02-21T08:24:46.7572562Z %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32> 2026-02-21T08:24:46.7572770Z %29 = arith.ori %27, %28 : tensor<256x32xi1> 2026-02-21T08:24:46.7573012Z %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:24:46.7573263Z %31 = math.log %30 : tensor<256x32xf32> 2026-02-21T08:24:46.7573519Z %32 = arith.subf %31, %22 : tensor<256x32xf32> 2026-02-21T08:24:46.7573729Z %33 = arith.mulf %23, %32 : tensor<256x32xf32> 2026-02-21T08:24:46.7573932Z %34 = arith.addf %33, %cst : tensor<256x32xf32> 2026-02-21T08:24:46.7574130Z scf.yield %34 : tensor<256x32xf32> 2026-02-21T08:24:46.7574294Z } 2026-02-21T08:24:46.7574445Z %25 = arith.addf %arg7, %24 : tensor<256x32xf32> 2026-02-21T08:24:46.7574635Z scf.yield %25 : tensor<256x32xf32> 2026-02-21T08:24:46.7574838Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:24:46.7575046Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:24:46.7575229Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:46.7575411Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:46.7575591Z tt.reduce.return %22 : f32 2026-02-21T08:24:46.7575780Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:24:46.7576009Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:24:46.7576285Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:24:46.7576520Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:24:46.7576711Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:46.7576884Z tt.return 2026-02-21T08:24:46.7577004Z } 2026-02-21T08:24:46.7577126Z } 2026-02-21T08:24:46.7577192Z 2026-02-21T08:24:46.7577242Z {-# 2026-02-21T08:24:46.7577372Z external_resources: { 2026-02-21T08:24:46.7577522Z mlir_reproducer: { 2026-02-21T08:24:46.7581738Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:46.7586125Z disable_threading: false, 2026-02-21T08:24:46.7586289Z verify_each: true 2026-02-21T08:24:46.7586433Z } 2026-02-21T08:24:46.7586560Z } 2026-02-21T08:24:46.7586673Z #-} 2026-02-21T08:24:46.7587094Z /tmp/torchinductor_root/mr/cmro5mf7dlfwrt33obhsbtlfyttej6je6h7p2i3nwbklsmwo5qaf.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:46.7588352Z /tmp/torchinductor_root/mr/cmro5mf7dlfwrt33obhsbtlfyttej6je6h7p2i3nwbklsmwo5qaf.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:46.7589314Z [36s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:46.7590380Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:24:46.7596244Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:46.7596535Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:47.5541600Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:24:47.5542493Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:47.5543152Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:24:47.5543354Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:47.5547427Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:47.5549530Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32> 2026-02-21T08:24:47.5549799Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:24:47.5549984Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:47.5550172Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:47.5550344Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:47.5550529Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:47.5550841Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:47.5551290Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:47.5551609Z %2 = tt.get_program_id x : i32 2026-02-21T08:24:47.5551779Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:24:47.5552043Z %4 = arith.minsi %3, %c32_i32 : i32 2026-02-21T08:24:47.5552242Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:24:47.5552451Z %5 = arith.muli %arg5, %c128_i32 : i32 2026-02-21T08:24:47.5552679Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:24:47.5552949Z %7 = tt.splat %5 : i32 -> tensor<128xi32> 2026-02-21T08:24:47.5553152Z %8 = arith.addi %7, %6 : tensor<128xi32> 2026-02-21T08:24:47.5553460Z %9 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>) : i32 { 2026-02-21T08:24:47.5553877Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:24:47.5554255Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:24:47.5554902Z %15 = scf.if %arg3 -> (tensor<128x32xf32>) { 2026-02-21T08:24:47.5555281Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32> 2026-02-21T08:24:47.5555649Z %18 = arith.subf %14, %13 : tensor<128x32xf32> 2026-02-21T08:24:47.5555867Z %19 = arith.mulf %17, %18 : tensor<128x32xf32> 2026-02-21T08:24:47.5556073Z %20 = arith.addf %19, %cst : tensor<128x32xf32> 2026-02-21T08:24:47.5556277Z scf.yield %20 : tensor<128x32xf32> 2026-02-21T08:24:47.5556445Z } else { 2026-02-21T08:24:47.5556612Z %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32> 2026-02-21T08:24:47.5556833Z %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32> 2026-02-21T08:24:47.5557175Z %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32> 2026-02-21T08:24:47.5557397Z %20 = arith.ori %18, %19 : tensor<128x32xi1> 2026-02-21T08:24:47.5557630Z %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32> 2026-02-21T08:24:47.5557882Z %22 = math.log %21 : tensor<128x32xf32> 2026-02-21T08:24:47.5558094Z %23 = arith.subf %22, %13 : tensor<128x32xf32> 2026-02-21T08:24:47.5558296Z %24 = arith.mulf %14, %23 : tensor<128x32xf32> 2026-02-21T08:24:47.5558532Z %25 = arith.addf %24, %cst : tensor<128x32xf32> 2026-02-21T08:24:47.5558722Z scf.yield %25 : tensor<128x32xf32> 2026-02-21T08:24:47.5558893Z } 2026-02-21T08:24:47.5559036Z %16 = arith.addf %arg7, %15 : tensor<128x32xf32> 2026-02-21T08:24:47.5559230Z scf.yield %16 : tensor<128x32xf32> 2026-02-21T08:24:47.5559508Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:24:47.5559801Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:24:47.5559999Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:47.5560172Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:47.5560360Z tt.reduce.return %13 : f32 2026-02-21T08:24:47.5560542Z }) : (tensor<128x32xf32>) -> tensor<128xf32> 2026-02-21T08:24:47.5560778Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:24:47.5561041Z %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:24:47.5561273Z tt.store %12, %10 : tensor<128x!tt.ptr> 2026-02-21T08:24:47.5561473Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T08:24:47.5561633Z tt.return 2026-02-21T08:24:47.5561761Z } 2026-02-21T08:24:47.5561910Z } 2026-02-21T08:24:47.5561984Z 2026-02-21T08:24:47.5562031Z {-# 2026-02-21T08:24:47.5562152Z external_resources: { 2026-02-21T08:24:47.5562305Z mlir_reproducer: { 2026-02-21T08:24:47.5566590Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:47.5571103Z disable_threading: false, 2026-02-21T08:24:47.5571273Z verify_each: true 2026-02-21T08:24:47.5571413Z } 2026-02-21T08:24:47.5571532Z } 2026-02-21T08:24:47.5571640Z #-} 2026-02-21T08:24:47.5572195Z /tmp/torchinductor_root/jh/cjh3kqlworq67kiavguonnx5vbnedwyzfa3s7tsnyxsk2v2fxuls.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:47.5573390Z /tmp/torchinductor_root/jh/cjh3kqlworq67kiavguonnx5vbnedwyzfa3s7tsnyxsk2v2fxuls.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:47.5574347Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:47.5575431Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:24:47.5576426Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:47.5576672Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:48.0930839Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:24:48.0933199Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:48.0933809Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:24:48.0934021Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:48.0934212Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:24:48.0934440Z %cst = arith.constant dense<8192> : tensor<4x1xi32> 2026-02-21T08:24:48.0934690Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x4xf32> 2026-02-21T08:24:48.0934916Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:24:48.0935120Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:48.0935315Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:48.0935486Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:48.0935665Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:48.0935972Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:48.0936280Z %1 = tt.get_program_id x : i32 2026-02-21T08:24:48.0936460Z %2 = arith.subi %c1024_i32, %1 : i32 2026-02-21T08:24:48.0936632Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:48.0936814Z %3 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:24:48.0936997Z %4 = arith.addi %2, %3 : i32 2026-02-21T08:24:48.0937173Z %5 = arith.divui %4, %c2368_i32 : i32 2026-02-21T08:24:48.0937346Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:24:48.0937521Z %6 = arith.remsi %5, %c3_i32 : i32 2026-02-21T08:24:48.0937693Z %7 = arith.subi %5, %6 : i32 2026-02-21T08:24:48.0937872Z %8 = arith.muli %7, %c2368_i32 : i32 2026-02-21T08:24:48.0938341Z %9 = arith.addi %1, %8 : i32 2026-02-21T08:24:48.0938510Z %10 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:24:48.0938748Z scf.for %arg5 = %1 to %9 step %10 : i32 { 2026-02-21T08:24:48.0938940Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:24:48.0939155Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:48.0939402Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0939596Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:24:48.0939894Z %15 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:24:48.0940207Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0940403Z %40 = arith.addi %39, %12 : tensor<4xi32> 2026-02-21T08:24:48.0940649Z %41 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:48.0940995Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:24:48.0941235Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:24:48.0941512Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0941751Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0942042Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:24:48.0942266Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0942528Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:24:48.0942814Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0943139Z %50 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:24:48.0943424Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:24:48.0943776Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:24:48.0944141Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0944340Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:24:48.0944555Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0944760Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:24:48.0944930Z } else { 2026-02-21T08:24:48.0945094Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:24:48.0945300Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:24:48.0945511Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:24:48.0945707Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:24:48.0945939Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:24:48.0946178Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:24:48.0946365Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0946561Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:24:48.0946756Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0946949Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:24:48.0947107Z } 2026-02-21T08:24:48.0947251Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:24:48.0947434Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:24:48.0947683Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:24:48.0947950Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:24:48.0948134Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:48.0948312Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:48.0948489Z tt.reduce.return %39 : f32 2026-02-21T08:24:48.0948671Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:24:48.0948886Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:48.0949253Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:48.0949486Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:24:48.0949675Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:24:48.0949865Z %19 = arith.muli %c2368_i32, %c1_i32_1 : i32 2026-02-21T08:24:48.0950049Z %20 = arith.addi %arg5, %19 : i32 2026-02-21T08:24:48.0950225Z %21 = arith.muli %20, %c4_i32 : i32 2026-02-21T08:24:48.0950436Z %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:48.0950708Z %23 = tt.splat %21 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0950892Z %24 = arith.addi %23, %22 : tensor<4xi32> 2026-02-21T08:24:48.0951191Z %25 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:24:48.0951495Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0951774Z %40 = arith.addi %39, %22 : tensor<4xi32> 2026-02-21T08:24:48.0952079Z %41 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:48.0952342Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:24:48.0952597Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:24:48.0952882Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0953132Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0953362Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:24:48.0953593Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0953866Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:24:48.0954151Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0954497Z %50 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:24:48.0954799Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:24:48.0955161Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:24:48.0955532Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0955736Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:24:48.0955953Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0956153Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:24:48.0956334Z } else { 2026-02-21T08:24:48.0956503Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:24:48.0956717Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:24:48.0956940Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:24:48.0957146Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:24:48.0957386Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:24:48.0957619Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:24:48.0957821Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0958023Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:24:48.0958226Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0958427Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:24:48.0958593Z } 2026-02-21T08:24:48.0958745Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:24:48.0958938Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:24:48.0959197Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:24:48.0959475Z %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({ 2026-02-21T08:24:48.0959667Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:48.0959852Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:48.0960110Z tt.reduce.return %39 : f32 2026-02-21T08:24:48.0960312Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:24:48.0960529Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:48.0960785Z %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:48.0961010Z tt.store %28, %26 : tensor<4x!tt.ptr> 2026-02-21T08:24:48.0961208Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:48.0961396Z %29 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:24:48.0961578Z %30 = arith.addi %arg5, %29 : i32 2026-02-21T08:24:48.0961757Z %31 = arith.muli %30, %c4_i32 : i32 2026-02-21T08:24:48.0962004Z %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:48.0962235Z %33 = tt.splat %31 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0962416Z %34 = arith.addi %33, %32 : tensor<4xi32> 2026-02-21T08:24:48.0962772Z %35 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:24:48.0963086Z %39 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0963279Z %40 = arith.addi %39, %32 : tensor<4xi32> 2026-02-21T08:24:48.0963519Z %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:48.0963765Z %42 = arith.muli %41, %cst : tensor<4x1xi32> 2026-02-21T08:24:48.0964007Z %43 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:24:48.0964269Z %44 = tt.broadcast %42 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0964513Z %45 = tt.broadcast %43 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0964730Z %46 = arith.addi %44, %45 : tensor<4x4xi32> 2026-02-21T08:24:48.0964951Z %47 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0965212Z %48 = tt.addptr %47, %46 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:24:48.0965486Z %49 = tt.load %48 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0965813Z %50 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:24:48.0966090Z %51 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:24:48.0966438Z %53 = tt.extern_elementwise %50 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:24:48.0966791Z %54 = arith.subf %50, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0966981Z %55 = arith.mulf %53, %54 : tensor<4x4xf32> 2026-02-21T08:24:48.0967189Z %56 = arith.addf %55, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0967378Z scf.yield %56 : tensor<4x4xf32> 2026-02-21T08:24:48.0967552Z } else { 2026-02-21T08:24:48.0967710Z %53 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:24:48.0967925Z %54 = arith.cmpf ogt, %50, %53 : tensor<4x4xf32> 2026-02-21T08:24:48.0968140Z %55 = arith.cmpf une, %50, %50 : tensor<4x4xf32> 2026-02-21T08:24:48.0968334Z %56 = arith.ori %54, %55 : tensor<4x4xi1> 2026-02-21T08:24:48.0968587Z %57 = arith.select %56, %50, %53 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:24:48.0968819Z %58 = math.log %57 : tensor<4x4xf32> 2026-02-21T08:24:48.0969005Z %59 = arith.subf %58, %49 : tensor<4x4xf32> 2026-02-21T08:24:48.0969199Z %60 = arith.mulf %50, %59 : tensor<4x4xf32> 2026-02-21T08:24:48.0969399Z %61 = arith.addf %60, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0969586Z scf.yield %61 : tensor<4x4xf32> 2026-02-21T08:24:48.0969751Z } 2026-02-21T08:24:48.0969888Z %52 = arith.addf %arg7, %51 : tensor<4x4xf32> 2026-02-21T08:24:48.0970076Z scf.yield %52 : tensor<4x4xf32> 2026-02-21T08:24:48.0970316Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:24:48.0970636Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T08:24:48.0970817Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:48.0970993Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:48.0971174Z tt.reduce.return %39 : f32 2026-02-21T08:24:48.0971370Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:24:48.0971586Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:48.0971838Z %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:48.0972091Z tt.store %38, %36 : tensor<4x!tt.ptr> 2026-02-21T08:24:48.0972282Z } {tt.num_stages = 1 : i32} 2026-02-21T08:24:48.0972479Z scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32 : i32 { 2026-02-21T08:24:48.0972697Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:24:48.0972915Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:48.0973209Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0973409Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:24:48.0973697Z %15 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c4_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x4xf32>) : i32 { 2026-02-21T08:24:48.0974005Z %19 = tt.splat %arg6 : i32 -> tensor<4xi32> 2026-02-21T08:24:48.0974194Z %20 = arith.addi %19, %12 : tensor<4xi32> 2026-02-21T08:24:48.0974433Z %21 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:48.0974677Z %22 = arith.muli %21, %cst : tensor<4x1xi32> 2026-02-21T08:24:48.0974917Z %23 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32> -> tensor<1x4xi32> 2026-02-21T08:24:48.0975196Z %24 = tt.broadcast %22 : tensor<4x1xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0975437Z %25 = tt.broadcast %23 : tensor<1x4xi32> -> tensor<4x4xi32> 2026-02-21T08:24:48.0975660Z %26 = arith.addi %24, %25 : tensor<4x4xi32> 2026-02-21T08:24:48.0975887Z %27 = tt.splat %arg0 : !tt.ptr -> tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0976151Z %28 = tt.addptr %27, %26 : tensor<4x4x!tt.ptr>, tensor<4x4xi32> 2026-02-21T08:24:48.0976419Z %29 = tt.load %28 evictionPolicy = evict_last : tensor<4x4x!tt.ptr> 2026-02-21T08:24:48.0976739Z %30 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x4xf32> 2026-02-21T08:24:48.0977021Z %31 = scf.if %arg3 -> (tensor<4x4xf32>) { 2026-02-21T08:24:48.0977358Z %33 = tt.extern_elementwise %30 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4xf32>) -> tensor<4x4xf32> 2026-02-21T08:24:48.0977724Z %34 = arith.subf %30, %29 : tensor<4x4xf32> 2026-02-21T08:24:48.0977919Z %35 = arith.mulf %33, %34 : tensor<4x4xf32> 2026-02-21T08:24:48.0978122Z %36 = arith.addf %35, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0978317Z scf.yield %36 : tensor<4x4xf32> 2026-02-21T08:24:48.0978529Z } else { 2026-02-21T08:24:48.0978724Z %33 = tt.splat %arg4 : f32 -> tensor<4x4xf32> 2026-02-21T08:24:48.0978939Z %34 = arith.cmpf ogt, %30, %33 : tensor<4x4xf32> 2026-02-21T08:24:48.0979145Z %35 = arith.cmpf une, %30, %30 : tensor<4x4xf32> 2026-02-21T08:24:48.0979347Z %36 = arith.ori %34, %35 : tensor<4x4xi1> 2026-02-21T08:24:48.0979568Z %37 = arith.select %36, %30, %33 : tensor<4x4xi1>, tensor<4x4xf32> 2026-02-21T08:24:48.0979798Z %38 = math.log %37 : tensor<4x4xf32> 2026-02-21T08:24:48.0979982Z %39 = arith.subf %38, %29 : tensor<4x4xf32> 2026-02-21T08:24:48.0980175Z %40 = arith.mulf %30, %39 : tensor<4x4xf32> 2026-02-21T08:24:48.0980374Z %41 = arith.addf %40, %cst_0 : tensor<4x4xf32> 2026-02-21T08:24:48.0980562Z scf.yield %41 : tensor<4x4xf32> 2026-02-21T08:24:48.0980729Z } 2026-02-21T08:24:48.0980864Z %32 = arith.addf %arg7, %31 : tensor<4x4xf32> 2026-02-21T08:24:48.0981057Z scf.yield %32 : tensor<4x4xf32> 2026-02-21T08:24:48.0981354Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:24:48.0981618Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:24:48.0981799Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:48.0982007Z %19 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:48.0982193Z tt.reduce.return %19 : f32 2026-02-21T08:24:48.0982372Z }) : (tensor<4x4xf32>) -> tensor<4xf32> 2026-02-21T08:24:48.0982595Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:48.0982845Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:48.0983078Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:24:48.0983262Z } {tt.num_stages = 1 : i32} 2026-02-21T08:24:48.0983426Z tt.return 2026-02-21T08:24:48.0983552Z } 2026-02-21T08:24:48.0983677Z } 2026-02-21T08:24:48.0983746Z 2026-02-21T08:24:48.0983808Z {-# 2026-02-21T08:24:48.0984013Z external_resources: { 2026-02-21T08:24:48.0984177Z mlir_reproducer: { 2026-02-21T08:24:48.0988438Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:48.0992813Z disable_threading: false, 2026-02-21T08:24:48.0992972Z verify_each: true 2026-02-21T08:24:48.0993116Z } 2026-02-21T08:24:48.0993230Z } 2026-02-21T08:24:48.0993349Z #-} 2026-02-21T08:24:48.0993754Z /tmp/torchinductor_root/7g/c7gjsy6mlluox5unwj6ock57hszytyp5zlzgitn3szsuyjxwnppz.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:48.0994963Z /tmp/torchinductor_root/7g/c7gjsy6mlluox5unwj6ock57hszytyp5zlzgitn3szsuyjxwnppz.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:48.0995945Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:48.0997040Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:24:48.0998085Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:48.0998351Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:48.3769190Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:24:48.3772801Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:48.3773753Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:24:48.3773950Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:24:48.3774146Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:48.3774370Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:24:48.3774929Z %cst = arith.constant dense<8192> : tensor<32x1xi32> 2026-02-21T08:24:48.3775214Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x8xf32> 2026-02-21T08:24:48.3775432Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:24:48.3775618Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:48.3775797Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:48.3775975Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:48.3776155Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:48.3776456Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:48.3776772Z %1 = tt.get_program_id x : i32 2026-02-21T08:24:48.3776975Z scf.for %arg5 = %1 to %c128_i32 step %c592_i32 : i32 { 2026-02-21T08:24:48.3777192Z %2 = arith.muli %arg5, %c32_i32 : i32 2026-02-21T08:24:48.3777416Z %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:24:48.3777671Z %4 = tt.splat %2 : i32 -> tensor<32xi32> 2026-02-21T08:24:48.3777869Z %5 = arith.addi %4, %3 : tensor<32xi32> 2026-02-21T08:24:48.3778047Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:24:48.3778352Z %6 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<32x8xf32>) : i32 { 2026-02-21T08:24:48.3778696Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:48.3778942Z %11 = tt.splat %arg6 : i32 -> tensor<8xi32> 2026-02-21T08:24:48.3779139Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:24:48.3779390Z %13 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:24:48.3779653Z %14 = arith.muli %13, %cst : tensor<32x1xi32> 2026-02-21T08:24:48.3779897Z %15 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:48.3780180Z %16 = tt.broadcast %14 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:24:48.3780431Z %17 = tt.broadcast %15 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:24:48.3780675Z %18 = arith.addi %16, %17 : tensor<32x8xi32> 2026-02-21T08:24:48.3780902Z %19 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:24:48.3781178Z %20 = tt.addptr %19, %18 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:24:48.3781466Z %21 = tt.load %20 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:24:48.3781792Z %22 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<32x8xf32> 2026-02-21T08:24:48.3782298Z %23 = scf.if %arg3 -> (tensor<32x8xf32>) { 2026-02-21T08:24:48.3782653Z %42 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:24:48.3783025Z %43 = arith.subf %22, %21 : tensor<32x8xf32> 2026-02-21T08:24:48.3783235Z %44 = arith.mulf %42, %43 : tensor<32x8xf32> 2026-02-21T08:24:48.3783574Z %45 = arith.addf %44, %cst_0 : tensor<32x8xf32> 2026-02-21T08:24:48.3783775Z scf.yield %45 : tensor<32x8xf32> 2026-02-21T08:24:48.3783941Z } else { 2026-02-21T08:24:48.3784129Z %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32> 2026-02-21T08:24:48.3784345Z %43 = arith.cmpf ogt, %22, %42 : tensor<32x8xf32> 2026-02-21T08:24:48.3784571Z %44 = arith.cmpf une, %22, %22 : tensor<32x8xf32> 2026-02-21T08:24:48.3784785Z %45 = arith.ori %43, %44 : tensor<32x8xi1> 2026-02-21T08:24:48.3785021Z %46 = arith.select %45, %22, %42 : tensor<32x8xi1>, tensor<32x8xf32> 2026-02-21T08:24:48.3785266Z %47 = math.log %46 : tensor<32x8xf32> 2026-02-21T08:24:48.3785458Z %48 = arith.subf %47, %21 : tensor<32x8xf32> 2026-02-21T08:24:48.3785662Z %49 = arith.mulf %22, %48 : tensor<32x8xf32> 2026-02-21T08:24:48.3785863Z %50 = arith.addf %49, %cst_0 : tensor<32x8xf32> 2026-02-21T08:24:48.3786128Z scf.yield %50 : tensor<32x8xf32> 2026-02-21T08:24:48.3786300Z } 2026-02-21T08:24:48.3786453Z %24 = arith.addf %arg7, %23 : tensor<32x8xf32> 2026-02-21T08:24:48.3786655Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:48.3786845Z %25 = arith.muli %c8_i32, %c1_i32 : i32 2026-02-21T08:24:48.3787035Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:24:48.3787256Z %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:48.3787502Z %28 = tt.splat %26 : i32 -> tensor<8xi32> 2026-02-21T08:24:48.3787699Z %29 = arith.addi %28, %27 : tensor<8xi32> 2026-02-21T08:24:48.3787947Z %30 = tt.expand_dims %5 {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> 2026-02-21T08:24:48.3788213Z %31 = arith.muli %30, %cst : tensor<32x1xi32> 2026-02-21T08:24:48.3788460Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:48.3788747Z %33 = tt.broadcast %31 : tensor<32x1xi32> -> tensor<32x8xi32> 2026-02-21T08:24:48.3789004Z %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<32x8xi32> 2026-02-21T08:24:48.3789246Z %35 = arith.addi %33, %34 : tensor<32x8xi32> 2026-02-21T08:24:48.3789468Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<32x8x!tt.ptr> 2026-02-21T08:24:48.3789734Z %37 = tt.addptr %36, %35 : tensor<32x8x!tt.ptr>, tensor<32x8xi32> 2026-02-21T08:24:48.3790021Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<32x8x!tt.ptr> 2026-02-21T08:24:48.3790346Z %39 = tt.descriptor_load %0[%2, %26] : !tt.tensordesc> -> tensor<32x8xf32> 2026-02-21T08:24:48.3790634Z %40 = scf.if %arg3 -> (tensor<32x8xf32>) { 2026-02-21T08:24:48.3790979Z %42 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<32x8xf32>) -> tensor<32x8xf32> 2026-02-21T08:24:48.3791343Z %43 = arith.subf %39, %38 : tensor<32x8xf32> 2026-02-21T08:24:48.3791549Z %44 = arith.mulf %42, %43 : tensor<32x8xf32> 2026-02-21T08:24:48.3791755Z %45 = arith.addf %44, %cst_0 : tensor<32x8xf32> 2026-02-21T08:24:48.3791986Z scf.yield %45 : tensor<32x8xf32> 2026-02-21T08:24:48.3792150Z } else { 2026-02-21T08:24:48.3792314Z %42 = tt.splat %arg4 : f32 -> tensor<32x8xf32> 2026-02-21T08:24:48.3792522Z %43 = arith.cmpf ogt, %39, %42 : tensor<32x8xf32> 2026-02-21T08:24:48.3792743Z %44 = arith.cmpf une, %39, %39 : tensor<32x8xf32> 2026-02-21T08:24:48.3792949Z %45 = arith.ori %43, %44 : tensor<32x8xi1> 2026-02-21T08:24:48.3793195Z %46 = arith.select %45, %39, %42 : tensor<32x8xi1>, tensor<32x8xf32> 2026-02-21T08:24:48.3793440Z %47 = math.log %46 : tensor<32x8xf32> 2026-02-21T08:24:48.3793634Z %48 = arith.subf %47, %38 : tensor<32x8xf32> 2026-02-21T08:24:48.3793845Z %49 = arith.mulf %39, %48 : tensor<32x8xf32> 2026-02-21T08:24:48.3794048Z %50 = arith.addf %49, %cst_0 : tensor<32x8xf32> 2026-02-21T08:24:48.3794307Z scf.yield %50 : tensor<32x8xf32> 2026-02-21T08:24:48.3794466Z } 2026-02-21T08:24:48.3794609Z %41 = arith.addf %24, %40 : tensor<32x8xf32> 2026-02-21T08:24:48.3794796Z scf.yield %41 : tensor<32x8xf32> 2026-02-21T08:24:48.3795003Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:24:48.3795227Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:24:48.3795405Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:48.3795580Z %10 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:48.3795756Z tt.reduce.return %10 : f32 2026-02-21T08:24:48.3795941Z }) : (tensor<32x8xf32>) -> tensor<32xf32> 2026-02-21T08:24:48.3796156Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<32x!tt.ptr> 2026-02-21T08:24:48.3796413Z %9 = tt.addptr %8, %5 : tensor<32x!tt.ptr>, tensor<32xi32> 2026-02-21T08:24:48.3796694Z tt.store %9, %7 : tensor<32x!tt.ptr> 2026-02-21T08:24:48.3796888Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:24:48.3797060Z tt.return 2026-02-21T08:24:48.3797179Z } 2026-02-21T08:24:48.3797298Z } 2026-02-21T08:24:48.3797363Z 2026-02-21T08:24:48.3797411Z {-# 2026-02-21T08:24:48.3797572Z external_resources: { 2026-02-21T08:24:48.3797721Z mlir_reproducer: { 2026-02-21T08:24:48.3801932Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:48.3806268Z disable_threading: false, 2026-02-21T08:24:48.3806425Z verify_each: true 2026-02-21T08:24:48.3806573Z } 2026-02-21T08:24:48.3806686Z } 2026-02-21T08:24:48.3806800Z #-} 2026-02-21T08:24:48.3807212Z /tmp/torchinductor_root/zc/czcgaxxqjni443xocez7ve6cni2qufwsdli2zj3hwqhryjh4lgxx.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:48.3808384Z /tmp/torchinductor_root/zc/czcgaxxqjni443xocez7ve6cni2qufwsdli2zj3hwqhryjh4lgxx.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:48.3809336Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:48.3810456Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 32], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:24:48.3811409Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:48.3811657Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:49.2414568Z module { 2026-02-21T08:24:49.2416155Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:49.2416732Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:24:49.2417244Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:49.2417505Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x8xf32> 2026-02-21T08:24:49.2417732Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:24:49.2417923Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:49.2418102Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:49.2418275Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:49.2418452Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:49.2418757Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:49.2419185Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:49.2419485Z %2 = tt.get_program_id x : i32 2026-02-21T08:24:49.2419674Z %3 = arith.muli %2, %c1024_i32 : i32 2026-02-21T08:24:49.2419911Z %4 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:24:49.2420167Z %5 = tt.splat %3 : i32 -> tensor<1024xi32> 2026-02-21T08:24:49.2420359Z %6 = arith.addi %5, %4 : tensor<1024xi32> 2026-02-21T08:24:49.2420661Z %7 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c8_i32 iter_args(%arg6 = %cst) -> (tensor<1024x8xf32>) : i32 { 2026-02-21T08:24:49.2421058Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<1024x8xf32> 2026-02-21T08:24:49.2421424Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<1024x8xf32> 2026-02-21T08:24:49.2421703Z %13 = scf.if %arg3 -> (tensor<1024x8xf32>) { 2026-02-21T08:24:49.2422159Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x8xf32>) -> tensor<1024x8xf32> 2026-02-21T08:24:49.2422518Z %16 = arith.subf %12, %11 : tensor<1024x8xf32> 2026-02-21T08:24:49.2422727Z %17 = arith.mulf %15, %16 : tensor<1024x8xf32> 2026-02-21T08:24:49.2422935Z %18 = arith.addf %17, %cst : tensor<1024x8xf32> 2026-02-21T08:24:49.2423143Z scf.yield %18 : tensor<1024x8xf32> 2026-02-21T08:24:49.2423315Z } else { 2026-02-21T08:24:49.2423472Z %15 = tt.splat %arg4 : f32 -> tensor<1024x8xf32> 2026-02-21T08:24:49.2423696Z %16 = arith.cmpf ogt, %12, %15 : tensor<1024x8xf32> 2026-02-21T08:24:49.2423913Z %17 = arith.cmpf une, %12, %12 : tensor<1024x8xf32> 2026-02-21T08:24:49.2424124Z %18 = arith.ori %16, %17 : tensor<1024x8xi1> 2026-02-21T08:24:49.2424360Z %19 = arith.select %18, %12, %15 : tensor<1024x8xi1>, tensor<1024x8xf32> 2026-02-21T08:24:49.2424601Z %20 = math.log %19 : tensor<1024x8xf32> 2026-02-21T08:24:49.2424797Z %21 = arith.subf %20, %11 : tensor<1024x8xf32> 2026-02-21T08:24:49.2424991Z %22 = arith.mulf %12, %21 : tensor<1024x8xf32> 2026-02-21T08:24:49.2425193Z %23 = arith.addf %22, %cst : tensor<1024x8xf32> 2026-02-21T08:24:49.2425383Z scf.yield %23 : tensor<1024x8xf32> 2026-02-21T08:24:49.2425737Z } 2026-02-21T08:24:49.2425880Z %14 = arith.addf %arg6, %13 : tensor<1024x8xf32> 2026-02-21T08:24:49.2426078Z scf.yield %14 : tensor<1024x8xf32> 2026-02-21T08:24:49.2426353Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:24:49.2426614Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:24:49.2426805Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:24:49.2426972Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:24:49.2427157Z tt.reduce.return %11 : f32 2026-02-21T08:24:49.2427333Z }) : (tensor<1024x8xf32>) -> tensor<1024xf32> 2026-02-21T08:24:49.2427568Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<1024x!tt.ptr> 2026-02-21T08:24:49.2427822Z %10 = tt.addptr %9, %6 : tensor<1024x!tt.ptr>, tensor<1024xi32> 2026-02-21T08:24:49.2428061Z tt.store %10, %8 : tensor<1024x!tt.ptr> 2026-02-21T08:24:49.2428242Z tt.return 2026-02-21T08:24:49.2428429Z } 2026-02-21T08:24:49.2428557Z } 2026-02-21T08:24:49.2428621Z 2026-02-21T08:24:49.2428669Z {-# 2026-02-21T08:24:49.2428796Z external_resources: { 2026-02-21T08:24:49.2428944Z mlir_reproducer: { 2026-02-21T08:24:49.2433147Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:49.2437615Z disable_threading: false, 2026-02-21T08:24:49.2437782Z verify_each: true 2026-02-21T08:24:49.2437924Z } 2026-02-21T08:24:49.2438036Z } 2026-02-21T08:24:49.2438152Z #-} 2026-02-21T08:24:49.2438565Z /tmp/torchinductor_root/nr/cnr7f25gh6e2u5sisdbfibfm75rhxvditarqdcbkxerpiisvqpft.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:49.2439759Z /tmp/torchinductor_root/nr/cnr7f25gh6e2u5sisdbfibfm75rhxvditarqdcbkxerpiisvqpft.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:49.2440755Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:49.2441723Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:24:49.2442671Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:49.2442918Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:49.7262703Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:24:49.7266589Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:49.7270818Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:49.7272693Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:24:49.7272914Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:49.7273412Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:49.7273651Z %cst = arith.constant dense<8192> : tensor<2048x1xi32> 2026-02-21T08:24:49.7273916Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<2048x8xf32> 2026-02-21T08:24:49.7274143Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T08:24:49.7274337Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:49.7274522Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:49.7274698Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:49.7274878Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:49.7275185Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:49.7275509Z %1 = tt.get_program_id x : i32 2026-02-21T08:24:49.7275680Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:24:49.7275859Z %3 = arith.minsi %2, %c2_i32 : i32 2026-02-21T08:24:49.7276054Z scf.for %arg5 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:24:49.7276258Z %4 = arith.muli %arg5, %c2048_i32 : i32 2026-02-21T08:24:49.7276509Z %5 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> 2026-02-21T08:24:49.7276758Z %6 = tt.splat %4 : i32 -> tensor<2048xi32> 2026-02-21T08:24:49.7276953Z %7 = arith.addi %6, %5 : tensor<2048xi32> 2026-02-21T08:24:49.7277140Z %c8184_i32 = arith.constant 8184 : i32 2026-02-21T08:24:49.7277333Z %c24_i32 = arith.constant 24 : i32 2026-02-21T08:24:49.7277635Z %8 = scf.for %arg6 = %c0_i32 to %c8184_i32 step %c24_i32 iter_args(%arg7 = %cst_0) -> (tensor<2048x8xf32>) : i32 { 2026-02-21T08:24:49.7277988Z %27 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:49.7278234Z %28 = tt.splat %arg6 : i32 -> tensor<8xi32> 2026-02-21T08:24:49.7278430Z %29 = arith.addi %28, %27 : tensor<8xi32> 2026-02-21T08:24:49.7278720Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:24:49.7278992Z %31 = arith.muli %30, %cst : tensor<2048x1xi32> 2026-02-21T08:24:49.7279242Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:49.7279527Z %33 = tt.broadcast %31 : tensor<2048x1xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7279786Z %34 = tt.broadcast %32 : tensor<1x8xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7280012Z %35 = arith.addi %33, %34 : tensor<2048x8xi32> 2026-02-21T08:24:49.7280247Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7280518Z %37 = tt.addptr %36, %35 : tensor<2048x8x!tt.ptr>, tensor<2048x8xi32> 2026-02-21T08:24:49.7280816Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7281156Z %39 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<2048x8xf32> 2026-02-21T08:24:49.7281446Z %40 = scf.if %arg3 -> (tensor<2048x8xf32>) { 2026-02-21T08:24:49.7281820Z %76 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32> 2026-02-21T08:24:49.7282409Z %77 = arith.subf %39, %38 : tensor<2048x8xf32> 2026-02-21T08:24:49.7282620Z %78 = arith.mulf %76, %77 : tensor<2048x8xf32> 2026-02-21T08:24:49.7282827Z %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7283036Z scf.yield %79 : tensor<2048x8xf32> 2026-02-21T08:24:49.7283211Z } else { 2026-02-21T08:24:49.7283372Z %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32> 2026-02-21T08:24:49.7283595Z %77 = arith.cmpf ogt, %39, %76 : tensor<2048x8xf32> 2026-02-21T08:24:49.7283810Z %78 = arith.cmpf une, %39, %39 : tensor<2048x8xf32> 2026-02-21T08:24:49.7284023Z %79 = arith.ori %77, %78 : tensor<2048x8xi1> 2026-02-21T08:24:49.7284254Z %80 = arith.select %79, %39, %76 : tensor<2048x8xi1>, tensor<2048x8xf32> 2026-02-21T08:24:49.7284575Z %81 = math.log %80 : tensor<2048x8xf32> 2026-02-21T08:24:49.7284782Z %82 = arith.subf %81, %38 : tensor<2048x8xf32> 2026-02-21T08:24:49.7284979Z %83 = arith.mulf %39, %82 : tensor<2048x8xf32> 2026-02-21T08:24:49.7285187Z %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7285379Z scf.yield %84 : tensor<2048x8xf32> 2026-02-21T08:24:49.7285548Z } 2026-02-21T08:24:49.7285689Z %41 = arith.addf %arg7, %40 : tensor<2048x8xf32> 2026-02-21T08:24:49.7285891Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:24:49.7286085Z %42 = arith.muli %c8_i32, %c1_i32_1 : i32 2026-02-21T08:24:49.7286271Z %43 = arith.addi %arg6, %42 : i32 2026-02-21T08:24:49.7286505Z %44 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:49.7286739Z %45 = tt.splat %43 : i32 -> tensor<8xi32> 2026-02-21T08:24:49.7286929Z %46 = arith.addi %45, %44 : tensor<8xi32> 2026-02-21T08:24:49.7287170Z %47 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:24:49.7287437Z %48 = arith.muli %47, %cst : tensor<2048x1xi32> 2026-02-21T08:24:49.7287675Z %49 = tt.expand_dims %46 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:49.7287954Z %50 = tt.broadcast %48 : tensor<2048x1xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7288213Z %51 = tt.broadcast %49 : tensor<1x8xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7288434Z %52 = arith.addi %50, %51 : tensor<2048x8xi32> 2026-02-21T08:24:49.7288662Z %53 = tt.splat %arg0 : !tt.ptr -> tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7288925Z %54 = tt.addptr %53, %52 : tensor<2048x8x!tt.ptr>, tensor<2048x8xi32> 2026-02-21T08:24:49.7289221Z %55 = tt.load %54 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7289557Z %56 = tt.descriptor_load %0[%4, %43] : !tt.tensordesc> -> tensor<2048x8xf32> 2026-02-21T08:24:49.7289846Z %57 = scf.if %arg3 -> (tensor<2048x8xf32>) { 2026-02-21T08:24:49.7290207Z %76 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32> 2026-02-21T08:24:49.7290568Z %77 = arith.subf %56, %55 : tensor<2048x8xf32> 2026-02-21T08:24:49.7290775Z %78 = arith.mulf %76, %77 : tensor<2048x8xf32> 2026-02-21T08:24:49.7290976Z %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7291182Z scf.yield %79 : tensor<2048x8xf32> 2026-02-21T08:24:49.7291354Z } else { 2026-02-21T08:24:49.7291508Z %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32> 2026-02-21T08:24:49.7291726Z %77 = arith.cmpf ogt, %56, %76 : tensor<2048x8xf32> 2026-02-21T08:24:49.7291976Z %78 = arith.cmpf une, %56, %56 : tensor<2048x8xf32> 2026-02-21T08:24:49.7292191Z %79 = arith.ori %77, %78 : tensor<2048x8xi1> 2026-02-21T08:24:49.7292485Z %80 = arith.select %79, %56, %76 : tensor<2048x8xi1>, tensor<2048x8xf32> 2026-02-21T08:24:49.7292726Z %81 = math.log %80 : tensor<2048x8xf32> 2026-02-21T08:24:49.7292923Z %82 = arith.subf %81, %55 : tensor<2048x8xf32> 2026-02-21T08:24:49.7293119Z %83 = arith.mulf %56, %82 : tensor<2048x8xf32> 2026-02-21T08:24:49.7293325Z %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7293520Z scf.yield %84 : tensor<2048x8xf32> 2026-02-21T08:24:49.7293691Z } 2026-02-21T08:24:49.7293827Z %58 = arith.addf %41, %57 : tensor<2048x8xf32> 2026-02-21T08:24:49.7294021Z %c2_i32_2 = arith.constant 2 : i32 2026-02-21T08:24:49.7294210Z %59 = arith.muli %c8_i32, %c2_i32_2 : i32 2026-02-21T08:24:49.7294391Z %60 = arith.addi %arg6, %59 : i32 2026-02-21T08:24:49.7294611Z %61 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:49.7294922Z %62 = tt.splat %60 : i32 -> tensor<8xi32> 2026-02-21T08:24:49.7295123Z %63 = arith.addi %62, %61 : tensor<8xi32> 2026-02-21T08:24:49.7295369Z %64 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:24:49.7295639Z %65 = arith.muli %64, %cst : tensor<2048x1xi32> 2026-02-21T08:24:49.7295891Z %66 = tt.expand_dims %63 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:49.7296168Z %67 = tt.broadcast %65 : tensor<2048x1xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7296437Z %68 = tt.broadcast %66 : tensor<1x8xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7296696Z %69 = arith.addi %67, %68 : tensor<2048x8xi32> 2026-02-21T08:24:49.7296936Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7297210Z %71 = tt.addptr %70, %69 : tensor<2048x8x!tt.ptr>, tensor<2048x8xi32> 2026-02-21T08:24:49.7297514Z %72 = tt.load %71 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7297856Z %73 = tt.descriptor_load %0[%4, %60] : !tt.tensordesc> -> tensor<2048x8xf32> 2026-02-21T08:24:49.7298145Z %74 = scf.if %arg3 -> (tensor<2048x8xf32>) { 2026-02-21T08:24:49.7298508Z %76 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32> 2026-02-21T08:24:49.7298868Z %77 = arith.subf %73, %72 : tensor<2048x8xf32> 2026-02-21T08:24:49.7299079Z %78 = arith.mulf %76, %77 : tensor<2048x8xf32> 2026-02-21T08:24:49.7299288Z %79 = arith.addf %78, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7299497Z scf.yield %79 : tensor<2048x8xf32> 2026-02-21T08:24:49.7299670Z } else { 2026-02-21T08:24:49.7299830Z %76 = tt.splat %arg4 : f32 -> tensor<2048x8xf32> 2026-02-21T08:24:49.7300055Z %77 = arith.cmpf ogt, %73, %76 : tensor<2048x8xf32> 2026-02-21T08:24:49.7300274Z %78 = arith.cmpf une, %73, %73 : tensor<2048x8xf32> 2026-02-21T08:24:49.7300491Z %79 = arith.ori %77, %78 : tensor<2048x8xi1> 2026-02-21T08:24:49.7300727Z %80 = arith.select %79, %73, %76 : tensor<2048x8xi1>, tensor<2048x8xf32> 2026-02-21T08:24:49.7300976Z %81 = math.log %80 : tensor<2048x8xf32> 2026-02-21T08:24:49.7301179Z %82 = arith.subf %81, %72 : tensor<2048x8xf32> 2026-02-21T08:24:49.7301377Z %83 = arith.mulf %73, %82 : tensor<2048x8xf32> 2026-02-21T08:24:49.7301591Z %84 = arith.addf %83, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7301789Z scf.yield %84 : tensor<2048x8xf32> 2026-02-21T08:24:49.7301984Z } 2026-02-21T08:24:49.7302120Z %75 = arith.addf %58, %74 : tensor<2048x8xf32> 2026-02-21T08:24:49.7302309Z scf.yield %75 : tensor<2048x8xf32> 2026-02-21T08:24:49.7302492Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:49.7302719Z %9 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:24:49.7303023Z %10 = tt.splat %c8184_i32 : i32 -> tensor<8xi32> 2026-02-21T08:24:49.7303214Z %11 = arith.addi %10, %9 : tensor<8xi32> 2026-02-21T08:24:49.7303459Z %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2048xi32> -> tensor<2048x1xi32> 2026-02-21T08:24:49.7303717Z %13 = arith.muli %12, %cst : tensor<2048x1xi32> 2026-02-21T08:24:49.7303984Z %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:24:49.7304252Z %15 = tt.broadcast %13 : tensor<2048x1xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7304510Z %16 = tt.broadcast %14 : tensor<1x8xi32> -> tensor<2048x8xi32> 2026-02-21T08:24:49.7304759Z %17 = arith.addi %15, %16 : tensor<2048x8xi32> 2026-02-21T08:24:49.7304992Z %18 = tt.splat %arg0 : !tt.ptr -> tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7305276Z %19 = tt.addptr %18, %17 : tensor<2048x8x!tt.ptr>, tensor<2048x8xi32> 2026-02-21T08:24:49.7305635Z %20 = tt.load %19 evictionPolicy = evict_first : tensor<2048x8x!tt.ptr> 2026-02-21T08:24:49.7306010Z %21 = tt.descriptor_load %0[%4, %c8184_i32] : !tt.tensordesc> -> tensor<2048x8xf32> 2026-02-21T08:24:49.7306330Z %22 = scf.if %arg3 -> (tensor<2048x8xf32>) { 2026-02-21T08:24:49.7306702Z %27 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<2048x8xf32>) -> tensor<2048x8xf32> 2026-02-21T08:24:49.7307083Z %28 = arith.subf %21, %20 : tensor<2048x8xf32> 2026-02-21T08:24:49.7307289Z %29 = arith.mulf %27, %28 : tensor<2048x8xf32> 2026-02-21T08:24:49.7307509Z %30 = arith.addf %29, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7307712Z scf.yield %30 : tensor<2048x8xf32> 2026-02-21T08:24:49.7307891Z } else { 2026-02-21T08:24:49.7308060Z %27 = tt.splat %arg4 : f32 -> tensor<2048x8xf32> 2026-02-21T08:24:49.7308284Z %28 = arith.cmpf ogt, %21, %27 : tensor<2048x8xf32> 2026-02-21T08:24:49.7308527Z %29 = arith.cmpf une, %21, %21 : tensor<2048x8xf32> 2026-02-21T08:24:49.7308752Z %30 = arith.ori %28, %29 : tensor<2048x8xi1> 2026-02-21T08:24:49.7309005Z %31 = arith.select %30, %21, %27 : tensor<2048x8xi1>, tensor<2048x8xf32> 2026-02-21T08:24:49.7309251Z %32 = math.log %31 : tensor<2048x8xf32> 2026-02-21T08:24:49.7309456Z %33 = arith.subf %32, %20 : tensor<2048x8xf32> 2026-02-21T08:24:49.7309667Z %34 = arith.mulf %21, %33 : tensor<2048x8xf32> 2026-02-21T08:24:49.7309880Z %35 = arith.addf %34, %cst_0 : tensor<2048x8xf32> 2026-02-21T08:24:49.7310090Z scf.yield %35 : tensor<2048x8xf32> 2026-02-21T08:24:49.7310261Z } 2026-02-21T08:24:49.7310414Z %23 = arith.addf %8, %22 : tensor<2048x8xf32> 2026-02-21T08:24:49.7310616Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T08:24:49.7310816Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:49.7310995Z %27 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:49.7311193Z tt.reduce.return %27 : f32 2026-02-21T08:24:49.7311392Z }) : (tensor<2048x8xf32>) -> tensor<2048xf32> 2026-02-21T08:24:49.7311631Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<2048x!tt.ptr> 2026-02-21T08:24:49.7311936Z %26 = tt.addptr %25, %7 : tensor<2048x!tt.ptr>, tensor<2048xi32> 2026-02-21T08:24:49.7312185Z tt.store %26, %24 : tensor<2048x!tt.ptr> 2026-02-21T08:24:49.7312391Z } {tt.warp_specialize} 2026-02-21T08:24:49.7312545Z tt.return 2026-02-21T08:24:49.7312681Z } 2026-02-21T08:24:49.7312800Z } 2026-02-21T08:24:49.7312876Z 2026-02-21T08:24:49.7312926Z {-# 2026-02-21T08:24:49.7313060Z external_resources: { 2026-02-21T08:24:49.7313218Z mlir_reproducer: { 2026-02-21T08:24:49.7317548Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:49.7321923Z disable_threading: false, 2026-02-21T08:24:49.7322093Z verify_each: true 2026-02-21T08:24:49.7322232Z } 2026-02-21T08:24:49.7322351Z } 2026-02-21T08:24:49.7322459Z #-} 2026-02-21T08:24:49.7322886Z /tmp/torchinductor_root/7q/c7qvgckpip6quizl7wwune2zn2qif7xj2v2ry4kmed7ypp353pjy.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:49.7324052Z /tmp/torchinductor_root/7q/c7qvgckpip6quizl7wwune2zn2qif7xj2v2ry4kmed7ypp353pjy.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:49.7325004Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:49.7326045Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 2048], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:24:49.7326976Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:49.7327225Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:52.2024985Z module { 2026-02-21T08:24:52.2029646Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:24:52.2033725Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:24:52.2035184Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:24:52.2035401Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:24:52.2035584Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:24:52.2035793Z %cst = arith.constant dense<8192> : tensor<4x1xi32> 2026-02-21T08:24:52.2036051Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x32xf32> 2026-02-21T08:24:52.2036291Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:24:52.2036473Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:24:52.2036656Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T08:24:52.2036842Z %c8192_i64 = arith.constant 8192 : i64 2026-02-21T08:24:52.2037363Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:24:52.2037664Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c8192_i32], [%c8192_i64, %c1_i64] : , > 2026-02-21T08:24:52.2037982Z %1 = tt.get_program_id x : i32 2026-02-21T08:24:52.2038162Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T08:24:52.2038329Z %3 = arith.addi %2, %c4_i32 : i32 2026-02-21T08:24:52.2038506Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:24:52.2038678Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:24:52.2038847Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:24:52.2039022Z %6 = arith.subi %c1_i32, %c1_i32_1 : i32 2026-02-21T08:24:52.2039198Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:24:52.2039355Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:24:52.2039527Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:24:52.2039695Z %9 = arith.remsi %8, %c3_i32 : i32 2026-02-21T08:24:52.2039855Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:24:52.2040126Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:24:52.2040302Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:24:52.2040478Z %13 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:24:52.2040667Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:24:52.2040865Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:24:52.2041083Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:52.2041334Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:24:52.2041526Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:24:52.2041831Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>) : i32 { 2026-02-21T08:24:52.2042366Z %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:24:52.2042612Z %43 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T08:24:52.2042823Z %44 = arith.addi %43, %42 : tensor<32xi32> 2026-02-21T08:24:52.2043075Z %45 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:52.2043324Z %46 = arith.muli %45, %cst : tensor<4x1xi32> 2026-02-21T08:24:52.2043574Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:24:52.2043850Z %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2044102Z %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2044319Z %50 = arith.addi %48, %49 : tensor<4x32xi32> 2026-02-21T08:24:52.2044550Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2044816Z %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr>, tensor<4x32xi32> 2026-02-21T08:24:52.2045096Z %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2045430Z %54 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x32xf32> 2026-02-21T08:24:52.2045712Z %55 = scf.if %arg3 -> (tensor<4x32xf32>) { 2026-02-21T08:24:52.2046074Z %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32> 2026-02-21T08:24:52.2046436Z %58 = arith.subf %54, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2046635Z %59 = arith.mulf %57, %58 : tensor<4x32xf32> 2026-02-21T08:24:52.2046845Z %60 = arith.addf %59, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2047039Z scf.yield %60 : tensor<4x32xf32> 2026-02-21T08:24:52.2047208Z } else { 2026-02-21T08:24:52.2047412Z %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32> 2026-02-21T08:24:52.2047623Z %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32> 2026-02-21T08:24:52.2047841Z %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32> 2026-02-21T08:24:52.2048047Z %60 = arith.ori %58, %59 : tensor<4x32xi1> 2026-02-21T08:24:52.2048278Z %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32> 2026-02-21T08:24:52.2048608Z %62 = math.log %61 : tensor<4x32xf32> 2026-02-21T08:24:52.2048797Z %63 = arith.subf %62, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2048995Z %64 = arith.mulf %54, %63 : tensor<4x32xf32> 2026-02-21T08:24:52.2049195Z %65 = arith.addf %64, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2049397Z scf.yield %65 : tensor<4x32xf32> 2026-02-21T08:24:52.2049567Z } 2026-02-21T08:24:52.2049709Z %56 = arith.addf %arg7, %55 : tensor<4x32xf32> 2026-02-21T08:24:52.2049903Z scf.yield %56 : tensor<4x32xf32> 2026-02-21T08:24:52.2050085Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:24:52.2050278Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:24:52.2050458Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:52.2050633Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:52.2050870Z tt.reduce.return %42 : f32 2026-02-21T08:24:52.2051058Z }) : (tensor<4x32xf32>) -> tensor<4xf32> 2026-02-21T08:24:52.2051279Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:52.2051526Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:52.2051755Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:24:52.2051973Z %c1_i32_2 = arith.constant 1 : i32 2026-02-21T08:24:52.2052158Z %22 = arith.muli %c1_i32, %c1_i32_2 : i32 2026-02-21T08:24:52.2052336Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:24:52.2052512Z %24 = arith.muli %23, %c4_i32 : i32 2026-02-21T08:24:52.2052721Z %25 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:52.2052957Z %26 = tt.splat %24 : i32 -> tensor<4xi32> 2026-02-21T08:24:52.2053147Z %27 = arith.addi %26, %25 : tensor<4xi32> 2026-02-21T08:24:52.2053445Z %28 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>) : i32 { 2026-02-21T08:24:52.2053793Z %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:24:52.2054031Z %43 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T08:24:52.2054231Z %44 = arith.addi %43, %42 : tensor<32xi32> 2026-02-21T08:24:52.2054475Z %45 = tt.expand_dims %27 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:52.2054722Z %46 = arith.muli %45, %cst : tensor<4x1xi32> 2026-02-21T08:24:52.2054963Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:24:52.2055235Z %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2055485Z %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2055703Z %50 = arith.addi %48, %49 : tensor<4x32xi32> 2026-02-21T08:24:52.2055927Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2056192Z %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr>, tensor<4x32xi32> 2026-02-21T08:24:52.2056471Z %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2056805Z %54 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<4x32xf32> 2026-02-21T08:24:52.2057085Z %55 = scf.if %arg3 -> (tensor<4x32xf32>) { 2026-02-21T08:24:52.2057442Z %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32> 2026-02-21T08:24:52.2057802Z %58 = arith.subf %54, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2058000Z %59 = arith.mulf %57, %58 : tensor<4x32xf32> 2026-02-21T08:24:52.2058216Z %60 = arith.addf %59, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2058442Z scf.yield %60 : tensor<4x32xf32> 2026-02-21T08:24:52.2058619Z } else { 2026-02-21T08:24:52.2058786Z %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32> 2026-02-21T08:24:52.2059080Z %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32> 2026-02-21T08:24:52.2059299Z %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32> 2026-02-21T08:24:52.2059516Z %60 = arith.ori %58, %59 : tensor<4x32xi1> 2026-02-21T08:24:52.2059761Z %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32> 2026-02-21T08:24:52.2060002Z %62 = math.log %61 : tensor<4x32xf32> 2026-02-21T08:24:52.2060206Z %63 = arith.subf %62, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2060404Z %64 = arith.mulf %54, %63 : tensor<4x32xf32> 2026-02-21T08:24:52.2060619Z %65 = arith.addf %64, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2060816Z scf.yield %65 : tensor<4x32xf32> 2026-02-21T08:24:52.2060996Z } 2026-02-21T08:24:52.2061147Z %56 = arith.addf %arg7, %55 : tensor<4x32xf32> 2026-02-21T08:24:52.2061338Z scf.yield %56 : tensor<4x32xf32> 2026-02-21T08:24:52.2061637Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:24:52.2061837Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:24:52.2062068Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:52.2062245Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:52.2062438Z tt.reduce.return %42 : f32 2026-02-21T08:24:52.2062624Z }) : (tensor<4x32xf32>) -> tensor<4xf32> 2026-02-21T08:24:52.2062857Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:52.2063124Z %31 = tt.addptr %30, %27 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:52.2063356Z tt.store %31, %29 : tensor<4x!tt.ptr> 2026-02-21T08:24:52.2063558Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:24:52.2063741Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:24:52.2063929Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:24:52.2064104Z %34 = arith.muli %33, %c4_i32 : i32 2026-02-21T08:24:52.2064331Z %35 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:52.2064577Z %36 = tt.splat %34 : i32 -> tensor<4xi32> 2026-02-21T08:24:52.2064766Z %37 = arith.addi %36, %35 : tensor<4xi32> 2026-02-21T08:24:52.2065082Z %38 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>) : i32 { 2026-02-21T08:24:52.2065435Z %42 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:24:52.2065689Z %43 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T08:24:52.2065926Z %44 = arith.addi %43, %42 : tensor<32xi32> 2026-02-21T08:24:52.2066167Z %45 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:52.2066420Z %46 = arith.muli %45, %cst : tensor<4x1xi32> 2026-02-21T08:24:52.2066656Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:24:52.2066936Z %48 = tt.broadcast %46 : tensor<4x1xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2067178Z %49 = tt.broadcast %47 : tensor<1x32xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2067406Z %50 = arith.addi %48, %49 : tensor<4x32xi32> 2026-02-21T08:24:52.2067621Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2067883Z %52 = tt.addptr %51, %50 : tensor<4x32x!tt.ptr>, tensor<4x32xi32> 2026-02-21T08:24:52.2068162Z %53 = tt.load %52 evictionPolicy = evict_last : tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2068481Z %54 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<4x32xf32> 2026-02-21T08:24:52.2068764Z %55 = scf.if %arg3 -> (tensor<4x32xf32>) { 2026-02-21T08:24:52.2069108Z %57 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32> 2026-02-21T08:24:52.2069465Z %58 = arith.subf %54, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2069672Z %59 = arith.mulf %57, %58 : tensor<4x32xf32> 2026-02-21T08:24:52.2069935Z %60 = arith.addf %59, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2070134Z scf.yield %60 : tensor<4x32xf32> 2026-02-21T08:24:52.2070296Z } else { 2026-02-21T08:24:52.2070456Z %57 = tt.splat %arg4 : f32 -> tensor<4x32xf32> 2026-02-21T08:24:52.2070664Z %58 = arith.cmpf ogt, %54, %57 : tensor<4x32xf32> 2026-02-21T08:24:52.2070882Z %59 = arith.cmpf une, %54, %54 : tensor<4x32xf32> 2026-02-21T08:24:52.2071091Z %60 = arith.ori %58, %59 : tensor<4x32xi1> 2026-02-21T08:24:52.2071315Z %61 = arith.select %60, %54, %57 : tensor<4x32xi1>, tensor<4x32xf32> 2026-02-21T08:24:52.2071549Z %62 = math.log %61 : tensor<4x32xf32> 2026-02-21T08:24:52.2071735Z %63 = arith.subf %62, %53 : tensor<4x32xf32> 2026-02-21T08:24:52.2071955Z %64 = arith.mulf %54, %63 : tensor<4x32xf32> 2026-02-21T08:24:52.2072151Z %65 = arith.addf %64, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2072418Z scf.yield %65 : tensor<4x32xf32> 2026-02-21T08:24:52.2072584Z } 2026-02-21T08:24:52.2072731Z %56 = arith.addf %arg7, %55 : tensor<4x32xf32> 2026-02-21T08:24:52.2072922Z scf.yield %56 : tensor<4x32xf32> 2026-02-21T08:24:52.2073100Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:24:52.2073294Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:24:52.2073477Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:52.2073656Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:52.2073832Z tt.reduce.return %42 : f32 2026-02-21T08:24:52.2074016Z }) : (tensor<4x32xf32>) -> tensor<4xf32> 2026-02-21T08:24:52.2074231Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:52.2074488Z %41 = tt.addptr %40, %37 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:52.2074716Z tt.store %41, %39 : tensor<4x!tt.ptr> 2026-02-21T08:24:52.2074959Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:52.2075217Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:24:52.2075408Z %14 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:24:52.2075630Z %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:24:52.2075858Z %16 = tt.splat %14 : i32 -> tensor<4xi32> 2026-02-21T08:24:52.2076049Z %17 = arith.addi %16, %15 : tensor<4xi32> 2026-02-21T08:24:52.2076351Z %18 = scf.for %arg6 = %c0_i32 to %c8192_i32 step %c32_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x32xf32>) : i32 { 2026-02-21T08:24:52.2076692Z %22 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> 2026-02-21T08:24:52.2076934Z %23 = tt.splat %arg6 : i32 -> tensor<32xi32> 2026-02-21T08:24:52.2077124Z %24 = arith.addi %23, %22 : tensor<32xi32> 2026-02-21T08:24:52.2077365Z %25 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:24:52.2077617Z %26 = arith.muli %25, %cst : tensor<4x1xi32> 2026-02-21T08:24:52.2077851Z %27 = tt.expand_dims %24 {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> 2026-02-21T08:24:52.2078128Z %28 = tt.broadcast %26 : tensor<4x1xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2078372Z %29 = tt.broadcast %27 : tensor<1x32xi32> -> tensor<4x32xi32> 2026-02-21T08:24:52.2078596Z %30 = arith.addi %28, %29 : tensor<4x32xi32> 2026-02-21T08:24:52.2078812Z %31 = tt.splat %arg0 : !tt.ptr -> tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2079076Z %32 = tt.addptr %31, %30 : tensor<4x32x!tt.ptr>, tensor<4x32xi32> 2026-02-21T08:24:52.2079361Z %33 = tt.load %32 evictionPolicy = evict_last : tensor<4x32x!tt.ptr> 2026-02-21T08:24:52.2079683Z %34 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<4x32xf32> 2026-02-21T08:24:52.2079970Z %35 = scf.if %arg3 -> (tensor<4x32xf32>) { 2026-02-21T08:24:52.2080318Z %37 = tt.extern_elementwise %34 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x32xf32>) -> tensor<4x32xf32> 2026-02-21T08:24:52.2080736Z %38 = arith.subf %34, %33 : tensor<4x32xf32> 2026-02-21T08:24:52.2080931Z %39 = arith.mulf %37, %38 : tensor<4x32xf32> 2026-02-21T08:24:52.2081144Z %40 = arith.addf %39, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2081349Z scf.yield %40 : tensor<4x32xf32> 2026-02-21T08:24:52.2081513Z } else { 2026-02-21T08:24:52.2081675Z %37 = tt.splat %arg4 : f32 -> tensor<4x32xf32> 2026-02-21T08:24:52.2081918Z %38 = arith.cmpf ogt, %34, %37 : tensor<4x32xf32> 2026-02-21T08:24:52.2082151Z %39 = arith.cmpf une, %34, %34 : tensor<4x32xf32> 2026-02-21T08:24:52.2082357Z %40 = arith.ori %38, %39 : tensor<4x32xi1> 2026-02-21T08:24:52.2082590Z %41 = arith.select %40, %34, %37 : tensor<4x32xi1>, tensor<4x32xf32> 2026-02-21T08:24:52.2082823Z %42 = math.log %41 : tensor<4x32xf32> 2026-02-21T08:24:52.2083068Z %43 = arith.subf %42, %33 : tensor<4x32xf32> 2026-02-21T08:24:52.2083272Z %44 = arith.mulf %34, %43 : tensor<4x32xf32> 2026-02-21T08:24:52.2083470Z %45 = arith.addf %44, %cst_0 : tensor<4x32xf32> 2026-02-21T08:24:52.2083668Z scf.yield %45 : tensor<4x32xf32> 2026-02-21T08:24:52.2083829Z } 2026-02-21T08:24:52.2083978Z %36 = arith.addf %arg7, %35 : tensor<4x32xf32> 2026-02-21T08:24:52.2084163Z scf.yield %36 : tensor<4x32xf32> 2026-02-21T08:24:52.2084347Z } {tt.flatten, tt.warp_specialize} 2026-02-21T08:24:52.2084537Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:24:52.2084716Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:24:52.2084893Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:24:52.2085070Z tt.reduce.return %22 : f32 2026-02-21T08:24:52.2085252Z }) : (tensor<4x32xf32>) -> tensor<4xf32> 2026-02-21T08:24:52.2085467Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:24:52.2085726Z %21 = tt.addptr %20, %17 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:24:52.2085952Z tt.store %21, %19 : tensor<4x!tt.ptr> 2026-02-21T08:24:52.2086192Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:24:52.2086420Z tt.return 2026-02-21T08:24:52.2086540Z } 2026-02-21T08:24:52.2086662Z } 2026-02-21T08:24:52.2086729Z 2026-02-21T08:24:52.2086777Z {-# 2026-02-21T08:24:52.2086904Z external_resources: { 2026-02-21T08:24:52.2087052Z mlir_reproducer: { 2026-02-21T08:24:52.2091314Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:24:52.2095705Z disable_threading: false, 2026-02-21T08:24:52.2095866Z verify_each: true 2026-02-21T08:24:52.2096037Z } 2026-02-21T08:24:52.2096161Z } 2026-02-21T08:24:52.2096303Z #-} 2026-02-21T08:24:52.2096827Z /tmp/torchinductor_root/ut/cutkrmuratgdy4w4xrkpvokkovuoxsp5tyx4ojndcaoy767utpnl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:24:52.2098243Z /tmp/torchinductor_root/ut/cutkrmuratgdy4w4xrkpvokkovuoxsp5tyx4ojndcaoy767utpnl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:24:52.2099336Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:24:52.2100436Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 4], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], num_sm_multiplier=2, num_stages=7, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:24:52.2101400Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:24:52.2101673Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:24:52.2714478Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 12.7 configs/s 2026-02-21T08:24:52.2725302Z [41s] Adaptive compile timeout: 30s (90% percentile=25.3s, bounds=[30.0s, 30s]) 2026-02-21T08:24:53.2595370Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1009.9 configs/s 2026-02-21T08:24:53.3118534Z [42s] Initial random population of 100, 5 starting points: 2026-02-21T08:24:53.3124007Z error=14 2026-02-21T08:24:53.3127877Z timeout=9 2026-02-21T08:24:53.3129403Z ok=77 2026-02-21T08:24:53.3129604Z min=0.0707 2026-02-21T08:24:53.3135156Z mid=1.2359 2026-02-21T08:24:53.3138328Z max=78.4087 2026-02-21T08:24:53.3142288Z best={'block_sizes': [256, 2], 2026-02-21T08:24:53.3144174Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:24:53.3144397Z 'load_eviction_policies': ['first', 'last'], 2026-02-21T08:24:53.3144588Z 'num_stages': 6, 2026-02-21T08:24:53.3144725Z 'num_warps': 1, 2026-02-21T08:24:53.3144869Z 'pid_type': 'flat', 2026-02-21T08:24:53.3145026Z 'range_flattens': [None, True], 2026-02-21T08:24:53.3145206Z 'range_multi_buffers': [None, None], 2026-02-21T08:24:53.3145391Z 'range_num_stages': [0, 0], 2026-02-21T08:24:53.3145582Z 'range_unroll_factors': [0, 1], 2026-02-21T08:24:53.3145777Z 'range_warp_specializes': [None, True]} 2026-02-21T08:24:53.3145982Z [42s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:24:54.6128768Z [44s] Generation 1 starting: 80 neighbors, 5 active search path(s) 2026-02-21T08:24:59.7567037Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 8.3 configs/s 2026-02-21T08:25:04.6039369Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 17.3 configs/s 2026-02-21T08:25:12.8341328Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 127.0 2026-02-21T08:25:12.8341828Z configs/s 2026-02-21T08:25:13.1924694Z [62s] Generation 1 complete: 2026-02-21T08:25:13.1928341Z ok=86 2026-02-21T08:25:13.1932445Z min=0.0685 2026-02-21T08:25:13.1936341Z mid=0.0769 2026-02-21T08:25:13.1940318Z max=0.8306 2026-02-21T08:25:13.1944249Z best={'block_sizes': [2048, 1], 2026-02-21T08:25:13.1948713Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:25:13.1950259Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:25:13.1950584Z 'num_stages': 5, 2026-02-21T08:25:13.1950829Z 'num_warps': 8, 2026-02-21T08:25:13.1951034Z 'pid_type': 'flat', 2026-02-21T08:25:13.1951353Z 'range_flattens': [None, None], 2026-02-21T08:25:13.1951620Z 'range_multi_buffers': [None, None], 2026-02-21T08:25:13.1951962Z 'range_num_stages': [0, 3], 2026-02-21T08:25:13.1952194Z 'range_unroll_factors': [0, 1], 2026-02-21T08:25:13.1952456Z 'range_warp_specializes': [None, True]} 2026-02-21T08:25:13.1952745Z [62s] Fitting surrogate: 186 points, 186 targets 2026-02-21T08:25:14.1738084Z [63s] Generation 2 starting: 69 neighbors, 5 active search path(s) 2026-02-21T08:25:26.6790989Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 1.7 configs/s 2026-02-21T08:25:30.8929143Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.3 configs/s 2026-02-21T08:25:37.3164139Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 157.3 2026-02-21T08:25:37.3164906Z configs/s 2026-02-21T08:25:37.5962284Z [87s] Generation 2 complete: 2026-02-21T08:25:37.5962595Z ok=75 2026-02-21T08:25:37.5962749Z min=0.0624 2026-02-21T08:25:37.5962876Z mid=0.0728 2026-02-21T08:25:37.5963037Z max=0.3326 2026-02-21T08:25:37.5963191Z best={'block_sizes': [1024, 1], 2026-02-21T08:25:37.5963456Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:25:37.5963729Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:25:37.5963916Z 'num_stages': 7, 2026-02-21T08:25:37.5964068Z 'num_warps': 4, 2026-02-21T08:25:37.5964205Z 'pid_type': 'flat', 2026-02-21T08:25:37.5964365Z 'range_flattens': [None, None], 2026-02-21T08:25:37.5964538Z 'range_multi_buffers': [None, None], 2026-02-21T08:25:37.5964719Z 'range_num_stages': [0, 1], 2026-02-21T08:25:37.5964878Z 'range_unroll_factors': [0, 0], 2026-02-21T08:25:37.5965104Z 'range_warp_specializes': [None, True]} 2026-02-21T08:25:37.5978748Z [87s] Fitting surrogate: 261 points, 261 targets 2026-02-21T08:25:38.6348352Z [88s] Generation 3 starting: 69 neighbors, 5 active search path(s) 2026-02-21T08:25:50.6669015Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 1.9 configs/s 2026-02-21T08:25:54.7470804Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 17.3 configs/s 2026-02-21T08:26:01.8268185Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 142.8 2026-02-21T08:26:01.8268708Z configs/s 2026-02-21T08:26:02.1449869Z [111s] Generation 3 complete: 2026-02-21T08:26:02.1454303Z ok=74 2026-02-21T08:26:02.1458869Z min=0.0624 2026-02-21T08:26:02.1462916Z mid=0.0687 2026-02-21T08:26:02.1467476Z max=0.6073 2026-02-21T08:26:02.1472003Z best={'block_sizes': [1024, 1], 2026-02-21T08:26:02.1475804Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:26:02.1476146Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:02.1476359Z 'num_stages': 6, 2026-02-21T08:26:02.1481542Z 'num_warps': 1, 2026-02-21T08:26:02.1485400Z 'pid_type': 'flat', 2026-02-21T08:26:02.1489343Z 'range_flattens': [None, True], 2026-02-21T08:26:02.1493311Z 'range_multi_buffers': [None, None], 2026-02-21T08:26:02.1495374Z 'range_num_stages': [0, 0], 2026-02-21T08:26:02.1495585Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:02.1495768Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:02.1496052Z [111s] Fitting surrogate: 335 points, 335 targets 2026-02-21T08:26:03.1052805Z [112s] Generation 4 starting: 67 neighbors, 5 active search path(s) 2026-02-21T08:26:06.4328393Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 17.2 configs/s 2026-02-21T08:26:10.4187957Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.5 configs/s 2026-02-21T08:26:17.7003457Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 138.8 2026-02-21T08:26:17.7004609Z configs/s 2026-02-21T08:26:18.1189959Z [127s] Generation 4 complete: 2026-02-21T08:26:18.1193424Z ok=73 2026-02-21T08:26:18.1194973Z min=0.0624 2026-02-21T08:26:18.1195124Z mid=0.0646 2026-02-21T08:26:18.1195252Z max=0.3154 2026-02-21T08:26:18.1195382Z best={'block_sizes': [1024, 1], 2026-02-21T08:26:18.1195600Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:26:18.1195816Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:18.1196003Z 'num_stages': 6, 2026-02-21T08:26:18.1196148Z 'num_warps': 1, 2026-02-21T08:26:18.1196283Z 'pid_type': 'flat', 2026-02-21T08:26:18.1196446Z 'range_flattens': [None, True], 2026-02-21T08:26:18.1196620Z 'range_multi_buffers': [None, None], 2026-02-21T08:26:18.1196805Z 'range_num_stages': [0, 0], 2026-02-21T08:26:18.1196966Z 'range_unroll_factors': [0, 0], 2026-02-21T08:26:18.1197484Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:18.1207024Z [127s] Fitting surrogate: 408 points, 408 targets 2026-02-21T08:26:18.9072947Z [128s] Generation 5 starting: 51 neighbors, 4 active search path(s) 2026-02-21T08:26:30.7792752Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 1.3 configs/s 2026-02-21T08:26:33.8577812Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.5 configs/s 2026-02-21T08:26:39.6299365Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 186.3 2026-02-21T08:26:39.6300805Z configs/s 2026-02-21T08:26:39.8992321Z [149s] Generation 5 complete: 2026-02-21T08:26:39.8993695Z ok=55 2026-02-21T08:26:39.8993850Z min=0.0624 2026-02-21T08:26:39.8993984Z mid=0.0644 2026-02-21T08:26:39.8994103Z max=0.3040 2026-02-21T08:26:39.8994244Z best={'block_sizes': [2048, 1], 2026-02-21T08:26:39.8994497Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:26:39.8994794Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:39.8995002Z 'num_stages': 5, 2026-02-21T08:26:39.8995138Z 'num_warps': 8, 2026-02-21T08:26:39.8995278Z 'pid_type': 'flat', 2026-02-21T08:26:39.8995426Z 'range_flattens': [None, False], 2026-02-21T08:26:39.8995605Z 'range_multi_buffers': [None, False], 2026-02-21T08:26:39.8995780Z 'range_num_stages': [0, 1], 2026-02-21T08:26:39.8995943Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:39.8996114Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:39.9007177Z [149s] Fitting surrogate: 463 points, 463 targets 2026-02-21T08:26:40.6200495Z [150s] Generation 6 starting: 36 neighbors, 3 active search path(s) 2026-02-21T08:26:42.2896442Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 35.1 configs/s 2026-02-21T08:26:44.3610002Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 17.8 configs/s 2026-02-21T08:26:48.3941142Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 250.4 2026-02-21T08:26:48.3942208Z configs/s 2026-02-21T08:26:48.6182201Z [158s] Generation 6 complete: 2026-02-21T08:26:48.6186212Z ok=39 2026-02-21T08:26:48.6190082Z min=0.0624 2026-02-21T08:26:48.6191490Z mid=0.0626 2026-02-21T08:26:48.6191644Z max=0.1363 2026-02-21T08:26:48.6191795Z best={'block_sizes': [2048, 1], 2026-02-21T08:26:48.6192122Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:26:48.6192401Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:48.6192593Z 'num_stages': 5, 2026-02-21T08:26:48.6192744Z 'num_warps': 8, 2026-02-21T08:26:48.6192880Z 'pid_type': 'flat', 2026-02-21T08:26:48.6193041Z 'range_flattens': [None, False], 2026-02-21T08:26:48.6193211Z 'range_multi_buffers': [None, False], 2026-02-21T08:26:48.6193397Z 'range_num_stages': [0, 1], 2026-02-21T08:26:48.6193563Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:48.6193734Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:48.6201733Z [158s] Fitting surrogate: 502 points, 502 targets 2026-02-21T08:26:49.0287000Z [158s] Generation 7 starting: 11 neighbors, 1 active search path(s) 2026-02-21T08:26:49.6899402Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 44.6 configs/s 2026-02-21T08:26:50.3414894Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 11/11 18.2 configs/s 2026-02-21T08:26:51.9889422Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 699.8 2026-02-21T08:26:51.9889764Z configs/s 2026-02-21T08:26:52.0612134Z [161s] Generation 7 complete: 2026-02-21T08:26:52.0614182Z ok=13 2026-02-21T08:26:52.0618840Z min=0.0624 2026-02-21T08:26:52.0620180Z mid=0.0624 2026-02-21T08:26:52.0620338Z max=0.0932 2026-02-21T08:26:52.0620475Z best={'block_sizes': [2048, 1], 2026-02-21T08:26:52.0620726Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:26:52.0621023Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:26:52.0621230Z 'num_stages': 5, 2026-02-21T08:26:52.0621369Z 'num_warps': 8, 2026-02-21T08:26:52.0621502Z 'pid_type': 'flat', 2026-02-21T08:26:52.0621654Z 'range_flattens': [None, False], 2026-02-21T08:26:52.0621824Z 'range_multi_buffers': [None, False], 2026-02-21T08:26:52.0622513Z 'range_num_stages': [0, 1], 2026-02-21T08:26:52.0622675Z 'range_unroll_factors': [0, 1], 2026-02-21T08:26:52.0622855Z 'range_warp_specializes': [None, True]} 2026-02-21T08:26:52.0633097Z [161s] Fitting surrogate: 515 points, 515 targets 2026-02-21T08:26:52.3349124Z [162s] Autotuning complete in 162.0s after searching 485 configs. 2026-02-21T08:26:52.3350614Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:26:52.3351567Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:26:52.3352613Z 2026-02-21T08:26:52.3352864Z [162s] Code of selected kernel: /tmp/torchinductor_root/im/cimnzdq4zypdmhpvdsxfhsdli2ldhkgwquhjbqrdqltuunfy3gzp.py 2026-02-21T08:26:52.3534268Z from __future__ import annotations 2026-02-21T08:26:52.3534475Z 2026-02-21T08:26:52.3534545Z import torch 2026-02-21T08:26:52.3534679Z import triton 2026-02-21T08:26:52.3534829Z import triton.language as tl 2026-02-21T08:26:52.3535026Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:26:52.3535329Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:26:52.3535613Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:26:52.3535888Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:26:52.3536057Z 2026-02-21T08:26:52.3536123Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:26:52.3536323Z _BLOCK_SIZE_0 = tl.constexpr(2048) 2026-02-21T08:26:52.3536783Z 2026-02-21T08:26:52.3536850Z @triton.jit 2026-02-21T08:26:52.3537033Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:26:52.3537322Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:26:52.3537553Z pid_0 = tl.program_id(0) 2026-02-21T08:26:52.3537712Z offset_1 = pid_0 2026-02-21T08:26:52.3537873Z indices_1 = offset_1 + tl.zeros([1], tl.int32) 2026-02-21T08:26:52.3538153Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:26:52.3538465Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:26:52.3538730Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:26:52.3539038Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:26:52.3539404Z # src[kl_div.py:92-112]: ... 2026-02-21T08:26:52.3539808Z for offset_0 in tl.range(0, 8192, _BLOCK_SIZE_0, loop_unroll_factor=1, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T08:26:52.3540255Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:26:52.3540478Z loss_sum_copy = loss_sum 2026-02-21T08:26:52.3540653Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:26:52.3540909Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:26:52.3541221Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:26:52.3541475Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:26:52.3541828Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 8192 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:26:52.3542273Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:26:52.3542632Z y_true_val = tl.load(y_true + (indices_1[:, None] * 8192 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:26:52.3542978Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:26:52.3543266Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:26:52.3543580Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:26:52.3543809Z # src[kl_div.py:98-106]: ... 2026-02-21T08:26:52.3543980Z if log_target: 2026-02-21T08:26:52.3544142Z y_true_val_copy = y_true_val 2026-02-21T08:26:52.3544326Z y_pred_val_copy = y_pred_val 2026-02-21T08:26:52.3544512Z kl_loss_copy = kl_loss 2026-02-21T08:26:52.3544689Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:26:52.3544891Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:26:52.3545077Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:26:52.3545305Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:26:52.3545544Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:26:52.3545787Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:26:52.3546054Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:26:52.3546245Z v_2 = v_0 * v_1 2026-02-21T08:26:52.3546415Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:26:52.3546600Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:26:52.3546861Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:26:52.3547162Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:26:52.3547374Z # src[kl_div.py:98-106]: ... 2026-02-21T08:26:52.3547553Z _not = not log_target 2026-02-21T08:26:52.3547705Z if _not: 2026-02-21T08:26:52.3547857Z y_true_val_copy_1 = y_true_val 2026-02-21T08:26:52.3548120Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:26:52.3548302Z kl_loss_copy_1 = kl_loss 2026-02-21T08:26:52.3548493Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:26:52.3548707Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:26:52.3548917Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:26:52.3549169Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:26:52.3549467Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:26:52.3549684Z v_5 = tl_math.log(v_4) 2026-02-21T08:26:52.3549913Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:26:52.3550156Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:26:52.3550346Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:26:52.3550530Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:26:52.3550799Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:26:52.3551008Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:26:52.3551226Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:26:52.3551473Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:26:52.3551689Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:26:52.3551826Z 2026-02-21T08:26:52.3552187Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:26:52.3552641Z """ 2026-02-21T08:26:52.3552778Z Compute KL Divergence loss. 2026-02-21T08:26:52.3552892Z 2026-02-21T08:26:52.3552956Z Args: 2026-02-21T08:26:52.3553128Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:26:52.3553459Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:26:52.3553781Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:26:52.3554089Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:26:52.3554334Z eps: Small value to avoid numerical issues 2026-02-21T08:26:52.3554461Z 2026-02-21T08:26:52.3554515Z Returns: 2026-02-21T08:26:52.3554654Z loss: KL divergence loss 2026-02-21T08:26:52.3554804Z """ 2026-02-21T08:26:52.3554945Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:26:52.3555121Z BT, V = y_pred.shape 2026-02-21T08:26:52.3555315Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:26:52.3555575Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:26:52.3555811Z # src[kl_div.py:77]: ) 2026-02-21T08:26:52.3556060Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:26:52.3556338Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:26:52.3556555Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:26:52.3556752Z # src[kl_div.py:82]: else: 2026-02-21T08:26:52.3556912Z # src[kl_div.py:80-83]: ... 2026-02-21T08:26:52.3557066Z if reduction == 'none': 2026-02-21T08:26:52.3557254Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:26:52.3557462Z loss = torch.zeros_like(y_pred) 2026-02-21T08:26:52.3557624Z else: 2026-02-21T08:26:52.3557845Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:26:52.3558162Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:26:52.3558454Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:26:52.3558761Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:26:52.3559020Z # src[kl_div.py:89-115]: ... 2026-02-21T08:26:52.3559317Z _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=5) 2026-02-21T08:26:52.3559708Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:26:52.3559939Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:26:52.3560162Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:26:52.3560353Z # src[kl_div.py:118-125]: ... 2026-02-21T08:26:52.3560515Z if reduction == 'batchmean': 2026-02-21T08:26:52.3560713Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:26:52.3560929Z final_loss = torch.sum(loss) / BT 2026-02-21T08:26:52.3561105Z elif reduction == 'sum': 2026-02-21T08:26:52.3561301Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:26:52.3561512Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:26:52.3561696Z elif reduction == 'mean': 2026-02-21T08:26:52.3561943Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:26:52.3562229Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:26:52.3562402Z else: 2026-02-21T08:26:52.3562546Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:26:52.3562741Z final_loss = loss 2026-02-21T08:26:52.3562908Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:26:52.3563096Z return final_loss 2026-02-21T08:26:53.2457624Z WARNING:tritonbench.utils.triton_op:Completed input ID 1: 2026-02-21T08:26:53.2462326Z (B, T, V) 2026-02-21T08:26:53.2466526Z -------------- 2026-02-21T08:26:53.2468594Z (8, 512, 8192) 2026-02-21T08:26:53.2468714Z 2026-02-21T08:26:53.2469278Z 33%|███▎ | 2/6 [05:25<10:54, 163.52s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T08:26:53.2469574Z (B, T, V) 2026-02-21T08:26:53.2469710Z --------------- 2026-02-21T08:26:53.2469845Z (8, 512, 16384) 2026-02-21T08:26:53.2470086Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:26:54.3674728Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:26:55.4523673Z INFO:tritonbench.utils.triton_op:Took 3.20ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:26:56.7705818Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:26:56.7706090Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:26:56.7706286Z 'dtype': 'torch.float32', 2026-02-21T08:26:56.7706466Z 'shape': (4096, 16384), 2026-02-21T08:26:56.7706650Z 'stride': (16384, 1)}, 2026-02-21T08:26:56.7706826Z { 'device': 'cuda:0', 2026-02-21T08:26:56.7706987Z 'dtype': 'torch.float32', 2026-02-21T08:26:56.7707163Z 'shape': (4096, 16384), 2026-02-21T08:26:56.7707321Z 'stride': (16384, 1)}), 2026-02-21T08:26:56.7707484Z 'kwargs': {}} 2026-02-21T08:26:56.7718103Z INFO:tritonbench.utils.triton_op:Took 1.65ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:26:56.9671141Z [0s] Autotune random seed: 2135561342 2026-02-21T08:26:57.0108165Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:27:29.3092854Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[3, 3], range_warp_specializes=[None, None]) 2026-02-21T08:27:29.5604744Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=2, num_stages=5, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[False, None]) 2026-02-21T08:27:29.9480009Z [32s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:27:30.1333046Z [33s] Timeout after 30s compiling Config(block_sizes=[16, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=1, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[False, False]) 2026-02-21T08:27:30.6582980Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[None, False]) 2026-02-21T08:27:30.8233278Z [33s] Timeout after 30s compiling Config(block_sizes=[32, 2048], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:27:31.3254793Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 512], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=128, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[4, 3], range_unroll_factors=[1, 3], range_warp_specializes=[None, None]) 2026-02-21T08:27:31.6744315Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, False]) 2026-02-21T08:27:31.6770099Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.9 configs/s 2026-02-21T08:27:35.0579206Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:27:35.0580139Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:27:35.0581018Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:27:35.0581295Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:27:35.0581599Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:27:35.0582177Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:35.0582502Z %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32> 2026-02-21T08:27:35.0582856Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:27:35.0583122Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:27:35.0583405Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:27:35.0583716Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:27:35.0583977Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:27:35.0584458Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:35.0585180Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:35.0585692Z %2 = tt.get_program_id x : i32 2026-02-21T08:27:35.0585967Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:27:35.0586247Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T08:27:35.0586914Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:27:35.0587175Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:27:35.0587474Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:27:35.0587755Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:27:35.0588049Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:27:35.0588337Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:27:35.0588612Z %9 = arith.remsi %8, %c2_i32 : i32 2026-02-21T08:27:35.0588890Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:27:35.0589153Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:27:35.0589431Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:27:35.0589696Z %13 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:27:35.0590006Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:27:35.0590318Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:27:35.0590813Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:27:35.0591239Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:27:35.0591546Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:27:35.0592118Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:27:35.0592795Z %32 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0593432Z %33 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0593910Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:27:35.0594522Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:27:35.0595160Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:27:35.0595484Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:27:35.0595829Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0596143Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:27:35.0596412Z } else { 2026-02-21T08:27:35.0596658Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:27:35.0597019Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:27:35.0597389Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:27:35.0597719Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:27:35.0598121Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:27:35.0598512Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:27:35.0598836Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:27:35.0599158Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:27:35.0599503Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0599829Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:27:35.0600092Z } 2026-02-21T08:27:35.0600318Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:27:35.0600633Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:27:35.0600966Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:27:35.0601289Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:27:35.0601592Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:35.0601907Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:35.0602191Z tt.reduce.return %32 : f32 2026-02-21T08:27:35.0602509Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:27:35.0602871Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:27:35.0603305Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:27:35.0603690Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:27:35.0604001Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:27:35.0604399Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:27:35.0604700Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:27:35.0604984Z %24 = arith.muli %23, %c256_i32 : i32 2026-02-21T08:27:35.0605340Z %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:27:35.0605747Z %26 = tt.splat %24 : i32 -> tensor<256xi32> 2026-02-21T08:27:35.0606039Z %27 = arith.addi %26, %25 : tensor<256xi32> 2026-02-21T08:27:35.0606512Z %28 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:27:35.0607133Z %32 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0607711Z %33 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0608241Z %34 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:27:35.0608814Z %36 = tt.extern_elementwise %33 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:27:35.0609361Z %37 = arith.subf %33, %32 : tensor<256x32xf32> 2026-02-21T08:27:35.0609675Z %38 = arith.mulf %36, %37 : tensor<256x32xf32> 2026-02-21T08:27:35.0609982Z %39 = arith.addf %38, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0610289Z scf.yield %39 : tensor<256x32xf32> 2026-02-21T08:27:35.0610534Z } else { 2026-02-21T08:27:35.0610769Z %36 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:27:35.0611105Z %37 = arith.cmpf ogt, %33, %36 : tensor<256x32xf32> 2026-02-21T08:27:35.0611432Z %38 = arith.cmpf une, %33, %33 : tensor<256x32xf32> 2026-02-21T08:27:35.0611755Z %39 = arith.ori %37, %38 : tensor<256x32xi1> 2026-02-21T08:27:35.0612190Z %40 = arith.select %39, %33, %36 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:27:35.0612594Z %41 = math.log %40 : tensor<256x32xf32> 2026-02-21T08:27:35.0612910Z %42 = arith.subf %41, %32 : tensor<256x32xf32> 2026-02-21T08:27:35.0613242Z %43 = arith.mulf %33, %42 : tensor<256x32xf32> 2026-02-21T08:27:35.0613582Z %44 = arith.addf %43, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0613890Z scf.yield %44 : tensor<256x32xf32> 2026-02-21T08:27:35.0614161Z } 2026-02-21T08:27:35.0614385Z %35 = arith.addf %arg7, %34 : tensor<256x32xf32> 2026-02-21T08:27:35.0614704Z scf.yield %35 : tensor<256x32xf32> 2026-02-21T08:27:35.0615017Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:27:35.0615350Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:27:35.0615643Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:35.0615923Z %32 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:35.0616217Z tt.reduce.return %32 : f32 2026-02-21T08:27:35.0616504Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:27:35.0616877Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:27:35.0617302Z %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:27:35.0617692Z tt.store %31, %29 : tensor<256x!tt.ptr> 2026-02-21T08:27:35.0618007Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:27:35.0618329Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:27:35.0618658Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:27:35.0619027Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:27:35.0619423Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:27:35.0619731Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:27:35.0620249Z %18 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<256x32xf32>) : i32 { 2026-02-21T08:27:35.0620919Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0621645Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x32xf32> 2026-02-21T08:27:35.0622159Z %24 = scf.if %arg3 -> (tensor<256x32xf32>) { 2026-02-21T08:27:35.0622760Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x32xf32>) -> tensor<256x32xf32> 2026-02-21T08:27:35.0623384Z %27 = arith.subf %23, %22 : tensor<256x32xf32> 2026-02-21T08:27:35.0623714Z %28 = arith.mulf %26, %27 : tensor<256x32xf32> 2026-02-21T08:27:35.0624062Z %29 = arith.addf %28, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0624397Z scf.yield %29 : tensor<256x32xf32> 2026-02-21T08:27:35.0624664Z } else { 2026-02-21T08:27:35.0624922Z %26 = tt.splat %arg4 : f32 -> tensor<256x32xf32> 2026-02-21T08:27:35.0625364Z %27 = arith.cmpf ogt, %23, %26 : tensor<256x32xf32> 2026-02-21T08:27:35.0625731Z %28 = arith.cmpf une, %23, %23 : tensor<256x32xf32> 2026-02-21T08:27:35.0626073Z %29 = arith.ori %27, %28 : tensor<256x32xi1> 2026-02-21T08:27:35.0626468Z %30 = arith.select %29, %23, %26 : tensor<256x32xi1>, tensor<256x32xf32> 2026-02-21T08:27:35.0626866Z %31 = math.log %30 : tensor<256x32xf32> 2026-02-21T08:27:35.0627181Z %32 = arith.subf %31, %22 : tensor<256x32xf32> 2026-02-21T08:27:35.0627517Z %33 = arith.mulf %23, %32 : tensor<256x32xf32> 2026-02-21T08:27:35.0627845Z %34 = arith.addf %33, %cst : tensor<256x32xf32> 2026-02-21T08:27:35.0628168Z scf.yield %34 : tensor<256x32xf32> 2026-02-21T08:27:35.0628432Z } 2026-02-21T08:27:35.0628660Z %25 = arith.addf %arg7, %24 : tensor<256x32xf32> 2026-02-21T08:27:35.0628969Z scf.yield %25 : tensor<256x32xf32> 2026-02-21T08:27:35.0629295Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:27:35.0629634Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:27:35.0629934Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:35.0630215Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:35.0630502Z tt.reduce.return %22 : f32 2026-02-21T08:27:35.0630790Z }) : (tensor<256x32xf32>) -> tensor<256xf32> 2026-02-21T08:27:35.0631144Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:27:35.0631548Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:27:35.0631958Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:27:35.0632280Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:27:35.0632558Z tt.return 2026-02-21T08:27:35.0632742Z } 2026-02-21T08:27:35.0632918Z } 2026-02-21T08:27:35.0633018Z 2026-02-21T08:27:35.0633087Z {-# 2026-02-21T08:27:35.0633277Z external_resources: { 2026-02-21T08:27:35.0633510Z mlir_reproducer: { 2026-02-21T08:27:35.0640588Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:27:35.0648007Z disable_threading: false, 2026-02-21T08:27:35.0648249Z verify_each: true 2026-02-21T08:27:35.0648446Z } 2026-02-21T08:27:35.0648611Z } 2026-02-21T08:27:35.0648761Z #-} 2026-02-21T08:27:35.0649543Z /tmp/torchinductor_root/lt/cltaacf64k627mu66er2ezv42z4eihhvqghotubql3xm55hfpgyi.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:27:35.0651474Z /tmp/torchinductor_root/lt/cltaacf64k627mu66er2ezv42z4eihhvqghotubql3xm55hfpgyi.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:27:35.0653118Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:27:35.0654862Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=16, num_stages=6, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:27:35.0656444Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:27:35.0656819Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:27:36.4903151Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:27:36.4903929Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:27:36.4904549Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:27:36.4904744Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:27:36.4904935Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:27:36.4905162Z %cst = arith.constant dense<0.000000e+00> : tensor<64x256xf32> 2026-02-21T08:27:36.4905395Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:27:36.4905608Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:27:36.4905819Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:27:36.4906008Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:27:36.4906181Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:27:36.4906497Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:36.4906923Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:36.4907231Z %2 = tt.get_program_id x : i32 2026-02-21T08:27:36.4907430Z scf.for %arg5 = %2 to %c64_i32 step %c9472_i32 : i32 { 2026-02-21T08:27:36.4907644Z %3 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:27:36.4907873Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:27:36.4908115Z %5 = tt.splat %3 : i32 -> tensor<64xi32> 2026-02-21T08:27:36.4908313Z %6 = arith.addi %5, %4 : tensor<64xi32> 2026-02-21T08:27:36.4908510Z %c16128_i32 = arith.constant 16128 : i32 2026-02-21T08:27:36.4909174Z %c768_i32 = arith.constant 768 : i32 2026-02-21T08:27:36.4909665Z %7 = scf.for %arg6 = %c0_i32 to %c16128_i32 step %c768_i32 iter_args(%arg7 = %cst) -> (tensor<64x256xf32>) : i32 { 2026-02-21T08:27:36.4910339Z %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4910959Z %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4911422Z %17 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:27:36.4912127Z %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:27:36.4912747Z %32 = arith.subf %16, %15 : tensor<64x256xf32> 2026-02-21T08:27:36.4913081Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:27:36.4913543Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4913862Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:27:36.4914136Z } else { 2026-02-21T08:27:36.4914380Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:27:36.4914740Z %32 = arith.cmpf ogt, %16, %31 : tensor<64x256xf32> 2026-02-21T08:27:36.4915097Z %33 = arith.cmpf une, %16, %16 : tensor<64x256xf32> 2026-02-21T08:27:36.4915439Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:27:36.4915831Z %35 = arith.select %34, %16, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:27:36.4916218Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:27:36.4916539Z %37 = arith.subf %36, %15 : tensor<64x256xf32> 2026-02-21T08:27:36.4916866Z %38 = arith.mulf %16, %37 : tensor<64x256xf32> 2026-02-21T08:27:36.4917203Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4917523Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:27:36.4917798Z } 2026-02-21T08:27:36.4918023Z %18 = arith.addf %arg7, %17 : tensor<64x256xf32> 2026-02-21T08:27:36.4918346Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:36.4918651Z %19 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:27:36.4918933Z %20 = arith.addi %arg6, %19 : i32 2026-02-21T08:27:36.4919366Z %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4919992Z %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4920463Z %23 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:27:36.4921072Z %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:27:36.4921688Z %32 = arith.subf %22, %21 : tensor<64x256xf32> 2026-02-21T08:27:36.4922059Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:27:36.4922417Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4922731Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:27:36.4923002Z } else { 2026-02-21T08:27:36.4923246Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:27:36.4923609Z %32 = arith.cmpf ogt, %22, %31 : tensor<64x256xf32> 2026-02-21T08:27:36.4923981Z %33 = arith.cmpf une, %22, %22 : tensor<64x256xf32> 2026-02-21T08:27:36.4924316Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:27:36.4924717Z %35 = arith.select %34, %22, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:27:36.4925105Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:27:36.4925426Z %37 = arith.subf %36, %21 : tensor<64x256xf32> 2026-02-21T08:27:36.4925752Z %38 = arith.mulf %22, %37 : tensor<64x256xf32> 2026-02-21T08:27:36.4926091Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4926508Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:27:36.4926768Z } 2026-02-21T08:27:36.4926994Z %24 = arith.addf %18, %23 : tensor<64x256xf32> 2026-02-21T08:27:36.4927303Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:27:36.4927603Z %25 = arith.muli %c256_i32, %c2_i32 : i32 2026-02-21T08:27:36.4927893Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:27:36.4928333Z %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4928925Z %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4929395Z %29 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:27:36.4930002Z %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:27:36.4930726Z %32 = arith.subf %28, %27 : tensor<64x256xf32> 2026-02-21T08:27:36.4931059Z %33 = arith.mulf %31, %32 : tensor<64x256xf32> 2026-02-21T08:27:36.4931394Z %34 = arith.addf %33, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4931712Z scf.yield %34 : tensor<64x256xf32> 2026-02-21T08:27:36.4932081Z } else { 2026-02-21T08:27:36.4932327Z %31 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:27:36.4932688Z %32 = arith.cmpf ogt, %28, %31 : tensor<64x256xf32> 2026-02-21T08:27:36.4933043Z %33 = arith.cmpf une, %28, %28 : tensor<64x256xf32> 2026-02-21T08:27:36.4933391Z %34 = arith.ori %32, %33 : tensor<64x256xi1> 2026-02-21T08:27:36.4933785Z %35 = arith.select %34, %28, %31 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:27:36.4934166Z %36 = math.log %35 : tensor<64x256xf32> 2026-02-21T08:27:36.4934487Z %37 = arith.subf %36, %27 : tensor<64x256xf32> 2026-02-21T08:27:36.4934815Z %38 = arith.mulf %28, %37 : tensor<64x256xf32> 2026-02-21T08:27:36.4935149Z %39 = arith.addf %38, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4935463Z scf.yield %39 : tensor<64x256xf32> 2026-02-21T08:27:36.4935732Z } 2026-02-21T08:27:36.4935954Z %30 = arith.addf %24, %29 : tensor<64x256xf32> 2026-02-21T08:27:36.4936259Z scf.yield %30 : tensor<64x256xf32> 2026-02-21T08:27:36.4936550Z } {tt.num_stages = 1 : i32} 2026-02-21T08:27:36.4936998Z %8 = tt.descriptor_load %0[%3, %c16128_i32] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4937651Z %9 = tt.descriptor_load %1[%3, %c16128_i32] : !tt.tensordesc> -> tensor<64x256xf32> 2026-02-21T08:27:36.4938136Z %10 = scf.if %arg3 -> (tensor<64x256xf32>) { 2026-02-21T08:27:36.4938741Z %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x256xf32>) -> tensor<64x256xf32> 2026-02-21T08:27:36.4939330Z %16 = arith.subf %9, %8 : tensor<64x256xf32> 2026-02-21T08:27:36.4939654Z %17 = arith.mulf %15, %16 : tensor<64x256xf32> 2026-02-21T08:27:36.4939992Z %18 = arith.addf %17, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4940301Z scf.yield %18 : tensor<64x256xf32> 2026-02-21T08:27:36.4940569Z } else { 2026-02-21T08:27:36.4940803Z %15 = tt.splat %arg4 : f32 -> tensor<64x256xf32> 2026-02-21T08:27:36.4941156Z %16 = arith.cmpf ogt, %9, %15 : tensor<64x256xf32> 2026-02-21T08:27:36.4941510Z %17 = arith.cmpf une, %9, %9 : tensor<64x256xf32> 2026-02-21T08:27:36.4941841Z %18 = arith.ori %16, %17 : tensor<64x256xi1> 2026-02-21T08:27:36.4942271Z %19 = arith.select %18, %9, %15 : tensor<64x256xi1>, tensor<64x256xf32> 2026-02-21T08:27:36.4942655Z %20 = math.log %19 : tensor<64x256xf32> 2026-02-21T08:27:36.4942983Z %21 = arith.subf %20, %8 : tensor<64x256xf32> 2026-02-21T08:27:36.4943306Z %22 = arith.mulf %9, %21 : tensor<64x256xf32> 2026-02-21T08:27:36.4943648Z %23 = arith.addf %22, %cst : tensor<64x256xf32> 2026-02-21T08:27:36.4944058Z scf.yield %23 : tensor<64x256xf32> 2026-02-21T08:27:36.4944326Z } 2026-02-21T08:27:36.4944553Z %11 = arith.addf %7, %10 : tensor<64x256xf32> 2026-02-21T08:27:36.4944879Z %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({ 2026-02-21T08:27:36.4945188Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:36.4945474Z %15 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:36.4945780Z tt.reduce.return %15 : f32 2026-02-21T08:27:36.4946075Z }) : (tensor<64x256xf32>) -> tensor<64xf32> 2026-02-21T08:27:36.4946457Z %13 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:27:36.4946893Z %14 = tt.addptr %13, %6 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:27:36.4947272Z tt.store %14, %12 : tensor<64x!tt.ptr> 2026-02-21T08:27:36.4947692Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:27:36.4948167Z tt.return 2026-02-21T08:27:36.4948364Z } 2026-02-21T08:27:36.4948537Z } 2026-02-21T08:27:36.4948652Z 2026-02-21T08:27:36.4948722Z {-# 2026-02-21T08:27:36.4948907Z external_resources: { 2026-02-21T08:27:36.4949150Z mlir_reproducer: { 2026-02-21T08:27:36.4956743Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:27:36.4964776Z disable_threading: false, 2026-02-21T08:27:36.4965044Z verify_each: true 2026-02-21T08:27:36.4965262Z } 2026-02-21T08:27:36.4965440Z } 2026-02-21T08:27:36.4965605Z #-} 2026-02-21T08:27:36.4966327Z /tmp/torchinductor_root/hh/chhlewbekkztkmhfeiv3trprbofzgt5kx7k3voaanknl5ztkortp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:27:36.4968469Z /tmp/torchinductor_root/hh/chhlewbekkztkmhfeiv3trprbofzgt5kx7k3voaanknl5ztkortp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:27:36.4970203Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:27:36.4972146Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:27:36.4973944Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:27:36.4974360Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:27:36.5452407Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:27:36.5453333Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:27:36.5454234Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:27:36.5454656Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:27:36.5454939Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:36.5455306Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32> 2026-02-21T08:27:36.5455670Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:27:36.5455961Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:27:36.5456259Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:27:36.5456559Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:27:36.5456855Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:27:36.5457355Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:36.5458088Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:36.5458597Z %2 = tt.get_program_id x : i32 2026-02-21T08:27:36.5458858Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:27:36.5459136Z %4 = arith.minsi %3, %c32_i32 : i32 2026-02-21T08:27:36.5459429Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:27:36.5459743Z %5 = arith.muli %arg5, %c128_i32 : i32 2026-02-21T08:27:36.5460102Z %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:27:36.5460500Z %7 = tt.splat %5 : i32 -> tensor<128xi32> 2026-02-21T08:27:36.5460807Z %8 = arith.addi %7, %6 : tensor<128xi32> 2026-02-21T08:27:36.5461307Z %9 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c32_i32 iter_args(%arg7 = %cst) -> (tensor<128x32xf32>) : i32 { 2026-02-21T08:27:36.5462017Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:27:36.5462650Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:27:36.5463130Z %15 = scf.if %arg3 -> (tensor<128x32xf32>) { 2026-02-21T08:27:36.5463735Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32> 2026-02-21T08:27:36.5464349Z %18 = arith.subf %14, %13 : tensor<128x32xf32> 2026-02-21T08:27:36.5464679Z %19 = arith.mulf %17, %18 : tensor<128x32xf32> 2026-02-21T08:27:36.5465011Z %20 = arith.addf %19, %cst : tensor<128x32xf32> 2026-02-21T08:27:36.5465330Z scf.yield %20 : tensor<128x32xf32> 2026-02-21T08:27:36.5465592Z } else { 2026-02-21T08:27:36.5465843Z %17 = tt.splat %arg4 : f32 -> tensor<128x32xf32> 2026-02-21T08:27:36.5466206Z %18 = arith.cmpf ogt, %14, %17 : tensor<128x32xf32> 2026-02-21T08:27:36.5466559Z %19 = arith.cmpf une, %14, %14 : tensor<128x32xf32> 2026-02-21T08:27:36.5466900Z %20 = arith.ori %18, %19 : tensor<128x32xi1> 2026-02-21T08:27:36.5467273Z %21 = arith.select %20, %14, %17 : tensor<128x32xi1>, tensor<128x32xf32> 2026-02-21T08:27:36.5467665Z %22 = math.log %21 : tensor<128x32xf32> 2026-02-21T08:27:36.5467984Z %23 = arith.subf %22, %13 : tensor<128x32xf32> 2026-02-21T08:27:36.5468415Z %24 = arith.mulf %14, %23 : tensor<128x32xf32> 2026-02-21T08:27:36.5468759Z %25 = arith.addf %24, %cst : tensor<128x32xf32> 2026-02-21T08:27:36.5469075Z scf.yield %25 : tensor<128x32xf32> 2026-02-21T08:27:36.5469348Z } 2026-02-21T08:27:36.5469573Z %16 = arith.addf %arg7, %15 : tensor<128x32xf32> 2026-02-21T08:27:36.5469895Z scf.yield %16 : tensor<128x32xf32> 2026-02-21T08:27:36.5470342Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:27:36.5470816Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:27:36.5471110Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:36.5471385Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:36.5471676Z tt.reduce.return %13 : f32 2026-02-21T08:27:36.5472001Z }) : (tensor<128x32xf32>) -> tensor<128xf32> 2026-02-21T08:27:36.5472455Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:27:36.5472889Z %12 = tt.addptr %11, %8 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:27:36.5473287Z tt.store %12, %10 : tensor<128x!tt.ptr> 2026-02-21T08:27:36.5473575Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T08:27:36.5473848Z tt.return 2026-02-21T08:27:36.5474039Z } 2026-02-21T08:27:36.5474206Z } 2026-02-21T08:27:36.5474309Z 2026-02-21T08:27:36.5474387Z {-# 2026-02-21T08:27:36.5474568Z external_resources: { 2026-02-21T08:27:36.5474812Z mlir_reproducer: { 2026-02-21T08:27:36.5482429Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:27:36.5490317Z disable_threading: false, 2026-02-21T08:27:36.5490618Z verify_each: true 2026-02-21T08:27:36.5490883Z } 2026-02-21T08:27:36.5491090Z } 2026-02-21T08:27:36.5491301Z #-} 2026-02-21T08:27:36.5492133Z /tmp/torchinductor_root/zc/czcqwzi5sssdgaf7drc4sspkirwdodgk7w2szpqbcbt5ahalb7yk.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:27:36.5494370Z /tmp/torchinductor_root/zc/czcqwzi5sssdgaf7drc4sspkirwdodgk7w2szpqbcbt5ahalb7yk.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:27:36.5496220Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:27:36.5498128Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:27:36.5499856Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:27:36.5500256Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:27:37.9793557Z module { 2026-02-21T08:27:37.9796158Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:27:37.9796889Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:27:37.9797077Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:27:37.9797252Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:27:37.9797475Z %cst = arith.constant dense<0.000000e+00> : tensor<64x8xf32> 2026-02-21T08:27:37.9797693Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:27:37.9797876Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:27:37.9798054Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:27:37.9798236Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:27:37.9798407Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:27:37.9798718Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:37.9799157Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:27:37.9799468Z %2 = tt.get_program_id x : i32 2026-02-21T08:27:37.9799641Z %3 = arith.subi %c64_i32, %2 : i32 2026-02-21T08:27:37.9799809Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:27:37.9799990Z %4 = arith.subi %c9472_i32, %c1_i32 : i32 2026-02-21T08:27:37.9800166Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:27:37.9800337Z %6 = arith.divui %5, %c9472_i32 : i32 2026-02-21T08:27:37.9800516Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:27:37.9800707Z %7 = arith.remsi %6, %c3_i32 : i32 2026-02-21T08:27:37.9819089Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:27:37.9819320Z %9 = arith.muli %8, %c9472_i32 : i32 2026-02-21T08:27:37.9819518Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:27:37.9819721Z %11 = arith.muli %c9472_i32, %c3_i32 : i32 2026-02-21T08:27:37.9819937Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:27:37.9820169Z %12 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:27:37.9820443Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:27:37.9820714Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T08:27:37.9820935Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T08:27:37.9821261Z %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>) : i32 { 2026-02-21T08:27:37.9821693Z %40 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9822121Z %41 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9822414Z %42 = scf.if %arg3 -> (tensor<64x8xf32>) { 2026-02-21T08:27:37.9822801Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32> 2026-02-21T08:27:37.9823186Z %45 = arith.subf %41, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9823615Z %46 = arith.mulf %44, %45 : tensor<64x8xf32> 2026-02-21T08:27:37.9823830Z %47 = arith.addf %46, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9824050Z scf.yield %47 : tensor<64x8xf32> 2026-02-21T08:27:37.9824238Z } else { 2026-02-21T08:27:37.9824408Z %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32> 2026-02-21T08:27:37.9824645Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32> 2026-02-21T08:27:37.9824876Z %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32> 2026-02-21T08:27:37.9825097Z %47 = arith.ori %45, %46 : tensor<64x8xi1> 2026-02-21T08:27:37.9825340Z %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32> 2026-02-21T08:27:37.9825600Z %49 = math.log %48 : tensor<64x8xf32> 2026-02-21T08:27:37.9825810Z %50 = arith.subf %49, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9826083Z %51 = arith.mulf %41, %50 : tensor<64x8xf32> 2026-02-21T08:27:37.9826311Z %52 = arith.addf %51, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9826514Z scf.yield %52 : tensor<64x8xf32> 2026-02-21T08:27:37.9826698Z } 2026-02-21T08:27:37.9826849Z %43 = arith.addf %arg7, %42 : tensor<64x8xf32> 2026-02-21T08:27:37.9827058Z scf.yield %43 : tensor<64x8xf32> 2026-02-21T08:27:37.9827324Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:27:37.9827615Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:27:37.9827820Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:37.9828007Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:37.9828210Z tt.reduce.return %40 : f32 2026-02-21T08:27:37.9828401Z }) : (tensor<64x8xf32>) -> tensor<64xf32> 2026-02-21T08:27:37.9828654Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:27:37.9828918Z %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:27:37.9829163Z tt.store %19, %17 : tensor<64x!tt.ptr> 2026-02-21T08:27:37.9829369Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:27:37.9829557Z %20 = arith.muli %c9472_i32, %c1_i32_0 : i32 2026-02-21T08:27:37.9829751Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:27:37.9829926Z %22 = arith.muli %21, %c64_i32 : i32 2026-02-21T08:27:37.9830153Z %23 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:27:37.9830389Z %24 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T08:27:37.9830587Z %25 = arith.addi %24, %23 : tensor<64xi32> 2026-02-21T08:27:37.9830887Z %26 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>) : i32 { 2026-02-21T08:27:37.9831318Z %40 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9831694Z %41 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9832022Z %42 = scf.if %arg3 -> (tensor<64x8xf32>) { 2026-02-21T08:27:37.9832388Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32> 2026-02-21T08:27:37.9832744Z %45 = arith.subf %41, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9832956Z %46 = arith.mulf %44, %45 : tensor<64x8xf32> 2026-02-21T08:27:37.9833173Z %47 = arith.addf %46, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9833370Z scf.yield %47 : tensor<64x8xf32> 2026-02-21T08:27:37.9833550Z } else { 2026-02-21T08:27:37.9833714Z %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32> 2026-02-21T08:27:37.9833941Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32> 2026-02-21T08:27:37.9834161Z %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32> 2026-02-21T08:27:37.9834380Z %47 = arith.ori %45, %46 : tensor<64x8xi1> 2026-02-21T08:27:37.9834634Z %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32> 2026-02-21T08:27:37.9834974Z %49 = math.log %48 : tensor<64x8xf32> 2026-02-21T08:27:37.9835182Z %50 = arith.subf %49, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9835382Z %51 = arith.mulf %41, %50 : tensor<64x8xf32> 2026-02-21T08:27:37.9835596Z %52 = arith.addf %51, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9835790Z scf.yield %52 : tensor<64x8xf32> 2026-02-21T08:27:37.9835966Z } 2026-02-21T08:27:37.9836112Z %43 = arith.addf %arg7, %42 : tensor<64x8xf32> 2026-02-21T08:27:37.9836319Z scf.yield %43 : tensor<64x8xf32> 2026-02-21T08:27:37.9836581Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:27:37.9836848Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:27:37.9837049Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:37.9837226Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:37.9837486Z tt.reduce.return %40 : f32 2026-02-21T08:27:37.9837675Z }) : (tensor<64x8xf32>) -> tensor<64xf32> 2026-02-21T08:27:37.9837912Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:27:37.9838178Z %29 = tt.addptr %28, %25 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:27:37.9838416Z tt.store %29, %27 : tensor<64x!tt.ptr> 2026-02-21T08:27:37.9838625Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:27:37.9838811Z %30 = arith.muli %c9472_i32, %c2_i32 : i32 2026-02-21T08:27:37.9839008Z %31 = arith.addi %arg5, %30 : i32 2026-02-21T08:27:37.9839184Z %32 = arith.muli %31, %c64_i32 : i32 2026-02-21T08:27:37.9839416Z %33 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:27:37.9839652Z %34 = tt.splat %32 : i32 -> tensor<64xi32> 2026-02-21T08:27:37.9839850Z %35 = arith.addi %34, %33 : tensor<64xi32> 2026-02-21T08:27:37.9840164Z %36 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>) : i32 { 2026-02-21T08:27:37.9840564Z %40 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9840928Z %41 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9841203Z %42 = scf.if %arg3 -> (tensor<64x8xf32>) { 2026-02-21T08:27:37.9841563Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32> 2026-02-21T08:27:37.9841986Z %45 = arith.subf %41, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9842182Z %46 = arith.mulf %44, %45 : tensor<64x8xf32> 2026-02-21T08:27:37.9842389Z %47 = arith.addf %46, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9842578Z scf.yield %47 : tensor<64x8xf32> 2026-02-21T08:27:37.9842751Z } else { 2026-02-21T08:27:37.9842910Z %44 = tt.splat %arg4 : f32 -> tensor<64x8xf32> 2026-02-21T08:27:37.9843133Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x8xf32> 2026-02-21T08:27:37.9843354Z %46 = arith.cmpf une, %41, %41 : tensor<64x8xf32> 2026-02-21T08:27:37.9843556Z %47 = arith.ori %45, %46 : tensor<64x8xi1> 2026-02-21T08:27:37.9843798Z %48 = arith.select %47, %41, %44 : tensor<64x8xi1>, tensor<64x8xf32> 2026-02-21T08:27:37.9844029Z %49 = math.log %48 : tensor<64x8xf32> 2026-02-21T08:27:37.9844223Z %50 = arith.subf %49, %40 : tensor<64x8xf32> 2026-02-21T08:27:37.9844415Z %51 = arith.mulf %41, %50 : tensor<64x8xf32> 2026-02-21T08:27:37.9844624Z %52 = arith.addf %51, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9844819Z scf.yield %52 : tensor<64x8xf32> 2026-02-21T08:27:37.9844981Z } 2026-02-21T08:27:37.9845132Z %43 = arith.addf %arg7, %42 : tensor<64x8xf32> 2026-02-21T08:27:37.9845320Z scf.yield %43 : tensor<64x8xf32> 2026-02-21T08:27:37.9845577Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:27:37.9845891Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:27:37.9846084Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:37.9846257Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:37.9846447Z tt.reduce.return %40 : f32 2026-02-21T08:27:37.9846632Z }) : (tensor<64x8xf32>) -> tensor<64xf32> 2026-02-21T08:27:37.9846852Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:27:37.9847110Z %39 = tt.addptr %38, %35 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:27:37.9847336Z tt.store %39, %37 : tensor<64x!tt.ptr> 2026-02-21T08:27:37.9847533Z } {tt.num_stages = 1 : i32} 2026-02-21T08:27:37.9847731Z scf.for %arg5 = %10 to %c64_i32 step %c9472_i32 : i32 { 2026-02-21T08:27:37.9847951Z %12 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:27:37.9848235Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:27:37.9848473Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T08:27:37.9848666Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T08:27:37.9848961Z %16 = scf.for %arg6 = %c0_i32 to %c16384_i32 step %c8_i32 iter_args(%arg7 = %cst) -> (tensor<64x8xf32>) : i32 { 2026-02-21T08:27:37.9849355Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9849704Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<64x8xf32> 2026-02-21T08:27:37.9849986Z %22 = scf.if %arg3 -> (tensor<64x8xf32>) { 2026-02-21T08:27:37.9850343Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x8xf32>) -> tensor<64x8xf32> 2026-02-21T08:27:37.9850691Z %25 = arith.subf %21, %20 : tensor<64x8xf32> 2026-02-21T08:27:37.9850903Z %26 = arith.mulf %24, %25 : tensor<64x8xf32> 2026-02-21T08:27:37.9851108Z %27 = arith.addf %26, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9851308Z scf.yield %27 : tensor<64x8xf32> 2026-02-21T08:27:37.9851474Z } else { 2026-02-21T08:27:37.9851641Z %24 = tt.splat %arg4 : f32 -> tensor<64x8xf32> 2026-02-21T08:27:37.9851894Z %25 = arith.cmpf ogt, %21, %24 : tensor<64x8xf32> 2026-02-21T08:27:37.9852109Z %26 = arith.cmpf une, %21, %21 : tensor<64x8xf32> 2026-02-21T08:27:37.9852324Z %27 = arith.ori %25, %26 : tensor<64x8xi1> 2026-02-21T08:27:37.9852555Z %28 = arith.select %27, %21, %24 : tensor<64x8xi1>, tensor<64x8xf32> 2026-02-21T08:27:37.9852794Z %29 = math.log %28 : tensor<64x8xf32> 2026-02-21T08:27:37.9852982Z %30 = arith.subf %29, %20 : tensor<64x8xf32> 2026-02-21T08:27:37.9853186Z %31 = arith.mulf %21, %30 : tensor<64x8xf32> 2026-02-21T08:27:37.9853390Z %32 = arith.addf %31, %cst : tensor<64x8xf32> 2026-02-21T08:27:37.9853578Z scf.yield %32 : tensor<64x8xf32> 2026-02-21T08:27:37.9853752Z } 2026-02-21T08:27:37.9853910Z %23 = arith.addf %arg7, %22 : tensor<64x8xf32> 2026-02-21T08:27:37.9854108Z scf.yield %23 : tensor<64x8xf32> 2026-02-21T08:27:37.9854351Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:27:37.9854619Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:27:37.9854813Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:27:37.9854985Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:27:37.9855172Z tt.reduce.return %20 : f32 2026-02-21T08:27:37.9855350Z }) : (tensor<64x8xf32>) -> tensor<64xf32> 2026-02-21T08:27:37.9855580Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:27:37.9855832Z %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:27:37.9856070Z tt.store %19, %17 : tensor<64x!tt.ptr> 2026-02-21T08:27:37.9856273Z } {tt.num_stages = 1 : i32} 2026-02-21T08:27:37.9856490Z tt.return 2026-02-21T08:27:37.9856624Z } 2026-02-21T08:27:37.9856742Z } 2026-02-21T08:27:37.9856813Z 2026-02-21T08:27:37.9856875Z {-# 2026-02-21T08:27:37.9857000Z external_resources: { 2026-02-21T08:27:37.9857162Z mlir_reproducer: { 2026-02-21T08:27:37.9861523Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:27:37.9866207Z disable_threading: false, 2026-02-21T08:27:37.9866380Z verify_each: true 2026-02-21T08:27:37.9866547Z } 2026-02-21T08:27:37.9866700Z } 2026-02-21T08:27:37.9866840Z #-} 2026-02-21T08:27:37.9867383Z /tmp/torchinductor_root/ep/cep5rg3g5xg5gfzsazcgokkggs64aoxaafehishm262xa2jmr7tz.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:27:37.9868809Z /tmp/torchinductor_root/ep/cep5rg3g5xg5gfzsazcgokkggs64aoxaafehishm262xa2jmr7tz.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:27:37.9869873Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:27:37.9871070Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:27:37.9872119Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:27:37.9872387Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:27:42.1171165Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.6 configs/s 2026-02-21T08:27:42.1180959Z [45s] Adaptive compile timeout: 30s (90% percentile=16.5s, bounds=[30.0s, 30s]) 2026-02-21T08:27:43.1528880Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 950.5 configs/s 2026-02-21T08:27:43.2066427Z [46s] Initial random population of 100, 5 starting points: 2026-02-21T08:27:43.2070256Z error=14 2026-02-21T08:27:43.2071387Z timeout=8 2026-02-21T08:27:43.2071517Z ok=78 2026-02-21T08:27:43.2071635Z min=0.1157 2026-02-21T08:27:43.2071765Z mid=1.2339 2026-02-21T08:27:43.2071936Z max=148.8250 2026-02-21T08:27:43.2072096Z best={'block_sizes': [2048, 1], 2026-02-21T08:27:43.2072344Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:27:43.2072579Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:27:43.2072768Z 'num_stages': 8, 2026-02-21T08:27:43.2072910Z 'num_warps': 4, 2026-02-21T08:27:43.2073042Z 'pid_type': 'flat', 2026-02-21T08:27:43.2073198Z 'range_flattens': [None, False], 2026-02-21T08:27:43.2073368Z 'range_multi_buffers': [None, False], 2026-02-21T08:27:43.2073549Z 'range_num_stages': [0, 4], 2026-02-21T08:27:43.2073706Z 'range_unroll_factors': [0, 0], 2026-02-21T08:27:43.2073885Z 'range_warp_specializes': [None, True]} 2026-02-21T08:27:43.2080914Z [46s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:27:44.2433360Z [47s] Generation 1 starting: 78 neighbors, 5 active search path(s) 2026-02-21T08:27:56.9809316Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.4 configs/s 2026-02-21T08:28:00.1843083Z module { 2026-02-21T08:28:00.1844997Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:00.1845671Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:28:00.1845890Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:00.1846156Z %cst = arith.constant dense<0.000000e+00> : tensor<8x256xf32> 2026-02-21T08:28:00.1846416Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:28:00.1846632Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:00.1846857Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:28:00.1847069Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:28:00.1847314Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:00.1847689Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:00.1848194Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:00.1848543Z %2 = tt.get_program_id x : i32 2026-02-21T08:28:00.1848746Z %3 = arith.muli %2, %c8_i32 : i32 2026-02-21T08:28:00.1849001Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:28:00.1849268Z %5 = tt.splat %3 : i32 -> tensor<8xi32> 2026-02-21T08:28:00.1849482Z %6 = arith.addi %5, %4 : tensor<8xi32> 2026-02-21T08:28:00.1849827Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<8x256xf32>) : i32 { 2026-02-21T08:28:00.1850301Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<8x256xf32> 2026-02-21T08:28:00.1850715Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<8x256xf32> 2026-02-21T08:28:00.1851054Z %13 = scf.if %arg3 -> (tensor<8x256xf32>) { 2026-02-21T08:28:00.1851471Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x256xf32>) -> tensor<8x256xf32> 2026-02-21T08:28:00.1852118Z %16 = arith.subf %12, %11 : tensor<8x256xf32> 2026-02-21T08:28:00.1852368Z %17 = arith.mulf %15, %16 : tensor<8x256xf32> 2026-02-21T08:28:00.1852601Z %18 = arith.addf %17, %cst : tensor<8x256xf32> 2026-02-21T08:28:00.1852839Z scf.yield %18 : tensor<8x256xf32> 2026-02-21T08:28:00.1853034Z } else { 2026-02-21T08:28:00.1853230Z %15 = tt.splat %arg4 : f32 -> tensor<8x256xf32> 2026-02-21T08:28:00.1853491Z %16 = arith.cmpf ogt, %12, %15 : tensor<8x256xf32> 2026-02-21T08:28:00.1853743Z %17 = arith.cmpf une, %12, %12 : tensor<8x256xf32> 2026-02-21T08:28:00.1853995Z %18 = arith.ori %16, %17 : tensor<8x256xi1> 2026-02-21T08:28:00.1854700Z %19 = arith.select %18, %12, %15 : tensor<8x256xi1>, tensor<8x256xf32> 2026-02-21T08:28:00.1854994Z %20 = math.log %19 : tensor<8x256xf32> 2026-02-21T08:28:00.1855215Z %21 = arith.subf %20, %11 : tensor<8x256xf32> 2026-02-21T08:28:00.1855455Z %22 = arith.mulf %12, %21 : tensor<8x256xf32> 2026-02-21T08:28:00.1855690Z %23 = arith.addf %22, %cst : tensor<8x256xf32> 2026-02-21T08:28:00.1855907Z scf.yield %23 : tensor<8x256xf32> 2026-02-21T08:28:00.1856107Z } 2026-02-21T08:28:00.1856266Z %14 = arith.addf %arg6, %13 : tensor<8x256xf32> 2026-02-21T08:28:00.1856492Z scf.yield %14 : tensor<8x256xf32> 2026-02-21T08:28:00.1856777Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:28:00.1857104Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:28:00.1857318Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:28:00.1857631Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:28:00.1857856Z tt.reduce.return %11 : f32 2026-02-21T08:28:00.1858059Z }) : (tensor<8x256xf32>) -> tensor<8xf32> 2026-02-21T08:28:00.1858315Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:28:00.1858603Z %10 = tt.addptr %9, %6 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:28:00.1858869Z tt.store %10, %8 : tensor<8x!tt.ptr> 2026-02-21T08:28:00.1859074Z tt.return 2026-02-21T08:28:00.1859213Z } 2026-02-21T08:28:00.1859352Z } 2026-02-21T08:28:00.1859429Z 2026-02-21T08:28:00.1859483Z {-# 2026-02-21T08:28:00.1859630Z external_resources: { 2026-02-21T08:28:00.1859802Z mlir_reproducer: { 2026-02-21T08:28:00.1864783Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:00.1869605Z disable_threading: false, 2026-02-21T08:28:00.1869778Z verify_each: true 2026-02-21T08:28:00.1869932Z } 2026-02-21T08:28:00.1870052Z } 2026-02-21T08:28:00.1870174Z #-} 2026-02-21T08:28:00.1870618Z /tmp/torchinductor_root/eb/cebfo3qanzurb5hreek6etdls6omorkgf56ltlrepkgmmlylpopl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:00.1871930Z /tmp/torchinductor_root/eb/cebfo3qanzurb5hreek6etdls6omorkgf56ltlrepkgmmlylpopl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:00.1873189Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:00.1874285Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:00.1875223Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:00.1875494Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:01.6397620Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 17.3 configs/s 2026-02-21T08:28:14.1025729Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.6 configs/s 2026-02-21T08:28:14.4585743Z [77s] Generation 1 complete: 2026-02-21T08:28:14.4589900Z error=1 2026-02-21T08:28:14.4591384Z ok=83 2026-02-21T08:28:14.4591582Z min=0.1177 2026-02-21T08:28:14.4597242Z mid=0.1361 2026-02-21T08:28:14.4599460Z max=1.1249 2026-02-21T08:28:14.4599675Z best={'block_sizes': [2048, 1], 2026-02-21T08:28:14.4603063Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:28:14.4606295Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:28:14.4610755Z 'num_stages': 8, 2026-02-21T08:28:14.4614578Z 'num_warps': 4, 2026-02-21T08:28:14.4618959Z 'pid_type': 'flat', 2026-02-21T08:28:14.4620574Z 'range_flattens': [None, False], 2026-02-21T08:28:14.4620855Z 'range_multi_buffers': [None, False], 2026-02-21T08:28:14.4626244Z 'range_num_stages': [0, 4], 2026-02-21T08:28:14.4627844Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:14.4628151Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:14.4633332Z [77s] Fitting surrogate: 184 points, 184 targets 2026-02-21T08:28:15.4274941Z [78s] Generation 2 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:28:20.7571119Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 3.7 configs/s 2026-02-21T08:28:23.1239762Z module { 2026-02-21T08:28:23.1244633Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:23.1245758Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:28:23.1246006Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:23.1246231Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32> 2026-02-21T08:28:23.1246466Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:28:23.1246645Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:23.1246869Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:28:23.1247065Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:28:23.1247242Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:23.1247546Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:23.1247975Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:23.1248284Z %2 = tt.get_program_id x : i32 2026-02-21T08:28:23.1248456Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:28:23.1248674Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:28:23.1248903Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:28:23.1249091Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:28:23.1249395Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>) : i32 { 2026-02-21T08:28:23.1250132Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:23.1250502Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:23.1250784Z %13 = scf.if %arg3 -> (tensor<4x256xf32>) { 2026-02-21T08:28:23.1251156Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32> 2026-02-21T08:28:23.1251530Z %16 = arith.subf %12, %11 : tensor<4x256xf32> 2026-02-21T08:28:23.1251731Z %17 = arith.mulf %15, %16 : tensor<4x256xf32> 2026-02-21T08:28:23.1251981Z %18 = arith.addf %17, %cst : tensor<4x256xf32> 2026-02-21T08:28:23.1252180Z scf.yield %18 : tensor<4x256xf32> 2026-02-21T08:28:23.1252351Z } else { 2026-02-21T08:28:23.1252507Z %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32> 2026-02-21T08:28:23.1252838Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32> 2026-02-21T08:28:23.1253066Z %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32> 2026-02-21T08:28:23.1253265Z %18 = arith.ori %16, %17 : tensor<4x256xi1> 2026-02-21T08:28:23.1253507Z %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32> 2026-02-21T08:28:23.1253738Z %20 = math.log %19 : tensor<4x256xf32> 2026-02-21T08:28:23.1253932Z %21 = arith.subf %20, %11 : tensor<4x256xf32> 2026-02-21T08:28:23.1254124Z %22 = arith.mulf %12, %21 : tensor<4x256xf32> 2026-02-21T08:28:23.1254324Z %23 = arith.addf %22, %cst : tensor<4x256xf32> 2026-02-21T08:28:23.1254516Z scf.yield %23 : tensor<4x256xf32> 2026-02-21T08:28:23.1254679Z } 2026-02-21T08:28:23.1254827Z %14 = arith.addf %arg6, %13 : tensor<4x256xf32> 2026-02-21T08:28:23.1255014Z scf.yield %14 : tensor<4x256xf32> 2026-02-21T08:28:23.1255263Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:28:23.1255521Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:28:23.1255710Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:28:23.1255875Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:28:23.1256061Z tt.reduce.return %11 : f32 2026-02-21T08:28:23.1256245Z }) : (tensor<4x256xf32>) -> tensor<4xf32> 2026-02-21T08:28:23.1256459Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:28:23.1256716Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:28:23.1256935Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:28:23.1257111Z tt.return 2026-02-21T08:28:23.1257229Z } 2026-02-21T08:28:23.1257346Z } 2026-02-21T08:28:23.1257413Z 2026-02-21T08:28:23.1257469Z {-# 2026-02-21T08:28:23.1257598Z external_resources: { 2026-02-21T08:28:23.1257753Z mlir_reproducer: { 2026-02-21T08:28:23.1262173Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:23.1266816Z disable_threading: false, 2026-02-21T08:28:23.1266984Z verify_each: true 2026-02-21T08:28:23.1267122Z } 2026-02-21T08:28:23.1267243Z } 2026-02-21T08:28:23.1267352Z #-} 2026-02-21T08:28:23.1267829Z /tmp/torchinductor_root/7a/c7ahhktrmlgwjp4px77jizrsvzaxhjcxlkqp6bijd6wtzgezrfa6.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:23.1269038Z /tmp/torchinductor_root/7a/c7ahhktrmlgwjp4px77jizrsvzaxhjcxlkqp6bijd6wtzgezrfa6.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:23.1270020Z [86s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:23.1271008Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:23.1271927Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:23.1272181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:23.5600017Z module { 2026-02-21T08:28:23.5600953Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:23.5602177Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:28:23.5602478Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:23.5602845Z %cst = arith.constant dense<0.000000e+00> : tensor<8x512xf32> 2026-02-21T08:28:23.5603200Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:28:23.5603489Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:23.5603819Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:28:23.5604112Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:28:23.5604398Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:23.5604932Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:23.5605660Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:23.5606162Z %2 = tt.get_program_id x : i32 2026-02-21T08:28:23.5606438Z %3 = arith.muli %2, %c8_i32 : i32 2026-02-21T08:28:23.5606785Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:28:23.5607153Z %5 = tt.splat %3 : i32 -> tensor<8xi32> 2026-02-21T08:28:23.5607446Z %6 = arith.addi %5, %4 : tensor<8xi32> 2026-02-21T08:28:23.5607937Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c512_i32 iter_args(%arg6 = %cst) -> (tensor<8x512xf32>) : i32 { 2026-02-21T08:28:23.5608596Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<8x512xf32> 2026-02-21T08:28:23.5609193Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<8x512xf32> 2026-02-21T08:28:23.5609996Z %13 = scf.if %arg3 -> (tensor<8x512xf32>) { 2026-02-21T08:28:23.5610595Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<8x512xf32>) -> tensor<8x512xf32> 2026-02-21T08:28:23.5611193Z %16 = arith.subf %12, %11 : tensor<8x512xf32> 2026-02-21T08:28:23.5611519Z %17 = arith.mulf %15, %16 : tensor<8x512xf32> 2026-02-21T08:28:23.5611837Z %18 = arith.addf %17, %cst : tensor<8x512xf32> 2026-02-21T08:28:23.5612187Z scf.yield %18 : tensor<8x512xf32> 2026-02-21T08:28:23.5612453Z } else { 2026-02-21T08:28:23.5612705Z %15 = tt.splat %arg4 : f32 -> tensor<8x512xf32> 2026-02-21T08:28:23.5613070Z %16 = arith.cmpf ogt, %12, %15 : tensor<8x512xf32> 2026-02-21T08:28:23.5613422Z %17 = arith.cmpf une, %12, %12 : tensor<8x512xf32> 2026-02-21T08:28:23.5613766Z %18 = arith.ori %16, %17 : tensor<8x512xi1> 2026-02-21T08:28:23.5614277Z %19 = arith.select %18, %12, %15 : tensor<8x512xi1>, tensor<8x512xf32> 2026-02-21T08:28:23.5614681Z %20 = math.log %19 : tensor<8x512xf32> 2026-02-21T08:28:23.5614992Z %21 = arith.subf %20, %11 : tensor<8x512xf32> 2026-02-21T08:28:23.5615316Z %22 = arith.mulf %12, %21 : tensor<8x512xf32> 2026-02-21T08:28:23.5615651Z %23 = arith.addf %22, %cst : tensor<8x512xf32> 2026-02-21T08:28:23.5615949Z scf.yield %23 : tensor<8x512xf32> 2026-02-21T08:28:23.5616218Z } 2026-02-21T08:28:23.5616434Z %14 = arith.addf %arg6, %13 : tensor<8x512xf32> 2026-02-21T08:28:23.5616740Z scf.yield %14 : tensor<8x512xf32> 2026-02-21T08:28:23.5617147Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:28:23.5617582Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:28:23.5617872Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:28:23.5618152Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:28:23.5618451Z tt.reduce.return %11 : f32 2026-02-21T08:28:23.5618738Z }) : (tensor<8x512xf32>) -> tensor<8xf32> 2026-02-21T08:28:23.5619100Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<8x!tt.ptr> 2026-02-21T08:28:23.5619523Z %10 = tt.addptr %9, %6 : tensor<8x!tt.ptr>, tensor<8xi32> 2026-02-21T08:28:23.5619897Z tt.store %10, %8 : tensor<8x!tt.ptr> 2026-02-21T08:28:23.5620179Z tt.return 2026-02-21T08:28:23.5620356Z } 2026-02-21T08:28:23.5620532Z } 2026-02-21T08:28:23.5620630Z 2026-02-21T08:28:23.5620699Z {-# 2026-02-21T08:28:23.5620892Z external_resources: { 2026-02-21T08:28:23.5621126Z mlir_reproducer: { 2026-02-21T08:28:23.5628734Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:23.5636723Z disable_threading: false, 2026-02-21T08:28:23.5636987Z verify_each: true 2026-02-21T08:28:23.5637199Z } 2026-02-21T08:28:23.5637374Z } 2026-02-21T08:28:23.5637537Z #-} 2026-02-21T08:28:23.5638248Z /tmp/torchinductor_root/e5/ce54lljssnh2prtaef7tqwd3wn6vl4knfn4dpqe6vnzyl4to3czs.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:23.5640414Z /tmp/torchinductor_root/e5/ce54lljssnh2prtaef7tqwd3wn6vl4knfn4dpqe6vnzyl4to3czs.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:23.5642166Z [86s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:23.5643885Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 8], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:23.5645412Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:23.5645826Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:24.5473753Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 17.9 configs/s 2026-02-21T08:28:35.9013091Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 90.6 configs/s 2026-02-21T08:28:36.2323135Z [99s] Generation 2 complete: 2026-02-21T08:28:36.2327411Z error=2 2026-02-21T08:28:36.2331363Z ok=70 2026-02-21T08:28:36.2335191Z min=0.1160 2026-02-21T08:28:36.2336686Z mid=0.1258 2026-02-21T08:28:36.2336847Z max=0.5642 2026-02-21T08:28:36.2336984Z best={'block_sizes': [2048, 1], 2026-02-21T08:28:36.2337216Z 'indexing': ['pointer', 'tensor_descriptor', 'pointer'], 2026-02-21T08:28:36.2337450Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:28:36.2337620Z 'num_stages': 8, 2026-02-21T08:28:36.2337764Z 'num_warps': 1, 2026-02-21T08:28:36.2337901Z 'pid_type': 'flat', 2026-02-21T08:28:36.2338057Z 'range_flattens': [None, False], 2026-02-21T08:28:36.2338230Z 'range_multi_buffers': [None, None], 2026-02-21T08:28:36.2338415Z 'range_num_stages': [0, 4], 2026-02-21T08:28:36.2338576Z 'range_unroll_factors': [0, 0], 2026-02-21T08:28:36.2338762Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:36.2338997Z [99s] Fitting surrogate: 256 points, 256 targets 2026-02-21T08:28:37.0553238Z [100s] Generation 3 starting: 57 neighbors, 4 active search path(s) 2026-02-21T08:28:41.2143421Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.2 configs/s 2026-02-21T08:28:43.2977658Z module { 2026-02-21T08:28:43.2978258Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:43.2978880Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:28:43.2979072Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:43.2979301Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32> 2026-02-21T08:28:43.2979519Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:28:43.2979699Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:43.2979890Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:28:43.2980517Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:28:43.2980695Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:43.2981013Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:43.2981455Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:43.2981761Z %2 = tt.get_program_id x : i32 2026-02-21T08:28:43.2982150Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:28:43.2982368Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:28:43.2982609Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:28:43.2982793Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:28:43.2983100Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>) : i32 { 2026-02-21T08:28:43.2983608Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:43.2983983Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:43.2984276Z %13 = scf.if %arg3 -> (tensor<4x256xf32>) { 2026-02-21T08:28:43.2984639Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32> 2026-02-21T08:28:43.2985011Z %16 = arith.subf %12, %11 : tensor<4x256xf32> 2026-02-21T08:28:43.2985247Z %17 = arith.mulf %15, %16 : tensor<4x256xf32> 2026-02-21T08:28:43.2985448Z %18 = arith.addf %17, %cst : tensor<4x256xf32> 2026-02-21T08:28:43.2985645Z scf.yield %18 : tensor<4x256xf32> 2026-02-21T08:28:43.2985808Z } else { 2026-02-21T08:28:43.2985970Z %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32> 2026-02-21T08:28:43.2986182Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32> 2026-02-21T08:28:43.2986409Z %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32> 2026-02-21T08:28:43.2986621Z %18 = arith.ori %16, %17 : tensor<4x256xi1> 2026-02-21T08:28:43.2986853Z %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32> 2026-02-21T08:28:43.2987095Z %20 = math.log %19 : tensor<4x256xf32> 2026-02-21T08:28:43.2987283Z %21 = arith.subf %20, %11 : tensor<4x256xf32> 2026-02-21T08:28:43.2987485Z %22 = arith.mulf %12, %21 : tensor<4x256xf32> 2026-02-21T08:28:43.2987681Z %23 = arith.addf %22, %cst : tensor<4x256xf32> 2026-02-21T08:28:43.2987878Z scf.yield %23 : tensor<4x256xf32> 2026-02-21T08:28:43.2988050Z } 2026-02-21T08:28:43.2988189Z %14 = arith.addf %arg6, %13 : tensor<4x256xf32> 2026-02-21T08:28:43.2988385Z scf.yield %14 : tensor<4x256xf32> 2026-02-21T08:28:43.2988634Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:28:43.2988901Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:28:43.2989087Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:28:43.2989261Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:28:43.2989439Z tt.reduce.return %11 : f32 2026-02-21T08:28:43.2989619Z }) : (tensor<4x256xf32>) -> tensor<4xf32> 2026-02-21T08:28:43.2989839Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:28:43.2990085Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:28:43.2990312Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:28:43.2990484Z tt.return 2026-02-21T08:28:43.2990608Z } 2026-02-21T08:28:43.2990721Z } 2026-02-21T08:28:43.2990796Z 2026-02-21T08:28:43.2990845Z {-# 2026-02-21T08:28:43.2990967Z external_resources: { 2026-02-21T08:28:43.2991125Z mlir_reproducer: { 2026-02-21T08:28:43.2995480Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:43.2999840Z disable_threading: false, 2026-02-21T08:28:43.3000006Z verify_each: true 2026-02-21T08:28:43.3000144Z } 2026-02-21T08:28:43.3000263Z } 2026-02-21T08:28:43.3000369Z #-} 2026-02-21T08:28:43.3000779Z /tmp/torchinductor_root/yz/cyzks6vbmk67lvqrfhanze723cfpvacoohlj5tj54zfrsorbhg3o.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:43.3001977Z /tmp/torchinductor_root/yz/cyzks6vbmk67lvqrfhanze723cfpvacoohlj5tj54zfrsorbhg3o.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:43.3002948Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:43.3003919Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:43.3004788Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:43.3005034Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:43.6554426Z module { 2026-02-21T08:28:43.6555087Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:28:43.6560213Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:28:43.6564348Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:28:43.6564672Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf32> 2026-02-21T08:28:43.6569102Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:28:43.6573645Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:28:43.6576945Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:28:43.6577254Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:28:43.6577485Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:28:43.6577866Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:43.6578640Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:28:43.6583742Z %2 = tt.get_program_id x : i32 2026-02-21T08:28:43.6585791Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:28:43.6586066Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:28:43.6586316Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:28:43.6586505Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:28:43.6586817Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c256_i32 iter_args(%arg6 = %cst) -> (tensor<4x256xf32>) : i32 { 2026-02-21T08:28:43.6587213Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:43.6587577Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x256xf32> 2026-02-21T08:28:43.6588084Z %13 = scf.if %arg3 -> (tensor<4x256xf32>) { 2026-02-21T08:28:43.6588463Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x256xf32>) -> tensor<4x256xf32> 2026-02-21T08:28:43.6588828Z %16 = arith.subf %12, %11 : tensor<4x256xf32> 2026-02-21T08:28:43.6589054Z %17 = arith.mulf %15, %16 : tensor<4x256xf32> 2026-02-21T08:28:43.6589257Z %18 = arith.addf %17, %cst : tensor<4x256xf32> 2026-02-21T08:28:43.6589468Z scf.yield %18 : tensor<4x256xf32> 2026-02-21T08:28:43.6589632Z } else { 2026-02-21T08:28:43.6589794Z %15 = tt.splat %arg4 : f32 -> tensor<4x256xf32> 2026-02-21T08:28:43.6590010Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x256xf32> 2026-02-21T08:28:43.6590222Z %17 = arith.cmpf une, %12, %12 : tensor<4x256xf32> 2026-02-21T08:28:43.6590448Z %18 = arith.ori %16, %17 : tensor<4x256xi1> 2026-02-21T08:28:43.6590688Z %19 = arith.select %18, %12, %15 : tensor<4x256xi1>, tensor<4x256xf32> 2026-02-21T08:28:43.6590938Z %20 = math.log %19 : tensor<4x256xf32> 2026-02-21T08:28:43.6591133Z %21 = arith.subf %20, %11 : tensor<4x256xf32> 2026-02-21T08:28:43.6591326Z %22 = arith.mulf %12, %21 : tensor<4x256xf32> 2026-02-21T08:28:43.6591528Z %23 = arith.addf %22, %cst : tensor<4x256xf32> 2026-02-21T08:28:43.6591715Z scf.yield %23 : tensor<4x256xf32> 2026-02-21T08:28:43.6592006Z } 2026-02-21T08:28:43.6592151Z %14 = arith.addf %arg6, %13 : tensor<4x256xf32> 2026-02-21T08:28:43.6592343Z scf.yield %14 : tensor<4x256xf32> 2026-02-21T08:28:43.6592547Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:28:43.6592744Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:28:43.6592933Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:28:43.6593103Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:28:43.6593291Z tt.reduce.return %11 : f32 2026-02-21T08:28:43.6593473Z }) : (tensor<4x256xf32>) -> tensor<4xf32> 2026-02-21T08:28:43.6593703Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:28:43.6593953Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:28:43.6594182Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:28:43.6594360Z tt.return 2026-02-21T08:28:43.6594477Z } 2026-02-21T08:28:43.6594612Z } 2026-02-21T08:28:43.6594677Z 2026-02-21T08:28:43.6594724Z {-# 2026-02-21T08:28:43.6594850Z external_resources: { 2026-02-21T08:28:43.6594997Z mlir_reproducer: { 2026-02-21T08:28:43.6599265Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:28:43.6603664Z disable_threading: false, 2026-02-21T08:28:43.6603846Z verify_each: true 2026-02-21T08:28:43.6603997Z } 2026-02-21T08:28:43.6604144Z } 2026-02-21T08:28:43.6604305Z #-} 2026-02-21T08:28:43.6604824Z /tmp/torchinductor_root/uu/cuucfqp53u36oyzbhddfv4rgl27hwc7yimcic2whdoumdwksxue4.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:28:43.6606165Z /tmp/torchinductor_root/uu/cuucfqp53u36oyzbhddfv4rgl27hwc7yimcic2whdoumdwksxue4.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:28:43.6607230Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:28:43.6608314Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:28:43.6609279Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:28:43.6609534Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:28:44.5186198Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 18.1 configs/s 2026-02-21T08:28:52.9784791Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 118.8 2026-02-21T08:28:52.9786213Z configs/s 2026-02-21T08:28:53.2397662Z [116s] Generation 3 complete: 2026-02-21T08:28:53.2402036Z error=2 2026-02-21T08:28:53.2403318Z ok=59 2026-02-21T08:28:53.2403474Z min=0.1136 2026-02-21T08:28:53.2403610Z mid=0.1238 2026-02-21T08:28:53.2403724Z max=0.3698 2026-02-21T08:28:53.2403864Z best={'block_sizes': [1024, 2], 2026-02-21T08:28:53.2404122Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:28:53.2404402Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:28:53.2404589Z 'num_stages': 5, 2026-02-21T08:28:53.2404724Z 'num_warps': 1, 2026-02-21T08:28:53.2404865Z 'pid_type': 'flat', 2026-02-21T08:28:53.2405015Z 'range_flattens': [None, False], 2026-02-21T08:28:53.2405192Z 'range_multi_buffers': [None, True], 2026-02-21T08:28:53.2405365Z 'range_num_stages': [0, 1], 2026-02-21T08:28:53.2405529Z 'range_unroll_factors': [0, 1], 2026-02-21T08:28:53.2405725Z 'range_warp_specializes': [None, True]} 2026-02-21T08:28:53.2414148Z [116s] Fitting surrogate: 317 points, 317 targets 2026-02-21T08:28:53.9812475Z [116s] Generation 4 starting: 43 neighbors, 3 active search path(s) 2026-02-21T08:28:56.0748019Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 22.1 configs/s 2026-02-21T08:28:58.6039366Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.7 configs/s 2026-02-21T08:29:05.9807670Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 143.4 2026-02-21T08:29:05.9810942Z configs/s 2026-02-21T08:29:06.2051060Z [129s] Generation 4 complete: 2026-02-21T08:29:06.2051339Z ok=46 2026-02-21T08:29:06.2051474Z min=0.1076 2026-02-21T08:29:06.2051627Z mid=0.1199 2026-02-21T08:29:06.2051750Z max=0.1894 2026-02-21T08:29:06.2053407Z best={'block_sizes': [1024, 1], 2026-02-21T08:29:06.2054093Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:06.2054414Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:29:06.2054615Z 'num_stages': 5, 2026-02-21T08:29:06.2054755Z 'num_warps': 1, 2026-02-21T08:29:06.2054905Z 'pid_type': 'flat', 2026-02-21T08:29:06.2055065Z 'range_flattens': [None, False], 2026-02-21T08:29:06.2055249Z 'range_multi_buffers': [None, True], 2026-02-21T08:29:06.2055427Z 'range_num_stages': [0, 2], 2026-02-21T08:29:06.2055597Z 'range_unroll_factors': [0, 1], 2026-02-21T08:29:06.2055777Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:06.2065609Z [129s] Fitting surrogate: 363 points, 363 targets 2026-02-21T08:29:07.0483884Z [130s] Generation 5 starting: 42 neighbors, 3 active search path(s) 2026-02-21T08:29:09.0796819Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 41.5 configs/s 2026-02-21T08:29:11.5514359Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.7 configs/s 2026-02-21T08:29:18.4312394Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 147.1 2026-02-21T08:29:18.4312948Z configs/s 2026-02-21T08:29:18.6500814Z [141s] Generation 5 complete: 2026-02-21T08:29:18.6504386Z ok=46 2026-02-21T08:29:18.6508276Z min=0.1076 2026-02-21T08:29:18.6511476Z mid=0.1199 2026-02-21T08:29:18.6515560Z max=0.2203 2026-02-21T08:29:18.6518859Z best={'block_sizes': [1024, 1], 2026-02-21T08:29:18.6522824Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:18.6524153Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:29:18.6524352Z 'num_stages': 5, 2026-02-21T08:29:18.6524496Z 'num_warps': 1, 2026-02-21T08:29:18.6524640Z 'pid_type': 'flat', 2026-02-21T08:29:18.6524804Z 'range_flattens': [None, False], 2026-02-21T08:29:18.6524985Z 'range_multi_buffers': [None, True], 2026-02-21T08:29:18.6525166Z 'range_num_stages': [0, 3], 2026-02-21T08:29:18.6525337Z 'range_unroll_factors': [0, 1], 2026-02-21T08:29:18.6525542Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:18.6525765Z [141s] Fitting surrogate: 409 points, 409 targets 2026-02-21T08:29:19.0303301Z [142s] Generation 6 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:29:20.5111784Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 33.4 configs/s 2026-02-21T08:29:21.2905523Z module { 2026-02-21T08:29:21.2907863Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:29:21.2908582Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:29:21.2908793Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:29:21.2909051Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:29:21.2909296Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:29:21.2909494Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:29:21.2909728Z %c16384_i32 = arith.constant 16384 : i32 2026-02-21T08:29:21.2910372Z %c16384_i64 = arith.constant 16384 : i64 2026-02-21T08:29:21.2910573Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:29:21.2910908Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:29:21.2911390Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c16384_i32], [%c16384_i64, %c1_i64] : , > 2026-02-21T08:29:21.2913044Z %2 = tt.get_program_id x : i32 2026-02-21T08:29:21.2913288Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:29:21.2917887Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:29:21.2922155Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:29:21.2926943Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:29:21.2931435Z %7 = scf.for %arg5 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:29:21.2936793Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:29:21.2938309Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:29:21.2938654Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:29:21.2939054Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:29:21.2939457Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:29:21.2939659Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:29:21.2939873Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:29:21.2940073Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:29:21.2940257Z } else { 2026-02-21T08:29:21.2940427Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:29:21.2940677Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:29:21.2940918Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:29:21.2941147Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:29:21.2941406Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:29:21.2941662Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:29:21.2942062Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:29:21.2942279Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:29:21.2942502Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:29:21.2942707Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:29:21.2942932Z } 2026-02-21T08:29:21.2943098Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:29:21.2943286Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:29:21.2943543Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:29:21.2944084Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:29:21.2944280Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:29:21.2944459Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:29:21.2944656Z tt.reduce.return %11 : f32 2026-02-21T08:29:21.2944844Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:29:21.2945065Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:29:21.2945325Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:29:21.2945558Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:29:21.2945747Z tt.return 2026-02-21T08:29:21.2945875Z } 2026-02-21T08:29:21.2946012Z } 2026-02-21T08:29:21.2946084Z 2026-02-21T08:29:21.2946136Z {-# 2026-02-21T08:29:21.2946274Z external_resources: { 2026-02-21T08:29:21.2946433Z mlir_reproducer: { 2026-02-21T08:29:21.2950793Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:29:21.2955157Z disable_threading: false, 2026-02-21T08:29:21.2955330Z verify_each: true 2026-02-21T08:29:21.2955485Z } 2026-02-21T08:29:21.2955605Z } 2026-02-21T08:29:21.2955733Z #-} 2026-02-21T08:29:21.2956162Z /tmp/torchinductor_root/qh/cqhg3hekzjiacdff4zpxjzwm4m5mixo7gnd7d65mn2s4lw3jh4q3.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:29:21.2957398Z /tmp/torchinductor_root/qh/cqhg3hekzjiacdff4zpxjzwm4m5mixo7gnd7d65mn2s4lw3jh4q3.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:29:21.2958394Z [144s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:29:21.2959430Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:29:21.2960363Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:29:21.2960686Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:29:21.6948476Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 18.5 configs/s 2026-02-21T08:29:25.4521024Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 291.9 2026-02-21T08:29:25.4521573Z configs/s 2026-02-21T08:29:25.5820980Z [148s] Generation 6 complete: 2026-02-21T08:29:25.5822622Z error=1 2026-02-21T08:29:25.5822780Z ok=22 2026-02-21T08:29:25.5822907Z min=0.1095 2026-02-21T08:29:25.5823047Z mid=0.1158 2026-02-21T08:29:25.5823168Z max=0.1669 2026-02-21T08:29:25.5823318Z best={'block_sizes': [1024, 1], 2026-02-21T08:29:25.5823576Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:29:25.5823866Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:29:25.5824053Z 'num_stages': 5, 2026-02-21T08:29:25.5824563Z 'num_warps': 2, 2026-02-21T08:29:25.5824736Z 'pid_type': 'flat', 2026-02-21T08:29:25.5824889Z 'range_flattens': [None, False], 2026-02-21T08:29:25.5825072Z 'range_multi_buffers': [None, True], 2026-02-21T08:29:25.5825247Z 'range_num_stages': [0, 3], 2026-02-21T08:29:25.5825417Z 'range_unroll_factors': [0, 1], 2026-02-21T08:29:25.5825591Z 'range_warp_specializes': [None, True]} 2026-02-21T08:29:25.5832889Z [148s] Fitting surrogate: 432 points, 432 targets 2026-02-21T08:29:25.7421770Z [148s] Autotuning complete in 148.7s after searching 406 configs. 2026-02-21T08:29:25.7422359Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:29:25.7423308Z @helion.kernel(config=helion.Config(block_sizes=[1024, 1], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=5, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:29:25.7424147Z 2026-02-21T08:29:25.7717755Z [148s] Code of selected kernel: /tmp/torchinductor_root/ev/cevbvrakdvigvaxviegoca6q5acz4nqw7apu6ysxypfng7qcpeon.py 2026-02-21T08:29:25.7718151Z from __future__ import annotations 2026-02-21T08:29:25.7719290Z 2026-02-21T08:29:25.7719451Z import torch 2026-02-21T08:29:25.7719611Z import triton 2026-02-21T08:29:25.7719765Z import triton.language as tl 2026-02-21T08:29:25.7719964Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:29:25.7720234Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:29:25.7720515Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:29:25.7720792Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:29:25.7720958Z 2026-02-21T08:29:25.7721033Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:29:25.7721204Z _BLOCK_SIZE_0 = tl.constexpr(1024) 2026-02-21T08:29:25.7721313Z 2026-02-21T08:29:25.7721387Z @triton.jit 2026-02-21T08:29:25.7721579Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:29:25.7721935Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:29:25.7722179Z pid_0 = tl.program_id(0) 2026-02-21T08:29:25.7722353Z offset_1 = pid_0 2026-02-21T08:29:25.7722535Z indices_1 = offset_1 + tl.zeros([1], tl.int32) 2026-02-21T08:29:25.7722815Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:29:25.7723141Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:29:25.7723420Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:29:25.7723740Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:29:25.7724007Z # src[kl_div.py:92-112]: ... 2026-02-21T08:29:25.7724444Z for offset_0 in tl.range(0, 16384, _BLOCK_SIZE_0, loop_unroll_factor=1, warp_specialize=True, num_stages=3, disallow_acc_multi_buffer=False, flatten=False): 2026-02-21T08:29:25.7725182Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:29:25.7725415Z loss_sum_copy = loss_sum 2026-02-21T08:29:25.7725584Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:29:25.7725846Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:29:25.7726152Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:29:25.7726420Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:29:25.7726772Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 16384 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:29:25.7727120Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:29:25.7727545Z y_true_val = tl.load(y_true + (indices_1[:, None] * 16384 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:29:25.7727871Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:29:25.7728131Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:29:25.7728418Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:29:25.7728635Z # src[kl_div.py:98-106]: ... 2026-02-21T08:29:25.7728809Z if log_target: 2026-02-21T08:29:25.7728961Z y_true_val_copy = y_true_val 2026-02-21T08:29:25.7729144Z y_pred_val_copy = y_pred_val 2026-02-21T08:29:25.7729316Z kl_loss_copy = kl_loss 2026-02-21T08:29:25.7729495Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:29:25.7729681Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:29:25.7729867Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:29:25.7730070Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:29:25.7730297Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:29:25.7730540Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:29:25.7730789Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:29:25.7730979Z v_2 = v_0 * v_1 2026-02-21T08:29:25.7731136Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:29:25.7731319Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:29:25.7731562Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:29:25.7731889Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:29:25.7732117Z # src[kl_div.py:98-106]: ... 2026-02-21T08:29:25.7732290Z _not = not log_target 2026-02-21T08:29:25.7732456Z if _not: 2026-02-21T08:29:25.7732603Z y_true_val_copy_1 = y_true_val 2026-02-21T08:29:25.7732796Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:29:25.7732977Z kl_loss_copy_1 = kl_loss 2026-02-21T08:29:25.7733176Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:29:25.7733381Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:29:25.7733588Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:29:25.7733848Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:29:25.7734142Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:29:25.7734374Z v_5 = tl_math.log(v_4) 2026-02-21T08:29:25.7734597Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:29:25.7734844Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:29:25.7735025Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:29:25.7735212Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:29:25.7735411Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:29:25.7735609Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:29:25.7735913Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:29:25.7736157Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:29:25.7736383Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:29:25.7736519Z 2026-02-21T08:29:25.7736819Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:29:25.7737233Z """ 2026-02-21T08:29:25.7737381Z Compute KL Divergence loss. 2026-02-21T08:29:25.7737496Z 2026-02-21T08:29:25.7737553Z Args: 2026-02-21T08:29:25.7737739Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:29:25.7738036Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:29:25.7738386Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:29:25.7738767Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:29:25.7739031Z eps: Small value to avoid numerical issues 2026-02-21T08:29:25.7739176Z 2026-02-21T08:29:25.7739235Z Returns: 2026-02-21T08:29:25.7739367Z loss: KL divergence loss 2026-02-21T08:29:25.7739525Z """ 2026-02-21T08:29:25.7739659Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:29:25.7739842Z BT, V = y_pred.shape 2026-02-21T08:29:25.7740030Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:29:25.7740298Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:29:25.7740523Z # src[kl_div.py:77]: ) 2026-02-21T08:29:25.7740770Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:29:25.7741057Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:29:25.7741272Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:29:25.7741479Z # src[kl_div.py:82]: else: 2026-02-21T08:29:25.7741636Z # src[kl_div.py:80-83]: ... 2026-02-21T08:29:25.7741796Z if reduction == 'none': 2026-02-21T08:29:25.7742005Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:29:25.7742214Z loss = torch.zeros_like(y_pred) 2026-02-21T08:29:25.7742381Z else: 2026-02-21T08:29:25.7742594Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:29:25.7742915Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:29:25.7743197Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:29:25.7743508Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:29:25.7743757Z # src[kl_div.py:89-115]: ... 2026-02-21T08:29:25.7744052Z _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=4, num_stages=5) 2026-02-21T08:29:25.7744387Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:29:25.7744619Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:29:25.7744850Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:29:25.7745041Z # src[kl_div.py:118-125]: ... 2026-02-21T08:29:25.7745212Z if reduction == 'batchmean': 2026-02-21T08:29:25.7745402Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:29:25.7745621Z final_loss = torch.sum(loss) / BT 2026-02-21T08:29:25.7745805Z elif reduction == 'sum': 2026-02-21T08:29:25.7745997Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:29:25.7746213Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:29:25.7746389Z elif reduction == 'mean': 2026-02-21T08:29:25.7746589Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:29:25.7746806Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:29:25.7746980Z else: 2026-02-21T08:29:25.7747113Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:29:25.7747357Z final_loss = loss 2026-02-21T08:29:25.7747518Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:29:25.7747687Z return final_loss 2026-02-21T08:29:26.8022786Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T08:29:26.8026726Z (B, T, V) 2026-02-21T08:29:26.8026923Z --------------- 2026-02-21T08:29:26.8027118Z (8, 512, 16384) 2026-02-21T08:29:26.8027225Z 2026-02-21T08:29:26.8376710Z 50%|█████ | 3/6 [07:59<07:56, 158.97s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T08:29:26.8381071Z (B, T, V) 2026-02-21T08:29:26.8383066Z --------------- 2026-02-21T08:29:26.8383297Z (8, 512, 32768) 2026-02-21T08:29:26.8387587Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:29:28.0374680Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for liger_kl_div 2026-02-21T08:29:29.1633193Z INFO:tritonbench.utils.triton_op:Took 3.97ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:29:30.5049934Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:29:30.5050298Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:29:30.5050592Z 'dtype': 'torch.float32', 2026-02-21T08:29:30.5050868Z 'shape': (4096, 32768), 2026-02-21T08:29:30.5051144Z 'stride': (32768, 1)}, 2026-02-21T08:29:30.5051417Z { 'device': 'cuda:0', 2026-02-21T08:29:30.5051677Z 'dtype': 'torch.float32', 2026-02-21T08:29:30.5052298Z 'shape': (4096, 32768), 2026-02-21T08:29:30.5052566Z 'stride': (32768, 1)}), 2026-02-21T08:29:30.5052826Z 'kwargs': {}} 2026-02-21T08:29:30.5062052Z INFO:tritonbench.utils.triton_op:Took 1.64ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:29:30.7376594Z [0s] Autotune random seed: 2135561342 2026-02-21T08:29:30.7732030Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:30:03.6654858Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 128], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=64, num_stages=8, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[3, 0], range_unroll_factors=[0, 4], range_warp_specializes=[True, None]) 2026-02-21T08:30:03.7081163Z [32s] Timeout after 30s compiling Config(block_sizes=[2048, 32], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=7, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]) 2026-02-21T08:30:04.2249891Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 8], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', ''], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[None, None]) 2026-02-21T08:30:04.3136854Z [33s] Timeout after 30s compiling Config(block_sizes=[2048, 512], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], num_sm_multiplier=16, num_stages=8, num_warps=32, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[1, 4], range_warp_specializes=[False, None]) 2026-02-21T08:30:04.6716066Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T08:30:05.8680310Z [35s] Timeout after 30s compiling Config(block_sizes=[128, 1024], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=7, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 1], range_unroll_factors=[1, 1], range_warp_specializes=[None, True]) 2026-02-21T08:30:05.8695971Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:30:05.9579640Z module { 2026-02-21T08:30:05.9584414Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:05.9585658Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:30:05.9586274Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:05.9586512Z %cst = arith.constant dense<32768> : tensor<16x1xi32> 2026-02-21T08:30:05.9586775Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32> 2026-02-21T08:30:05.9586994Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:30:05.9587180Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:05.9587371Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:05.9587550Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:05.9587731Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:05.9588039Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:05.9588355Z %1 = tt.get_program_id x : i32 2026-02-21T08:30:05.9588524Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T08:30:05.9588749Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:30:05.9588989Z %4 = tt.splat %2 : i32 -> tensor<16xi32> 2026-02-21T08:30:05.9589173Z %5 = arith.addi %4, %3 : tensor<16xi32> 2026-02-21T08:30:05.9589481Z %6 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>) : i32 { 2026-02-21T08:30:05.9589820Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:30:05.9590064Z %11 = tt.splat %arg5 : i32 -> tensor<8xi32> 2026-02-21T08:30:05.9590256Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:30:05.9590533Z %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc> -> tensor<16x8xf32> 2026-02-21T08:30:05.9590869Z %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:30:05.9591118Z %15 = arith.muli %14, %cst : tensor<16x1xi32> 2026-02-21T08:30:05.9591363Z %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:30:05.9591631Z %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32> 2026-02-21T08:30:05.9591932Z %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32> 2026-02-21T08:30:05.9592162Z %19 = arith.addi %17, %18 : tensor<16x8xi32> 2026-02-21T08:30:05.9592399Z %20 = tt.splat %arg1 : !tt.ptr -> tensor<16x8x!tt.ptr> 2026-02-21T08:30:05.9592671Z %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr>, tensor<16x8xi32> 2026-02-21T08:30:05.9592957Z %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr> 2026-02-21T08:30:05.9593220Z %23 = scf.if %arg3 -> (tensor<16x8xf32>) { 2026-02-21T08:30:05.9593572Z %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32> 2026-02-21T08:30:05.9593939Z %26 = arith.subf %22, %13 : tensor<16x8xf32> 2026-02-21T08:30:05.9594148Z %27 = arith.mulf %25, %26 : tensor<16x8xf32> 2026-02-21T08:30:05.9594352Z %28 = arith.addf %27, %cst_0 : tensor<16x8xf32> 2026-02-21T08:30:05.9594556Z scf.yield %28 : tensor<16x8xf32> 2026-02-21T08:30:05.9594845Z } else { 2026-02-21T08:30:05.9595004Z %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32> 2026-02-21T08:30:05.9595213Z %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32> 2026-02-21T08:30:05.9595428Z %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32> 2026-02-21T08:30:05.9595621Z %28 = arith.ori %26, %27 : tensor<16x8xi1> 2026-02-21T08:30:05.9595853Z %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32> 2026-02-21T08:30:05.9596086Z %30 = math.log %29 : tensor<16x8xf32> 2026-02-21T08:30:05.9596272Z %31 = arith.subf %30, %13 : tensor<16x8xf32> 2026-02-21T08:30:05.9596468Z %32 = arith.mulf %22, %31 : tensor<16x8xf32> 2026-02-21T08:30:05.9596662Z %33 = arith.addf %32, %cst_0 : tensor<16x8xf32> 2026-02-21T08:30:05.9596856Z scf.yield %33 : tensor<16x8xf32> 2026-02-21T08:30:05.9597016Z } 2026-02-21T08:30:05.9597162Z %24 = arith.addf %arg6, %23 : tensor<16x8xf32> 2026-02-21T08:30:05.9597412Z scf.yield %24 : tensor<16x8xf32> 2026-02-21T08:30:05.9597590Z } {tt.warp_specialize} 2026-02-21T08:30:05.9597759Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:30:05.9597934Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:30:05.9598107Z %10 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:30:05.9598281Z tt.reduce.return %10 : f32 2026-02-21T08:30:05.9598457Z }) : (tensor<16x8xf32>) -> tensor<16xf32> 2026-02-21T08:30:05.9598669Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:30:05.9598922Z %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:30:05.9599160Z tt.store %9, %7 : tensor<16x!tt.ptr> 2026-02-21T08:30:05.9599332Z tt.return 2026-02-21T08:30:05.9599458Z } 2026-02-21T08:30:05.9599571Z } 2026-02-21T08:30:05.9599636Z 2026-02-21T08:30:05.9599693Z {-# 2026-02-21T08:30:05.9599817Z external_resources: { 2026-02-21T08:30:05.9599973Z mlir_reproducer: { 2026-02-21T08:30:05.9604313Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:05.9608697Z disable_threading: false, 2026-02-21T08:30:05.9608879Z verify_each: true 2026-02-21T08:30:05.9609029Z } 2026-02-21T08:30:05.9609163Z } 2026-02-21T08:30:05.9609294Z #-} 2026-02-21T08:30:05.9609791Z /tmp/torchinductor_root/hc/chc5bnnpsov5qsmmbqdu5bumoylsvdxz3pcs4sl2kb2c3a5myqiv.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:05.9611199Z /tmp/torchinductor_root/hc/chc5bnnpsov5qsmmbqdu5bumoylsvdxz3pcs4sl2kb2c3a5myqiv.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:05.9612320Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:05.9613428Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:30:05.9614373Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:05.9614619Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:11.4559625Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:30:11.4560538Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:11.4561338Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:30:11.4561588Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:11.4561812Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:11.4562186Z %cst = arith.constant dense<0.000000e+00> : tensor<64x16xf32> 2026-02-21T08:30:11.4562410Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:30:11.4562604Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:11.4562836Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:11.4563028Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:11.4563208Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:11.4563512Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.4563944Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.4564244Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:11.4564441Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:30:11.4564620Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:30:11.4564791Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:30:11.4564965Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:30:11.4565139Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:30:11.4565317Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:30:11.4565476Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:30:11.4565660Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:30:11.4565826Z %9 = arith.remsi %8, %c3_i32 : i32 2026-02-21T08:30:11.4565995Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:30:11.4566159Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:30:11.4566333Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:30:11.4566505Z %13 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:30:11.4566692Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:30:11.4566891Z %14 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:30:11.4567115Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:11.4567371Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T08:30:11.4567560Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T08:30:11.4567871Z %18 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>) : i32 { 2026-02-21T08:30:11.4568275Z %42 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4568964Z %43 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4569262Z %44 = scf.if %arg3 -> (tensor<64x16xf32>) { 2026-02-21T08:30:11.4569623Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32> 2026-02-21T08:30:11.4569998Z %47 = arith.subf %43, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4570210Z %48 = arith.mulf %46, %47 : tensor<64x16xf32> 2026-02-21T08:30:11.4570412Z %49 = arith.addf %48, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4570615Z scf.yield %49 : tensor<64x16xf32> 2026-02-21T08:30:11.4570780Z } else { 2026-02-21T08:30:11.4570942Z %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32> 2026-02-21T08:30:11.4571158Z %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32> 2026-02-21T08:30:11.4571503Z %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32> 2026-02-21T08:30:11.4571718Z %49 = arith.ori %47, %48 : tensor<64x16xi1> 2026-02-21T08:30:11.4571997Z %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32> 2026-02-21T08:30:11.4572239Z %51 = math.log %50 : tensor<64x16xf32> 2026-02-21T08:30:11.4572429Z %52 = arith.subf %51, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4572637Z %53 = arith.mulf %43, %52 : tensor<64x16xf32> 2026-02-21T08:30:11.4572839Z %54 = arith.addf %53, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4573038Z scf.yield %54 : tensor<64x16xf32> 2026-02-21T08:30:11.4573208Z } 2026-02-21T08:30:11.4573359Z %45 = arith.addf %arg7, %44 : tensor<64x16xf32> 2026-02-21T08:30:11.4573553Z scf.yield %45 : tensor<64x16xf32> 2026-02-21T08:30:11.4573750Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:11.4573963Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.4574152Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.4574330Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.4574506Z tt.reduce.return %42 : f32 2026-02-21T08:30:11.4574687Z }) : (tensor<64x16xf32>) -> tensor<64xf32> 2026-02-21T08:30:11.4574908Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:11.4575166Z %21 = tt.addptr %20, %17 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:11.4575400Z tt.store %21, %19 : tensor<64x!tt.ptr> 2026-02-21T08:30:11.4575587Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:30:11.4575774Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:30:11.4575953Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:30:11.4576132Z %24 = arith.muli %23, %c64_i32 : i32 2026-02-21T08:30:11.4576347Z %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:11.4576587Z %26 = tt.splat %24 : i32 -> tensor<64xi32> 2026-02-21T08:30:11.4576779Z %27 = arith.addi %26, %25 : tensor<64xi32> 2026-02-21T08:30:11.4577075Z %28 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>) : i32 { 2026-02-21T08:30:11.4577467Z %42 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4577819Z %43 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4578103Z %44 = scf.if %arg3 -> (tensor<64x16xf32>) { 2026-02-21T08:30:11.4578458Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32> 2026-02-21T08:30:11.4578813Z %47 = arith.subf %43, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4579025Z %48 = arith.mulf %46, %47 : tensor<64x16xf32> 2026-02-21T08:30:11.4579227Z %49 = arith.addf %48, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4579496Z scf.yield %49 : tensor<64x16xf32> 2026-02-21T08:30:11.4579658Z } else { 2026-02-21T08:30:11.4579819Z %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32> 2026-02-21T08:30:11.4580034Z %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32> 2026-02-21T08:30:11.4580250Z %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32> 2026-02-21T08:30:11.4580456Z %49 = arith.ori %47, %48 : tensor<64x16xi1> 2026-02-21T08:30:11.4580683Z %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32> 2026-02-21T08:30:11.4580920Z %51 = math.log %50 : tensor<64x16xf32> 2026-02-21T08:30:11.4581111Z %52 = arith.subf %51, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4581312Z %53 = arith.mulf %43, %52 : tensor<64x16xf32> 2026-02-21T08:30:11.4581513Z %54 = arith.addf %53, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4581735Z scf.yield %54 : tensor<64x16xf32> 2026-02-21T08:30:11.4582137Z } 2026-02-21T08:30:11.4582289Z %45 = arith.addf %arg7, %44 : tensor<64x16xf32> 2026-02-21T08:30:11.4582476Z scf.yield %45 : tensor<64x16xf32> 2026-02-21T08:30:11.4582678Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:11.4582875Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.4583064Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.4583241Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.4583417Z tt.reduce.return %42 : f32 2026-02-21T08:30:11.4583601Z }) : (tensor<64x16xf32>) -> tensor<64xf32> 2026-02-21T08:30:11.4583815Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:11.4584074Z %31 = tt.addptr %30, %27 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:11.4584298Z tt.store %31, %29 : tensor<64x!tt.ptr> 2026-02-21T08:30:11.4584493Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:11.4584680Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:30:11.4584858Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:30:11.4585035Z %34 = arith.muli %33, %c64_i32 : i32 2026-02-21T08:30:11.4585247Z %35 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:11.4585484Z %36 = tt.splat %34 : i32 -> tensor<64xi32> 2026-02-21T08:30:11.4585663Z %37 = arith.addi %36, %35 : tensor<64xi32> 2026-02-21T08:30:11.4585991Z %38 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>) : i32 { 2026-02-21T08:30:11.4586390Z %42 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4586750Z %43 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4587038Z %44 = scf.if %arg3 -> (tensor<64x16xf32>) { 2026-02-21T08:30:11.4587395Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32> 2026-02-21T08:30:11.4587768Z %47 = arith.subf %43, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4587969Z %48 = arith.mulf %46, %47 : tensor<64x16xf32> 2026-02-21T08:30:11.4588174Z %49 = arith.addf %48, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4588372Z scf.yield %49 : tensor<64x16xf32> 2026-02-21T08:30:11.4588537Z } else { 2026-02-21T08:30:11.4588699Z %46 = tt.splat %arg4 : f32 -> tensor<64x16xf32> 2026-02-21T08:30:11.4588914Z %47 = arith.cmpf ogt, %43, %46 : tensor<64x16xf32> 2026-02-21T08:30:11.4589147Z %48 = arith.cmpf une, %43, %43 : tensor<64x16xf32> 2026-02-21T08:30:11.4589346Z %49 = arith.ori %47, %48 : tensor<64x16xi1> 2026-02-21T08:30:11.4589579Z %50 = arith.select %49, %43, %46 : tensor<64x16xi1>, tensor<64x16xf32> 2026-02-21T08:30:11.4589814Z %51 = math.log %50 : tensor<64x16xf32> 2026-02-21T08:30:11.4590003Z %52 = arith.subf %51, %42 : tensor<64x16xf32> 2026-02-21T08:30:11.4590261Z %53 = arith.mulf %43, %52 : tensor<64x16xf32> 2026-02-21T08:30:11.4590455Z %54 = arith.addf %53, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4590651Z scf.yield %54 : tensor<64x16xf32> 2026-02-21T08:30:11.4590813Z } 2026-02-21T08:30:11.4590961Z %45 = arith.addf %arg7, %44 : tensor<64x16xf32> 2026-02-21T08:30:11.4591147Z scf.yield %45 : tensor<64x16xf32> 2026-02-21T08:30:11.4591347Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:11.4591577Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.4591759Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.4591972Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.4592158Z tt.reduce.return %42 : f32 2026-02-21T08:30:11.4592338Z }) : (tensor<64x16xf32>) -> tensor<64xf32> 2026-02-21T08:30:11.4592568Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:11.4592921Z %41 = tt.addptr %40, %37 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:11.4593162Z tt.store %41, %39 : tensor<64x!tt.ptr> 2026-02-21T08:30:11.4593347Z } {tt.num_stages = 1 : i32} 2026-02-21T08:30:11.4593529Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:30:11.4593728Z %14 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:30:11.4593943Z %15 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:11.4594180Z %16 = tt.splat %14 : i32 -> tensor<64xi32> 2026-02-21T08:30:11.4594363Z %17 = arith.addi %16, %15 : tensor<64xi32> 2026-02-21T08:30:11.4594667Z %18 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<64x16xf32>) : i32 { 2026-02-21T08:30:11.4595057Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4595437Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<64x16xf32> 2026-02-21T08:30:11.4595725Z %24 = scf.if %arg3 -> (tensor<64x16xf32>) { 2026-02-21T08:30:11.4596074Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x16xf32>) -> tensor<64x16xf32> 2026-02-21T08:30:11.4596436Z %27 = arith.subf %23, %22 : tensor<64x16xf32> 2026-02-21T08:30:11.4596636Z %28 = arith.mulf %26, %27 : tensor<64x16xf32> 2026-02-21T08:30:11.4596845Z %29 = arith.addf %28, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4597043Z scf.yield %29 : tensor<64x16xf32> 2026-02-21T08:30:11.4597206Z } else { 2026-02-21T08:30:11.4597366Z %26 = tt.splat %arg4 : f32 -> tensor<64x16xf32> 2026-02-21T08:30:11.4597578Z %27 = arith.cmpf ogt, %23, %26 : tensor<64x16xf32> 2026-02-21T08:30:11.4597799Z %28 = arith.cmpf une, %23, %23 : tensor<64x16xf32> 2026-02-21T08:30:11.4598006Z %29 = arith.ori %27, %28 : tensor<64x16xi1> 2026-02-21T08:30:11.4598246Z %30 = arith.select %29, %23, %26 : tensor<64x16xi1>, tensor<64x16xf32> 2026-02-21T08:30:11.4598475Z %31 = math.log %30 : tensor<64x16xf32> 2026-02-21T08:30:11.4598670Z %32 = arith.subf %31, %22 : tensor<64x16xf32> 2026-02-21T08:30:11.4598867Z %33 = arith.mulf %23, %32 : tensor<64x16xf32> 2026-02-21T08:30:11.4599062Z %34 = arith.addf %33, %cst : tensor<64x16xf32> 2026-02-21T08:30:11.4599256Z scf.yield %34 : tensor<64x16xf32> 2026-02-21T08:30:11.4599418Z } 2026-02-21T08:30:11.4599562Z %25 = arith.addf %arg7, %24 : tensor<64x16xf32> 2026-02-21T08:30:11.4599746Z scf.yield %25 : tensor<64x16xf32> 2026-02-21T08:30:11.4599943Z } {tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:11.4600143Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.4600324Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.4600499Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.4600736Z tt.reduce.return %22 : f32 2026-02-21T08:30:11.4600918Z }) : (tensor<64x16xf32>) -> tensor<64xf32> 2026-02-21T08:30:11.4601134Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:11.4601390Z %21 = tt.addptr %20, %17 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:11.4601611Z tt.store %21, %19 : tensor<64x!tt.ptr> 2026-02-21T08:30:11.4601801Z } {tt.num_stages = 1 : i32} 2026-02-21T08:30:11.4601990Z tt.return 2026-02-21T08:30:11.4602113Z } 2026-02-21T08:30:11.4602234Z } 2026-02-21T08:30:11.4602301Z 2026-02-21T08:30:11.4602350Z {-# 2026-02-21T08:30:11.4602482Z external_resources: { 2026-02-21T08:30:11.4602630Z mlir_reproducer: { 2026-02-21T08:30:11.4607014Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:11.4611555Z disable_threading: false, 2026-02-21T08:30:11.4611759Z verify_each: true 2026-02-21T08:30:11.4611955Z } 2026-02-21T08:30:11.4612105Z } 2026-02-21T08:30:11.4612262Z #-} 2026-02-21T08:30:11.4612791Z /tmp/torchinductor_root/2l/c2lby4mn655jcmwxwsdua65gj62bm6qvop2az7nfbzwxlpel5xur.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:11.4614185Z /tmp/torchinductor_root/2l/c2lby4mn655jcmwxwsdua65gj62bm6qvop2az7nfbzwxlpel5xur.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:11.4615327Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:11.4616503Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'first'], maxnreg=64, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 1], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:30:11.4617572Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:11.4617821Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:11.4921674Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:30:11.4925311Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:11.4930261Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:11.4934840Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:30:11.4938752Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:11.4943854Z %c592_i32 = arith.constant 592 : i32 2026-02-21T08:30:11.4947124Z %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32> 2026-02-21T08:30:11.4950301Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:11.4955501Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:11.4955790Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:11.4959945Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:11.4964140Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:11.4964612Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.4965428Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.4970340Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:11.4974912Z scf.for %arg5 = %2 to %c32_i32 step %c592_i32 : i32 { 2026-02-21T08:30:11.4978670Z %3 = arith.muli %arg5, %c128_i32 : i32 2026-02-21T08:30:11.4983177Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:30:11.4985110Z %5 = tt.splat %3 : i32 -> tensor<128xi32> 2026-02-21T08:30:11.4985353Z %6 = arith.addi %5, %4 : tensor<128xi32> 2026-02-21T08:30:11.4985669Z %7 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c16_i32 iter_args(%arg7 = %cst) -> (tensor<128x16xf32>) : i32 { 2026-02-21T08:30:11.4986089Z %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<128x16xf32> 2026-02-21T08:30:11.4986464Z %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<128x16xf32> 2026-02-21T08:30:11.4986749Z %13 = scf.if %arg3 -> (tensor<128x16xf32>) { 2026-02-21T08:30:11.4987118Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x16xf32>) -> tensor<128x16xf32> 2026-02-21T08:30:11.4987477Z %16 = arith.subf %12, %11 : tensor<128x16xf32> 2026-02-21T08:30:11.4987685Z %17 = arith.mulf %15, %16 : tensor<128x16xf32> 2026-02-21T08:30:11.4987888Z %18 = arith.addf %17, %cst : tensor<128x16xf32> 2026-02-21T08:30:11.4988091Z scf.yield %18 : tensor<128x16xf32> 2026-02-21T08:30:11.4988266Z } else { 2026-02-21T08:30:11.4988425Z %15 = tt.splat %arg4 : f32 -> tensor<128x16xf32> 2026-02-21T08:30:11.4988654Z %16 = arith.cmpf ogt, %12, %15 : tensor<128x16xf32> 2026-02-21T08:30:11.4988873Z %17 = arith.cmpf une, %12, %12 : tensor<128x16xf32> 2026-02-21T08:30:11.4989085Z %18 = arith.ori %16, %17 : tensor<128x16xi1> 2026-02-21T08:30:11.4989321Z %19 = arith.select %18, %12, %15 : tensor<128x16xi1>, tensor<128x16xf32> 2026-02-21T08:30:11.4989570Z %20 = math.log %19 : tensor<128x16xf32> 2026-02-21T08:30:11.4989774Z %21 = arith.subf %20, %11 : tensor<128x16xf32> 2026-02-21T08:30:11.4989976Z %22 = arith.mulf %12, %21 : tensor<128x16xf32> 2026-02-21T08:30:11.4990183Z %23 = arith.addf %22, %cst : tensor<128x16xf32> 2026-02-21T08:30:11.4990376Z scf.yield %23 : tensor<128x16xf32> 2026-02-21T08:30:11.4990561Z } 2026-02-21T08:30:11.4990702Z %14 = arith.addf %arg7, %13 : tensor<128x16xf32> 2026-02-21T08:30:11.4990897Z scf.yield %14 : tensor<128x16xf32> 2026-02-21T08:30:11.4991069Z } 2026-02-21T08:30:11.4991204Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.4991612Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.4991786Z %11 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.4992055Z tt.reduce.return %11 : f32 2026-02-21T08:30:11.4992237Z }) : (tensor<128x16xf32>) -> tensor<128xf32> 2026-02-21T08:30:11.4992469Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:30:11.4992727Z %10 = tt.addptr %9, %6 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:30:11.4992966Z tt.store %10, %8 : tensor<128x!tt.ptr> 2026-02-21T08:30:11.4993203Z } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:11.4993409Z tt.return 2026-02-21T08:30:11.4993538Z } 2026-02-21T08:30:11.4993650Z } 2026-02-21T08:30:11.4993725Z 2026-02-21T08:30:11.4993773Z {-# 2026-02-21T08:30:11.4993895Z external_resources: { 2026-02-21T08:30:11.4994054Z mlir_reproducer: { 2026-02-21T08:30:11.4998455Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:11.5002708Z disable_threading: false, 2026-02-21T08:30:11.5002872Z verify_each: true 2026-02-21T08:30:11.5003008Z } 2026-02-21T08:30:11.5003124Z } 2026-02-21T08:30:11.5003229Z #-} 2026-02-21T08:30:11.5003643Z /tmp/torchinductor_root/kv/ckvgqnmj4yegtswbnyslacdbdlyyi26tqlv5jhjogcicpk66757j.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:11.5004821Z /tmp/torchinductor_root/kv/ckvgqnmj4yegtswbnyslacdbdlyyi26tqlv5jhjogcicpk66757j.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:11.5005776Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:11.5006848Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=4, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:30:11.5007858Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:11.5008103Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:11.5236647Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:30:11.5238497Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:11.5239134Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:30:11.5243343Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:11.5247943Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:11.5251840Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:11.5253774Z %cst = arith.constant dense<0.000000e+00> : tensor<4x128xf32> 2026-02-21T08:30:11.5254021Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:30:11.5254407Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:11.5254649Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:11.5258402Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:11.5262927Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:11.5266929Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.5268309Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:11.5268664Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:11.5268848Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:30:11.5269046Z %4 = arith.minsi %3, %c1024_i32 : i32 2026-02-21T08:30:11.5269248Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:30:11.5269457Z %5 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:30:11.5269699Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:30:11.5269942Z %7 = tt.splat %5 : i32 -> tensor<4xi32> 2026-02-21T08:30:11.5270133Z %8 = arith.addi %7, %6 : tensor<4xi32> 2026-02-21T08:30:11.5270435Z %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c128_i32 iter_args(%arg7 = %cst) -> (tensor<4x128xf32>) : i32 { 2026-02-21T08:30:11.5270842Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<4x128xf32> 2026-02-21T08:30:11.5271227Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<4x128xf32> 2026-02-21T08:30:11.5271519Z %15 = scf.if %arg3 -> (tensor<4x128xf32>) { 2026-02-21T08:30:11.5271998Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x128xf32>) -> tensor<4x128xf32> 2026-02-21T08:30:11.5272564Z %18 = arith.subf %14, %13 : tensor<4x128xf32> 2026-02-21T08:30:11.5272883Z %19 = arith.mulf %17, %18 : tensor<4x128xf32> 2026-02-21T08:30:11.5273142Z %20 = arith.addf %19, %cst : tensor<4x128xf32> 2026-02-21T08:30:11.5273370Z scf.yield %20 : tensor<4x128xf32> 2026-02-21T08:30:11.5273544Z } else { 2026-02-21T08:30:11.5273721Z %17 = tt.splat %arg4 : f32 -> tensor<4x128xf32> 2026-02-21T08:30:11.5273977Z %18 = arith.cmpf ogt, %14, %17 : tensor<4x128xf32> 2026-02-21T08:30:11.5274202Z %19 = arith.cmpf une, %14, %14 : tensor<4x128xf32> 2026-02-21T08:30:11.5274424Z %20 = arith.ori %18, %19 : tensor<4x128xi1> 2026-02-21T08:30:11.5274668Z %21 = arith.select %20, %14, %17 : tensor<4x128xi1>, tensor<4x128xf32> 2026-02-21T08:30:11.5274918Z %22 = math.log %21 : tensor<4x128xf32> 2026-02-21T08:30:11.5275123Z %23 = arith.subf %22, %13 : tensor<4x128xf32> 2026-02-21T08:30:11.5275326Z %24 = arith.mulf %14, %23 : tensor<4x128xf32> 2026-02-21T08:30:11.5275539Z %25 = arith.addf %24, %cst : tensor<4x128xf32> 2026-02-21T08:30:11.5275741Z scf.yield %25 : tensor<4x128xf32> 2026-02-21T08:30:11.5276169Z } 2026-02-21T08:30:11.5276319Z %16 = arith.addf %arg7, %15 : tensor<4x128xf32> 2026-02-21T08:30:11.5276524Z scf.yield %16 : tensor<4x128xf32> 2026-02-21T08:30:11.5276817Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:11.5277121Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:30:11.5277323Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:11.5277503Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:11.5277698Z tt.reduce.return %13 : f32 2026-02-21T08:30:11.5277887Z }) : (tensor<4x128xf32>) -> tensor<4xf32> 2026-02-21T08:30:11.5278123Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:30:11.5278385Z %12 = tt.addptr %11, %8 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:30:11.5278624Z tt.store %12, %10 : tensor<4x!tt.ptr> 2026-02-21T08:30:11.5279011Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32} 2026-02-21T08:30:11.5279318Z tt.return 2026-02-21T08:30:11.5279465Z } 2026-02-21T08:30:11.5279584Z } 2026-02-21T08:30:11.5279662Z 2026-02-21T08:30:11.5279713Z {-# 2026-02-21T08:30:11.5279842Z external_resources: { 2026-02-21T08:30:11.5280001Z mlir_reproducer: { 2026-02-21T08:30:11.5284444Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:11.5288774Z disable_threading: false, 2026-02-21T08:30:11.5288943Z verify_each: true 2026-02-21T08:30:11.5289080Z } 2026-02-21T08:30:11.5289198Z } 2026-02-21T08:30:11.5289303Z #-} 2026-02-21T08:30:11.5289718Z /tmp/torchinductor_root/cw/ccwb7dn2gwrggtpkebrg3wqmhspmdsywjh5dp3wmoysq6jmdayeb.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:11.5290897Z /tmp/torchinductor_root/cw/ccwb7dn2gwrggtpkebrg3wqmhspmdsywjh5dp3wmoysq6jmdayeb.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:11.5291914Z [40s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:11.5293183Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', ''], maxnreg=64, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[2, 1], range_unroll_factors=[1, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:30:11.5294162Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:11.5294408Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:12.7843000Z module { 2026-02-21T08:30:12.7843638Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:12.7846018Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:12.7846243Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:30:12.7846471Z %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf32> 2026-02-21T08:30:12.7846705Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:30:12.7846891Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:12.7847069Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:12.7847252Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:12.7847424Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:12.7847752Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:12.7849153Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:12.7849473Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:12.7849654Z %3 = arith.subi %c64_i32, %2 : i32 2026-02-21T08:30:12.7849863Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:12.7853963Z %4 = arith.subi %c9472_i32, %c1_i32 : i32 2026-02-21T08:30:12.7854248Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:30:12.7858626Z %6 = arith.divui %5, %c9472_i32 : i32 2026-02-21T08:30:12.7864036Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:30:12.7867307Z %7 = arith.remsi %6, %c3_i32 : i32 2026-02-21T08:30:12.7870712Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:30:12.7872673Z %9 = arith.muli %8, %c9472_i32 : i32 2026-02-21T08:30:12.7872903Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:30:12.7873089Z %11 = arith.muli %c9472_i32, %c3_i32 : i32 2026-02-21T08:30:12.7873296Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:30:12.7873490Z %12 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:30:12.7873726Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:12.7873975Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T08:30:12.7874185Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T08:30:12.7874509Z %16 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>) : i32 { 2026-02-21T08:30:12.7874907Z %40 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7875273Z %41 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7875551Z %42 = scf.if %arg3 -> (tensor<64x64xf32>) { 2026-02-21T08:30:12.7875921Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32> 2026-02-21T08:30:12.7876297Z %45 = arith.subf %41, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7876507Z %46 = arith.mulf %44, %45 : tensor<64x64xf32> 2026-02-21T08:30:12.7876728Z %47 = arith.addf %46, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7876929Z scf.yield %47 : tensor<64x64xf32> 2026-02-21T08:30:12.7877110Z } else { 2026-02-21T08:30:12.7877588Z %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32> 2026-02-21T08:30:12.7877827Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32> 2026-02-21T08:30:12.7878061Z %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32> 2026-02-21T08:30:12.7878274Z %47 = arith.ori %45, %46 : tensor<64x64xi1> 2026-02-21T08:30:12.7878525Z %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32> 2026-02-21T08:30:12.7878769Z %49 = math.log %48 : tensor<64x64xf32> 2026-02-21T08:30:12.7878970Z %50 = arith.subf %49, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7879166Z %51 = arith.mulf %41, %50 : tensor<64x64xf32> 2026-02-21T08:30:12.7879372Z %52 = arith.addf %51, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7879569Z scf.yield %52 : tensor<64x64xf32> 2026-02-21T08:30:12.7879733Z } 2026-02-21T08:30:12.7879949Z %43 = arith.addf %arg7, %42 : tensor<64x64xf32> 2026-02-21T08:30:12.7880147Z scf.yield %43 : tensor<64x64xf32> 2026-02-21T08:30:12.7880435Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:12.7880710Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:30:12.7880894Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:12.7881073Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:12.7881258Z tt.reduce.return %40 : f32 2026-02-21T08:30:12.7881437Z }) : (tensor<64x64xf32>) -> tensor<64xf32> 2026-02-21T08:30:12.7881665Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:12.7882073Z %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:12.7882310Z tt.store %19, %17 : tensor<64x!tt.ptr> 2026-02-21T08:30:12.7882500Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:30:12.7882691Z %20 = arith.muli %c9472_i32, %c1_i32_0 : i32 2026-02-21T08:30:12.7882878Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:30:12.7883059Z %22 = arith.muli %21, %c64_i32 : i32 2026-02-21T08:30:12.7883282Z %23 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:12.7883511Z %24 = tt.splat %22 : i32 -> tensor<64xi32> 2026-02-21T08:30:12.7883703Z %25 = arith.addi %24, %23 : tensor<64xi32> 2026-02-21T08:30:12.7884000Z %26 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>) : i32 { 2026-02-21T08:30:12.7884397Z %40 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7884751Z %41 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7885026Z %42 = scf.if %arg3 -> (tensor<64x64xf32>) { 2026-02-21T08:30:12.7885381Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32> 2026-02-21T08:30:12.7885737Z %45 = arith.subf %41, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7885943Z %46 = arith.mulf %44, %45 : tensor<64x64xf32> 2026-02-21T08:30:12.7886143Z %47 = arith.addf %46, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7886339Z scf.yield %47 : tensor<64x64xf32> 2026-02-21T08:30:12.7886507Z } else { 2026-02-21T08:30:12.7886660Z %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32> 2026-02-21T08:30:12.7886877Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32> 2026-02-21T08:30:12.7887089Z %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32> 2026-02-21T08:30:12.7887300Z %47 = arith.ori %45, %46 : tensor<64x64xi1> 2026-02-21T08:30:12.7887528Z %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32> 2026-02-21T08:30:12.7887769Z %49 = math.log %48 : tensor<64x64xf32> 2026-02-21T08:30:12.7887962Z %50 = arith.subf %49, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7888161Z %51 = arith.mulf %41, %50 : tensor<64x64xf32> 2026-02-21T08:30:12.7888449Z %52 = arith.addf %51, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7888645Z scf.yield %52 : tensor<64x64xf32> 2026-02-21T08:30:12.7888820Z } 2026-02-21T08:30:12.7888966Z %43 = arith.addf %arg7, %42 : tensor<64x64xf32> 2026-02-21T08:30:12.7889169Z scf.yield %43 : tensor<64x64xf32> 2026-02-21T08:30:12.7889424Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:12.7889705Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:30:12.7889906Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:12.7890083Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:12.7890273Z tt.reduce.return %40 : f32 2026-02-21T08:30:12.7890453Z }) : (tensor<64x64xf32>) -> tensor<64xf32> 2026-02-21T08:30:12.7890685Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:12.7890993Z %29 = tt.addptr %28, %25 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:12.7891235Z tt.store %29, %27 : tensor<64x!tt.ptr> 2026-02-21T08:30:12.7891432Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:30:12.7891608Z %30 = arith.muli %c9472_i32, %c2_i32 : i32 2026-02-21T08:30:12.7891792Z %31 = arith.addi %arg5, %30 : i32 2026-02-21T08:30:12.7892001Z %32 = arith.muli %31, %c64_i32 : i32 2026-02-21T08:30:12.7892226Z %33 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:12.7892458Z %34 = tt.splat %32 : i32 -> tensor<64xi32> 2026-02-21T08:30:12.7892649Z %35 = arith.addi %34, %33 : tensor<64xi32> 2026-02-21T08:30:12.7892946Z %36 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>) : i32 { 2026-02-21T08:30:12.7893341Z %40 = tt.descriptor_load %0[%32, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7893705Z %41 = tt.descriptor_load %1[%32, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7893988Z %42 = scf.if %arg3 -> (tensor<64x64xf32>) { 2026-02-21T08:30:12.7894348Z %44 = tt.extern_elementwise %41 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32> 2026-02-21T08:30:12.7894704Z %45 = arith.subf %41, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7894911Z %46 = arith.mulf %44, %45 : tensor<64x64xf32> 2026-02-21T08:30:12.7895117Z %47 = arith.addf %46, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7895307Z scf.yield %47 : tensor<64x64xf32> 2026-02-21T08:30:12.7895477Z } else { 2026-02-21T08:30:12.7895627Z %44 = tt.splat %arg4 : f32 -> tensor<64x64xf32> 2026-02-21T08:30:12.7895843Z %45 = arith.cmpf ogt, %41, %44 : tensor<64x64xf32> 2026-02-21T08:30:12.7896054Z %46 = arith.cmpf une, %41, %41 : tensor<64x64xf32> 2026-02-21T08:30:12.7896265Z %47 = arith.ori %45, %46 : tensor<64x64xi1> 2026-02-21T08:30:12.7896502Z %48 = arith.select %47, %41, %44 : tensor<64x64xi1>, tensor<64x64xf32> 2026-02-21T08:30:12.7896734Z %49 = math.log %48 : tensor<64x64xf32> 2026-02-21T08:30:12.7896953Z %50 = arith.subf %49, %40 : tensor<64x64xf32> 2026-02-21T08:30:12.7897156Z %51 = arith.mulf %41, %50 : tensor<64x64xf32> 2026-02-21T08:30:12.7897353Z %52 = arith.addf %51, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7897548Z scf.yield %52 : tensor<64x64xf32> 2026-02-21T08:30:12.7897710Z } 2026-02-21T08:30:12.7897855Z %43 = arith.addf %arg7, %42 : tensor<64x64xf32> 2026-02-21T08:30:12.7898039Z scf.yield %43 : tensor<64x64xf32> 2026-02-21T08:30:12.7898288Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:12.7898552Z %37 = "tt.reduce"(%36) <{axis = 1 : i32}> ({ 2026-02-21T08:30:12.7898733Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:12.7898970Z %40 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:12.7899145Z tt.reduce.return %40 : f32 2026-02-21T08:30:12.7899325Z }) : (tensor<64x64xf32>) -> tensor<64xf32> 2026-02-21T08:30:12.7899549Z %38 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:12.7899824Z %39 = tt.addptr %38, %35 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:12.7900071Z tt.store %39, %37 : tensor<64x!tt.ptr> 2026-02-21T08:30:12.7900265Z } {tt.num_stages = 1 : i32} 2026-02-21T08:30:12.7900473Z scf.for %arg5 = %10 to %c64_i32 step %c9472_i32 : i32 { 2026-02-21T08:30:12.7900690Z %12 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:30:12.7900926Z %13 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:12.7901165Z %14 = tt.splat %12 : i32 -> tensor<64xi32> 2026-02-21T08:30:12.7901366Z %15 = arith.addi %14, %13 : tensor<64xi32> 2026-02-21T08:30:12.7901737Z %16 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c64_i32 iter_args(%arg7 = %cst) -> (tensor<64x64xf32>) : i32 { 2026-02-21T08:30:12.7902182Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7902563Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<64x64xf32> 2026-02-21T08:30:12.7902858Z %22 = scf.if %arg3 -> (tensor<64x64xf32>) { 2026-02-21T08:30:12.7903235Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x64xf32>) -> tensor<64x64xf32> 2026-02-21T08:30:12.7903609Z %25 = arith.subf %21, %20 : tensor<64x64xf32> 2026-02-21T08:30:12.7903821Z %26 = arith.mulf %24, %25 : tensor<64x64xf32> 2026-02-21T08:30:12.7904036Z %27 = arith.addf %26, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7904235Z scf.yield %27 : tensor<64x64xf32> 2026-02-21T08:30:12.7904414Z } else { 2026-02-21T08:30:12.7904578Z %24 = tt.splat %arg4 : f32 -> tensor<64x64xf32> 2026-02-21T08:30:12.7904811Z %25 = arith.cmpf ogt, %21, %24 : tensor<64x64xf32> 2026-02-21T08:30:12.7905033Z %26 = arith.cmpf une, %21, %21 : tensor<64x64xf32> 2026-02-21T08:30:12.7905253Z %27 = arith.ori %25, %26 : tensor<64x64xi1> 2026-02-21T08:30:12.7905501Z %28 = arith.select %27, %21, %24 : tensor<64x64xi1>, tensor<64x64xf32> 2026-02-21T08:30:12.7905745Z %29 = math.log %28 : tensor<64x64xf32> 2026-02-21T08:30:12.7905951Z %30 = arith.subf %29, %20 : tensor<64x64xf32> 2026-02-21T08:30:12.7906153Z %31 = arith.mulf %21, %30 : tensor<64x64xf32> 2026-02-21T08:30:12.7906367Z %32 = arith.addf %31, %cst : tensor<64x64xf32> 2026-02-21T08:30:12.7906562Z scf.yield %32 : tensor<64x64xf32> 2026-02-21T08:30:12.7906740Z } 2026-02-21T08:30:12.7906886Z %23 = arith.addf %arg7, %22 : tensor<64x64xf32> 2026-02-21T08:30:12.7907099Z scf.yield %23 : tensor<64x64xf32> 2026-02-21T08:30:12.7907353Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:30:12.7907611Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:30:12.7907802Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:12.7907971Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:12.7908157Z tt.reduce.return %20 : f32 2026-02-21T08:30:12.7908335Z }) : (tensor<64x64xf32>) -> tensor<64xf32> 2026-02-21T08:30:12.7908562Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:12.7908819Z %19 = tt.addptr %18, %15 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:12.7909047Z tt.store %19, %17 : tensor<64x!tt.ptr> 2026-02-21T08:30:12.7909236Z } {tt.num_stages = 1 : i32} 2026-02-21T08:30:12.7909386Z tt.return 2026-02-21T08:30:12.7909512Z } 2026-02-21T08:30:12.7909625Z } 2026-02-21T08:30:12.7909698Z 2026-02-21T08:30:12.7909747Z {-# 2026-02-21T08:30:12.7909870Z external_resources: { 2026-02-21T08:30:12.7910086Z mlir_reproducer: { 2026-02-21T08:30:12.7914407Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=2}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=2}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=2}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:12.7918788Z disable_threading: false, 2026-02-21T08:30:12.7918956Z verify_each: true 2026-02-21T08:30:12.7919095Z } 2026-02-21T08:30:12.7919221Z } 2026-02-21T08:30:12.7919333Z #-} 2026-02-21T08:30:12.7919740Z /tmp/torchinductor_root/4e/c4e2f3hrdolelqb36u3a232yzjx6227te325rbipesnksnfjniyl.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:12.7920924Z /tmp/torchinductor_root/4e/c4e2f3hrdolelqb36u3a232yzjx6227te325rbipesnksnfjniyl.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:12.7921926Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:12.7923019Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:30:12.7924004Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:12.7924252Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:13.5490679Z module { 2026-02-21T08:30:13.5491613Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:13.5492631Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:30:13.5492927Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:13.5493296Z %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32> 2026-02-21T08:30:13.5493665Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:13.5493999Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:13.5494636Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:13.5494943Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:13.5495233Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:13.5495745Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:13.5496493Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:13.5496998Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:13.5497273Z %3 = arith.muli %2, %c128_i32 : i32 2026-02-21T08:30:13.5497628Z %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:30:13.5498027Z %5 = tt.splat %3 : i32 -> tensor<128xi32> 2026-02-21T08:30:13.5498334Z %6 = arith.addi %5, %4 : tensor<128xi32> 2026-02-21T08:30:13.5498980Z %7 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c32_i32 iter_args(%arg6 = %cst) -> (tensor<128x32xf32>) : i32 { 2026-02-21T08:30:13.5499668Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:30:13.5500297Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<128x32xf32> 2026-02-21T08:30:13.5500768Z %13 = scf.if %arg3 -> (tensor<128x32xf32>) { 2026-02-21T08:30:13.5501386Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x32xf32>) -> tensor<128x32xf32> 2026-02-21T08:30:13.5502045Z %16 = arith.subf %12, %11 : tensor<128x32xf32> 2026-02-21T08:30:13.5502379Z %17 = arith.mulf %15, %16 : tensor<128x32xf32> 2026-02-21T08:30:13.5502713Z %18 = arith.addf %17, %cst : tensor<128x32xf32> 2026-02-21T08:30:13.5503036Z scf.yield %18 : tensor<128x32xf32> 2026-02-21T08:30:13.5503311Z } else { 2026-02-21T08:30:13.5503560Z %15 = tt.splat %arg4 : f32 -> tensor<128x32xf32> 2026-02-21T08:30:13.5503930Z %16 = arith.cmpf ogt, %12, %15 : tensor<128x32xf32> 2026-02-21T08:30:13.5504287Z %17 = arith.cmpf une, %12, %12 : tensor<128x32xf32> 2026-02-21T08:30:13.5504634Z %18 = arith.ori %16, %17 : tensor<128x32xi1> 2026-02-21T08:30:13.5505025Z %19 = arith.select %18, %12, %15 : tensor<128x32xi1>, tensor<128x32xf32> 2026-02-21T08:30:13.5505427Z %20 = math.log %19 : tensor<128x32xf32> 2026-02-21T08:30:13.5505751Z %21 = arith.subf %20, %11 : tensor<128x32xf32> 2026-02-21T08:30:13.5506076Z %22 = arith.mulf %12, %21 : tensor<128x32xf32> 2026-02-21T08:30:13.5506412Z %23 = arith.addf %22, %cst : tensor<128x32xf32> 2026-02-21T08:30:13.5506721Z scf.yield %23 : tensor<128x32xf32> 2026-02-21T08:30:13.5506995Z } 2026-02-21T08:30:13.5507211Z %14 = arith.addf %arg6, %13 : tensor<128x32xf32> 2026-02-21T08:30:13.5507526Z scf.yield %14 : tensor<128x32xf32> 2026-02-21T08:30:13.5507939Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:13.5508376Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:30:13.5508673Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:30:13.5508943Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:30:13.5509237Z tt.reduce.return %11 : f32 2026-02-21T08:30:13.5509519Z }) : (tensor<128x32xf32>) -> tensor<128xf32> 2026-02-21T08:30:13.5509890Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:30:13.5510312Z %10 = tt.addptr %9, %6 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:30:13.5510695Z tt.store %10, %8 : tensor<128x!tt.ptr> 2026-02-21T08:30:13.5510982Z tt.return 2026-02-21T08:30:13.5511165Z } 2026-02-21T08:30:13.5511340Z } 2026-02-21T08:30:13.5511443Z 2026-02-21T08:30:13.5511514Z {-# 2026-02-21T08:30:13.5511709Z external_resources: { 2026-02-21T08:30:13.5511983Z mlir_reproducer: { 2026-02-21T08:30:13.5519659Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:13.5527550Z disable_threading: false, 2026-02-21T08:30:13.5527812Z verify_each: true 2026-02-21T08:30:13.5528033Z } 2026-02-21T08:30:13.5528200Z } 2026-02-21T08:30:13.5528368Z #-} 2026-02-21T08:30:13.5529073Z /tmp/torchinductor_root/3y/c3ylosf5o3jtxcrwzpobw5iszyv4zvcreo5umlfav3tb2yj6e6zh.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:13.5531186Z /tmp/torchinductor_root/3y/c3ylosf5o3jtxcrwzpobw5iszyv4zvcreo5umlfav3tb2yj6e6zh.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:13.5532966Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:13.5534659Z Config: @helion.kernel(config=helion.Config(block_sizes=[32, 128], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], num_stages=4, num_warps=32, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:30:13.5536172Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:13.5536593Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:15.7120605Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:30:15.7121319Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:15.7122000Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:30:15.7122190Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:15.7122373Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:15.7122585Z %cst = arith.constant dense<0.000000e+00> : tensor<64x128xf32> 2026-02-21T08:30:15.7122819Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:30:15.7122997Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:15.7123225Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:15.7123727Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:15.7123903Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:15.7124218Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:15.7124649Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:15.7124966Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:15.7125139Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:30:15.7125321Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:30:15.7125519Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:30:15.7125711Z %5 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:30:15.7125940Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:30:15.7126183Z %7 = tt.splat %5 : i32 -> tensor<64xi32> 2026-02-21T08:30:15.7126469Z %8 = arith.addi %7, %6 : tensor<64xi32> 2026-02-21T08:30:15.7126658Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:30:15.7126970Z %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c256_i32 iter_args(%arg7 = %cst) -> (tensor<64x128xf32>) : i32 { 2026-02-21T08:30:15.7127376Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:30:15.7127743Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:30:15.7128040Z %15 = scf.if %arg3 -> (tensor<64x128xf32>) { 2026-02-21T08:30:15.7128409Z %23 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T08:30:15.7128793Z %24 = arith.subf %14, %13 : tensor<64x128xf32> 2026-02-21T08:30:15.7129018Z %25 = arith.mulf %23, %24 : tensor<64x128xf32> 2026-02-21T08:30:15.7129229Z %26 = arith.addf %25, %cst : tensor<64x128xf32> 2026-02-21T08:30:15.7129431Z scf.yield %26 : tensor<64x128xf32> 2026-02-21T08:30:15.7129596Z } else { 2026-02-21T08:30:15.7129760Z %23 = tt.splat %arg4 : f32 -> tensor<64x128xf32> 2026-02-21T08:30:15.7129973Z %24 = arith.cmpf ogt, %14, %23 : tensor<64x128xf32> 2026-02-21T08:30:15.7130194Z %25 = arith.cmpf une, %14, %14 : tensor<64x128xf32> 2026-02-21T08:30:15.7130400Z %26 = arith.ori %24, %25 : tensor<64x128xi1> 2026-02-21T08:30:15.7130643Z %27 = arith.select %26, %14, %23 : tensor<64x128xi1>, tensor<64x128xf32> 2026-02-21T08:30:15.7130884Z %28 = math.log %27 : tensor<64x128xf32> 2026-02-21T08:30:15.7131076Z %29 = arith.subf %28, %13 : tensor<64x128xf32> 2026-02-21T08:30:15.7131280Z %30 = arith.mulf %14, %29 : tensor<64x128xf32> 2026-02-21T08:30:15.7131479Z %31 = arith.addf %30, %cst : tensor<64x128xf32> 2026-02-21T08:30:15.7131679Z scf.yield %31 : tensor<64x128xf32> 2026-02-21T08:30:15.7131842Z } 2026-02-21T08:30:15.7132108Z %16 = arith.addf %arg7, %15 : tensor<64x128xf32> 2026-02-21T08:30:15.7132305Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:30:15.7132523Z %17 = arith.muli %c128_i32, %c1_i32_0 : i32 2026-02-21T08:30:15.7132709Z %18 = arith.addi %arg6, %17 : i32 2026-02-21T08:30:15.7132991Z %19 = tt.descriptor_load %0[%5, %18] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:30:15.7133353Z %20 = tt.descriptor_load %1[%5, %18] : !tt.tensordesc> -> tensor<64x128xf32> 2026-02-21T08:30:15.7133636Z %21 = scf.if %arg3 -> (tensor<64x128xf32>) { 2026-02-21T08:30:15.7134006Z %23 = tt.extern_elementwise %20 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x128xf32>) -> tensor<64x128xf32> 2026-02-21T08:30:15.7134362Z %24 = arith.subf %20, %19 : tensor<64x128xf32> 2026-02-21T08:30:15.7134569Z %25 = arith.mulf %23, %24 : tensor<64x128xf32> 2026-02-21T08:30:15.7134858Z %26 = arith.addf %25, %cst : tensor<64x128xf32> 2026-02-21T08:30:15.7135052Z scf.yield %26 : tensor<64x128xf32> 2026-02-21T08:30:15.7135231Z } else { 2026-02-21T08:30:15.7135395Z %23 = tt.splat %arg4 : f32 -> tensor<64x128xf32> 2026-02-21T08:30:15.7135619Z %24 = arith.cmpf ogt, %20, %23 : tensor<64x128xf32> 2026-02-21T08:30:15.7135840Z %25 = arith.cmpf une, %20, %20 : tensor<64x128xf32> 2026-02-21T08:30:15.7136057Z %26 = arith.ori %24, %25 : tensor<64x128xi1> 2026-02-21T08:30:15.7136295Z %27 = arith.select %26, %20, %23 : tensor<64x128xi1>, tensor<64x128xf32> 2026-02-21T08:30:15.7136544Z %28 = math.log %27 : tensor<64x128xf32> 2026-02-21T08:30:15.7136747Z %29 = arith.subf %28, %19 : tensor<64x128xf32> 2026-02-21T08:30:15.7136947Z %30 = arith.mulf %20, %29 : tensor<64x128xf32> 2026-02-21T08:30:15.7137214Z %31 = arith.addf %30, %cst : tensor<64x128xf32> 2026-02-21T08:30:15.7137410Z scf.yield %31 : tensor<64x128xf32> 2026-02-21T08:30:15.7137577Z } 2026-02-21T08:30:15.7137714Z %22 = arith.addf %16, %21 : tensor<64x128xf32> 2026-02-21T08:30:15.7137906Z scf.yield %22 : tensor<64x128xf32> 2026-02-21T08:30:15.7138092Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:30:15.7138278Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:30:15.7138462Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:15.7138632Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:15.7138813Z tt.reduce.return %13 : f32 2026-02-21T08:30:15.7138987Z }) : (tensor<64x128xf32>) -> tensor<64xf32> 2026-02-21T08:30:15.7139216Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:30:15.7139470Z %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:30:15.7139703Z tt.store %12, %10 : tensor<64x!tt.ptr> 2026-02-21T08:30:15.7140051Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:30:15.7140379Z tt.return 2026-02-21T08:30:15.7140507Z } 2026-02-21T08:30:15.7140621Z } 2026-02-21T08:30:15.7140695Z 2026-02-21T08:30:15.7140743Z {-# 2026-02-21T08:30:15.7140863Z external_resources: { 2026-02-21T08:30:15.7141019Z mlir_reproducer: { 2026-02-21T08:30:15.7145338Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:15.7149891Z disable_threading: false, 2026-02-21T08:30:15.7150068Z verify_each: true 2026-02-21T08:30:15.7150211Z } 2026-02-21T08:30:15.7150335Z } 2026-02-21T08:30:15.7150447Z #-} 2026-02-21T08:30:15.7150875Z /tmp/torchinductor_root/4f/c4fdvapk55kjtm4nftfpzewslzr3q745k423xfsdvr6boffxkbgg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:15.7152135Z /tmp/torchinductor_root/4f/c4fdvapk55kjtm4nftfpzewslzr3q745k423xfsdvr6boffxkbgg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:15.7153205Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:15.7154368Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=32, num_sm_multiplier=64, num_stages=8, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[1, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:30:15.7155409Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:15.7155694Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:17.5356182Z module { 2026-02-21T08:30:17.5360904Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:30:17.5364501Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:30:17.5368940Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:30:17.5372449Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:30:17.5372787Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x4xf32> 2026-02-21T08:30:17.5373049Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:30:17.5373258Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:30:17.5373445Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:30:17.5373635Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:30:17.5373833Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:30:17.5378140Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:17.5382714Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:30:17.5384388Z %2 = tt.get_program_id x : i32 2026-02-21T08:30:17.5384625Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:30:17.5384820Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T08:30:17.5385017Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:30:17.5385228Z %5 = arith.muli %arg5, %c1024_i32 : i32 2026-02-21T08:30:17.5385465Z %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:30:17.5385712Z %7 = tt.splat %5 : i32 -> tensor<1024xi32> 2026-02-21T08:30:17.5385908Z %8 = arith.addi %7, %6 : tensor<1024xi32> 2026-02-21T08:30:17.5386210Z %9 = scf.for %arg6 = %c0_i32 to %c32768_i32 step %c4_i32 iter_args(%arg7 = %cst) -> (tensor<1024x4xf32>) : i32 { 2026-02-21T08:30:17.5386610Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<1024x4xf32> 2026-02-21T08:30:17.5386966Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<1024x4xf32> 2026-02-21T08:30:17.5387262Z %15 = scf.if %arg3 -> (tensor<1024x4xf32>) { 2026-02-21T08:30:17.5387973Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x4xf32>) -> tensor<1024x4xf32> 2026-02-21T08:30:17.5388336Z %18 = arith.subf %14, %13 : tensor<1024x4xf32> 2026-02-21T08:30:17.5388548Z %19 = arith.mulf %17, %18 : tensor<1024x4xf32> 2026-02-21T08:30:17.5388751Z %20 = arith.addf %19, %cst : tensor<1024x4xf32> 2026-02-21T08:30:17.5388956Z scf.yield %20 : tensor<1024x4xf32> 2026-02-21T08:30:17.5389128Z } else { 2026-02-21T08:30:17.5389286Z %17 = tt.splat %arg4 : f32 -> tensor<1024x4xf32> 2026-02-21T08:30:17.5389511Z %18 = arith.cmpf ogt, %14, %17 : tensor<1024x4xf32> 2026-02-21T08:30:17.5389722Z %19 = arith.cmpf une, %14, %14 : tensor<1024x4xf32> 2026-02-21T08:30:17.5389930Z %20 = arith.ori %18, %19 : tensor<1024x4xi1> 2026-02-21T08:30:17.5390257Z %21 = arith.select %20, %14, %17 : tensor<1024x4xi1>, tensor<1024x4xf32> 2026-02-21T08:30:17.5390510Z %22 = math.log %21 : tensor<1024x4xf32> 2026-02-21T08:30:17.5390702Z %23 = arith.subf %22, %13 : tensor<1024x4xf32> 2026-02-21T08:30:17.5390906Z %24 = arith.mulf %14, %23 : tensor<1024x4xf32> 2026-02-21T08:30:17.5391117Z %25 = arith.addf %24, %cst : tensor<1024x4xf32> 2026-02-21T08:30:17.5391311Z scf.yield %25 : tensor<1024x4xf32> 2026-02-21T08:30:17.5391507Z } 2026-02-21T08:30:17.5391656Z %16 = arith.addf %arg7, %15 : tensor<1024x4xf32> 2026-02-21T08:30:17.5391954Z scf.yield %16 : tensor<1024x4xf32> 2026-02-21T08:30:17.5392149Z } {tt.num_stages = 1 : i32} 2026-02-21T08:30:17.5392331Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:30:17.5392528Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:30:17.5392706Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:30:17.5392901Z tt.reduce.return %13 : f32 2026-02-21T08:30:17.5393089Z }) : (tensor<1024x4xf32>) -> tensor<1024xf32> 2026-02-21T08:30:17.5393332Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<1024x!tt.ptr> 2026-02-21T08:30:17.5393592Z %12 = tt.addptr %11, %8 : tensor<1024x!tt.ptr>, tensor<1024xi32> 2026-02-21T08:30:17.5393838Z tt.store %12, %10 : tensor<1024x!tt.ptr> 2026-02-21T08:30:17.5394084Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.warp_specialize} 2026-02-21T08:30:17.5394301Z tt.return 2026-02-21T08:30:17.5394434Z } 2026-02-21T08:30:17.5394547Z } 2026-02-21T08:30:17.5394620Z 2026-02-21T08:30:17.5394668Z {-# 2026-02-21T08:30:17.5394791Z external_resources: { 2026-02-21T08:30:17.5394946Z mlir_reproducer: { 2026-02-21T08:30:17.5399171Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=5}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=5}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=5}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:30:17.5403750Z disable_threading: false, 2026-02-21T08:30:17.5403930Z verify_each: true 2026-02-21T08:30:17.5404075Z } 2026-02-21T08:30:17.5404204Z } 2026-02-21T08:30:17.5404310Z #-} 2026-02-21T08:30:17.5404723Z /tmp/torchinductor_root/4m/c4moxtqczhzhvxwf6zuzg4uslqz5jjoud3lscas2hcnwmlxlqjkw.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:30:17.5405986Z /tmp/torchinductor_root/4m/c4moxtqczhzhvxwf6zuzg4uslqz5jjoud3lscas2hcnwmlxlqjkw.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:30:17.5406943Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:30:17.5407982Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 1024], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=64, num_stages=5, num_warps=32, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:30:17.5408908Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:30:17.5409151Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:30:17.7902974Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.3 configs/s 2026-02-21T08:30:17.7911312Z [47s] Adaptive compile timeout: 30s (90% percentile=7.7s, bounds=[30.0s, 30s]) 2026-02-21T08:30:20.7499619Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 916/916 305.9 configs/s 2026-02-21T08:30:20.8381254Z [50s] Initial random population of 100, 5 starting points: 2026-02-21T08:30:20.8381549Z error=17 2026-02-21T08:30:20.8381691Z timeout=6 2026-02-21T08:30:20.8381833Z ok=77 2026-02-21T08:30:20.8382246Z min=0.2264 2026-02-21T08:30:20.8382390Z mid=1.7909 2026-02-21T08:30:20.8382517Z max=220.4436 2026-02-21T08:30:20.8382673Z best={'block_sizes': [512, 1], 2026-02-21T08:30:20.8382924Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:30:20.8383179Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:30:20.8383403Z 'maxnreg': 32, 2026-02-21T08:30:20.8383551Z 'num_sm_multiplier': 64, 2026-02-21T08:30:20.8383714Z 'num_stages': 6, 2026-02-21T08:30:20.8383851Z 'num_warps': 2, 2026-02-21T08:30:20.8384038Z 'pid_type': 'persistent_interleaved', 2026-02-21T08:30:20.8384244Z 'range_flattens': [False, False], 2026-02-21T08:30:20.8384434Z 'range_multi_buffers': [False, True], 2026-02-21T08:30:20.8384620Z 'range_num_stages': [0, 0], 2026-02-21T08:30:20.8384795Z 'range_unroll_factors': [3, 0], 2026-02-21T08:30:20.8384972Z 'range_warp_specializes': [None, True]} 2026-02-21T08:30:20.8403131Z [50s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:30:21.9389721Z [51s] Generation 1 starting: 83 neighbors, 5 active search path(s) 2026-02-21T08:30:28.7519312Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 7.3 configs/s 2026-02-21T08:30:33.8826935Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 17.1 configs/s 2026-02-21T08:30:52.0907472Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 992/992 55.5 configs/s 2026-02-21T08:30:52.4249142Z [81s] Generation 1 complete: 2026-02-21T08:30:52.4253201Z ok=88 2026-02-21T08:30:52.4256598Z min=0.2038 2026-02-21T08:30:52.4260513Z mid=0.2652 2026-02-21T08:30:52.4264376Z max=1.6531 2026-02-21T08:30:52.4268287Z best={'block_sizes': [1024, 1], 2026-02-21T08:30:52.4271813Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:30:52.4277547Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:30:52.4277890Z 'num_stages': 6, 2026-02-21T08:30:52.4278120Z 'num_warps': 1, 2026-02-21T08:30:52.4278370Z 'pid_type': 'flat', 2026-02-21T08:30:52.4278632Z 'range_flattens': [None, False], 2026-02-21T08:30:52.4278921Z 'range_multi_buffers': [None, None], 2026-02-21T08:30:52.4279207Z 'range_num_stages': [0, 4], 2026-02-21T08:30:52.4279464Z 'range_unroll_factors': [0, 0], 2026-02-21T08:30:52.4279760Z 'range_warp_specializes': [None, True]} 2026-02-21T08:30:52.4280104Z [81s] Fitting surrogate: 188 points, 188 targets 2026-02-21T08:30:53.3653720Z [82s] Generation 2 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:30:59.2351035Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 3.5 configs/s 2026-02-21T08:31:02.2089531Z module { 2026-02-21T08:31:02.2090160Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:31:02.2090703Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:31:02.2090899Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:31:02.2091120Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:31:02.2091358Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:31:02.2091538Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:31:02.2091721Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T08:31:02.2092117Z %c32768_i64 = arith.constant 32768 : i64 2026-02-21T08:31:02.2092296Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:31:02.2092649Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:31:02.2093091Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c32768_i32], [%c32768_i64, %c1_i64] : , > 2026-02-21T08:31:02.2093402Z %2 = tt.get_program_id x : i32 2026-02-21T08:31:02.2093580Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:31:02.2093794Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:31:02.2094033Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:31:02.2094212Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:31:02.2094522Z %7 = scf.for %arg5 = %c0_i32 to %c32768_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:31:02.2094918Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:31:02.2095282Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:31:02.2095577Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:31:02.2095937Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:31:02.2096305Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:31:02.2096503Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:31:02.2096709Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:31:02.2096907Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:31:02.2097071Z } else { 2026-02-21T08:31:02.2097232Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:31:02.2097446Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:31:02.2097669Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:31:02.2097874Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:31:02.2098124Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:31:02.2098677Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:31:02.2098868Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:31:02.2099073Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:31:02.2099275Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:31:02.2099473Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:31:02.2099639Z } 2026-02-21T08:31:02.2099786Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:31:02.2099974Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:31:02.2100178Z } {tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:31:02.2100393Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:31:02.2100574Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:31:02.2100754Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:31:02.2100933Z tt.reduce.return %11 : f32 2026-02-21T08:31:02.2101204Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:31:02.2101428Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:31:02.2101678Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:31:02.2101934Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:31:02.2102112Z tt.return 2026-02-21T08:31:02.2102238Z } 2026-02-21T08:31:02.2102354Z } 2026-02-21T08:31:02.2102420Z 2026-02-21T08:31:02.2102479Z {-# 2026-02-21T08:31:02.2102602Z external_resources: { 2026-02-21T08:31:02.2102758Z mlir_reproducer: { 2026-02-21T08:31:02.2106943Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:31:02.2111253Z disable_threading: false, 2026-02-21T08:31:02.2111414Z verify_each: true 2026-02-21T08:31:02.2111557Z } 2026-02-21T08:31:02.2111674Z } 2026-02-21T08:31:02.2111782Z #-} 2026-02-21T08:31:02.2112234Z /tmp/torchinductor_root/jp/cjp7lrgyeevtgjtetd7rzaqljywnn7cyuy3rdfzuigyauoom3p2n.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:31:02.2113423Z /tmp/torchinductor_root/jp/cjp7lrgyeevtgjtetd7rzaqljywnn7cyuy3rdfzuigyauoom3p2n.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:31:02.2114514Z [91s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:31:02.2115510Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:31:02.2116406Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:31:02.2116670Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:31:03.0808486Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.9 configs/s 2026-02-21T08:31:21.9657799Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 54.0 configs/s 2026-02-21T08:31:22.4658807Z [111s] Generation 2 complete: 2026-02-21T08:31:22.4660902Z error=1 2026-02-21T08:31:22.4661104Z ok=71 2026-02-21T08:31:22.4661282Z min=0.2100 2026-02-21T08:31:22.4661455Z mid=0.2263 2026-02-21T08:31:22.4661634Z max=0.6544 2026-02-21T08:31:22.4661819Z best={'block_sizes': [1024, 1], 2026-02-21T08:31:22.4662507Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:31:22.4662922Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:31:22.4663218Z 'num_stages': 6, 2026-02-21T08:31:22.4663420Z 'num_warps': 4, 2026-02-21T08:31:22.4663635Z 'pid_type': 'flat', 2026-02-21T08:31:22.4663863Z 'range_flattens': [None, False], 2026-02-21T08:31:22.4664140Z 'range_multi_buffers': [None, True], 2026-02-21T08:31:22.4664418Z 'range_num_stages': [0, 4], 2026-02-21T08:31:22.4664651Z 'range_unroll_factors': [0, 0], 2026-02-21T08:31:22.4664918Z 'range_warp_specializes': [None, True]} 2026-02-21T08:31:22.4674888Z [111s] Fitting surrogate: 260 points, 260 targets 2026-02-21T08:31:23.5767809Z [112s] Generation 3 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:31:27.5864826Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 41.5 configs/s 2026-02-21T08:31:31.3911557Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 17.8 configs/s 2026-02-21T08:31:51.1861057Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 50.7 configs/s 2026-02-21T08:31:51.5823617Z [140s] Generation 3 complete: 2026-02-21T08:31:51.5824005Z ok=72 2026-02-21T08:31:51.5826718Z min=0.2100 2026-02-21T08:31:51.5826916Z mid=0.2222 2026-02-21T08:31:51.5827099Z max=0.3532 2026-02-21T08:31:51.5827302Z best={'block_sizes': [1024, 1], 2026-02-21T08:31:51.5827740Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:31:51.5830048Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:31:51.5830360Z 'num_stages': 6, 2026-02-21T08:31:51.5830584Z 'num_warps': 4, 2026-02-21T08:31:51.5830832Z 'pid_type': 'flat', 2026-02-21T08:31:51.5831089Z 'range_flattens': [None, False], 2026-02-21T08:31:51.5831423Z 'range_multi_buffers': [None, True], 2026-02-21T08:31:51.5833364Z 'range_num_stages': [0, 4], 2026-02-21T08:31:51.5833631Z 'range_unroll_factors': [0, 0], 2026-02-21T08:31:51.5833894Z 'range_warp_specializes': [None, True]} 2026-02-21T08:31:51.5849959Z [140s] Fitting surrogate: 332 points, 332 targets 2026-02-21T08:31:52.5970740Z [141s] Generation 4 starting: 52 neighbors, 5 active search path(s) 2026-02-21T08:31:57.0729511Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53/53 7.6 configs/s 2026-02-21T08:32:00.1009171Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 53/53 17.8 configs/s 2026-02-21T08:32:14.5749143Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 69.6 configs/s 2026-02-21T08:32:14.8670745Z [164s] Generation 4 complete: 2026-02-21T08:32:14.8675806Z ok=58 2026-02-21T08:32:14.8679592Z min=0.2121 2026-02-21T08:32:14.8683480Z mid=0.2222 2026-02-21T08:32:14.8687386Z max=0.4756 2026-02-21T08:32:14.8690570Z best={'block_sizes': [2048, 1], 2026-02-21T08:32:14.8694523Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:14.8698319Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:32:14.8698605Z 'num_stages': 6, 2026-02-21T08:32:14.8698774Z 'num_warps': 8, 2026-02-21T08:32:14.8698946Z 'pid_type': 'flat', 2026-02-21T08:32:14.8699120Z 'range_flattens': [None, None], 2026-02-21T08:32:14.8705444Z 'range_multi_buffers': [None, True], 2026-02-21T08:32:14.8709511Z 'range_num_stages': [0, 0], 2026-02-21T08:32:14.8710762Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:14.8710956Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:14.8711176Z [164s] Fitting surrogate: 390 points, 390 targets 2026-02-21T08:32:15.8751043Z [165s] Generation 5 starting: 62 neighbors, 5 active search path(s) 2026-02-21T08:32:18.9424954Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 29.6 configs/s 2026-02-21T08:32:22.6223862Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 64/64 17.6 configs/s 2026-02-21T08:32:40.0366995Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 58.5 configs/s 2026-02-21T08:32:40.3961371Z [189s] Generation 5 complete: 2026-02-21T08:32:40.3962771Z ok=67 2026-02-21T08:32:40.3962931Z min=0.2099 2026-02-21T08:32:40.3963064Z mid=0.2180 2026-02-21T08:32:40.3963179Z max=0.5397 2026-02-21T08:32:40.3963325Z best={'block_sizes': [2048, 1], 2026-02-21T08:32:40.3963578Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:40.3963838Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:32:40.3964029Z 'num_stages': 6, 2026-02-21T08:32:40.3964166Z 'num_warps': 8, 2026-02-21T08:32:40.3964311Z 'pid_type': 'flat', 2026-02-21T08:32:40.3964462Z 'range_flattens': [None, True], 2026-02-21T08:32:40.3964640Z 'range_multi_buffers': [None, True], 2026-02-21T08:32:40.3964849Z 'range_num_stages': [0, 0], 2026-02-21T08:32:40.3965043Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:40.3965243Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:40.3974231Z [189s] Fitting surrogate: 457 points, 457 targets 2026-02-21T08:32:40.9507361Z [190s] Generation 6 starting: 34 neighbors, 3 active search path(s) 2026-02-21T08:32:45.0083164Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 1.9 configs/s 2026-02-21T08:32:47.0085239Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 17.9 configs/s 2026-02-21T08:32:56.0879194Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.6 2026-02-21T08:32:56.0880321Z configs/s 2026-02-21T08:32:56.2911274Z [205s] Generation 6 complete: 2026-02-21T08:32:56.2912726Z ok=38 2026-02-21T08:32:56.2912917Z min=0.2100 2026-02-21T08:32:56.2913072Z mid=0.2221 2026-02-21T08:32:56.2913205Z max=1.0025 2026-02-21T08:32:56.2913369Z best={'block_sizes': [2048, 1], 2026-02-21T08:32:56.2913677Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:32:56.2913995Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:32:56.2914205Z 'num_stages': 6, 2026-02-21T08:32:56.2914366Z 'num_warps': 8, 2026-02-21T08:32:56.2914519Z 'pid_type': 'flat', 2026-02-21T08:32:56.2914703Z 'range_flattens': [None, True], 2026-02-21T08:32:56.2914906Z 'range_multi_buffers': [None, True], 2026-02-21T08:32:56.2915111Z 'range_num_stages': [0, 0], 2026-02-21T08:32:56.2915298Z 'range_unroll_factors': [0, 0], 2026-02-21T08:32:56.2915497Z 'range_warp_specializes': [None, True]} 2026-02-21T08:32:56.2932908Z [205s] Fitting surrogate: 495 points, 495 targets 2026-02-21T08:32:57.0400204Z [206s] Generation 7 starting: 45 neighbors, 3 active search path(s) 2026-02-21T08:32:59.5501424Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 29.1 configs/s 2026-02-21T08:33:02.2413116Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.8 configs/s 2026-02-21T08:33:14.7091789Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 82.4 configs/s 2026-02-21T08:33:14.9691044Z [224s] Generation 7 complete: 2026-02-21T08:33:14.9691342Z ok=49 2026-02-21T08:33:14.9691493Z min=0.2100 2026-02-21T08:33:14.9691650Z mid=0.2243 2026-02-21T08:33:14.9691799Z max=1.2893 2026-02-21T08:33:14.9692066Z best={'block_sizes': [2048, 1], 2026-02-21T08:33:14.9692281Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:33:14.9692516Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:33:14.9692708Z 'num_stages': 2, 2026-02-21T08:33:14.9692861Z 'num_warps': 8, 2026-02-21T08:33:14.9693002Z 'pid_type': 'flat', 2026-02-21T08:33:14.9693165Z 'range_flattens': [None, False], 2026-02-21T08:33:14.9693341Z 'range_multi_buffers': [None, True], 2026-02-21T08:33:14.9693523Z 'range_num_stages': [0, 4], 2026-02-21T08:33:14.9693691Z 'range_unroll_factors': [0, 1], 2026-02-21T08:33:14.9693867Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:14.9706557Z [224s] Fitting surrogate: 544 points, 544 targets 2026-02-21T08:33:15.5782142Z [224s] Generation 8 starting: 28 neighbors, 2 active search path(s) 2026-02-21T08:33:17.5818722Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 17.2 configs/s 2026-02-21T08:33:19.2484986Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 29/29 17.9 configs/s 2026-02-21T08:33:26.6110744Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.4 2026-02-21T08:33:26.6114562Z configs/s 2026-02-21T08:33:26.7714481Z [235s] Generation 8 complete: 2026-02-21T08:33:26.7714730Z ok=30 2026-02-21T08:33:26.7714861Z min=0.2056 2026-02-21T08:33:26.7714988Z mid=0.2388 2026-02-21T08:33:26.7715105Z max=0.6790 2026-02-21T08:33:26.7715243Z best={'block_sizes': [2048, 1], 2026-02-21T08:33:26.7715440Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:33:26.7715662Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:33:26.7715881Z 'num_stages': 3, 2026-02-21T08:33:26.7716039Z 'num_warps': 8, 2026-02-21T08:33:26.7716171Z 'pid_type': 'flat', 2026-02-21T08:33:26.7716327Z 'range_flattens': [None, True], 2026-02-21T08:33:26.7716494Z 'range_multi_buffers': [None, None], 2026-02-21T08:33:26.7716675Z 'range_num_stages': [0, 3], 2026-02-21T08:33:26.7716836Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:26.7717005Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:26.7733432Z [236s] Fitting surrogate: 574 points, 574 targets 2026-02-21T08:33:27.1784300Z [236s] Generation 9 starting: 12 neighbors, 1 active search path(s) 2026-02-21T08:33:28.1266056Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 24.3 configs/s 2026-02-21T08:33:28.8172824Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 18.7 configs/s 2026-02-21T08:33:31.5497216Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 363.2 2026-02-21T08:33:31.5497596Z configs/s 2026-02-21T08:33:31.6353060Z [240s] Generation 9 complete: 2026-02-21T08:33:31.6353365Z ok=13 2026-02-21T08:33:31.6353562Z min=0.2098 2026-02-21T08:33:31.6353745Z mid=0.2200 2026-02-21T08:33:31.6353928Z max=0.3554 2026-02-21T08:33:31.6354134Z best={'block_sizes': [2048, 1], 2026-02-21T08:33:31.6354440Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:33:31.6354794Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:33:31.6355090Z 'num_stages': 4, 2026-02-21T08:33:31.6355308Z 'num_warps': 8, 2026-02-21T08:33:31.6355515Z 'pid_type': 'flat', 2026-02-21T08:33:31.6355755Z 'range_flattens': [None, True], 2026-02-21T08:33:31.6356025Z 'range_multi_buffers': [None, None], 2026-02-21T08:33:31.6356312Z 'range_num_stages': [0, 4], 2026-02-21T08:33:31.6356572Z 'range_unroll_factors': [0, 0], 2026-02-21T08:33:31.6356851Z 'range_warp_specializes': [None, True]} 2026-02-21T08:33:31.6378387Z [240s] Fitting surrogate: 587 points, 587 targets 2026-02-21T08:33:31.8074022Z [241s] Autotuning complete in 241.0s after searching 556 configs. 2026-02-21T08:33:31.8074494Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:33:31.8075905Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:33:31.8077169Z 2026-02-21T08:33:31.8077565Z [241s] Code of selected kernel: /tmp/torchinductor_root/62/c624ylltgvkjwjg4c7ypjk65kagjycjwkdkbdjnh2wdf7knnrj6m.py 2026-02-21T08:33:31.8365277Z from __future__ import annotations 2026-02-21T08:33:31.8365472Z 2026-02-21T08:33:31.8365566Z import torch 2026-02-21T08:33:31.8365755Z import triton 2026-02-21T08:33:31.8365972Z import triton.language as tl 2026-02-21T08:33:31.8366275Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:33:31.8366739Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:33:31.8367200Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:33:31.8367629Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:33:31.8367909Z 2026-02-21T08:33:31.8368005Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:33:31.8368267Z _BLOCK_SIZE_0 = tl.constexpr(2048) 2026-02-21T08:33:31.8368440Z 2026-02-21T08:33:31.8368515Z @triton.jit 2026-02-21T08:33:31.8368798Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:33:31.8369247Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:33:31.8369629Z pid_0 = tl.program_id(0) 2026-02-21T08:33:31.8369863Z offset_1 = pid_0 2026-02-21T08:33:31.8370119Z indices_1 = offset_1 + tl.zeros([1], tl.int32) 2026-02-21T08:33:31.8370558Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:33:31.8371063Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:33:31.8371504Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:33:31.8372290Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:33:31.8372700Z # src[kl_div.py:92-112]: ... 2026-02-21T08:33:31.8373110Z for offset_0 in tl.range(0, 32768, _BLOCK_SIZE_0, warp_specialize=True, num_stages=4, flatten=True): 2026-02-21T08:33:31.8373624Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:33:31.8373958Z loss_sum_copy = loss_sum 2026-02-21T08:33:31.8374212Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:33:31.8374635Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:33:31.8375105Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:33:31.8375517Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:33:31.8376246Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 32768 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:33:31.8376784Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:33:31.8377341Z y_true_val = tl.load(y_true + (indices_1[:, None] * 32768 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:33:31.8377860Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:33:31.8378277Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:33:31.8378750Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:33:31.8379093Z # src[kl_div.py:98-106]: ... 2026-02-21T08:33:31.8379357Z if log_target: 2026-02-21T08:33:31.8379586Z y_true_val_copy = y_true_val 2026-02-21T08:33:31.8379958Z y_pred_val_copy = y_pred_val 2026-02-21T08:33:31.8380231Z kl_loss_copy = kl_loss 2026-02-21T08:33:31.8380513Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:33:31.8380818Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:33:31.8381114Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:33:31.8381445Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:33:31.8381805Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:33:31.8382242Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:33:31.8382643Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:33:31.8382939Z v_2 = v_0 * v_1 2026-02-21T08:33:31.8383179Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:33:31.8383463Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:33:31.8383856Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:33:31.8384321Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:33:31.8384662Z # src[kl_div.py:98-106]: ... 2026-02-21T08:33:31.8384920Z _not = not log_target 2026-02-21T08:33:31.8385151Z if _not: 2026-02-21T08:33:31.8385357Z y_true_val_copy_1 = y_true_val 2026-02-21T08:33:31.8385634Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:33:31.8385900Z kl_loss_copy_1 = kl_loss 2026-02-21T08:33:31.8386185Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:33:31.8386493Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:33:31.8386794Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:33:31.8387185Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:33:31.8387637Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:33:31.8387978Z v_5 = tl_math.log(v_4) 2026-02-21T08:33:31.8388317Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:33:31.8388695Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:33:31.8388968Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:33:31.8389255Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:33:31.8389544Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:33:31.8389848Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:33:31.8390184Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:33:31.8390546Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:33:31.8390873Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:33:31.8391079Z 2026-02-21T08:33:31.8391537Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:33:31.8392230Z """ 2026-02-21T08:33:31.8392430Z Compute KL Divergence loss. 2026-02-21T08:33:31.8392605Z 2026-02-21T08:33:31.8392764Z Args: 2026-02-21T08:33:31.8393030Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:33:31.8393482Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:33:31.8394023Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:33:31.8394506Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:33:31.8394898Z eps: Small value to avoid numerical issues 2026-02-21T08:33:31.8395108Z 2026-02-21T08:33:31.8395192Z Returns: 2026-02-21T08:33:31.8395396Z loss: KL divergence loss 2026-02-21T08:33:31.8395633Z """ 2026-02-21T08:33:31.8395832Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:33:31.8396115Z BT, V = y_pred.shape 2026-02-21T08:33:31.8396406Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:33:31.8396915Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:33:31.8397285Z # src[kl_div.py:77]: ) 2026-02-21T08:33:31.8397687Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:33:31.8398149Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:33:31.8398492Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:33:31.8398814Z # src[kl_div.py:82]: else: 2026-02-21T08:33:31.8399052Z # src[kl_div.py:80-83]: ... 2026-02-21T08:33:31.8399295Z if reduction == 'none': 2026-02-21T08:33:31.8399576Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:33:31.8399900Z loss = torch.zeros_like(y_pred) 2026-02-21T08:33:31.8400153Z else: 2026-02-21T08:33:31.8400485Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:33:31.8401005Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:33:31.8401459Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:33:31.8402546Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:33:31.8402963Z # src[kl_div.py:89-115]: ... 2026-02-21T08:33:31.8403443Z _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=4) 2026-02-21T08:33:31.8403987Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:33:31.8404351Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:33:31.8404716Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:33:31.8405012Z # src[kl_div.py:118-125]: ... 2026-02-21T08:33:31.8405277Z if reduction == 'batchmean': 2026-02-21T08:33:31.8405582Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:33:31.8405926Z final_loss = torch.sum(loss) / BT 2026-02-21T08:33:31.8406202Z elif reduction == 'sum': 2026-02-21T08:33:31.8406510Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:33:31.8406857Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:33:31.8407131Z elif reduction == 'mean': 2026-02-21T08:33:31.8407447Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:33:31.8407802Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:33:31.8408074Z else: 2026-02-21T08:33:31.8408276Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:33:31.8408563Z final_loss = loss 2026-02-21T08:33:31.8408809Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:33:31.8409077Z return final_loss 2026-02-21T08:33:33.0024955Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T08:33:33.0028992Z (B, T, V) 2026-02-21T08:33:33.0030975Z --------------- 2026-02-21T08:33:33.0031216Z (8, 512, 32768) 2026-02-21T08:33:33.0035875Z 2026-02-21T08:33:33.0044481Z 67%|██████▋ | 4/6 [12:05<06:26, 193.41s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4: 2026-02-21T08:33:33.0049333Z (B, T, V) 2026-02-21T08:33:33.0050829Z --------------- 2026-02-21T08:33:33.0051021Z (8, 512, 65536) 2026-02-21T08:33:33.0057576Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:33:34.0779906Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:33:35.2863482Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:33:39.0012667Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:33:39.0016792Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:33:39.0020797Z 'dtype': 'torch.float32', 2026-02-21T08:33:39.0023767Z 'shape': (4096, 65536), 2026-02-21T08:33:39.0027746Z 'stride': (65536, 1)}, 2026-02-21T08:33:39.0031636Z { 'device': 'cuda:0', 2026-02-21T08:33:39.0036142Z 'dtype': 'torch.float32', 2026-02-21T08:33:39.0039488Z 'shape': (4096, 65536), 2026-02-21T08:33:39.0040607Z 'stride': (65536, 1)}), 2026-02-21T08:33:39.0040794Z 'kwargs': {}} 2026-02-21T08:33:39.0041084Z INFO:tritonbench.utils.triton_op:Took 2.55ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:33:39.2205507Z [0s] Autotune random seed: 2135561342 2026-02-21T08:33:39.3918736Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:34:12.6724283Z [33s] Timeout after 30s compiling Config(block_sizes=[1024, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'first'], num_stages=8, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[None, None]) 2026-02-21T08:34:13.0652796Z [33s] Timeout after 30s compiling Config(block_sizes=[4096, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=2, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, None]) 2026-02-21T08:34:14.8944745Z [35s] Timeout after 30s compiling Config(block_sizes=[65536, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=8, num_stages=7, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 0], range_warp_specializes=[False, None]) 2026-02-21T08:34:15.1111322Z [35s] Timeout after 30s compiling Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T08:34:15.1126468Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s 2026-02-21T08:34:15.2078138Z module { 2026-02-21T08:34:15.2083040Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:34:15.2084593Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:34:15.2084820Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:34:15.2085036Z %cst = arith.constant dense<65536> : tensor<16x1xi32> 2026-02-21T08:34:15.2085303Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32> 2026-02-21T08:34:15.2085533Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:34:15.2085732Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:34:15.2085949Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T08:34:15.2086572Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T08:34:15.2086769Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:34:15.2087086Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:15.2087416Z %1 = tt.get_program_id x : i32 2026-02-21T08:34:15.2087591Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T08:34:15.2087823Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:34:15.2088066Z %4 = tt.splat %2 : i32 -> tensor<16xi32> 2026-02-21T08:34:15.2088246Z %5 = arith.addi %4, %3 : tensor<16xi32> 2026-02-21T08:34:15.2088551Z %6 = scf.for %arg5 = %c0_i32 to %c65536_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>) : i32 { 2026-02-21T08:34:15.2088887Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:34:15.2089238Z %11 = tt.splat %arg5 : i32 -> tensor<8xi32> 2026-02-21T08:34:15.2089530Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:34:15.2089842Z %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc> -> tensor<16x8xf32> 2026-02-21T08:34:15.2090177Z %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:34:15.2090429Z %15 = arith.muli %14, %cst : tensor<16x1xi32> 2026-02-21T08:34:15.2090676Z %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:34:15.2091000Z %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32> 2026-02-21T08:34:15.2091263Z %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32> 2026-02-21T08:34:15.2091490Z %19 = arith.addi %17, %18 : tensor<16x8xi32> 2026-02-21T08:34:15.2091715Z %20 = tt.splat %arg1 : !tt.ptr -> tensor<16x8x!tt.ptr> 2026-02-21T08:34:15.2094986Z %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr>, tensor<16x8xi32> 2026-02-21T08:34:15.2099232Z %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr> 2026-02-21T08:34:15.2102982Z %23 = scf.if %arg3 -> (tensor<16x8xf32>) { 2026-02-21T08:34:15.2106427Z %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32> 2026-02-21T08:34:15.2110364Z %26 = arith.subf %22, %13 : tensor<16x8xf32> 2026-02-21T08:34:15.2114342Z %27 = arith.mulf %25, %26 : tensor<16x8xf32> 2026-02-21T08:34:15.2116227Z %28 = arith.addf %27, %cst_0 : tensor<16x8xf32> 2026-02-21T08:34:15.2116440Z scf.yield %28 : tensor<16x8xf32> 2026-02-21T08:34:15.2116617Z } else { 2026-02-21T08:34:15.2116783Z %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32> 2026-02-21T08:34:15.2117016Z %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32> 2026-02-21T08:34:15.2117233Z %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32> 2026-02-21T08:34:15.2117444Z %28 = arith.ori %26, %27 : tensor<16x8xi1> 2026-02-21T08:34:15.2117688Z %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32> 2026-02-21T08:34:15.2117926Z %30 = math.log %29 : tensor<16x8xf32> 2026-02-21T08:34:15.2118113Z %31 = arith.subf %30, %13 : tensor<16x8xf32> 2026-02-21T08:34:15.2118314Z %32 = arith.mulf %22, %31 : tensor<16x8xf32> 2026-02-21T08:34:15.2118511Z %33 = arith.addf %32, %cst_0 : tensor<16x8xf32> 2026-02-21T08:34:15.2118703Z scf.yield %33 : tensor<16x8xf32> 2026-02-21T08:34:15.2118867Z } 2026-02-21T08:34:15.2119017Z %24 = arith.addf %arg6, %23 : tensor<16x8xf32> 2026-02-21T08:34:15.2119212Z scf.yield %24 : tensor<16x8xf32> 2026-02-21T08:34:15.2119387Z } {tt.warp_specialize} 2026-02-21T08:34:15.2119562Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:34:15.2119749Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:34:15.2119933Z %10 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:34:15.2120122Z tt.reduce.return %10 : f32 2026-02-21T08:34:15.2120538Z }) : (tensor<16x8xf32>) -> tensor<16xf32> 2026-02-21T08:34:15.2120766Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:34:15.2121042Z %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:34:15.2121280Z tt.store %9, %7 : tensor<16x!tt.ptr> 2026-02-21T08:34:15.2121458Z tt.return 2026-02-21T08:34:15.2121597Z } 2026-02-21T08:34:15.2121719Z } 2026-02-21T08:34:15.2121785Z 2026-02-21T08:34:15.2121994Z {-# 2026-02-21T08:34:15.2122124Z external_resources: { 2026-02-21T08:34:15.2122284Z mlir_reproducer: { 2026-02-21T08:34:15.2126614Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:34:15.2131132Z disable_threading: false, 2026-02-21T08:34:15.2131308Z verify_each: true 2026-02-21T08:34:15.2131450Z } 2026-02-21T08:34:15.2131574Z } 2026-02-21T08:34:15.2131686Z #-} 2026-02-21T08:34:15.2132144Z /tmp/torchinductor_root/23/c23utnd4szzwwjmnl75xrskyirxil2q5pn33uhnulddun2ry4cua.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:34:15.2133359Z /tmp/torchinductor_root/23/c23utnd4szzwwjmnl75xrskyirxil2q5pn33uhnulddun2ry4cua.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:34:15.2134353Z [35s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:34:15.2135381Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:34:15.2136215Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:34:15.2136459Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:34:23.7560551Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:34:23.7562480Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:34:23.7563517Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:34:23.7568154Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:34:23.7569703Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:34:23.7570022Z %cst = arith.constant dense<0.000000e+00> : tensor<4x4096xf32> 2026-02-21T08:34:23.7570259Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:34:23.7576267Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:34:23.7580827Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T08:34:23.7585485Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T08:34:23.7589961Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:34:23.7592211Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:23.7592981Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:23.7593374Z %2 = tt.get_program_id x : i32 2026-02-21T08:34:23.7598305Z scf.for %arg5 = %2 to %c1024_i32 step %c9472_i32 : i32 { 2026-02-21T08:34:23.7598642Z %3 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:34:23.7598912Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:34:23.7603678Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:34:23.7608100Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:34:23.7609632Z %c61440_i32 = arith.constant 61440 : i32 2026-02-21T08:34:23.7609862Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T08:34:23.7610257Z %7 = scf.for %arg6 = %c0_i32 to %c61440_i32 step %c12288_i32 iter_args(%arg7 = %cst) -> (tensor<4x4096xf32>) : i32 { 2026-02-21T08:34:23.7615514Z %15 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7619957Z %16 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7621816Z %17 = scf.if %arg3 -> (tensor<4x4096xf32>) { 2026-02-21T08:34:23.7622309Z %31 = tt.extern_elementwise %16 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32> 2026-02-21T08:34:23.7622691Z %32 = arith.subf %16, %15 : tensor<4x4096xf32> 2026-02-21T08:34:23.7622895Z %33 = arith.mulf %31, %32 : tensor<4x4096xf32> 2026-02-21T08:34:23.7623114Z %34 = arith.addf %33, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7623313Z scf.yield %34 : tensor<4x4096xf32> 2026-02-21T08:34:23.7623489Z } else { 2026-02-21T08:34:23.7623652Z %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32> 2026-02-21T08:34:23.7623877Z %32 = arith.cmpf ogt, %16, %31 : tensor<4x4096xf32> 2026-02-21T08:34:23.7624108Z %33 = arith.cmpf une, %16, %16 : tensor<4x4096xf32> 2026-02-21T08:34:23.7624319Z %34 = arith.ori %32, %33 : tensor<4x4096xi1> 2026-02-21T08:34:23.7624561Z %35 = arith.select %34, %16, %31 : tensor<4x4096xi1>, tensor<4x4096xf32> 2026-02-21T08:34:23.7624799Z %36 = math.log %35 : tensor<4x4096xf32> 2026-02-21T08:34:23.7624997Z %37 = arith.subf %36, %15 : tensor<4x4096xf32> 2026-02-21T08:34:23.7625190Z %38 = arith.mulf %16, %37 : tensor<4x4096xf32> 2026-02-21T08:34:23.7625394Z %39 = arith.addf %38, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7625589Z scf.yield %39 : tensor<4x4096xf32> 2026-02-21T08:34:23.7625763Z } 2026-02-21T08:34:23.7625915Z %18 = arith.addf %arg7, %17 : tensor<4x4096xf32> 2026-02-21T08:34:23.7626106Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:34:23.7626298Z %19 = arith.muli %c4096_i32, %c1_i32 : i32 2026-02-21T08:34:23.7626477Z %20 = arith.addi %arg6, %19 : i32 2026-02-21T08:34:23.7626751Z %21 = tt.descriptor_load %0[%3, %20] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7627332Z %22 = tt.descriptor_load %1[%3, %20] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7627622Z %23 = scf.if %arg3 -> (tensor<4x4096xf32>) { 2026-02-21T08:34:23.7627985Z %31 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32> 2026-02-21T08:34:23.7628341Z %32 = arith.subf %22, %21 : tensor<4x4096xf32> 2026-02-21T08:34:23.7628547Z %33 = arith.mulf %31, %32 : tensor<4x4096xf32> 2026-02-21T08:34:23.7628748Z %34 = arith.addf %33, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7628949Z scf.yield %34 : tensor<4x4096xf32> 2026-02-21T08:34:23.7629123Z } else { 2026-02-21T08:34:23.7629313Z %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32> 2026-02-21T08:34:23.7629605Z %32 = arith.cmpf ogt, %22, %31 : tensor<4x4096xf32> 2026-02-21T08:34:23.7629828Z %33 = arith.cmpf une, %22, %22 : tensor<4x4096xf32> 2026-02-21T08:34:23.7630047Z %34 = arith.ori %32, %33 : tensor<4x4096xi1> 2026-02-21T08:34:23.7630289Z %35 = arith.select %34, %22, %31 : tensor<4x4096xi1>, tensor<4x4096xf32> 2026-02-21T08:34:23.7630522Z %36 = math.log %35 : tensor<4x4096xf32> 2026-02-21T08:34:23.7630725Z %37 = arith.subf %36, %21 : tensor<4x4096xf32> 2026-02-21T08:34:23.7630920Z %38 = arith.mulf %22, %37 : tensor<4x4096xf32> 2026-02-21T08:34:23.7631131Z %39 = arith.addf %38, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7631323Z scf.yield %39 : tensor<4x4096xf32> 2026-02-21T08:34:23.7631498Z } 2026-02-21T08:34:23.7631641Z %24 = arith.addf %18, %23 : tensor<4x4096xf32> 2026-02-21T08:34:23.7631828Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:34:23.7632063Z %25 = arith.muli %c4096_i32, %c2_i32 : i32 2026-02-21T08:34:23.7632245Z %26 = arith.addi %arg6, %25 : i32 2026-02-21T08:34:23.7632516Z %27 = tt.descriptor_load %0[%3, %26] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7632862Z %28 = tt.descriptor_load %1[%3, %26] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7633145Z %29 = scf.if %arg3 -> (tensor<4x4096xf32>) { 2026-02-21T08:34:23.7633503Z %31 = tt.extern_elementwise %28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32> 2026-02-21T08:34:23.7633860Z %32 = arith.subf %28, %27 : tensor<4x4096xf32> 2026-02-21T08:34:23.7634065Z %33 = arith.mulf %31, %32 : tensor<4x4096xf32> 2026-02-21T08:34:23.7634266Z %34 = arith.addf %33, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7634466Z scf.yield %34 : tensor<4x4096xf32> 2026-02-21T08:34:23.7634634Z } else { 2026-02-21T08:34:23.7634801Z %31 = tt.splat %arg4 : f32 -> tensor<4x4096xf32> 2026-02-21T08:34:23.7635021Z %32 = arith.cmpf ogt, %28, %31 : tensor<4x4096xf32> 2026-02-21T08:34:23.7635234Z %33 = arith.cmpf une, %28, %28 : tensor<4x4096xf32> 2026-02-21T08:34:23.7635444Z %34 = arith.ori %32, %33 : tensor<4x4096xi1> 2026-02-21T08:34:23.7635671Z %35 = arith.select %34, %28, %31 : tensor<4x4096xi1>, tensor<4x4096xf32> 2026-02-21T08:34:23.7635911Z %36 = math.log %35 : tensor<4x4096xf32> 2026-02-21T08:34:23.7636099Z %37 = arith.subf %36, %27 : tensor<4x4096xf32> 2026-02-21T08:34:23.7636298Z %38 = arith.mulf %28, %37 : tensor<4x4096xf32> 2026-02-21T08:34:23.7636498Z %39 = arith.addf %38, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7636686Z scf.yield %39 : tensor<4x4096xf32> 2026-02-21T08:34:23.7636855Z } 2026-02-21T08:34:23.7636989Z %30 = arith.addf %24, %29 : tensor<4x4096xf32> 2026-02-21T08:34:23.7637182Z scf.yield %30 : tensor<4x4096xf32> 2026-02-21T08:34:23.7637360Z } {tt.num_stages = 1 : i32} 2026-02-21T08:34:23.7637707Z %8 = tt.descriptor_load %0[%3, %c61440_i32] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7638097Z %9 = tt.descriptor_load %1[%3, %c61440_i32] : !tt.tensordesc> -> tensor<4x4096xf32> 2026-02-21T08:34:23.7638389Z %10 = scf.if %arg3 -> (tensor<4x4096xf32>) { 2026-02-21T08:34:23.7638751Z %15 = tt.extern_elementwise %9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x4096xf32>) -> tensor<4x4096xf32> 2026-02-21T08:34:23.7639109Z %16 = arith.subf %9, %8 : tensor<4x4096xf32> 2026-02-21T08:34:23.7639320Z %17 = arith.mulf %15, %16 : tensor<4x4096xf32> 2026-02-21T08:34:23.7639526Z %18 = arith.addf %17, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7639730Z scf.yield %18 : tensor<4x4096xf32> 2026-02-21T08:34:23.7639908Z } else { 2026-02-21T08:34:23.7640118Z %15 = tt.splat %arg4 : f32 -> tensor<4x4096xf32> 2026-02-21T08:34:23.7640337Z %16 = arith.cmpf ogt, %9, %15 : tensor<4x4096xf32> 2026-02-21T08:34:23.7640547Z %17 = arith.cmpf une, %9, %9 : tensor<4x4096xf32> 2026-02-21T08:34:23.7640754Z %18 = arith.ori %16, %17 : tensor<4x4096xi1> 2026-02-21T08:34:23.7640984Z %19 = arith.select %18, %9, %15 : tensor<4x4096xi1>, tensor<4x4096xf32> 2026-02-21T08:34:23.7641229Z %20 = math.log %19 : tensor<4x4096xf32> 2026-02-21T08:34:23.7641431Z %21 = arith.subf %20, %8 : tensor<4x4096xf32> 2026-02-21T08:34:23.7641625Z %22 = arith.mulf %9, %21 : tensor<4x4096xf32> 2026-02-21T08:34:23.7641829Z %23 = arith.addf %22, %cst : tensor<4x4096xf32> 2026-02-21T08:34:23.7642045Z scf.yield %23 : tensor<4x4096xf32> 2026-02-21T08:34:23.7642214Z } 2026-02-21T08:34:23.7642350Z %11 = arith.addf %7, %10 : tensor<4x4096xf32> 2026-02-21T08:34:23.7642553Z %12 = "tt.reduce"(%11) <{axis = 1 : i32}> ({ 2026-02-21T08:34:23.7642739Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:23.7642920Z %15 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:23.7643102Z tt.reduce.return %15 : f32 2026-02-21T08:34:23.7643281Z }) : (tensor<4x4096xf32>) -> tensor<4xf32> 2026-02-21T08:34:23.7643510Z %13 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:34:23.7643761Z %14 = tt.addptr %13, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:34:23.7643993Z tt.store %14, %12 : tensor<4x!tt.ptr> 2026-02-21T08:34:23.7644248Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:34:23.7644496Z tt.return 2026-02-21T08:34:23.7644618Z } 2026-02-21T08:34:23.7644737Z } 2026-02-21T08:34:23.7644804Z 2026-02-21T08:34:23.7644860Z {-# 2026-02-21T08:34:23.7644981Z external_resources: { 2026-02-21T08:34:23.7645135Z mlir_reproducer: { 2026-02-21T08:34:23.7649363Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=16 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:34:23.7653796Z disable_threading: false, 2026-02-21T08:34:23.7653960Z verify_each: true 2026-02-21T08:34:23.7654120Z } 2026-02-21T08:34:23.7654256Z } 2026-02-21T08:34:23.7654395Z #-} 2026-02-21T08:34:23.7655001Z /tmp/torchinductor_root/by/cbyrkaomiylxwyh72dcm7phdezoaifof2wfpdtzmnlfnjxf2eehc.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:34:23.7656282Z /tmp/torchinductor_root/by/cbyrkaomiylxwyh72dcm7phdezoaifof2wfpdtzmnlfnjxf2eehc.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:34:23.7657329Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:34:23.7658491Z Config: @helion.kernel(config=helion.Config(block_sizes=[4096, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=256, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:34:23.7659513Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:34:23.7659798Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:34:23.7891157Z module attributes {ttg.maxnreg = 128 : i32} { 2026-02-21T08:34:23.7896135Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:34:23.7900288Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:34:23.7904212Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:34:23.7906384Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:34:23.7906702Z %cst = arith.constant dense<0.000000e+00> : tensor<64x1024xf32> 2026-02-21T08:34:23.7906948Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:34:23.7911392Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:34:23.7915356Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T08:34:23.7919492Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T08:34:23.7921198Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:34:23.7921579Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:23.7922084Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:23.7922411Z %2 = tt.get_program_id x : i32 2026-02-21T08:34:23.7922605Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:34:23.7922790Z %4 = arith.minsi %3, %c64_i32 : i32 2026-02-21T08:34:23.7923002Z scf.for %arg5 = %2 to %4 step %c1_i32 : i32 { 2026-02-21T08:34:23.7923202Z %5 = arith.muli %arg5, %c64_i32 : i32 2026-02-21T08:34:23.7923433Z %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:34:23.7923672Z %7 = tt.splat %5 : i32 -> tensor<64xi32> 2026-02-21T08:34:23.7923879Z %8 = arith.addi %7, %6 : tensor<64xi32> 2026-02-21T08:34:23.7924426Z %9 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<64x1024xf32>) : i32 { 2026-02-21T08:34:23.7924842Z %13 = tt.descriptor_load %0[%5, %arg6] : !tt.tensordesc> -> tensor<64x1024xf32> 2026-02-21T08:34:23.7925216Z %14 = tt.descriptor_load %1[%5, %arg6] : !tt.tensordesc> -> tensor<64x1024xf32> 2026-02-21T08:34:23.7925506Z %15 = scf.if %arg3 -> (tensor<64x1024xf32>) { 2026-02-21T08:34:23.7925885Z %17 = tt.extern_elementwise %14 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<64x1024xf32>) -> tensor<64x1024xf32> 2026-02-21T08:34:23.7926261Z %18 = arith.subf %14, %13 : tensor<64x1024xf32> 2026-02-21T08:34:23.7926476Z %19 = arith.mulf %17, %18 : tensor<64x1024xf32> 2026-02-21T08:34:23.7926690Z %20 = arith.addf %19, %cst : tensor<64x1024xf32> 2026-02-21T08:34:23.7926961Z scf.yield %20 : tensor<64x1024xf32> 2026-02-21T08:34:23.7927144Z } else { 2026-02-21T08:34:23.7927309Z %17 = tt.splat %arg4 : f32 -> tensor<64x1024xf32> 2026-02-21T08:34:23.7927537Z %18 = arith.cmpf ogt, %14, %17 : tensor<64x1024xf32> 2026-02-21T08:34:23.7927756Z %19 = arith.cmpf une, %14, %14 : tensor<64x1024xf32> 2026-02-21T08:34:23.7927974Z %20 = arith.ori %18, %19 : tensor<64x1024xi1> 2026-02-21T08:34:23.7928221Z %21 = arith.select %20, %14, %17 : tensor<64x1024xi1>, tensor<64x1024xf32> 2026-02-21T08:34:23.7928479Z %22 = math.log %21 : tensor<64x1024xf32> 2026-02-21T08:34:23.7928678Z %23 = arith.subf %22, %13 : tensor<64x1024xf32> 2026-02-21T08:34:23.7928872Z %24 = arith.mulf %14, %23 : tensor<64x1024xf32> 2026-02-21T08:34:23.7929077Z %25 = arith.addf %24, %cst : tensor<64x1024xf32> 2026-02-21T08:34:23.7929275Z scf.yield %25 : tensor<64x1024xf32> 2026-02-21T08:34:23.7929438Z } 2026-02-21T08:34:23.7929590Z %16 = arith.addf %arg7, %15 : tensor<64x1024xf32> 2026-02-21T08:34:23.7929782Z scf.yield %16 : tensor<64x1024xf32> 2026-02-21T08:34:23.7930061Z } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:34:23.7930345Z %10 = "tt.reduce"(%9) <{axis = 1 : i32}> ({ 2026-02-21T08:34:23.7930535Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:23.7930712Z %13 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:23.7930887Z tt.reduce.return %13 : f32 2026-02-21T08:34:23.7931071Z }) : (tensor<64x1024xf32>) -> tensor<64xf32> 2026-02-21T08:34:23.7931295Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<64x!tt.ptr> 2026-02-21T08:34:23.7931552Z %12 = tt.addptr %11, %8 : tensor<64x!tt.ptr>, tensor<64xi32> 2026-02-21T08:34:23.7931773Z tt.store %12, %10 : tensor<64x!tt.ptr> 2026-02-21T08:34:23.7932017Z } {tt.loop_unroll_factor = 1 : i32} 2026-02-21T08:34:23.7932180Z tt.return 2026-02-21T08:34:23.7932314Z } 2026-02-21T08:34:23.7932434Z } 2026-02-21T08:34:23.7932500Z 2026-02-21T08:34:23.7932551Z {-# 2026-02-21T08:34:23.7932684Z external_resources: { 2026-02-21T08:34:23.7932835Z mlir_reproducer: { 2026-02-21T08:34:23.7937081Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=1}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=1}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=1}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:34:23.7941531Z disable_threading: false, 2026-02-21T08:34:23.7941717Z verify_each: true 2026-02-21T08:34:23.7941900Z } 2026-02-21T08:34:23.7942012Z } 2026-02-21T08:34:23.7942127Z #-} 2026-02-21T08:34:23.7942545Z /tmp/torchinductor_root/4b/c4bilkrdypptccswjo2372mrikzya4kjnax776j2kcpz7owgfsfp.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:34:23.7943693Z /tmp/torchinductor_root/4b/c4bilkrdypptccswjo2372mrikzya4kjnax776j2kcpz7owgfsfp.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:34:23.7944643Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:34:23.7945725Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 64], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], maxnreg=128, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[1, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:34:23.7946707Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:34:23.7946958Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:34:24.7708492Z module attributes {ttg.maxnreg = 32 : i32} { 2026-02-21T08:34:24.7710190Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:34:24.7710880Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:34:24.7711073Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:34:24.7711300Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:34:24.7711497Z %c2368_i32 = arith.constant 2368 : i32 2026-02-21T08:34:24.7711712Z %cst = arith.constant dense<65536> : tensor<4x1xi32> 2026-02-21T08:34:24.7712204Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<4x16xf32> 2026-02-21T08:34:24.7712435Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:34:24.7712617Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:34:24.7712801Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T08:34:24.7712991Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T08:34:24.7713164Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:34:24.7713482Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:24.7713798Z %1 = tt.get_program_id x : i32 2026-02-21T08:34:24.7713976Z %2 = arith.subi %c1024_i32, %1 : i32 2026-02-21T08:34:24.7714148Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:34:24.7714337Z %3 = arith.subi %c2368_i32, %c1_i32 : i32 2026-02-21T08:34:24.7714885Z %4 = arith.addi %2, %3 : i32 2026-02-21T08:34:24.7715058Z %5 = arith.divui %4, %c2368_i32 : i32 2026-02-21T08:34:24.7715246Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:34:24.7715421Z %6 = arith.remsi %5, %c3_i32 : i32 2026-02-21T08:34:24.7715603Z %7 = arith.subi %5, %6 : i32 2026-02-21T08:34:24.7715767Z %8 = arith.muli %7, %c2368_i32 : i32 2026-02-21T08:34:24.7715950Z %9 = arith.addi %1, %8 : i32 2026-02-21T08:34:24.7716121Z %10 = arith.muli %c2368_i32, %c3_i32 : i32 2026-02-21T08:34:24.7716326Z scf.for %arg5 = %1 to %9 step %10 : i32 { 2026-02-21T08:34:24.7716525Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:34:24.7716749Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:34:24.7717007Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:34:24.7717198Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:34:24.7717617Z %15 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>) : i32 { 2026-02-21T08:34:24.7717973Z %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:34:24.7718227Z %40 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:34:24.7718432Z %41 = arith.addi %40, %39 : tensor<16xi32> 2026-02-21T08:34:24.7718678Z %42 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:34:24.7718940Z %43 = arith.muli %42, %cst : tensor<4x1xi32> 2026-02-21T08:34:24.7719186Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:34:24.7719481Z %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7719735Z %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7719971Z %47 = arith.addi %45, %46 : tensor<4x16xi32> 2026-02-21T08:34:24.7720203Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7720508Z %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr>, tensor<4x16xi32> 2026-02-21T08:34:24.7720787Z %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7721124Z %51 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x16xf32> 2026-02-21T08:34:24.7721410Z %52 = scf.if %arg3 -> (tensor<4x16xf32>) { 2026-02-21T08:34:24.7721760Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32> 2026-02-21T08:34:24.7722158Z %55 = arith.subf %51, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7722354Z %56 = arith.mulf %54, %55 : tensor<4x16xf32> 2026-02-21T08:34:24.7722568Z %57 = arith.addf %56, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7722765Z scf.yield %57 : tensor<4x16xf32> 2026-02-21T08:34:24.7722941Z } else { 2026-02-21T08:34:24.7723110Z %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32> 2026-02-21T08:34:24.7723321Z %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32> 2026-02-21T08:34:24.7723544Z %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32> 2026-02-21T08:34:24.7723745Z %57 = arith.ori %55, %56 : tensor<4x16xi1> 2026-02-21T08:34:24.7723983Z %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32> 2026-02-21T08:34:24.7724217Z %59 = math.log %58 : tensor<4x16xf32> 2026-02-21T08:34:24.7724412Z %60 = arith.subf %59, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7724610Z %61 = arith.mulf %51, %60 : tensor<4x16xf32> 2026-02-21T08:34:24.7724808Z %62 = arith.addf %61, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7725005Z scf.yield %62 : tensor<4x16xf32> 2026-02-21T08:34:24.7725167Z } 2026-02-21T08:34:24.7725312Z %53 = arith.addf %arg7, %52 : tensor<4x16xf32> 2026-02-21T08:34:24.7725500Z scf.yield %53 : tensor<4x16xf32> 2026-02-21T08:34:24.7725819Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:34:24.7726073Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:34:24.7726260Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:24.7726443Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:24.7726629Z tt.reduce.return %39 : f32 2026-02-21T08:34:24.7726821Z }) : (tensor<4x16xf32>) -> tensor<4xf32> 2026-02-21T08:34:24.7727047Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:34:24.7727312Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:34:24.7727545Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:34:24.7727748Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:34:24.7727945Z %19 = arith.muli %c2368_i32, %c1_i32_1 : i32 2026-02-21T08:34:24.7728195Z %20 = arith.addi %arg5, %19 : i32 2026-02-21T08:34:24.7728387Z %21 = arith.muli %20, %c4_i32 : i32 2026-02-21T08:34:24.7728608Z %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:34:24.7728855Z %23 = tt.splat %21 : i32 -> tensor<4xi32> 2026-02-21T08:34:24.7729047Z %24 = arith.addi %23, %22 : tensor<4xi32> 2026-02-21T08:34:24.7729369Z %25 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>) : i32 { 2026-02-21T08:34:24.7729744Z %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:34:24.7729995Z %40 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:34:24.7730207Z %41 = arith.addi %40, %39 : tensor<16xi32> 2026-02-21T08:34:24.7730455Z %42 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:34:24.7730723Z %43 = arith.muli %42, %cst : tensor<4x1xi32> 2026-02-21T08:34:24.7730978Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:34:24.7731284Z %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7731554Z %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7731781Z %47 = arith.addi %45, %46 : tensor<4x16xi32> 2026-02-21T08:34:24.7732049Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7732318Z %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr>, tensor<4x16xi32> 2026-02-21T08:34:24.7732615Z %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7732957Z %51 = tt.descriptor_load %0[%21, %arg6] : !tt.tensordesc> -> tensor<4x16xf32> 2026-02-21T08:34:24.7733257Z %52 = scf.if %arg3 -> (tensor<4x16xf32>) { 2026-02-21T08:34:24.7733632Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32> 2026-02-21T08:34:24.7734007Z %55 = arith.subf %51, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7734220Z %56 = arith.mulf %54, %55 : tensor<4x16xf32> 2026-02-21T08:34:24.7734431Z %57 = arith.addf %56, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7734641Z scf.yield %57 : tensor<4x16xf32> 2026-02-21T08:34:24.7734812Z } else { 2026-02-21T08:34:24.7734983Z %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32> 2026-02-21T08:34:24.7735212Z %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32> 2026-02-21T08:34:24.7735423Z %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32> 2026-02-21T08:34:24.7735632Z %57 = arith.ori %55, %56 : tensor<4x16xi1> 2026-02-21T08:34:24.7735860Z %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32> 2026-02-21T08:34:24.7736095Z %59 = math.log %58 : tensor<4x16xf32> 2026-02-21T08:34:24.7736280Z %60 = arith.subf %59, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7736479Z %61 = arith.mulf %51, %60 : tensor<4x16xf32> 2026-02-21T08:34:24.7736745Z %62 = arith.addf %61, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7736931Z scf.yield %62 : tensor<4x16xf32> 2026-02-21T08:34:24.7737099Z } 2026-02-21T08:34:24.7737237Z %53 = arith.addf %arg7, %52 : tensor<4x16xf32> 2026-02-21T08:34:24.7737427Z scf.yield %53 : tensor<4x16xf32> 2026-02-21T08:34:24.7737663Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:34:24.7737923Z %26 = "tt.reduce"(%25) <{axis = 1 : i32}> ({ 2026-02-21T08:34:24.7738110Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:24.7738279Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:24.7738463Z tt.reduce.return %39 : f32 2026-02-21T08:34:24.7738641Z }) : (tensor<4x16xf32>) -> tensor<4xf32> 2026-02-21T08:34:24.7738863Z %27 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:34:24.7739167Z %28 = tt.addptr %27, %24 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:34:24.7739400Z tt.store %28, %26 : tensor<4x!tt.ptr> 2026-02-21T08:34:24.7739588Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:34:24.7739770Z %29 = arith.muli %c2368_i32, %c2_i32 : i32 2026-02-21T08:34:24.7739957Z %30 = arith.addi %arg5, %29 : i32 2026-02-21T08:34:24.7740126Z %31 = arith.muli %30, %c4_i32 : i32 2026-02-21T08:34:24.7740345Z %32 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:34:24.7740576Z %33 = tt.splat %31 : i32 -> tensor<4xi32> 2026-02-21T08:34:24.7740768Z %34 = arith.addi %33, %32 : tensor<4xi32> 2026-02-21T08:34:24.7741067Z %35 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>) : i32 { 2026-02-21T08:34:24.7741423Z %39 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:34:24.7741673Z %40 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:34:24.7741925Z %41 = arith.addi %40, %39 : tensor<16xi32> 2026-02-21T08:34:24.7742177Z %42 = tt.expand_dims %34 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:34:24.7742421Z %43 = arith.muli %42, %cst : tensor<4x1xi32> 2026-02-21T08:34:24.7742662Z %44 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:34:24.7742927Z %45 = tt.broadcast %43 : tensor<4x1xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7743173Z %46 = tt.broadcast %44 : tensor<1x16xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7743393Z %47 = arith.addi %45, %46 : tensor<4x16xi32> 2026-02-21T08:34:24.7743610Z %48 = tt.splat %arg0 : !tt.ptr -> tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7743871Z %49 = tt.addptr %48, %47 : tensor<4x16x!tt.ptr>, tensor<4x16xi32> 2026-02-21T08:34:24.7744143Z %50 = tt.load %49 evictionPolicy = evict_last : tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7744470Z %51 = tt.descriptor_load %0[%31, %arg6] : !tt.tensordesc> -> tensor<4x16xf32> 2026-02-21T08:34:24.7744756Z %52 = scf.if %arg3 -> (tensor<4x16xf32>) { 2026-02-21T08:34:24.7745106Z %54 = tt.extern_elementwise %51 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32> 2026-02-21T08:34:24.7745465Z %55 = arith.subf %51, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7745662Z %56 = arith.mulf %54, %55 : tensor<4x16xf32> 2026-02-21T08:34:24.7745871Z %57 = arith.addf %56, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7746063Z scf.yield %57 : tensor<4x16xf32> 2026-02-21T08:34:24.7746236Z } else { 2026-02-21T08:34:24.7746399Z %54 = tt.splat %arg4 : f32 -> tensor<4x16xf32> 2026-02-21T08:34:24.7746609Z %55 = arith.cmpf ogt, %51, %54 : tensor<4x16xf32> 2026-02-21T08:34:24.7746826Z %56 = arith.cmpf une, %51, %51 : tensor<4x16xf32> 2026-02-21T08:34:24.7747028Z %57 = arith.ori %55, %56 : tensor<4x16xi1> 2026-02-21T08:34:24.7747319Z %58 = arith.select %57, %51, %54 : tensor<4x16xi1>, tensor<4x16xf32> 2026-02-21T08:34:24.7747547Z %59 = math.log %58 : tensor<4x16xf32> 2026-02-21T08:34:24.7747744Z %60 = arith.subf %59, %50 : tensor<4x16xf32> 2026-02-21T08:34:24.7747941Z %61 = arith.mulf %51, %60 : tensor<4x16xf32> 2026-02-21T08:34:24.7748138Z %62 = arith.addf %61, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7748333Z scf.yield %62 : tensor<4x16xf32> 2026-02-21T08:34:24.7748491Z } 2026-02-21T08:34:24.7748636Z %53 = arith.addf %arg7, %52 : tensor<4x16xf32> 2026-02-21T08:34:24.7748818Z scf.yield %53 : tensor<4x16xf32> 2026-02-21T08:34:24.7749063Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:34:24.7749317Z %36 = "tt.reduce"(%35) <{axis = 1 : i32}> ({ 2026-02-21T08:34:24.7749506Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:24.7749764Z %39 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:24.7749945Z tt.reduce.return %39 : f32 2026-02-21T08:34:24.7750125Z }) : (tensor<4x16xf32>) -> tensor<4xf32> 2026-02-21T08:34:24.7750342Z %37 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:34:24.7750595Z %38 = tt.addptr %37, %34 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:34:24.7750816Z tt.store %38, %36 : tensor<4x!tt.ptr> 2026-02-21T08:34:24.7751005Z } {tt.num_stages = 1 : i32} 2026-02-21T08:34:24.7751204Z scf.for %arg5 = %9 to %c1024_i32 step %c2368_i32 : i32 { 2026-02-21T08:34:24.7751424Z %11 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:34:24.7751678Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:34:24.7751952Z %13 = tt.splat %11 : i32 -> tensor<4xi32> 2026-02-21T08:34:24.7752137Z %14 = arith.addi %13, %12 : tensor<4xi32> 2026-02-21T08:34:24.7752440Z %15 = scf.for %arg6 = %c0_i32 to %c65536_i32 step %c16_i32 iter_args(%arg7 = %cst_0) -> (tensor<4x16xf32>) : i32 { 2026-02-21T08:34:24.7752780Z %19 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:34:24.7753025Z %20 = tt.splat %arg6 : i32 -> tensor<16xi32> 2026-02-21T08:34:24.7753225Z %21 = arith.addi %20, %19 : tensor<16xi32> 2026-02-21T08:34:24.7753457Z %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<4xi32> -> tensor<4x1xi32> 2026-02-21T08:34:24.7753707Z %23 = arith.muli %22, %cst : tensor<4x1xi32> 2026-02-21T08:34:24.7753943Z %24 = tt.expand_dims %21 {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> 2026-02-21T08:34:24.7754219Z %25 = tt.broadcast %23 : tensor<4x1xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7754463Z %26 = tt.broadcast %24 : tensor<1x16xi32> -> tensor<4x16xi32> 2026-02-21T08:34:24.7754693Z %27 = arith.addi %25, %26 : tensor<4x16xi32> 2026-02-21T08:34:24.7754924Z %28 = tt.splat %arg0 : !tt.ptr -> tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7755181Z %29 = tt.addptr %28, %27 : tensor<4x16x!tt.ptr>, tensor<4x16xi32> 2026-02-21T08:34:24.7755468Z %30 = tt.load %29 evictionPolicy = evict_last : tensor<4x16x!tt.ptr> 2026-02-21T08:34:24.7755787Z %31 = tt.descriptor_load %0[%11, %arg6] : !tt.tensordesc> -> tensor<4x16xf32> 2026-02-21T08:34:24.7756071Z %32 = scf.if %arg3 -> (tensor<4x16xf32>) { 2026-02-21T08:34:24.7756417Z %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x16xf32>) -> tensor<4x16xf32> 2026-02-21T08:34:24.7756776Z %35 = arith.subf %31, %30 : tensor<4x16xf32> 2026-02-21T08:34:24.7756979Z %36 = arith.mulf %34, %35 : tensor<4x16xf32> 2026-02-21T08:34:24.7757176Z %37 = arith.addf %36, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7757374Z scf.yield %37 : tensor<4x16xf32> 2026-02-21T08:34:24.7757535Z } else { 2026-02-21T08:34:24.7757700Z %34 = tt.splat %arg4 : f32 -> tensor<4x16xf32> 2026-02-21T08:34:24.7757964Z %35 = arith.cmpf ogt, %31, %34 : tensor<4x16xf32> 2026-02-21T08:34:24.7758180Z %36 = arith.cmpf une, %31, %31 : tensor<4x16xf32> 2026-02-21T08:34:24.7758387Z %37 = arith.ori %35, %36 : tensor<4x16xi1> 2026-02-21T08:34:24.7758612Z %38 = arith.select %37, %31, %34 : tensor<4x16xi1>, tensor<4x16xf32> 2026-02-21T08:34:24.7758863Z %39 = math.log %38 : tensor<4x16xf32> 2026-02-21T08:34:24.7759046Z %40 = arith.subf %39, %30 : tensor<4x16xf32> 2026-02-21T08:34:24.7759244Z %41 = arith.mulf %31, %40 : tensor<4x16xf32> 2026-02-21T08:34:24.7759448Z %42 = arith.addf %41, %cst_0 : tensor<4x16xf32> 2026-02-21T08:34:24.7759638Z scf.yield %42 : tensor<4x16xf32> 2026-02-21T08:34:24.7759808Z } 2026-02-21T08:34:24.7759945Z %33 = arith.addf %arg7, %32 : tensor<4x16xf32> 2026-02-21T08:34:24.7760190Z scf.yield %33 : tensor<4x16xf32> 2026-02-21T08:34:24.7760431Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:34:24.7760693Z %16 = "tt.reduce"(%15) <{axis = 1 : i32}> ({ 2026-02-21T08:34:24.7760880Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:24.7761049Z %19 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:24.7761230Z tt.reduce.return %19 : f32 2026-02-21T08:34:24.7761405Z }) : (tensor<4x16xf32>) -> tensor<4xf32> 2026-02-21T08:34:24.7761625Z %17 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:34:24.7761919Z %18 = tt.addptr %17, %14 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:34:24.7762156Z tt.store %18, %16 : tensor<4x!tt.ptr> 2026-02-21T08:34:24.7762339Z } {tt.num_stages = 1 : i32} 2026-02-21T08:34:24.7762501Z tt.return 2026-02-21T08:34:24.7762635Z } 2026-02-21T08:34:24.7762754Z } 2026-02-21T08:34:24.7762825Z 2026-02-21T08:34:24.7762885Z {-# 2026-02-21T08:34:24.7763010Z external_resources: { 2026-02-21T08:34:24.7763172Z mlir_reproducer: { 2026-02-21T08:34:24.7767372Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:34:24.7771644Z disable_threading: false, 2026-02-21T08:34:24.7771835Z verify_each: true 2026-02-21T08:34:24.7772025Z } 2026-02-21T08:34:24.7772170Z } 2026-02-21T08:34:24.7772373Z #-} 2026-02-21T08:34:24.7772866Z /tmp/torchinductor_root/hg/chgh4z3oq5aytli2qovqw2k2dzxgymzdxa2ewtfhxxv62w56enwr.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:34:24.7774123Z /tmp/torchinductor_root/hg/chgh4z3oq5aytli2qovqw2k2dzxgymzdxa2ewtfhxxv62w56enwr.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:34:24.7775156Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:34:24.7776416Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 4], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=32, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[3, 1], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:34:24.7777441Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:34:24.7777686Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:34:30.0213438Z module attributes {ttg.maxnreg = 256 : i32} { 2026-02-21T08:34:30.0218518Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:34:30.0219145Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:34:30.0219353Z %c64_i32 = arith.constant 64 : i32 2026-02-21T08:34:30.0219544Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:34:30.0219710Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:34:30.0219954Z %cst = arith.constant dense<65536> : tensor<1024x1xi32> 2026-02-21T08:34:30.0220225Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<1024x64xf32> 2026-02-21T08:34:30.0220462Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:34:30.0220639Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:34:30.0220826Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T08:34:30.0221011Z %c65536_i64 = arith.constant 65536 : i64 2026-02-21T08:34:30.0221184Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:34:30.0221499Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c65536_i32], [%c65536_i64, %c1_i64] : , > 2026-02-21T08:34:30.0221808Z %1 = tt.get_program_id x : i32 2026-02-21T08:34:30.0222231Z %2 = arith.addi %1, %c1_i32 : i32 2026-02-21T08:34:30.0222406Z %3 = arith.minsi %2, %c4_i32 : i32 2026-02-21T08:34:30.0222608Z scf.for %arg5 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T08:34:30.0222810Z %4 = arith.muli %arg5, %c1024_i32 : i32 2026-02-21T08:34:30.0223060Z %5 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> 2026-02-21T08:34:30.0223328Z %6 = tt.splat %4 : i32 -> tensor<1024xi32> 2026-02-21T08:34:30.0223518Z %7 = arith.addi %6, %5 : tensor<1024xi32> 2026-02-21T08:34:30.0223741Z %c65472_i32 = arith.constant 65472 : i32 2026-02-21T08:34:30.0223927Z %c192_i32 = arith.constant 192 : i32 2026-02-21T08:34:30.0224244Z %8 = scf.for %arg6 = %c0_i32 to %c65472_i32 step %c192_i32 iter_args(%arg7 = %cst_0) -> (tensor<1024x64xf32>) : i32 { 2026-02-21T08:34:30.0224608Z %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:34:30.0224861Z %28 = tt.splat %arg6 : i32 -> tensor<64xi32> 2026-02-21T08:34:30.0225062Z %29 = arith.addi %28, %27 : tensor<64xi32> 2026-02-21T08:34:30.0225322Z %30 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32> 2026-02-21T08:34:30.0225585Z %31 = arith.muli %30, %cst : tensor<1024x1xi32> 2026-02-21T08:34:30.0225844Z %32 = tt.expand_dims %29 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:34:30.0226490Z %33 = tt.broadcast %31 : tensor<1024x1xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0226757Z %34 = tt.broadcast %32 : tensor<1x64xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0226998Z %35 = arith.addi %33, %34 : tensor<1024x64xi32> 2026-02-21T08:34:30.0227230Z %36 = tt.splat %arg0 : !tt.ptr -> tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0227510Z %37 = tt.addptr %36, %35 : tensor<1024x64x!tt.ptr>, tensor<1024x64xi32> 2026-02-21T08:34:30.0227808Z %38 = tt.load %37 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0228163Z %39 = tt.descriptor_load %0[%4, %arg6] : !tt.tensordesc> -> tensor<1024x64xf32> 2026-02-21T08:34:30.0228465Z %40 = scf.if %arg3 -> (tensor<1024x64xf32>) { 2026-02-21T08:34:30.0228935Z %76 = tt.extern_elementwise %39 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32> 2026-02-21T08:34:30.0229323Z %77 = arith.subf %39, %38 : tensor<1024x64xf32> 2026-02-21T08:34:30.0229527Z %78 = arith.mulf %76, %77 : tensor<1024x64xf32> 2026-02-21T08:34:30.0229754Z %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0229963Z scf.yield %79 : tensor<1024x64xf32> 2026-02-21T08:34:30.0230149Z } else { 2026-02-21T08:34:30.0230330Z %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32> 2026-02-21T08:34:30.0230561Z %77 = arith.cmpf ogt, %39, %76 : tensor<1024x64xf32> 2026-02-21T08:34:30.0230801Z %78 = arith.cmpf une, %39, %39 : tensor<1024x64xf32> 2026-02-21T08:34:30.0231023Z %79 = arith.ori %77, %78 : tensor<1024x64xi1> 2026-02-21T08:34:30.0231277Z %80 = arith.select %79, %39, %76 : tensor<1024x64xi1>, tensor<1024x64xf32> 2026-02-21T08:34:30.0231524Z %81 = math.log %80 : tensor<1024x64xf32> 2026-02-21T08:34:30.0231736Z %82 = arith.subf %81, %38 : tensor<1024x64xf32> 2026-02-21T08:34:30.0231986Z %83 = arith.mulf %39, %82 : tensor<1024x64xf32> 2026-02-21T08:34:30.0232191Z %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0232397Z scf.yield %84 : tensor<1024x64xf32> 2026-02-21T08:34:30.0232565Z } 2026-02-21T08:34:30.0232722Z %41 = arith.addf %arg7, %40 : tensor<1024x64xf32> 2026-02-21T08:34:30.0232919Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:34:30.0233110Z %42 = arith.muli %c64_i32, %c1_i32_1 : i32 2026-02-21T08:34:30.0233303Z %43 = arith.addi %arg6, %42 : i32 2026-02-21T08:34:30.0233522Z %44 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:34:30.0233763Z %45 = tt.splat %43 : i32 -> tensor<64xi32> 2026-02-21T08:34:30.0233953Z %46 = arith.addi %45, %44 : tensor<64xi32> 2026-02-21T08:34:30.0234208Z %47 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32> 2026-02-21T08:34:30.0234481Z %48 = arith.muli %47, %cst : tensor<1024x1xi32> 2026-02-21T08:34:30.0234745Z %49 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:34:30.0235044Z %50 = tt.broadcast %48 : tensor<1024x1xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0235317Z %51 = tt.broadcast %49 : tensor<1x64xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0235561Z %52 = arith.addi %50, %51 : tensor<1024x64xi32> 2026-02-21T08:34:30.0235797Z %53 = tt.splat %arg0 : !tt.ptr -> tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0236086Z %54 = tt.addptr %53, %52 : tensor<1024x64x!tt.ptr>, tensor<1024x64xi32> 2026-02-21T08:34:30.0236391Z %55 = tt.load %54 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0236752Z %56 = tt.descriptor_load %0[%4, %43] : !tt.tensordesc> -> tensor<1024x64xf32> 2026-02-21T08:34:30.0237164Z %57 = scf.if %arg3 -> (tensor<1024x64xf32>) { 2026-02-21T08:34:30.0237540Z %76 = tt.extern_elementwise %56 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32> 2026-02-21T08:34:30.0237934Z %77 = arith.subf %56, %55 : tensor<1024x64xf32> 2026-02-21T08:34:30.0238149Z %78 = arith.mulf %76, %77 : tensor<1024x64xf32> 2026-02-21T08:34:30.0238376Z %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0238596Z scf.yield %79 : tensor<1024x64xf32> 2026-02-21T08:34:30.0238778Z } else { 2026-02-21T08:34:30.0238959Z %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32> 2026-02-21T08:34:30.0239191Z %77 = arith.cmpf ogt, %56, %76 : tensor<1024x64xf32> 2026-02-21T08:34:30.0239428Z %78 = arith.cmpf une, %56, %56 : tensor<1024x64xf32> 2026-02-21T08:34:30.0239648Z %79 = arith.ori %77, %78 : tensor<1024x64xi1> 2026-02-21T08:34:30.0239968Z %80 = arith.select %79, %56, %76 : tensor<1024x64xi1>, tensor<1024x64xf32> 2026-02-21T08:34:30.0240228Z %81 = math.log %80 : tensor<1024x64xf32> 2026-02-21T08:34:30.0240430Z %82 = arith.subf %81, %55 : tensor<1024x64xf32> 2026-02-21T08:34:30.0240646Z %83 = arith.mulf %56, %82 : tensor<1024x64xf32> 2026-02-21T08:34:30.0240859Z %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0241071Z scf.yield %84 : tensor<1024x64xf32> 2026-02-21T08:34:30.0241245Z } 2026-02-21T08:34:30.0241403Z %58 = arith.addf %41, %57 : tensor<1024x64xf32> 2026-02-21T08:34:30.0241607Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:34:30.0241811Z %59 = arith.muli %c64_i32, %c2_i32 : i32 2026-02-21T08:34:30.0242043Z %60 = arith.addi %arg6, %59 : i32 2026-02-21T08:34:30.0242271Z %61 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:34:30.0242526Z %62 = tt.splat %60 : i32 -> tensor<64xi32> 2026-02-21T08:34:30.0242725Z %63 = arith.addi %62, %61 : tensor<64xi32> 2026-02-21T08:34:30.0242982Z %64 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32> 2026-02-21T08:34:30.0243247Z %65 = arith.muli %64, %cst : tensor<1024x1xi32> 2026-02-21T08:34:30.0243505Z %66 = tt.expand_dims %63 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:34:30.0243806Z %67 = tt.broadcast %65 : tensor<1024x1xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0244076Z %68 = tt.broadcast %66 : tensor<1x64xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0244312Z %69 = arith.addi %67, %68 : tensor<1024x64xi32> 2026-02-21T08:34:30.0244537Z %70 = tt.splat %arg0 : !tt.ptr -> tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0244815Z %71 = tt.addptr %70, %69 : tensor<1024x64x!tt.ptr>, tensor<1024x64xi32> 2026-02-21T08:34:30.0245107Z %72 = tt.load %71 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0245455Z %73 = tt.descriptor_load %0[%4, %60] : !tt.tensordesc> -> tensor<1024x64xf32> 2026-02-21T08:34:30.0245750Z %74 = scf.if %arg3 -> (tensor<1024x64xf32>) { 2026-02-21T08:34:30.0246105Z %76 = tt.extern_elementwise %73 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32> 2026-02-21T08:34:30.0246475Z %77 = arith.subf %73, %72 : tensor<1024x64xf32> 2026-02-21T08:34:30.0246678Z %78 = arith.mulf %76, %77 : tensor<1024x64xf32> 2026-02-21T08:34:30.0246892Z %79 = arith.addf %78, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0247097Z scf.yield %79 : tensor<1024x64xf32> 2026-02-21T08:34:30.0247264Z } else { 2026-02-21T08:34:30.0247426Z %76 = tt.splat %arg4 : f32 -> tensor<1024x64xf32> 2026-02-21T08:34:30.0247639Z %77 = arith.cmpf ogt, %73, %76 : tensor<1024x64xf32> 2026-02-21T08:34:30.0247859Z %78 = arith.cmpf une, %73, %73 : tensor<1024x64xf32> 2026-02-21T08:34:30.0248173Z %79 = arith.ori %77, %78 : tensor<1024x64xi1> 2026-02-21T08:34:30.0248409Z %80 = arith.select %79, %73, %76 : tensor<1024x64xi1>, tensor<1024x64xf32> 2026-02-21T08:34:30.0248644Z %81 = math.log %80 : tensor<1024x64xf32> 2026-02-21T08:34:30.0248896Z %82 = arith.subf %81, %72 : tensor<1024x64xf32> 2026-02-21T08:34:30.0249101Z %83 = arith.mulf %73, %82 : tensor<1024x64xf32> 2026-02-21T08:34:30.0249322Z %84 = arith.addf %83, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0249535Z scf.yield %84 : tensor<1024x64xf32> 2026-02-21T08:34:30.0249703Z } 2026-02-21T08:34:30.0249860Z %75 = arith.addf %58, %74 : tensor<1024x64xf32> 2026-02-21T08:34:30.0250050Z scf.yield %75 : tensor<1024x64xf32> 2026-02-21T08:34:30.0250241Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:34:30.0250519Z %9 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> 2026-02-21T08:34:30.0250773Z %10 = tt.splat %c65472_i32 : i32 -> tensor<64xi32> 2026-02-21T08:34:30.0250981Z %11 = arith.addi %10, %9 : tensor<64xi32> 2026-02-21T08:34:30.0251214Z %12 = tt.expand_dims %7 {axis = 1 : i32} : tensor<1024xi32> -> tensor<1024x1xi32> 2026-02-21T08:34:30.0251478Z %13 = arith.muli %12, %cst : tensor<1024x1xi32> 2026-02-21T08:34:30.0251715Z %14 = tt.expand_dims %11 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> 2026-02-21T08:34:30.0252025Z %15 = tt.broadcast %13 : tensor<1024x1xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0252280Z %16 = tt.broadcast %14 : tensor<1x64xi32> -> tensor<1024x64xi32> 2026-02-21T08:34:30.0252513Z %17 = arith.addi %15, %16 : tensor<1024x64xi32> 2026-02-21T08:34:30.0252744Z %18 = tt.splat %arg0 : !tt.ptr -> tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0253011Z %19 = tt.addptr %18, %17 : tensor<1024x64x!tt.ptr>, tensor<1024x64xi32> 2026-02-21T08:34:30.0253314Z %20 = tt.load %19 evictionPolicy = evict_first : tensor<1024x64x!tt.ptr> 2026-02-21T08:34:30.0253661Z %21 = tt.descriptor_load %0[%4, %c65472_i32] : !tt.tensordesc> -> tensor<1024x64xf32> 2026-02-21T08:34:30.0253961Z %22 = scf.if %arg3 -> (tensor<1024x64xf32>) { 2026-02-21T08:34:30.0254313Z %27 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1024x64xf32>) -> tensor<1024x64xf32> 2026-02-21T08:34:30.0254678Z %28 = arith.subf %21, %20 : tensor<1024x64xf32> 2026-02-21T08:34:30.0254879Z %29 = arith.mulf %27, %28 : tensor<1024x64xf32> 2026-02-21T08:34:30.0255082Z %30 = arith.addf %29, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0255282Z scf.yield %30 : tensor<1024x64xf32> 2026-02-21T08:34:30.0255445Z } else { 2026-02-21T08:34:30.0255603Z %27 = tt.splat %arg4 : f32 -> tensor<1024x64xf32> 2026-02-21T08:34:30.0255818Z %28 = arith.cmpf ogt, %21, %27 : tensor<1024x64xf32> 2026-02-21T08:34:30.0256038Z %29 = arith.cmpf une, %21, %21 : tensor<1024x64xf32> 2026-02-21T08:34:30.0256245Z %30 = arith.ori %28, %29 : tensor<1024x64xi1> 2026-02-21T08:34:30.0256476Z %31 = arith.select %30, %21, %27 : tensor<1024x64xi1>, tensor<1024x64xf32> 2026-02-21T08:34:30.0256719Z %32 = math.log %31 : tensor<1024x64xf32> 2026-02-21T08:34:30.0256909Z %33 = arith.subf %32, %20 : tensor<1024x64xf32> 2026-02-21T08:34:30.0257108Z %34 = arith.mulf %21, %33 : tensor<1024x64xf32> 2026-02-21T08:34:30.0257305Z %35 = arith.addf %34, %cst_0 : tensor<1024x64xf32> 2026-02-21T08:34:30.0257505Z scf.yield %35 : tensor<1024x64xf32> 2026-02-21T08:34:30.0257674Z } 2026-02-21T08:34:30.0257814Z %23 = arith.addf %8, %22 : tensor<1024x64xf32> 2026-02-21T08:34:30.0258013Z %24 = "tt.reduce"(%23) <{axis = 1 : i32}> ({ 2026-02-21T08:34:30.0258195Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:34:30.0258374Z %27 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:34:30.0258615Z tt.reduce.return %27 : f32 2026-02-21T08:34:30.0258805Z }) : (tensor<1024x64xf32>) -> tensor<1024xf32> 2026-02-21T08:34:30.0259030Z %25 = tt.splat %arg2 : !tt.ptr -> tensor<1024x!tt.ptr> 2026-02-21T08:34:30.0259295Z %26 = tt.addptr %25, %7 : tensor<1024x!tt.ptr>, tensor<1024xi32> 2026-02-21T08:34:30.0259542Z tt.store %26, %24 : tensor<1024x!tt.ptr> 2026-02-21T08:34:30.0259735Z } {tt.warp_specialize} 2026-02-21T08:34:30.0259895Z tt.return 2026-02-21T08:34:30.0260017Z } 2026-02-21T08:34:30.0260139Z } 2026-02-21T08:34:30.0260205Z 2026-02-21T08:34:30.0260253Z {-# 2026-02-21T08:34:30.0260391Z external_resources: { 2026-02-21T08:34:30.0260541Z mlir_reproducer: { 2026-02-21T08:34:30.0264955Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:34:30.0269407Z disable_threading: false, 2026-02-21T08:34:30.0269570Z verify_each: true 2026-02-21T08:34:30.0269717Z } 2026-02-21T08:34:30.0269833Z } 2026-02-21T08:34:30.0269950Z #-} 2026-02-21T08:34:30.0270362Z /tmp/torchinductor_root/4r/c4rovp3bv6fv3fr5la5e2rov22fm2kkuysyfoxnyh52ztqjqxk3h.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:34:30.0271552Z /tmp/torchinductor_root/4r/c4rovp3bv6fv3fr5la5e2rov22fm2kkuysyfoxnyh52ztqjqxk3h.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:34:30.0272546Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:34:30.0273617Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 1024], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', ''], maxnreg=256, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:34:30.0274580Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:34:30.0274839Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:34:36.3901802Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 4.6 configs/s 2026-02-21T08:34:36.3910938Z [56s] Adaptive compile timeout: 30s (90% percentile=4.8s, bounds=[30.0s, 30s]) 2026-02-21T08:34:39.7926325Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 535/535 170.9 configs/s 2026-02-21T08:34:39.9123003Z [60s] Initial random population of 100, 5 starting points: 2026-02-21T08:34:39.9124288Z error=12 2026-02-21T08:34:39.9124503Z timeout=4 2026-02-21T08:34:39.9124680Z ok=84 2026-02-21T08:34:39.9124874Z min=0.4066 2026-02-21T08:34:39.9125050Z mid=3.2093 2026-02-21T08:34:39.9125235Z max=448.7332 2026-02-21T08:34:39.9125448Z best={'block_sizes': [4096, 1], 2026-02-21T08:34:39.9125841Z 'indexing': ['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:34:39.9126282Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:34:39.9126580Z 'num_stages': 6, 2026-02-21T08:34:39.9127142Z 'num_warps': 32, 2026-02-21T08:34:39.9127378Z 'pid_type': 'flat', 2026-02-21T08:34:39.9127626Z 'range_flattens': [None, True], 2026-02-21T08:34:39.9127903Z 'range_multi_buffers': [None, False], 2026-02-21T08:34:39.9128183Z 'range_num_stages': [0, 1], 2026-02-21T08:34:39.9128435Z 'range_unroll_factors': [0, 1], 2026-02-21T08:34:39.9128759Z 'range_warp_specializes': [None, False]} 2026-02-21T08:34:39.9148215Z [60s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:34:41.4274740Z [62s] Generation 1 starting: 89 neighbors, 5 active search path(s) 2026-02-21T08:35:16.2960526Z [96s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['', 'last'], num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[4, 1], range_unroll_factors=[0, 1], range_warp_specializes=[False, None]) 2026-02-21T08:35:16.2977203Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.5 configs/s 2026-02-21T08:35:21.7469551Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 17.0 configs/s 2026-02-21T08:35:41.0794890Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 535/535 27.7 configs/s 2026-02-21T08:35:41.3383499Z [121s] Generation 1 complete: 2026-02-21T08:35:41.3385136Z timeout=1 2026-02-21T08:35:41.3385338Z ok=93 2026-02-21T08:35:41.3385511Z min=0.4158 2026-02-21T08:35:41.3385673Z mid=0.5090 2026-02-21T08:35:41.3385836Z max=3.1733 2026-02-21T08:35:41.3386016Z best={'block_sizes': [4096, 1], 2026-02-21T08:35:41.3386348Z 'indexing': ['pointer', 'tensor_descriptor', 'tensor_descriptor'], 2026-02-21T08:35:41.3386691Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:35:41.3386945Z 'num_stages': 6, 2026-02-21T08:35:41.3387129Z 'num_warps': 32, 2026-02-21T08:35:41.3387317Z 'pid_type': 'flat', 2026-02-21T08:35:41.3387522Z 'range_flattens': [None, True], 2026-02-21T08:35:41.3387797Z 'range_multi_buffers': [None, True], 2026-02-21T08:35:41.3388037Z 'range_num_stages': [0, 1], 2026-02-21T08:35:41.3388196Z 'range_unroll_factors': [0, 1], 2026-02-21T08:35:41.3388377Z 'range_warp_specializes': [None, False]} 2026-02-21T08:35:41.3405201Z [121s] Fitting surrogate: 194 points, 194 targets 2026-02-21T08:35:42.5488272Z [123s] Generation 2 starting: 74 neighbors, 5 active search path(s) 2026-02-21T08:35:47.0082679Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 11.8 configs/s 2026-02-21T08:35:51.3448029Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 17.2 configs/s 2026-02-21T08:36:09.1641509Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 537/537 30.9 configs/s 2026-02-21T08:36:09.4279403Z [150s] Generation 2 complete: 2026-02-21T08:36:09.4283552Z ok=79 2026-02-21T08:36:09.4285485Z min=0.4227 2026-02-21T08:36:09.4285645Z mid=0.4516 2026-02-21T08:36:09.4285766Z max=1.5099 2026-02-21T08:36:09.4285910Z best={'block_sizes': [1024, 1], 2026-02-21T08:36:09.4286174Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:36:09.4286915Z 'load_eviction_policies': ['', 'last'], 2026-02-21T08:36:09.4287092Z 'num_stages': 1, 2026-02-21T08:36:09.4287244Z 'num_warps': 1, 2026-02-21T08:36:09.4287383Z 'pid_type': 'flat', 2026-02-21T08:36:09.4287540Z 'range_flattens': [None, False], 2026-02-21T08:36:09.4287713Z 'range_multi_buffers': [None, False], 2026-02-21T08:36:09.4287895Z 'range_num_stages': [0, 1], 2026-02-21T08:36:09.4288061Z 'range_unroll_factors': [0, 1], 2026-02-21T08:36:09.4288231Z 'range_warp_specializes': [None, True]} 2026-02-21T08:36:09.4293951Z [150s] Fitting surrogate: 273 points, 273 targets 2026-02-21T08:36:10.3138608Z [150s] Generation 3 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:36:13.4705099Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 38.1 configs/s 2026-02-21T08:36:17.0030558Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 17.5 configs/s 2026-02-21T08:36:33.8969338Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 537/537 31.8 configs/s 2026-02-21T08:36:34.1413032Z [174s] Generation 3 complete: 2026-02-21T08:36:34.1417030Z ok=66 2026-02-21T08:36:34.1420421Z min=0.4208 2026-02-21T08:36:34.1424363Z mid=0.4354 2026-02-21T08:36:34.1428218Z max=0.8530 2026-02-21T08:36:34.1432613Z best={'block_sizes': [2048, 2], 2026-02-21T08:36:34.1436243Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:36:34.1439546Z 'load_eviction_policies': ['', 'first'], 2026-02-21T08:36:34.1439831Z 'num_stages': 6, 2026-02-21T08:36:34.1440014Z 'num_warps': 32, 2026-02-21T08:36:34.1440200Z 'pid_type': 'flat', 2026-02-21T08:36:34.1440393Z 'range_flattens': [None, False], 2026-02-21T08:36:34.1440587Z 'range_multi_buffers': [None, False], 2026-02-21T08:36:34.1440768Z 'range_num_stages': [0, 0], 2026-02-21T08:36:34.1440941Z 'range_unroll_factors': [0, 0], 2026-02-21T08:36:34.1446362Z 'range_warp_specializes': [None, True]} 2026-02-21T08:36:34.1446650Z [174s] Fitting surrogate: 339 points, 339 targets 2026-02-21T08:36:34.9291284Z [175s] Generation 4 starting: 49 neighbors, 4 active search path(s) 2026-02-21T08:36:37.6158598Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 35.7 configs/s 2026-02-21T08:36:40.5115912Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 17.5 configs/s 2026-02-21T08:36:53.3666940Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 541/541 42.0 configs/s 2026-02-21T08:36:53.5639210Z [194s] Generation 4 complete: 2026-02-21T08:36:53.5642845Z ok=53 2026-02-21T08:36:53.5647945Z min=0.4158 2026-02-21T08:36:53.5649571Z mid=0.4332 2026-02-21T08:36:53.5649778Z max=0.9440 2026-02-21T08:36:53.5652755Z best={'block_sizes': [2048, 1], 2026-02-21T08:36:53.5652999Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:36:53.5653211Z 'load_eviction_policies': ['last', 'first'], 2026-02-21T08:36:53.5653448Z 'num_stages': 7, 2026-02-21T08:36:53.5653593Z 'num_warps': 8, 2026-02-21T08:36:53.5653765Z 'pid_type': 'flat', 2026-02-21T08:36:53.5658517Z 'range_flattens': [None, True], 2026-02-21T08:36:53.5662324Z 'range_multi_buffers': [None, None], 2026-02-21T08:36:53.5666704Z 'range_num_stages': [0, 1], 2026-02-21T08:36:53.5668202Z 'range_unroll_factors': [0, 1], 2026-02-21T08:36:53.5668438Z 'range_warp_specializes': [None, False]} 2026-02-21T08:36:53.5668719Z [194s] Fitting surrogate: 392 points, 392 targets 2026-02-21T08:36:54.2506544Z [194s] Generation 5 starting: 44 neighbors, 4 active search path(s) 2026-02-21T08:36:57.1962355Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 11.5 configs/s 2026-02-21T08:36:59.7190837Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 44/44 17.8 configs/s 2026-02-21T08:37:12.1869304Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 541/541 43.3 configs/s 2026-02-21T08:37:12.3947837Z [213s] Generation 5 complete: 2026-02-21T08:37:12.3949585Z ok=48 2026-02-21T08:37:12.3949747Z min=0.4127 2026-02-21T08:37:12.3949889Z mid=0.4392 2026-02-21T08:37:12.3950049Z max=0.7652 2026-02-21T08:37:12.3950969Z best={'block_sizes': [2048, 2], 2026-02-21T08:37:12.3951215Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:37:12.3951430Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:37:12.3951619Z 'num_stages': 7, 2026-02-21T08:37:12.3952833Z 'num_warps': 32, 2026-02-21T08:37:12.3953032Z 'pid_type': 'flat', 2026-02-21T08:37:12.3953222Z 'range_flattens': [None, False], 2026-02-21T08:37:12.3953426Z 'range_multi_buffers': [None, False], 2026-02-21T08:37:12.3953640Z 'range_num_stages': [0, 1], 2026-02-21T08:37:12.3953830Z 'range_unroll_factors': [0, 0], 2026-02-21T08:37:12.3954029Z 'range_warp_specializes': [None, True]} 2026-02-21T08:37:12.3966389Z [213s] Fitting surrogate: 440 points, 440 targets 2026-02-21T08:37:12.8664910Z [213s] Generation 6 starting: 23 neighbors, 2 active search path(s) 2026-02-21T08:37:15.1785038Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 10.4 configs/s 2026-02-21T08:37:16.5905949Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 17.6 configs/s 2026-02-21T08:37:22.2230610Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 550/550 96.7 configs/s 2026-02-21T08:37:22.3499726Z [222s] Generation 6 complete: 2026-02-21T08:37:22.3503246Z ok=25 2026-02-21T08:37:22.3507131Z min=0.4128 2026-02-21T08:37:22.3511016Z mid=0.4793 2026-02-21T08:37:22.3514928Z max=0.8183 2026-02-21T08:37:22.3518917Z best={'block_sizes': [2048, 2], 2026-02-21T08:37:22.3520332Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:37:22.3520557Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:37:22.3520752Z 'num_stages': 7, 2026-02-21T08:37:22.3520895Z 'num_warps': 32, 2026-02-21T08:37:22.3521050Z 'pid_type': 'flat', 2026-02-21T08:37:22.3521211Z 'range_flattens': [None, False], 2026-02-21T08:37:22.3521385Z 'range_multi_buffers': [None, False], 2026-02-21T08:37:22.3521569Z 'range_num_stages': [0, 1], 2026-02-21T08:37:22.3521732Z 'range_unroll_factors': [0, 0], 2026-02-21T08:37:22.3522129Z 'range_warp_specializes': [None, True]} 2026-02-21T08:37:22.3522355Z [222s] Fitting surrogate: 465 points, 465 targets 2026-02-21T08:37:22.7608723Z [223s] Generation 7 starting: 21 neighbors, 2 active search path(s) 2026-02-21T08:37:25.8397991Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.2 configs/s 2026-02-21T08:37:27.1254987Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 17.7 configs/s 2026-02-21T08:37:32.5693543Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 550/550 99.9 configs/s 2026-02-21T08:37:32.6940091Z [233s] Generation 7 complete: 2026-02-21T08:37:32.6942528Z ok=23 2026-02-21T08:37:32.6942726Z min=0.4153 2026-02-21T08:37:32.6942914Z mid=0.4374 2026-02-21T08:37:32.6943083Z max=0.9636 2026-02-21T08:37:32.6943274Z best={'block_sizes': [2048, 2], 2026-02-21T08:37:32.6943565Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:37:32.6943905Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:37:32.6944244Z 'num_stages': 7, 2026-02-21T08:37:32.6944924Z 'num_warps': 32, 2026-02-21T08:37:32.6945136Z 'pid_type': 'flat', 2026-02-21T08:37:32.6945366Z 'range_flattens': [None, False], 2026-02-21T08:37:32.6945645Z 'range_multi_buffers': [None, False], 2026-02-21T08:37:32.6945922Z 'range_num_stages': [0, 0], 2026-02-21T08:37:32.6946173Z 'range_unroll_factors': [0, 0], 2026-02-21T08:37:32.6946441Z 'range_warp_specializes': [None, True]} 2026-02-21T08:37:32.6961327Z [233s] Fitting surrogate: 488 points, 488 targets 2026-02-21T08:37:33.1877003Z [233s] Generation 8 starting: 20 neighbors, 2 active search path(s) 2026-02-21T08:37:34.4299739Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 31.0 configs/s 2026-02-21T08:37:35.6118333Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 17.6 configs/s 2026-02-21T08:37:41.4480141Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 556/556 94.5 configs/s 2026-02-21T08:37:41.5749823Z [242s] Generation 8 complete: 2026-02-21T08:37:41.5754568Z ok=22 2026-02-21T08:37:41.5755769Z min=0.4147 2026-02-21T08:37:41.5755928Z mid=0.4495 2026-02-21T08:37:41.5756047Z max=0.6380 2026-02-21T08:37:41.5756192Z best={'block_sizes': [2048, 1], 2026-02-21T08:37:41.5756395Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T08:37:41.5756614Z 'load_eviction_policies': ['first', 'first'], 2026-02-21T08:37:41.5756803Z 'num_stages': 7, 2026-02-21T08:37:41.5756939Z 'num_warps': 8, 2026-02-21T08:37:41.5757081Z 'pid_type': 'flat', 2026-02-21T08:37:41.5757230Z 'range_flattens': [None, True], 2026-02-21T08:37:41.5757411Z 'range_multi_buffers': [None, False], 2026-02-21T08:37:41.5757585Z 'range_num_stages': [0, 1], 2026-02-21T08:37:41.5757752Z 'range_unroll_factors': [0, 0], 2026-02-21T08:37:41.5757924Z 'range_warp_specializes': [None, True]} 2026-02-21T08:37:41.5766274Z [242s] Fitting surrogate: 510 points, 510 targets 2026-02-21T08:37:41.8548884Z [242s] Autotuning complete in 242.5s after searching 482 configs. 2026-02-21T08:37:41.8552215Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:37:41.8557062Z @helion.kernel(config=helion.Config(block_sizes=[2048, 1], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:37:41.8557866Z 2026-02-21T08:37:41.8558124Z [242s] Code of selected kernel: /tmp/torchinductor_root/p6/cp6bl3k74ixwykbkjt4ihezx34drzzpnpmzbaq6zywbrex3k4wwg.py 2026-02-21T08:37:41.8742276Z from __future__ import annotations 2026-02-21T08:37:41.8742545Z 2026-02-21T08:37:41.8746751Z import torch 2026-02-21T08:37:41.8746974Z import triton 2026-02-21T08:37:41.8747124Z import triton.language as tl 2026-02-21T08:37:41.8747334Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:37:41.8747626Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:37:41.8747923Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:37:41.8748197Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:37:41.8748368Z 2026-02-21T08:37:41.8748438Z _BLOCK_SIZE_1 = tl.constexpr(1) 2026-02-21T08:37:41.8748615Z _BLOCK_SIZE_0 = tl.constexpr(2048) 2026-02-21T08:37:41.8748725Z 2026-02-21T08:37:41.8748780Z @triton.jit 2026-02-21T08:37:41.8748967Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:37:41.8749256Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:37:41.8749493Z pid_0 = tl.program_id(0) 2026-02-21T08:37:41.8749661Z offset_1 = pid_0 2026-02-21T08:37:41.8749828Z indices_1 = offset_1 + tl.zeros([1], tl.int32) 2026-02-21T08:37:41.8750105Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:37:41.8750420Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:37:41.8751015Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:37:41.8751335Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:37:41.8751603Z # src[kl_div.py:92-112]: ... 2026-02-21T08:37:41.8752036Z for offset_0 in tl.range(0, 65536, _BLOCK_SIZE_0, warp_specialize=True, num_stages=1, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T08:37:41.8752444Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:37:41.8752676Z loss_sum_copy = loss_sum 2026-02-21T08:37:41.8752843Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:37:41.8753111Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:37:41.8753425Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:37:41.8753774Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:37:41.8754139Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 65536 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:37:41.8754486Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:37:41.8754828Z y_true_val = tl.load(y_true + (indices_1[:, None] * 65536 + indices_0[None, :] * 1), None, eviction_policy='evict_first') 2026-02-21T08:37:41.8755153Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:37:41.8755415Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:37:41.8755716Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:37:41.8755929Z # src[kl_div.py:98-106]: ... 2026-02-21T08:37:41.8756101Z if log_target: 2026-02-21T08:37:41.8756253Z y_true_val_copy = y_true_val 2026-02-21T08:37:41.8756437Z y_pred_val_copy = y_pred_val 2026-02-21T08:37:41.8756615Z kl_loss_copy = kl_loss 2026-02-21T08:37:41.8756806Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:37:41.8757026Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:37:41.8757207Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:37:41.8757419Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:37:41.8757640Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:37:41.8757886Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:37:41.8758137Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:37:41.8758329Z v_2 = v_0 * v_1 2026-02-21T08:37:41.8758493Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:37:41.8758670Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:37:41.8758922Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:37:41.8759210Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:37:41.8759424Z # src[kl_div.py:98-106]: ... 2026-02-21T08:37:41.8759592Z _not = not log_target 2026-02-21T08:37:41.8759747Z if _not: 2026-02-21T08:37:41.8759885Z y_true_val_copy_1 = y_true_val 2026-02-21T08:37:41.8760066Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:37:41.8760243Z kl_loss_copy_1 = kl_loss 2026-02-21T08:37:41.8760425Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:37:41.8760627Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:37:41.8760816Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:37:41.8761064Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:37:41.8761348Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:37:41.8761563Z v_5 = tl_math.log(v_4) 2026-02-21T08:37:41.8761784Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:37:41.8762135Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:37:41.8762320Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:37:41.8762495Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:37:41.8762683Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:37:41.8762866Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:37:41.8763079Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:37:41.8763313Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:37:41.8763522Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:37:41.8763650Z 2026-02-21T08:37:41.8763937Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:37:41.8764315Z """ 2026-02-21T08:37:41.8764452Z Compute KL Divergence loss. 2026-02-21T08:37:41.8764560Z 2026-02-21T08:37:41.8764611Z Args: 2026-02-21T08:37:41.8764845Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:37:41.8765138Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:37:41.8765458Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:37:41.8765767Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:37:41.8766003Z eps: Small value to avoid numerical issues 2026-02-21T08:37:41.8766139Z 2026-02-21T08:37:41.8766191Z Returns: 2026-02-21T08:37:41.8766323Z loss: KL divergence loss 2026-02-21T08:37:41.8766484Z """ 2026-02-21T08:37:41.8766626Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:37:41.8766803Z BT, V = y_pred.shape 2026-02-21T08:37:41.8767004Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:37:41.8767267Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:37:41.8767513Z # src[kl_div.py:77]: ) 2026-02-21T08:37:41.8767765Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:37:41.8768052Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:37:41.8768272Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:37:41.8768469Z # src[kl_div.py:82]: else: 2026-02-21T08:37:41.8768633Z # src[kl_div.py:80-83]: ... 2026-02-21T08:37:41.8768788Z if reduction == 'none': 2026-02-21T08:37:41.8768977Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:37:41.8769175Z loss = torch.zeros_like(y_pred) 2026-02-21T08:37:41.8769343Z else: 2026-02-21T08:37:41.8769556Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:37:41.8769879Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:37:41.8770172Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:37:41.8770500Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:37:41.8770760Z # src[kl_div.py:89-115]: ... 2026-02-21T08:37:41.8771047Z _launcher(_helion_kl_div_forward, (4096,), y_pred, y_true, loss, log_target, eps, num_warps=8, num_stages=7) 2026-02-21T08:37:41.8771383Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:37:41.8771618Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:37:41.8771838Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:37:41.8772069Z # src[kl_div.py:118-125]: ... 2026-02-21T08:37:41.8772233Z if reduction == 'batchmean': 2026-02-21T08:37:41.8772433Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:37:41.8772639Z final_loss = torch.sum(loss) / BT 2026-02-21T08:37:41.8772823Z elif reduction == 'sum': 2026-02-21T08:37:41.8773011Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:37:41.8773285Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:37:41.8773469Z elif reduction == 'mean': 2026-02-21T08:37:41.8773664Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:37:41.8773887Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:37:41.8774055Z else: 2026-02-21T08:37:41.8774194Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:37:41.8774365Z final_loss = loss 2026-02-21T08:37:41.8774529Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:37:41.8774693Z return final_loss 2026-02-21T08:37:43.2186227Z WARNING:tritonbench.utils.triton_op:Completed input ID 4: 2026-02-21T08:37:43.2188082Z (B, T, V) 2026-02-21T08:37:43.2188320Z --------------- 2026-02-21T08:37:43.2192737Z (8, 512, 65536) 2026-02-21T08:37:43.2196557Z 2026-02-21T08:37:43.2525628Z 83%|████████▎ | 5/6 [16:15<03:33, 213.89s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T08:37:43.2529806Z (B, T, V) 2026-02-21T08:37:43.2533419Z ---------------- 2026-02-21T08:37:43.2536371Z (8, 512, 131072) 2026-02-21T08:37:43.2549572Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for torch_kl_div 2026-02-21T08:37:44.5169098Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for liger_kl_div 2026-02-21T08:37:45.5232461Z INFO:tritonbench.utils.triton_op:Took 4.58ms to get benchmark function for torch_compile_kl_div 2026-02-21T08:37:49.7770973Z WARNING:__main__:Input tensor metadata: 2026-02-21T08:37:49.7771423Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T08:37:49.7775897Z 'dtype': 'torch.float32', 2026-02-21T08:37:49.7779977Z 'shape': (4096, 131072), 2026-02-21T08:37:49.7783761Z 'stride': (131072, 1)}, 2026-02-21T08:37:49.7788150Z { 'device': 'cuda:0', 2026-02-21T08:37:49.7789636Z 'dtype': 'torch.float32', 2026-02-21T08:37:49.7789852Z 'shape': (4096, 131072), 2026-02-21T08:37:49.7790119Z 'stride': (131072, 1)}), 2026-02-21T08:37:49.7790300Z 'kwargs': {}} 2026-02-21T08:37:49.7795344Z INFO:tritonbench.utils.triton_op:Took 2.69ms to get benchmark function for helion_kl_div_tritonbench 2026-02-21T08:37:49.9972296Z [0s] Autotune random seed: 2135561342 2026-02-21T08:37:50.1548037Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T08:38:23.2642791Z [33s] Timeout after 30s compiling Config(block_sizes=[65536, 1], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', ''], maxnreg=128, num_sm_multiplier=1, num_stages=6, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, False], range_num_stages=[1, 0], range_unroll_factors=[0, 1], range_warp_specializes=[True, None]) 2026-02-21T08:38:23.6352722Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 256], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=32, num_sm_multiplier=4, num_stages=5, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]) 2026-02-21T08:38:24.2169797Z [34s] Timeout after 30s compiling Config(block_sizes=[128, 512], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=128, num_sm_multiplier=1, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[False, False]) 2026-02-21T08:38:24.6411773Z [34s] Timeout after 30s compiling Config(block_sizes=[4096, 4], indexing=['pointer', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', ''], num_stages=5, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:38:25.1597895Z [35s] Timeout after 30s compiling Config(block_sizes=[1024, 128], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', 'last'], maxnreg=128, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[False, False]) 2026-02-21T08:38:25.9038097Z [35s] Timeout after 30s compiling Config(block_sizes=[512, 1024], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['', ''], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[None, None]) 2026-02-21T08:38:26.6282787Z [36s] Timeout after 30s compiling Config(block_sizes=[2048, 8], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['last', 'first'], num_sm_multiplier=4, num_stages=5, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[3, 0], range_warp_specializes=[False, False]) 2026-02-21T08:38:26.6835125Z [36s] Timeout after 30s compiling Config(block_sizes=[4096, 32], indexing=['pointer', 'pointer', 'pointer'], load_eviction_policies=['first', 'last'], num_stages=8, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[None, False]) 2026-02-21T08:38:27.3763688Z [37s] Timeout after 30s compiling Config(block_sizes=[65536, 8], indexing=['pointer', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=8, num_stages=7, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 0], range_warp_specializes=[False, None]) 2026-02-21T08:38:27.3778979Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.0 configs/s 2026-02-21T08:38:27.4877576Z module { 2026-02-21T08:38:27.4878227Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:38:27.4878818Z %c8_i32 = arith.constant 8 : i32 2026-02-21T08:38:27.4879008Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:38:27.4879243Z %cst = arith.constant dense<131072> : tensor<16x1xi32> 2026-02-21T08:38:27.4879515Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<16x8xf32> 2026-02-21T08:38:27.4879765Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:38:27.4879964Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:38:27.4880197Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:38:27.4880433Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:38:27.4880609Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:38:27.4880923Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:27.4881235Z %1 = tt.get_program_id x : i32 2026-02-21T08:38:27.4881418Z %2 = arith.muli %1, %c16_i32 : i32 2026-02-21T08:38:27.4881646Z %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:38:27.4882126Z %4 = tt.splat %2 : i32 -> tensor<16xi32> 2026-02-21T08:38:27.4882335Z %5 = arith.addi %4, %3 : tensor<16xi32> 2026-02-21T08:38:27.4882644Z %6 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c8_i32 iter_args(%arg6 = %cst_0) -> (tensor<16x8xf32>) : i32 { 2026-02-21T08:38:27.4883006Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32> 2026-02-21T08:38:27.4883259Z %11 = tt.splat %arg5 : i32 -> tensor<8xi32> 2026-02-21T08:38:27.4883777Z %12 = arith.addi %11, %10 : tensor<8xi32> 2026-02-21T08:38:27.4884069Z %13 = tt.descriptor_load %0[%2, %arg5] : !tt.tensordesc> -> tensor<16x8xf32> 2026-02-21T08:38:27.4884398Z %14 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32> -> tensor<16x1xi32> 2026-02-21T08:38:27.4884658Z %15 = arith.muli %14, %cst : tensor<16x1xi32> 2026-02-21T08:38:27.4884898Z %16 = tt.expand_dims %12 {axis = 0 : i32} : tensor<8xi32> -> tensor<1x8xi32> 2026-02-21T08:38:27.4885176Z %17 = tt.broadcast %15 : tensor<16x1xi32> -> tensor<16x8xi32> 2026-02-21T08:38:27.4885420Z %18 = tt.broadcast %16 : tensor<1x8xi32> -> tensor<16x8xi32> 2026-02-21T08:38:27.4885644Z %19 = arith.addi %17, %18 : tensor<16x8xi32> 2026-02-21T08:38:27.4885877Z %20 = tt.splat %arg1 : !tt.ptr -> tensor<16x8x!tt.ptr> 2026-02-21T08:38:27.4886141Z %21 = tt.addptr %20, %19 : tensor<16x8x!tt.ptr>, tensor<16x8xi32> 2026-02-21T08:38:27.4886523Z %22 = tt.load %21 evictionPolicy = evict_first : tensor<16x8x!tt.ptr> 2026-02-21T08:38:27.4886775Z %23 = scf.if %arg3 -> (tensor<16x8xf32>) { 2026-02-21T08:38:27.4887126Z %25 = tt.extern_elementwise %22 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x8xf32>) -> tensor<16x8xf32> 2026-02-21T08:38:27.4887487Z %26 = arith.subf %22, %13 : tensor<16x8xf32> 2026-02-21T08:38:27.4887684Z %27 = arith.mulf %25, %26 : tensor<16x8xf32> 2026-02-21T08:38:27.4887891Z %28 = arith.addf %27, %cst_0 : tensor<16x8xf32> 2026-02-21T08:38:27.4888081Z scf.yield %28 : tensor<16x8xf32> 2026-02-21T08:38:27.4888254Z } else { 2026-02-21T08:38:27.4888408Z %25 = tt.splat %arg4 : f32 -> tensor<16x8xf32> 2026-02-21T08:38:27.4888627Z %26 = arith.cmpf ogt, %22, %25 : tensor<16x8xf32> 2026-02-21T08:38:27.4888838Z %27 = arith.cmpf une, %22, %22 : tensor<16x8xf32> 2026-02-21T08:38:27.4889046Z %28 = arith.ori %26, %27 : tensor<16x8xi1> 2026-02-21T08:38:27.4889278Z %29 = arith.select %28, %22, %25 : tensor<16x8xi1>, tensor<16x8xf32> 2026-02-21T08:38:27.4889504Z %30 = math.log %29 : tensor<16x8xf32> 2026-02-21T08:38:27.4889692Z %31 = arith.subf %30, %13 : tensor<16x8xf32> 2026-02-21T08:38:27.4889877Z %32 = arith.mulf %22, %31 : tensor<16x8xf32> 2026-02-21T08:38:27.4890078Z %33 = arith.addf %32, %cst_0 : tensor<16x8xf32> 2026-02-21T08:38:27.4890261Z scf.yield %33 : tensor<16x8xf32> 2026-02-21T08:38:27.4890426Z } 2026-02-21T08:38:27.4890568Z %24 = arith.addf %arg6, %23 : tensor<16x8xf32> 2026-02-21T08:38:27.4890750Z scf.yield %24 : tensor<16x8xf32> 2026-02-21T08:38:27.4890946Z } {tt.warp_specialize} 2026-02-21T08:38:27.4891117Z %7 = "tt.reduce"(%6) <{axis = 1 : i32}> ({ 2026-02-21T08:38:27.4891294Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:38:27.4891471Z %10 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:38:27.4891644Z tt.reduce.return %10 : f32 2026-02-21T08:38:27.4891830Z }) : (tensor<16x8xf32>) -> tensor<16xf32> 2026-02-21T08:38:27.4892084Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:38:27.4892342Z %9 = tt.addptr %8, %5 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:38:27.4892564Z tt.store %9, %7 : tensor<16x!tt.ptr> 2026-02-21T08:38:27.4892742Z tt.return 2026-02-21T08:38:27.4892870Z } 2026-02-21T08:38:27.4892987Z } 2026-02-21T08:38:27.4893053Z 2026-02-21T08:38:27.4893112Z {-# 2026-02-21T08:38:27.4893252Z external_resources: { 2026-02-21T08:38:27.4893409Z mlir_reproducer: { 2026-02-21T08:38:27.4897761Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=8 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:38:27.4902396Z disable_threading: false, 2026-02-21T08:38:27.4902576Z verify_each: true 2026-02-21T08:38:27.4902732Z } 2026-02-21T08:38:27.4902871Z } 2026-02-21T08:38:27.4902991Z #-} 2026-02-21T08:38:27.4903544Z /tmp/torchinductor_root/xo/cxoplxv4egu44ahe6hembpyhmwszytytmjhzlnwzjl6cai3rj64b.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:38:27.4904965Z /tmp/torchinductor_root/xo/cxoplxv4egu44ahe6hembpyhmwszytytmjhzlnwzjl6cai3rj64b.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:38:27.4906040Z [37s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:38:27.4907073Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16], indexing=['tensor_descriptor', 'pointer', 'pointer'], load_eviction_policies=['first', 'first'], num_stages=6, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:38:27.4908012Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:38:27.4908262Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:38:35.9154087Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:38:35.9155705Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:38:35.9156300Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:38:35.9156484Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:38:35.9156678Z %c4736_i32 = arith.constant 4736 : i32 2026-02-21T08:38:35.9156909Z %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32> 2026-02-21T08:38:35.9157131Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:38:35.9157313Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:38:35.9157497Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:38:35.9157689Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:38:35.9157866Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:38:35.9158186Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:35.9158630Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:35.9159352Z %2 = tt.get_program_id x : i32 2026-02-21T08:38:35.9159564Z scf.for %arg5 = %2 to %c256_i32 step %c4736_i32 : i32 { 2026-02-21T08:38:35.9159782Z %3 = arith.muli %arg5, %c16_i32 : i32 2026-02-21T08:38:35.9160013Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> 2026-02-21T08:38:35.9160252Z %5 = tt.splat %3 : i32 -> tensor<16xi32> 2026-02-21T08:38:35.9160449Z %6 = arith.addi %5, %4 : tensor<16xi32> 2026-02-21T08:38:35.9160634Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:38:35.9160939Z %7 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c512_i32 iter_args(%arg7 = %cst) -> (tensor<16x256xf32>) : i32 { 2026-02-21T08:38:35.9161351Z %11 = tt.descriptor_load %0[%3, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:38:35.9161810Z %12 = tt.descriptor_load %1[%3, %arg6] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:38:35.9162342Z %13 = scf.if %arg3 -> (tensor<16x256xf32>) { 2026-02-21T08:38:35.9162719Z %21 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:38:35.9173751Z %22 = arith.subf %12, %11 : tensor<16x256xf32> 2026-02-21T08:38:35.9178003Z %23 = arith.mulf %21, %22 : tensor<16x256xf32> 2026-02-21T08:38:35.9179334Z %24 = arith.addf %23, %cst : tensor<16x256xf32> 2026-02-21T08:38:35.9179574Z scf.yield %24 : tensor<16x256xf32> 2026-02-21T08:38:35.9179759Z } else { 2026-02-21T08:38:35.9179934Z %21 = tt.splat %arg4 : f32 -> tensor<16x256xf32> 2026-02-21T08:38:35.9180170Z %22 = arith.cmpf ogt, %12, %21 : tensor<16x256xf32> 2026-02-21T08:38:35.9180389Z %23 = arith.cmpf une, %12, %12 : tensor<16x256xf32> 2026-02-21T08:38:35.9180615Z %24 = arith.ori %22, %23 : tensor<16x256xi1> 2026-02-21T08:38:35.9180872Z %25 = arith.select %24, %12, %21 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:38:35.9181111Z %26 = math.log %25 : tensor<16x256xf32> 2026-02-21T08:38:35.9181315Z %27 = arith.subf %26, %11 : tensor<16x256xf32> 2026-02-21T08:38:35.9181515Z %28 = arith.mulf %12, %27 : tensor<16x256xf32> 2026-02-21T08:38:35.9181724Z %29 = arith.addf %28, %cst : tensor<16x256xf32> 2026-02-21T08:38:35.9182117Z scf.yield %29 : tensor<16x256xf32> 2026-02-21T08:38:35.9182294Z } 2026-02-21T08:38:35.9182443Z %14 = arith.addf %arg7, %13 : tensor<16x256xf32> 2026-02-21T08:38:35.9182650Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:38:35.9182847Z %15 = arith.muli %c256_i32, %c1_i32 : i32 2026-02-21T08:38:35.9183034Z %16 = arith.addi %arg6, %15 : i32 2026-02-21T08:38:35.9183316Z %17 = tt.descriptor_load %0[%3, %16] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:38:35.9183673Z %18 = tt.descriptor_load %1[%3, %16] : !tt.tensordesc> -> tensor<16x256xf32> 2026-02-21T08:38:35.9183960Z %19 = scf.if %arg3 -> (tensor<16x256xf32>) { 2026-02-21T08:38:35.9184331Z %21 = tt.extern_elementwise %18 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<16x256xf32>) -> tensor<16x256xf32> 2026-02-21T08:38:35.9184687Z %22 = arith.subf %18, %17 : tensor<16x256xf32> 2026-02-21T08:38:35.9184892Z %23 = arith.mulf %21, %22 : tensor<16x256xf32> 2026-02-21T08:38:35.9185093Z %24 = arith.addf %23, %cst : tensor<16x256xf32> 2026-02-21T08:38:35.9185290Z scf.yield %24 : tensor<16x256xf32> 2026-02-21T08:38:35.9185453Z } else { 2026-02-21T08:38:35.9185633Z %21 = tt.splat %arg4 : f32 -> tensor<16x256xf32> 2026-02-21T08:38:35.9185853Z %22 = arith.cmpf ogt, %18, %21 : tensor<16x256xf32> 2026-02-21T08:38:35.9186070Z %23 = arith.cmpf une, %18, %18 : tensor<16x256xf32> 2026-02-21T08:38:35.9186538Z %24 = arith.ori %22, %23 : tensor<16x256xi1> 2026-02-21T08:38:35.9186774Z %25 = arith.select %24, %18, %21 : tensor<16x256xi1>, tensor<16x256xf32> 2026-02-21T08:38:35.9187017Z %26 = math.log %25 : tensor<16x256xf32> 2026-02-21T08:38:35.9187216Z %27 = arith.subf %26, %17 : tensor<16x256xf32> 2026-02-21T08:38:35.9187414Z %28 = arith.mulf %18, %27 : tensor<16x256xf32> 2026-02-21T08:38:35.9187621Z %29 = arith.addf %28, %cst : tensor<16x256xf32> 2026-02-21T08:38:35.9187814Z scf.yield %29 : tensor<16x256xf32> 2026-02-21T08:38:35.9187984Z } 2026-02-21T08:38:35.9188123Z %20 = arith.addf %14, %19 : tensor<16x256xf32> 2026-02-21T08:38:35.9188319Z scf.yield %20 : tensor<16x256xf32> 2026-02-21T08:38:35.9188503Z } {tt.flatten, tt.num_stages = 1 : i32} 2026-02-21T08:38:35.9188698Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:38:35.9188948Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:35.9189127Z %11 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:35.9189313Z tt.reduce.return %11 : f32 2026-02-21T08:38:35.9189493Z }) : (tensor<16x256xf32>) -> tensor<16xf32> 2026-02-21T08:38:35.9189722Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<16x!tt.ptr> 2026-02-21T08:38:35.9189978Z %10 = tt.addptr %9, %6 : tensor<16x!tt.ptr>, tensor<16xi32> 2026-02-21T08:38:35.9190216Z tt.store %10, %8 : tensor<16x!tt.ptr> 2026-02-21T08:38:35.9190455Z } {tt.flatten, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:38:35.9190662Z tt.return 2026-02-21T08:38:35.9190801Z } 2026-02-21T08:38:35.9190918Z } 2026-02-21T08:38:35.9190985Z 2026-02-21T08:38:35.9191042Z {-# 2026-02-21T08:38:35.9191164Z external_resources: { 2026-02-21T08:38:35.9191320Z mlir_reproducer: { 2026-02-21T08:38:35.9195570Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=6}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=6}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=6}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:38:35.9199972Z disable_threading: false, 2026-02-21T08:38:35.9200139Z verify_each: true 2026-02-21T08:38:35.9200282Z } 2026-02-21T08:38:35.9200408Z } 2026-02-21T08:38:35.9200520Z #-} 2026-02-21T08:38:35.9200953Z /tmp/torchinductor_root/u5/cu5cqukk7327wqhkcxuvresgfsw2rabe3azrdt6atotry4lstclq.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:38:35.9202279Z /tmp/torchinductor_root/u5/cu5cqukk7327wqhkcxuvresgfsw2rabe3azrdt6atotry4lstclq.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:38:35.9203278Z [45s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:38:35.9204390Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'last'], maxnreg=64, num_sm_multiplier=32, num_stages=6, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:38:35.9205455Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:38:35.9205723Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:38:46.7982836Z module { 2026-02-21T08:38:46.7983871Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:38:46.7984914Z %c16_i32 = arith.constant 16 : i32 2026-02-21T08:38:46.7985254Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:38:46.7985624Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:38:46.7985917Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:38:46.7986299Z %cst = arith.constant dense<0.000000e+00> : tensor<256x1024xf32> 2026-02-21T08:38:46.7986691Z %c256_i32 = arith.constant 256 : i32 2026-02-21T08:38:46.7987011Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:38:46.7987362Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:38:46.7987713Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:38:46.7988029Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:38:46.7988586Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:46.7989415Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:46.7989974Z %2 = tt.get_program_id x : i32 2026-02-21T08:38:46.7990276Z %3 = arith.addi %2, %c1_i32 : i32 2026-02-21T08:38:46.7990569Z %4 = arith.minsi %3, %c16_i32 : i32 2026-02-21T08:38:46.7990872Z %5 = arith.subi %4, %2 : i32 2026-02-21T08:38:46.7991162Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:38:46.7991469Z %6 = arith.subi %c1_i32, %c1_i32_0 : i32 2026-02-21T08:38:46.7991774Z %7 = arith.addi %5, %6 : i32 2026-02-21T08:38:46.7992138Z %8 = arith.divui %7, %c1_i32 : i32 2026-02-21T08:38:46.7992461Z %c3_i32 = arith.constant 3 : i32 2026-02-21T08:38:46.7992752Z %9 = arith.remsi %8, %c3_i32 : i32 2026-02-21T08:38:46.7993060Z %10 = arith.subi %8, %9 : i32 2026-02-21T08:38:46.7993343Z %11 = arith.muli %10, %c1_i32 : i32 2026-02-21T08:38:46.7993651Z %12 = arith.addi %2, %11 : i32 2026-02-21T08:38:46.7993951Z %13 = arith.muli %c1_i32, %c3_i32 : i32 2026-02-21T08:38:46.7994280Z scf.for %arg5 = %2 to %12 step %13 : i32 { 2026-02-21T08:38:46.7994620Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:38:46.7995018Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:38:46.7995462Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:38:46.7995797Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:38:46.7996376Z %18 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>) : i32 { 2026-02-21T08:38:46.7997150Z %42 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.7998232Z %43 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.7998772Z %44 = scf.if %arg3 -> (tensor<256x1024xf32>) { 2026-02-21T08:38:46.7999451Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32> 2026-02-21T08:38:46.8000147Z %47 = arith.subf %43, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8000524Z %48 = arith.mulf %46, %47 : tensor<256x1024xf32> 2026-02-21T08:38:46.8000897Z %49 = arith.addf %48, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8001261Z scf.yield %49 : tensor<256x1024xf32> 2026-02-21T08:38:46.8001554Z } else { 2026-02-21T08:38:46.8001838Z %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32> 2026-02-21T08:38:46.8002488Z %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32> 2026-02-21T08:38:46.8002891Z %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32> 2026-02-21T08:38:46.8003274Z %49 = arith.ori %47, %48 : tensor<256x1024xi1> 2026-02-21T08:38:46.8003701Z %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32> 2026-02-21T08:38:46.8004154Z %51 = math.log %50 : tensor<256x1024xf32> 2026-02-21T08:38:46.8004502Z %52 = arith.subf %51, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8004863Z %53 = arith.mulf %43, %52 : tensor<256x1024xf32> 2026-02-21T08:38:46.8005232Z %54 = arith.addf %53, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8005578Z scf.yield %54 : tensor<256x1024xf32> 2026-02-21T08:38:46.8005866Z } 2026-02-21T08:38:46.8006110Z %45 = arith.addf %arg7, %44 : tensor<256x1024xf32> 2026-02-21T08:38:46.8006462Z scf.yield %45 : tensor<256x1024xf32> 2026-02-21T08:38:46.8006827Z } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:38:46.8007231Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:38:46.8007550Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:46.8007980Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:46.8017554Z tt.reduce.return %42 : f32 2026-02-21T08:38:46.8017931Z }) : (tensor<256x1024xf32>) -> tensor<256xf32> 2026-02-21T08:38:46.8018360Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:38:46.8018850Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:38:46.8019285Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:38:46.8019632Z %c1_i32_1 = arith.constant 1 : i32 2026-02-21T08:38:46.8019977Z %22 = arith.muli %c1_i32, %c1_i32_1 : i32 2026-02-21T08:38:46.8020307Z %23 = arith.addi %arg5, %22 : i32 2026-02-21T08:38:46.8020627Z %24 = arith.muli %23, %c256_i32 : i32 2026-02-21T08:38:46.8021039Z %25 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:38:46.8021500Z %26 = tt.splat %24 : i32 -> tensor<256xi32> 2026-02-21T08:38:46.8021842Z %27 = arith.addi %26, %25 : tensor<256xi32> 2026-02-21T08:38:46.8022510Z %28 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>) : i32 { 2026-02-21T08:38:46.8023313Z %42 = tt.descriptor_load %0[%24, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8024037Z %43 = tt.descriptor_load %1[%24, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8024586Z %44 = scf.if %arg3 -> (tensor<256x1024xf32>) { 2026-02-21T08:38:46.8025274Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32> 2026-02-21T08:38:46.8025969Z %47 = arith.subf %43, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8026350Z %48 = arith.mulf %46, %47 : tensor<256x1024xf32> 2026-02-21T08:38:46.8026946Z %49 = arith.addf %48, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8027319Z scf.yield %49 : tensor<256x1024xf32> 2026-02-21T08:38:46.8027619Z } else { 2026-02-21T08:38:46.8027906Z %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32> 2026-02-21T08:38:46.8028304Z %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32> 2026-02-21T08:38:46.8028717Z %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32> 2026-02-21T08:38:46.8029110Z %49 = arith.ori %47, %48 : tensor<256x1024xi1> 2026-02-21T08:38:46.8029547Z %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32> 2026-02-21T08:38:46.8030003Z %51 = math.log %50 : tensor<256x1024xf32> 2026-02-21T08:38:46.8030365Z %52 = arith.subf %51, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8030741Z %53 = arith.mulf %43, %52 : tensor<256x1024xf32> 2026-02-21T08:38:46.8031213Z %54 = arith.addf %53, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8031587Z scf.yield %54 : tensor<256x1024xf32> 2026-02-21T08:38:46.8031947Z } 2026-02-21T08:38:46.8032190Z %45 = arith.addf %arg7, %44 : tensor<256x1024xf32> 2026-02-21T08:38:46.8032553Z scf.yield %45 : tensor<256x1024xf32> 2026-02-21T08:38:46.8032945Z } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:38:46.8033356Z %29 = "tt.reduce"(%28) <{axis = 1 : i32}> ({ 2026-02-21T08:38:46.8033672Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:46.8033982Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:46.8034297Z tt.reduce.return %42 : f32 2026-02-21T08:38:46.8034629Z }) : (tensor<256x1024xf32>) -> tensor<256xf32> 2026-02-21T08:38:46.8035044Z %30 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:38:46.8035510Z %31 = tt.addptr %30, %27 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:38:46.8035940Z tt.store %31, %29 : tensor<256x!tt.ptr> 2026-02-21T08:38:46.8036284Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:38:46.8036606Z %32 = arith.muli %c1_i32, %c2_i32 : i32 2026-02-21T08:38:46.8036917Z %33 = arith.addi %arg5, %32 : i32 2026-02-21T08:38:46.8037227Z %34 = arith.muli %33, %c256_i32 : i32 2026-02-21T08:38:46.8037629Z %35 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:38:46.8038046Z %36 = tt.splat %34 : i32 -> tensor<256xi32> 2026-02-21T08:38:46.8038391Z %37 = arith.addi %36, %35 : tensor<256xi32> 2026-02-21T08:38:46.8038961Z %38 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>) : i32 { 2026-02-21T08:38:46.8039736Z %42 = tt.descriptor_load %0[%34, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8040435Z %43 = tt.descriptor_load %1[%34, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8040985Z %44 = scf.if %arg3 -> (tensor<256x1024xf32>) { 2026-02-21T08:38:46.8041665Z %46 = tt.extern_elementwise %43 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32> 2026-02-21T08:38:46.8042411Z %47 = arith.subf %43, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8042791Z %48 = arith.mulf %46, %47 : tensor<256x1024xf32> 2026-02-21T08:38:46.8043169Z %49 = arith.addf %48, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8043536Z scf.yield %49 : tensor<256x1024xf32> 2026-02-21T08:38:46.8043845Z } else { 2026-02-21T08:38:46.8044119Z %46 = tt.splat %arg4 : f32 -> tensor<256x1024xf32> 2026-02-21T08:38:46.8044522Z %47 = arith.cmpf ogt, %43, %46 : tensor<256x1024xf32> 2026-02-21T08:38:46.8044931Z %48 = arith.cmpf une, %43, %43 : tensor<256x1024xf32> 2026-02-21T08:38:46.8045327Z %49 = arith.ori %47, %48 : tensor<256x1024xi1> 2026-02-21T08:38:46.8045854Z %50 = arith.select %49, %43, %46 : tensor<256x1024xi1>, tensor<256x1024xf32> 2026-02-21T08:38:46.8046305Z %51 = math.log %50 : tensor<256x1024xf32> 2026-02-21T08:38:46.8046666Z %52 = arith.subf %51, %42 : tensor<256x1024xf32> 2026-02-21T08:38:46.8047030Z %53 = arith.mulf %43, %52 : tensor<256x1024xf32> 2026-02-21T08:38:46.8047411Z %54 = arith.addf %53, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8047762Z scf.yield %54 : tensor<256x1024xf32> 2026-02-21T08:38:46.8048060Z } 2026-02-21T08:38:46.8048313Z %45 = arith.addf %arg7, %44 : tensor<256x1024xf32> 2026-02-21T08:38:46.8048676Z scf.yield %45 : tensor<256x1024xf32> 2026-02-21T08:38:46.8049062Z } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:38:46.8049471Z %39 = "tt.reduce"(%38) <{axis = 1 : i32}> ({ 2026-02-21T08:38:46.8049804Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:46.8050190Z %42 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:46.8050519Z tt.reduce.return %42 : f32 2026-02-21T08:38:46.8050840Z }) : (tensor<256x1024xf32>) -> tensor<256xf32> 2026-02-21T08:38:46.8051243Z %40 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:38:46.8051711Z %41 = tt.addptr %40, %37 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:38:46.8052200Z tt.store %41, %39 : tensor<256x!tt.ptr> 2026-02-21T08:38:46.8052608Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:38:46.8053005Z scf.for %arg5 = %12 to %4 step %c1_i32 : i32 { 2026-02-21T08:38:46.8053363Z %14 = arith.muli %arg5, %c256_i32 : i32 2026-02-21T08:38:46.8053764Z %15 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> 2026-02-21T08:38:46.8054203Z %16 = tt.splat %14 : i32 -> tensor<256xi32> 2026-02-21T08:38:46.8054534Z %17 = arith.addi %16, %15 : tensor<256xi32> 2026-02-21T08:38:46.8055111Z %18 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<256x1024xf32>) : i32 { 2026-02-21T08:38:46.8055889Z %22 = tt.descriptor_load %0[%14, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8056591Z %23 = tt.descriptor_load %1[%14, %arg6] : !tt.tensordesc> -> tensor<256x1024xf32> 2026-02-21T08:38:46.8057133Z %24 = scf.if %arg3 -> (tensor<256x1024xf32>) { 2026-02-21T08:38:46.8057799Z %26 = tt.extern_elementwise %23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<256x1024xf32>) -> tensor<256x1024xf32> 2026-02-21T08:38:46.8058492Z %27 = arith.subf %23, %22 : tensor<256x1024xf32> 2026-02-21T08:38:46.8058861Z %28 = arith.mulf %26, %27 : tensor<256x1024xf32> 2026-02-21T08:38:46.8059234Z %29 = arith.addf %28, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8059600Z scf.yield %29 : tensor<256x1024xf32> 2026-02-21T08:38:46.8059902Z } else { 2026-02-21T08:38:46.8060186Z %26 = tt.splat %arg4 : f32 -> tensor<256x1024xf32> 2026-02-21T08:38:46.8060578Z %27 = arith.cmpf ogt, %23, %26 : tensor<256x1024xf32> 2026-02-21T08:38:46.8060985Z %28 = arith.cmpf une, %23, %23 : tensor<256x1024xf32> 2026-02-21T08:38:46.8061375Z %29 = arith.ori %27, %28 : tensor<256x1024xi1> 2026-02-21T08:38:46.8061804Z %30 = arith.select %29, %23, %26 : tensor<256x1024xi1>, tensor<256x1024xf32> 2026-02-21T08:38:46.8062289Z %31 = math.log %30 : tensor<256x1024xf32> 2026-02-21T08:38:46.8062649Z %32 = arith.subf %31, %22 : tensor<256x1024xf32> 2026-02-21T08:38:46.8063025Z %33 = arith.mulf %23, %32 : tensor<256x1024xf32> 2026-02-21T08:38:46.8063397Z %34 = arith.addf %33, %cst : tensor<256x1024xf32> 2026-02-21T08:38:46.8063772Z scf.yield %34 : tensor<256x1024xf32> 2026-02-21T08:38:46.8064075Z } 2026-02-21T08:38:46.8064349Z %25 = arith.addf %arg7, %24 : tensor<256x1024xf32> 2026-02-21T08:38:46.8064799Z scf.yield %25 : tensor<256x1024xf32> 2026-02-21T08:38:46.8065192Z } {tt.flatten, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:38:46.8065601Z %19 = "tt.reduce"(%18) <{axis = 1 : i32}> ({ 2026-02-21T08:38:46.8065925Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:46.8066240Z %22 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:46.8066561Z tt.reduce.return %22 : f32 2026-02-21T08:38:46.8066880Z }) : (tensor<256x1024xf32>) -> tensor<256xf32> 2026-02-21T08:38:46.8067300Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<256x!tt.ptr> 2026-02-21T08:38:46.8067768Z %21 = tt.addptr %20, %17 : tensor<256x!tt.ptr>, tensor<256xi32> 2026-02-21T08:38:46.8068196Z tt.store %21, %19 : tensor<256x!tt.ptr> 2026-02-21T08:38:46.8068595Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:38:46.8068955Z tt.return 2026-02-21T08:38:46.8069237Z } 2026-02-21T08:38:46.8069439Z } 2026-02-21T08:38:46.8069552Z 2026-02-21T08:38:46.8069639Z {-# 2026-02-21T08:38:46.8069845Z external_resources: { 2026-02-21T08:38:46.8070114Z mlir_reproducer: { 2026-02-21T08:38:46.8078310Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=32 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:38:46.8086849Z disable_threading: false, 2026-02-21T08:38:46.8087128Z verify_each: true 2026-02-21T08:38:46.8087373Z } 2026-02-21T08:38:46.8087557Z } 2026-02-21T08:38:46.8087744Z #-} 2026-02-21T08:38:46.8088513Z /tmp/torchinductor_root/76/c76eam7pb3egtsi7cxrddznxiy4ady6vqx25xgh2kxuulbqy77jo.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:38:46.8090809Z /tmp/torchinductor_root/76/c76eam7pb3egtsi7cxrddznxiy4ady6vqx25xgh2kxuulbqy77jo.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:38:46.8092719Z [56s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:38:46.8094745Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 256], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['first', 'last'], num_sm_multiplier=128, num_stages=3, num_warps=32, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[3, 0], range_warp_specializes=[False, True]), static_shapes=True) 2026-02-21T08:38:46.8096635Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:38:46.8097090Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:38:53.8150925Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:38:53.8152815Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:38:53.8153633Z %c32_i32 = arith.constant 32 : i32 2026-02-21T08:38:53.8153892Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:38:53.8154586Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:38:53.8154803Z %c9472_i32 = arith.constant 9472 : i32 2026-02-21T08:38:53.8155028Z %cst = arith.constant dense<131072> : tensor<128x1xi32> 2026-02-21T08:38:53.8155282Z %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x512xf32> 2026-02-21T08:38:53.8155527Z %c128_i32 = arith.constant 128 : i32 2026-02-21T08:38:53.8155716Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:38:53.8155902Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:38:53.8156098Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:38:53.8156280Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:38:53.8156604Z %0 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:38:53.8156928Z %1 = tt.get_program_id x : i32 2026-02-21T08:38:53.8157147Z scf.for %arg5 = %1 to %c32_i32 step %c9472_i32 : i32 { 2026-02-21T08:38:53.8157360Z %2 = arith.muli %arg5, %c128_i32 : i32 2026-02-21T08:38:53.8157604Z %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> 2026-02-21T08:38:53.8157869Z %4 = tt.splat %2 : i32 -> tensor<128xi32> 2026-02-21T08:38:53.8158057Z %5 = arith.addi %4, %3 : tensor<128xi32> 2026-02-21T08:38:53.8158257Z %c130560_i32 = arith.constant 130560 : i32 2026-02-21T08:38:53.8158445Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T08:38:53.8158773Z %6 = scf.for %arg6 = %c0_i32 to %c130560_i32 step %c1536_i32 iter_args(%arg7 = %cst_0) -> (tensor<128x512xf32>) : i32 { 2026-02-21T08:38:53.8159146Z %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:38:53.8159400Z %26 = tt.splat %arg6 : i32 -> tensor<512xi32> 2026-02-21T08:38:53.8159600Z %27 = arith.addi %26, %25 : tensor<512xi32> 2026-02-21T08:38:53.8159850Z %28 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:38:53.8160109Z %29 = arith.muli %28, %cst : tensor<128x1xi32> 2026-02-21T08:38:53.8160364Z %30 = tt.expand_dims %27 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:38:53.8160650Z %31 = tt.broadcast %29 : tensor<128x1xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8160907Z %32 = tt.broadcast %30 : tensor<1x512xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8161148Z %33 = arith.addi %31, %32 : tensor<128x512xi32> 2026-02-21T08:38:53.8161384Z %34 = tt.splat %arg0 : !tt.ptr -> tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8161664Z %35 = tt.addptr %34, %33 : tensor<128x512x!tt.ptr>, tensor<128x512xi32> 2026-02-21T08:38:53.8162041Z %36 = tt.load %35 evictionPolicy = evict_last : tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8162400Z %37 = tt.descriptor_load %0[%2, %arg6] : !tt.tensordesc> -> tensor<128x512xf32> 2026-02-21T08:38:53.8162714Z %38 = scf.if %arg3 -> (tensor<128x512xf32>) { 2026-02-21T08:38:53.8163091Z %74 = tt.extern_elementwise %37 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32> 2026-02-21T08:38:53.8163634Z %75 = arith.subf %37, %36 : tensor<128x512xf32> 2026-02-21T08:38:53.8163850Z %76 = arith.mulf %74, %75 : tensor<128x512xf32> 2026-02-21T08:38:53.8164080Z %77 = arith.addf %76, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8164301Z scf.yield %77 : tensor<128x512xf32> 2026-02-21T08:38:53.8164478Z } else { 2026-02-21T08:38:53.8164656Z %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32> 2026-02-21T08:38:53.8164879Z %75 = arith.cmpf ogt, %37, %74 : tensor<128x512xf32> 2026-02-21T08:38:53.8165109Z %76 = arith.cmpf une, %37, %37 : tensor<128x512xf32> 2026-02-21T08:38:53.8165316Z %77 = arith.ori %75, %76 : tensor<128x512xi1> 2026-02-21T08:38:53.8165579Z %78 = arith.select %77, %37, %74 : tensor<128x512xi1>, tensor<128x512xf32> 2026-02-21T08:38:53.8165903Z %79 = math.log %78 : tensor<128x512xf32> 2026-02-21T08:38:53.8166108Z %80 = arith.subf %79, %36 : tensor<128x512xf32> 2026-02-21T08:38:53.8166317Z %81 = arith.mulf %37, %80 : tensor<128x512xf32> 2026-02-21T08:38:53.8166521Z %82 = arith.addf %81, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8166722Z scf.yield %82 : tensor<128x512xf32> 2026-02-21T08:38:53.8166885Z } 2026-02-21T08:38:53.8167038Z %39 = arith.addf %arg7, %38 : tensor<128x512xf32> 2026-02-21T08:38:53.8167247Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:38:53.8167432Z %40 = arith.muli %c512_i32, %c1_i32 : i32 2026-02-21T08:38:53.8167623Z %41 = arith.addi %arg6, %40 : i32 2026-02-21T08:38:53.8167843Z %42 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:38:53.8168087Z %43 = tt.splat %41 : i32 -> tensor<512xi32> 2026-02-21T08:38:53.8168280Z %44 = arith.addi %43, %42 : tensor<512xi32> 2026-02-21T08:38:53.8168531Z %45 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:38:53.8168797Z %46 = arith.muli %45, %cst : tensor<128x1xi32> 2026-02-21T08:38:53.8169040Z %47 = tt.expand_dims %44 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:38:53.8169329Z %48 = tt.broadcast %46 : tensor<128x1xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8169586Z %49 = tt.broadcast %47 : tensor<1x512xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8169824Z %50 = arith.addi %48, %49 : tensor<128x512xi32> 2026-02-21T08:38:53.8170051Z %51 = tt.splat %arg0 : !tt.ptr -> tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8170323Z %52 = tt.addptr %51, %50 : tensor<128x512x!tt.ptr>, tensor<128x512xi32> 2026-02-21T08:38:53.8170615Z %53 = tt.load %52 evictionPolicy = evict_last : tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8170943Z %54 = tt.descriptor_load %0[%2, %41] : !tt.tensordesc> -> tensor<128x512xf32> 2026-02-21T08:38:53.8171237Z %55 = scf.if %arg3 -> (tensor<128x512xf32>) { 2026-02-21T08:38:53.8171595Z %74 = tt.extern_elementwise %54 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32> 2026-02-21T08:38:53.8172009Z %75 = arith.subf %54, %53 : tensor<128x512xf32> 2026-02-21T08:38:53.8172215Z %76 = arith.mulf %74, %75 : tensor<128x512xf32> 2026-02-21T08:38:53.8172421Z %77 = arith.addf %76, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8172625Z scf.yield %77 : tensor<128x512xf32> 2026-02-21T08:38:53.8172792Z } else { 2026-02-21T08:38:53.8172953Z %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32> 2026-02-21T08:38:53.8173168Z %75 = arith.cmpf ogt, %54, %74 : tensor<128x512xf32> 2026-02-21T08:38:53.8173391Z %76 = arith.cmpf une, %54, %54 : tensor<128x512xf32> 2026-02-21T08:38:53.8173603Z %77 = arith.ori %75, %76 : tensor<128x512xi1> 2026-02-21T08:38:53.8173911Z %78 = arith.select %77, %54, %74 : tensor<128x512xi1>, tensor<128x512xf32> 2026-02-21T08:38:53.8174159Z %79 = math.log %78 : tensor<128x512xf32> 2026-02-21T08:38:53.8174360Z %80 = arith.subf %79, %53 : tensor<128x512xf32> 2026-02-21T08:38:53.8174568Z %81 = arith.mulf %54, %80 : tensor<128x512xf32> 2026-02-21T08:38:53.8174775Z %82 = arith.addf %81, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8174983Z scf.yield %82 : tensor<128x512xf32> 2026-02-21T08:38:53.8175152Z } 2026-02-21T08:38:53.8175307Z %56 = arith.addf %39, %55 : tensor<128x512xf32> 2026-02-21T08:38:53.8175508Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:38:53.8175695Z %57 = arith.muli %c512_i32, %c2_i32 : i32 2026-02-21T08:38:53.8175910Z %58 = arith.addi %arg6, %57 : i32 2026-02-21T08:38:53.8176143Z %59 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:38:53.8176439Z %60 = tt.splat %58 : i32 -> tensor<512xi32> 2026-02-21T08:38:53.8176646Z %61 = arith.addi %60, %59 : tensor<512xi32> 2026-02-21T08:38:53.8176882Z %62 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:38:53.8177151Z %63 = arith.muli %62, %cst : tensor<128x1xi32> 2026-02-21T08:38:53.8177396Z %64 = tt.expand_dims %61 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:38:53.8177690Z %65 = tt.broadcast %63 : tensor<128x1xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8177957Z %66 = tt.broadcast %64 : tensor<1x512xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8178191Z %67 = arith.addi %65, %66 : tensor<128x512xi32> 2026-02-21T08:38:53.8178427Z %68 = tt.splat %arg0 : !tt.ptr -> tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8178693Z %69 = tt.addptr %68, %67 : tensor<128x512x!tt.ptr>, tensor<128x512xi32> 2026-02-21T08:38:53.8178989Z %70 = tt.load %69 evictionPolicy = evict_last : tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8179326Z %71 = tt.descriptor_load %0[%2, %58] : !tt.tensordesc> -> tensor<128x512xf32> 2026-02-21T08:38:53.8179610Z %72 = scf.if %arg3 -> (tensor<128x512xf32>) { 2026-02-21T08:38:53.8179972Z %74 = tt.extern_elementwise %71 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32> 2026-02-21T08:38:53.8180331Z %75 = arith.subf %71, %70 : tensor<128x512xf32> 2026-02-21T08:38:53.8180536Z %76 = arith.mulf %74, %75 : tensor<128x512xf32> 2026-02-21T08:38:53.8180740Z %77 = arith.addf %76, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8180943Z scf.yield %77 : tensor<128x512xf32> 2026-02-21T08:38:53.8181115Z } else { 2026-02-21T08:38:53.8181271Z %74 = tt.splat %arg4 : f32 -> tensor<128x512xf32> 2026-02-21T08:38:53.8181496Z %75 = arith.cmpf ogt, %71, %74 : tensor<128x512xf32> 2026-02-21T08:38:53.8181711Z %76 = arith.cmpf une, %71, %71 : tensor<128x512xf32> 2026-02-21T08:38:53.8181959Z %77 = arith.ori %75, %76 : tensor<128x512xi1> 2026-02-21T08:38:53.8182193Z %78 = arith.select %77, %71, %74 : tensor<128x512xi1>, tensor<128x512xf32> 2026-02-21T08:38:53.8182435Z %79 = math.log %78 : tensor<128x512xf32> 2026-02-21T08:38:53.8182636Z %80 = arith.subf %79, %70 : tensor<128x512xf32> 2026-02-21T08:38:53.8182833Z %81 = arith.mulf %71, %80 : tensor<128x512xf32> 2026-02-21T08:38:53.8183044Z %82 = arith.addf %81, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8183239Z scf.yield %82 : tensor<128x512xf32> 2026-02-21T08:38:53.8183410Z } 2026-02-21T08:38:53.8183549Z %73 = arith.addf %56, %72 : tensor<128x512xf32> 2026-02-21T08:38:53.8183742Z scf.yield %73 : tensor<128x512xf32> 2026-02-21T08:38:53.8183919Z } {tt.num_stages = 1 : i32} 2026-02-21T08:38:53.8184139Z %7 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32> 2026-02-21T08:38:53.8184445Z %8 = tt.splat %c130560_i32 : i32 -> tensor<512xi32> 2026-02-21T08:38:53.8184638Z %9 = arith.addi %8, %7 : tensor<512xi32> 2026-02-21T08:38:53.8184877Z %10 = tt.expand_dims %5 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32> 2026-02-21T08:38:53.8185126Z %11 = arith.muli %10, %cst : tensor<128x1xi32> 2026-02-21T08:38:53.8185367Z %12 = tt.expand_dims %9 {axis = 0 : i32} : tensor<512xi32> -> tensor<1x512xi32> 2026-02-21T08:38:53.8185707Z %13 = tt.broadcast %11 : tensor<128x1xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8185960Z %14 = tt.broadcast %12 : tensor<1x512xi32> -> tensor<128x512xi32> 2026-02-21T08:38:53.8186193Z %15 = arith.addi %13, %14 : tensor<128x512xi32> 2026-02-21T08:38:53.8186418Z %16 = tt.splat %arg0 : !tt.ptr -> tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8186841Z %17 = tt.addptr %16, %15 : tensor<128x512x!tt.ptr>, tensor<128x512xi32> 2026-02-21T08:38:53.8187248Z %18 = tt.load %17 evictionPolicy = evict_last : tensor<128x512x!tt.ptr> 2026-02-21T08:38:53.8187776Z %19 = tt.descriptor_load %0[%2, %c130560_i32] : !tt.tensordesc> -> tensor<128x512xf32> 2026-02-21T08:38:53.8188201Z %20 = scf.if %arg3 -> (tensor<128x512xf32>) { 2026-02-21T08:38:53.8188611Z %25 = tt.extern_elementwise %19 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<128x512xf32>) -> tensor<128x512xf32> 2026-02-21T08:38:53.8188986Z %26 = arith.subf %19, %18 : tensor<128x512xf32> 2026-02-21T08:38:53.8189193Z %27 = arith.mulf %25, %26 : tensor<128x512xf32> 2026-02-21T08:38:53.8189418Z %28 = arith.addf %27, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8189622Z scf.yield %28 : tensor<128x512xf32> 2026-02-21T08:38:53.8189804Z } else { 2026-02-21T08:38:53.8189968Z %25 = tt.splat %arg4 : f32 -> tensor<128x512xf32> 2026-02-21T08:38:53.8190185Z %26 = arith.cmpf ogt, %19, %25 : tensor<128x512xf32> 2026-02-21T08:38:53.8190467Z %27 = arith.cmpf une, %19, %19 : tensor<128x512xf32> 2026-02-21T08:38:53.8190749Z %28 = arith.ori %26, %27 : tensor<128x512xi1> 2026-02-21T08:38:53.8191074Z %29 = arith.select %28, %19, %25 : tensor<128x512xi1>, tensor<128x512xf32> 2026-02-21T08:38:53.8191423Z %30 = math.log %29 : tensor<128x512xf32> 2026-02-21T08:38:53.8191689Z %31 = arith.subf %30, %18 : tensor<128x512xf32> 2026-02-21T08:38:53.8191990Z %32 = arith.mulf %19, %31 : tensor<128x512xf32> 2026-02-21T08:38:53.8192257Z %33 = arith.addf %32, %cst_0 : tensor<128x512xf32> 2026-02-21T08:38:53.8192520Z scf.yield %33 : tensor<128x512xf32> 2026-02-21T08:38:53.8192736Z } 2026-02-21T08:38:53.8192930Z %21 = arith.addf %6, %20 : tensor<128x512xf32> 2026-02-21T08:38:53.8193183Z %22 = "tt.reduce"(%21) <{axis = 1 : i32}> ({ 2026-02-21T08:38:53.8193429Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:38:53.8193657Z %25 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:38:53.8193904Z tt.reduce.return %25 : f32 2026-02-21T08:38:53.8194149Z }) : (tensor<128x512xf32>) -> tensor<128xf32> 2026-02-21T08:38:53.8194440Z %23 = tt.splat %arg2 : !tt.ptr -> tensor<128x!tt.ptr> 2026-02-21T08:38:53.8194781Z %24 = tt.addptr %23, %5 : tensor<128x!tt.ptr>, tensor<128xi32> 2026-02-21T08:38:53.8195082Z tt.store %24, %22 : tensor<128x!tt.ptr> 2026-02-21T08:38:53.8195425Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32, tt.warp_specialize} 2026-02-21T08:38:53.8195737Z tt.return 2026-02-21T08:38:53.8195899Z } 2026-02-21T08:38:53.8196049Z } 2026-02-21T08:38:53.8196136Z 2026-02-21T08:38:53.8196197Z {-# 2026-02-21T08:38:53.8196362Z external_resources: { 2026-02-21T08:38:53.8196554Z mlir_reproducer: { 2026-02-21T08:38:53.8201108Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=4}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=4}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=4}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:38:53.8206156Z disable_threading: false, 2026-02-21T08:38:53.8206326Z verify_each: true 2026-02-21T08:38:53.8206463Z } 2026-02-21T08:38:53.8206584Z } 2026-02-21T08:38:53.8206692Z #-} 2026-02-21T08:38:53.8207122Z /tmp/torchinductor_root/eo/ceodyuyt7ys736a4ooyt23geblu5dmn43rrkpaapfpn63ttb3xo2.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:38:53.8208308Z /tmp/torchinductor_root/eo/ceodyuyt7ys736a4ooyt23geblu5dmn43rrkpaapfpn63ttb3xo2.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPUPartitionLoops` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:38:53.8209264Z [63s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:38:53.8210364Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 128], indexing=['pointer', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[0, 3], range_warp_specializes=[True, None]), static_shapes=True) 2026-02-21T08:38:53.8211356Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:38:53.8211608Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:38:54.5017307Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.7 configs/s 2026-02-21T08:38:54.5037505Z [64s] Adaptive compile timeout: 30s (90% percentile=13.3s, bounds=[30.0s, 30s]) 2026-02-21T08:38:58.3073061Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 270/270 68.6 configs/s 2026-02-21T08:38:58.4775243Z [68s] Initial random population of 100, 5 starting points: 2026-02-21T08:38:58.4775596Z error=11 2026-02-21T08:38:58.4775760Z timeout=9 2026-02-21T08:38:58.4775905Z ok=80 2026-02-21T08:38:58.4776057Z min=0.8233 2026-02-21T08:38:58.4776202Z mid=5.1088 2026-02-21T08:38:58.4776351Z max=380.8112 2026-02-21T08:38:58.4776522Z best={'block_sizes': [512, 2], 2026-02-21T08:38:58.4776787Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:38:58.4777052Z 'load_eviction_policies': ['', ''], 2026-02-21T08:38:58.4777313Z 'num_sm_multiplier': 128, 2026-02-21T08:38:58.4777896Z 'num_stages': 3, 2026-02-21T08:38:58.4778064Z 'num_warps': 1, 2026-02-21T08:38:58.4778219Z 'pid_type': 'persistent_blocked', 2026-02-21T08:38:58.4778394Z 'range_flattens': [None, False], 2026-02-21T08:38:58.4778574Z 'range_multi_buffers': [True, False], 2026-02-21T08:38:58.4778747Z 'range_num_stages': [1, 2], 2026-02-21T08:38:58.4778913Z 'range_unroll_factors': [4, 0], 2026-02-21T08:38:58.4779082Z 'range_warp_specializes': [False, True]} 2026-02-21T08:38:58.4792935Z [68s] Fitting surrogate: 100 points, 100 targets 2026-02-21T08:38:59.6372205Z [69s] Generation 1 starting: 79 neighbors, 5 active search path(s) 2026-02-21T08:39:28.5617781Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.3 configs/s 2026-02-21T08:39:33.7881166Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 16.0 configs/s 2026-02-21T08:39:51.3974920Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 15.6 configs/s 2026-02-21T08:39:51.6301418Z [121s] Generation 1 complete: 2026-02-21T08:39:51.6303705Z ok=85 2026-02-21T08:39:51.6305057Z min=0.8048 2026-02-21T08:39:51.6305237Z mid=0.9575 2026-02-21T08:39:51.6305362Z max=4.8620 2026-02-21T08:39:51.6305514Z best={'block_sizes': [1024, 2], 2026-02-21T08:39:51.6305739Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:39:51.6305976Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:39:51.6306160Z 'num_stages': 7, 2026-02-21T08:39:51.6306294Z 'num_warps': 8, 2026-02-21T08:39:51.6306433Z 'pid_type': 'flat', 2026-02-21T08:39:51.6306582Z 'range_flattens': [None, None], 2026-02-21T08:39:51.6306762Z 'range_multi_buffers': [None, True], 2026-02-21T08:39:51.6306937Z 'range_num_stages': [0, 0], 2026-02-21T08:39:51.6307104Z 'range_unroll_factors': [0, 0], 2026-02-21T08:39:51.6307273Z 'range_warp_specializes': [None, True]} 2026-02-21T08:39:51.6322916Z [121s] Fitting surrogate: 185 points, 185 targets 2026-02-21T08:39:52.6770220Z [122s] Generation 2 starting: 74 neighbors, 5 active search path(s) 2026-02-21T08:39:58.0359501Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 28.6 configs/s 2026-02-21T08:40:01.1090317Z module { 2026-02-21T08:40:01.1094561Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:40:01.1095534Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:40:01.1095773Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:40:01.1096015Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:40:01.1096252Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:40:01.1096433Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:40:01.1096632Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:40:01.1096821Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:40:01.1097042Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:40:01.1097370Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:01.1097816Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:01.1098132Z %2 = tt.get_program_id x : i32 2026-02-21T08:40:01.1098307Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:40:01.1098527Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:40:01.1098760Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:40:01.1098946Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:40:01.1099254Z %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:40:01.1099673Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:40:01.1100050Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:40:01.1100771Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:40:01.1101142Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:40:01.1101510Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:40:01.1101719Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:40:01.1102114Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:40:01.1102312Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:40:01.1102488Z } else { 2026-02-21T08:40:01.1102651Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:40:01.1102885Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:40:01.1103108Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:40:01.1103441Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:40:01.1103687Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:40:01.1103940Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:40:01.1104145Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:40:01.1104343Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:40:01.1104549Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:40:01.1104736Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:40:01.1104906Z } 2026-02-21T08:40:01.1105045Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:40:01.1105238Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:40:01.1105556Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:40:01.1105873Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:40:01.1106061Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:40:01.1106232Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:40:01.1106415Z tt.reduce.return %11 : f32 2026-02-21T08:40:01.1106589Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:40:01.1106812Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:40:01.1107093Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:40:01.1107311Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:40:01.1107490Z tt.return 2026-02-21T08:40:01.1107608Z } 2026-02-21T08:40:01.1107745Z } 2026-02-21T08:40:01.1107814Z 2026-02-21T08:40:01.1107864Z {-# 2026-02-21T08:40:01.1107996Z external_resources: { 2026-02-21T08:40:01.1108151Z mlir_reproducer: { 2026-02-21T08:40:01.1112644Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:40:01.1117223Z disable_threading: false, 2026-02-21T08:40:01.1117387Z verify_each: true 2026-02-21T08:40:01.1117524Z } 2026-02-21T08:40:01.1117644Z } 2026-02-21T08:40:01.1117748Z #-} 2026-02-21T08:40:01.1118166Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:40:01.1119386Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:40:01.1120347Z [130s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:01.1121298Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:40:01.1122202Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:40:01.1122448Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:01.3004798Z module { 2026-02-21T08:40:01.3005416Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:40:01.3006032Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:40:01.3006221Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:40:01.3006452Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:40:01.3006674Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:40:01.3006862Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:40:01.3007051Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:40:01.3007247Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:40:01.3007426Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:40:01.3007750Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:01.3008203Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:01.3008509Z %2 = tt.get_program_id x : i32 2026-02-21T08:40:01.3008688Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:40:01.3008902Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:40:01.3009140Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:40:01.3009328Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:40:01.3009633Z %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:40:01.3010044Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:40:01.3010402Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:40:01.3010691Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:40:01.3011324Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:40:01.3011696Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:40:01.3012073Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:40:01.3012280Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:40:01.3012482Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:40:01.3012648Z } else { 2026-02-21T08:40:01.3012813Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:40:01.3013029Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:40:01.3013257Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:40:01.3013468Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:40:01.3013703Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:40:01.3014022Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:40:01.3014217Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:40:01.3014411Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:40:01.3014609Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:40:01.3014808Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:40:01.3014979Z } 2026-02-21T08:40:01.3015118Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:40:01.3015312Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:40:01.3015614Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32, tt.warp_specialize} 2026-02-21T08:40:01.3015936Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:40:01.3016115Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:40:01.3016288Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:40:01.3016468Z tt.reduce.return %11 : f32 2026-02-21T08:40:01.3016653Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:40:01.3016877Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:40:01.3017123Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:40:01.3017351Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:40:01.3017522Z tt.return 2026-02-21T08:40:01.3017650Z } 2026-02-21T08:40:01.3017764Z } 2026-02-21T08:40:01.3017839Z 2026-02-21T08:40:01.3017887Z {-# 2026-02-21T08:40:01.3018009Z external_resources: { 2026-02-21T08:40:01.3018166Z mlir_reproducer: { 2026-02-21T08:40:01.3022391Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:40:01.3026763Z disable_threading: false, 2026-02-21T08:40:01.3026954Z verify_each: true 2026-02-21T08:40:01.3027102Z } 2026-02-21T08:40:01.3027251Z } 2026-02-21T08:40:01.3027394Z #-} 2026-02-21T08:40:01.3027939Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:40:01.3029325Z /tmp/torchinductor_root/fj/cfjjwp2d5ymqupw6azxeyvudspbobtgib35peteoayy42bebl5rg.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:40:01.3030436Z [131s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:01.3031500Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], num_stages=8, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:40:01.3032480Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:40:01.3032744Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:02.6545168Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.4 configs/s 2026-02-21T08:40:18.8577900Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 17.0 configs/s 2026-02-21T08:40:19.0813938Z [148s] Generation 2 complete: 2026-02-21T08:40:19.0817765Z error=2 2026-02-21T08:40:19.0821635Z ok=77 2026-02-21T08:40:19.0826165Z min=0.8273 2026-02-21T08:40:19.0830869Z mid=0.9144 2026-02-21T08:40:19.0835437Z max=2.6329 2026-02-21T08:40:19.0835664Z best={'block_sizes': [1024, 1], 2026-02-21T08:40:19.0840574Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:40:19.0841987Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:40:19.0842221Z 'num_stages': 7, 2026-02-21T08:40:19.0842368Z 'num_warps': 2, 2026-02-21T08:40:19.0842524Z 'pid_type': 'flat', 2026-02-21T08:40:19.0842698Z 'range_flattens': [None, None], 2026-02-21T08:40:19.0842872Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:19.0843057Z 'range_num_stages': [0, 0], 2026-02-21T08:40:19.0843215Z 'range_unroll_factors': [0, 0], 2026-02-21T08:40:19.0843394Z 'range_warp_specializes': [None, True]} 2026-02-21T08:40:19.0843693Z [148s] Fitting surrogate: 264 points, 264 targets 2026-02-21T08:40:20.0657962Z [149s] Generation 3 starting: 66 neighbors, 5 active search path(s) 2026-02-21T08:40:25.9010677Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 2.7 configs/s 2026-02-21T08:40:30.2407067Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 15.8 configs/s 2026-02-21T08:40:47.2681171Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 16.2 configs/s 2026-02-21T08:40:47.4999309Z [177s] Generation 3 complete: 2026-02-21T08:40:47.5002492Z ok=72 2026-02-21T08:40:47.5006534Z min=0.8499 2026-02-21T08:40:47.5010379Z mid=0.8683 2026-02-21T08:40:47.5015750Z max=3.1497 2026-02-21T08:40:47.5017757Z best={'block_sizes': [1024, 1], 2026-02-21T08:40:47.5018019Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:40:47.5018271Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:40:47.5018470Z 'num_stages': 7, 2026-02-21T08:40:47.5018614Z 'num_warps': 2, 2026-02-21T08:40:47.5018797Z 'pid_type': 'flat', 2026-02-21T08:40:47.5019290Z 'range_flattens': [None, None], 2026-02-21T08:40:47.5019482Z 'range_multi_buffers': [None, None], 2026-02-21T08:40:47.5019666Z 'range_num_stages': [0, 0], 2026-02-21T08:40:47.5019843Z 'range_unroll_factors': [0, 0], 2026-02-21T08:40:47.5020020Z 'range_warp_specializes': [None, True]} 2026-02-21T08:40:47.5020244Z [177s] Fitting surrogate: 336 points, 336 targets 2026-02-21T08:40:48.3869507Z [178s] Generation 4 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:40:51.6807098Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 20.7 configs/s 2026-02-21T08:40:52.4356041Z module { 2026-02-21T08:40:52.4356645Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:40:52.4361039Z %c512_i32 = arith.constant 512 : i32 2026-02-21T08:40:52.4362869Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:40:52.4363183Z %cst = arith.constant dense<0.000000e+00> : tensor<4x512xf32> 2026-02-21T08:40:52.4363411Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:40:52.4363596Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:40:52.4363785Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:40:52.4363982Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:40:52.4364167Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:40:52.4364483Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:52.4364923Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:40:52.4365229Z %2 = tt.get_program_id x : i32 2026-02-21T08:40:52.4365406Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:40:52.4365629Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:40:52.4365870Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:40:52.4366076Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:40:52.4366383Z %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c512_i32 iter_args(%arg6 = %cst) -> (tensor<4x512xf32>) : i32 { 2026-02-21T08:40:52.4366794Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x512xf32> 2026-02-21T08:40:52.4367146Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x512xf32> 2026-02-21T08:40:52.4367430Z %13 = scf.if %arg3 -> (tensor<4x512xf32>) { 2026-02-21T08:40:52.4367858Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x512xf32>) -> tensor<4x512xf32> 2026-02-21T08:40:52.4372245Z %16 = arith.subf %12, %11 : tensor<4x512xf32> 2026-02-21T08:40:52.4376592Z %17 = arith.mulf %15, %16 : tensor<4x512xf32> 2026-02-21T08:40:52.4381720Z %18 = arith.addf %17, %cst : tensor<4x512xf32> 2026-02-21T08:40:52.4383476Z scf.yield %18 : tensor<4x512xf32> 2026-02-21T08:40:52.4383696Z } else { 2026-02-21T08:40:52.4383869Z %15 = tt.splat %arg4 : f32 -> tensor<4x512xf32> 2026-02-21T08:40:52.4384104Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x512xf32> 2026-02-21T08:40:52.4384321Z %17 = arith.cmpf une, %12, %12 : tensor<4x512xf32> 2026-02-21T08:40:52.4384539Z %18 = arith.ori %16, %17 : tensor<4x512xi1> 2026-02-21T08:40:52.4384778Z %19 = arith.select %18, %12, %15 : tensor<4x512xi1>, tensor<4x512xf32> 2026-02-21T08:40:52.4385021Z %20 = math.log %19 : tensor<4x512xf32> 2026-02-21T08:40:52.4385223Z %21 = arith.subf %20, %11 : tensor<4x512xf32> 2026-02-21T08:40:52.4385425Z %22 = arith.mulf %12, %21 : tensor<4x512xf32> 2026-02-21T08:40:52.4385629Z %23 = arith.addf %22, %cst : tensor<4x512xf32> 2026-02-21T08:40:52.4385820Z scf.yield %23 : tensor<4x512xf32> 2026-02-21T08:40:52.4386006Z } 2026-02-21T08:40:52.4386156Z %14 = arith.addf %arg6, %13 : tensor<4x512xf32> 2026-02-21T08:40:52.4386350Z scf.yield %14 : tensor<4x512xf32> 2026-02-21T08:40:52.4386661Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:40:52.4386988Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:40:52.4387176Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:40:52.4387343Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:40:52.4387527Z tt.reduce.return %11 : f32 2026-02-21T08:40:52.4387703Z }) : (tensor<4x512xf32>) -> tensor<4xf32> 2026-02-21T08:40:52.4387929Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:40:52.4388179Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:40:52.4388402Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:40:52.4388578Z tt.return 2026-02-21T08:40:52.4388696Z } 2026-02-21T08:40:52.4389009Z } 2026-02-21T08:40:52.4389090Z 2026-02-21T08:40:52.4389138Z {-# 2026-02-21T08:40:52.4389269Z external_resources: { 2026-02-21T08:40:52.4389421Z mlir_reproducer: { 2026-02-21T08:40:52.4393760Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=8}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=8}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=8}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:40:52.4398154Z disable_threading: false, 2026-02-21T08:40:52.4398326Z verify_each: true 2026-02-21T08:40:52.4398563Z } 2026-02-21T08:40:52.4398675Z } 2026-02-21T08:40:52.4398792Z #-} 2026-02-21T08:40:52.4399199Z /tmp/torchinductor_root/av/cavgjzn4xdul3dvrx4qwewvpikdjh7yah47dgg3c7k5theizcaka.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:40:52.4400372Z /tmp/torchinductor_root/av/cavgjzn4xdul3dvrx4qwewvpikdjh7yah47dgg3c7k5theizcaka.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:40:52.4401324Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:40:52.4402343Z Config: @helion.kernel(config=helion.Config(block_sizes=[512, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=8, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:40:52.4403230Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:40:52.4403489Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:40:55.4619133Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 16.9 configs/s 2026-02-21T08:41:10.3109027Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 18.5 configs/s 2026-02-21T08:41:10.5334893Z [200s] Generation 4 complete: 2026-02-21T08:41:10.5336658Z error=2 2026-02-21T08:41:10.5336807Z ok=65 2026-02-21T08:41:10.5336944Z min=0.8469 2026-02-21T08:41:10.5337069Z mid=0.8653 2026-02-21T08:41:10.5337200Z max=3.2273 2026-02-21T08:41:10.5337342Z best={'block_sizes': [1024, 1], 2026-02-21T08:41:10.5337903Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:41:10.5338172Z 'load_eviction_policies': ['last', ''], 2026-02-21T08:41:10.5338361Z 'num_stages': 7, 2026-02-21T08:41:10.5338515Z 'num_warps': 1, 2026-02-21T08:41:10.5338653Z 'pid_type': 'flat', 2026-02-21T08:41:10.5338812Z 'range_flattens': [None, None], 2026-02-21T08:41:10.5338981Z 'range_multi_buffers': [None, False], 2026-02-21T08:41:10.5339161Z 'range_num_stages': [0, 1], 2026-02-21T08:41:10.5339317Z 'range_unroll_factors': [0, 0], 2026-02-21T08:41:10.5339497Z 'range_warp_specializes': [None, True]} 2026-02-21T08:41:10.5355016Z [200s] Fitting surrogate: 403 points, 403 targets 2026-02-21T08:41:11.3702252Z [201s] Generation 5 starting: 59 neighbors, 5 active search path(s) 2026-02-21T08:41:15.7046711Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 6.1 configs/s 2026-02-21T08:41:19.5675653Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 16.2 configs/s 2026-02-21T08:41:34.0593390Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 19.0 configs/s 2026-02-21T08:41:34.2772428Z [224s] Generation 5 complete: 2026-02-21T08:41:34.2774361Z ok=64 2026-02-21T08:41:34.2774523Z min=0.8443 2026-02-21T08:41:34.2774650Z mid=0.9020 2026-02-21T08:41:34.2774774Z max=4.4452 2026-02-21T08:41:34.2774905Z best={'block_sizes': [1024, 1], 2026-02-21T08:41:34.2775134Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:41:34.2775360Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:34.2775526Z 'num_stages': 8, 2026-02-21T08:41:34.2775667Z 'num_warps': 1, 2026-02-21T08:41:34.2775801Z 'pid_type': 'flat', 2026-02-21T08:41:34.2775960Z 'range_flattens': [None, None], 2026-02-21T08:41:34.2776132Z 'range_multi_buffers': [None, False], 2026-02-21T08:41:34.2776312Z 'range_num_stages': [0, 1], 2026-02-21T08:41:34.2776472Z 'range_unroll_factors': [0, 0], 2026-02-21T08:41:34.2776651Z 'range_warp_specializes': [None, True]} 2026-02-21T08:41:34.2791712Z [224s] Fitting surrogate: 467 points, 467 targets 2026-02-21T08:41:35.1952760Z [225s] Generation 6 starting: 61 neighbors, 5 active search path(s) 2026-02-21T08:41:40.6770857Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 2.4 configs/s 2026-02-21T08:41:44.3835586Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.6 configs/s 2026-02-21T08:41:59.6353095Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 18.0 configs/s 2026-02-21T08:41:59.8609392Z [249s] Generation 6 complete: 2026-02-21T08:41:59.8609651Z ok=66 2026-02-21T08:41:59.8609837Z min=0.8591 2026-02-21T08:41:59.8609989Z mid=0.8776 2026-02-21T08:41:59.8610139Z max=3.6290 2026-02-21T08:41:59.8610308Z best={'block_sizes': [1024, 1], 2026-02-21T08:41:59.8610561Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:41:59.8610820Z 'load_eviction_policies': ['', ''], 2026-02-21T08:41:59.8615536Z 'num_stages': 8, 2026-02-21T08:41:59.8615735Z 'num_warps': 1, 2026-02-21T08:41:59.8615885Z 'pid_type': 'flat', 2026-02-21T08:41:59.8616356Z 'range_flattens': [None, None], 2026-02-21T08:41:59.8616564Z 'range_multi_buffers': [None, False], 2026-02-21T08:41:59.8616748Z 'range_num_stages': [0, 1], 2026-02-21T08:41:59.8616909Z 'range_unroll_factors': [0, 1], 2026-02-21T08:41:59.8617090Z 'range_warp_specializes': [None, True]} 2026-02-21T08:41:59.8630597Z [249s] Fitting surrogate: 533 points, 533 targets 2026-02-21T08:42:00.6755238Z [250s] Generation 7 starting: 50 neighbors, 5 active search path(s) 2026-02-21T08:42:03.5758623Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 30.8 configs/s 2026-02-21T08:42:04.3683478Z module { 2026-02-21T08:42:04.3684169Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:42:04.3684871Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:42:04.3685108Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:42:04.3685384Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:42:04.3685656Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:42:04.3685838Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:42:04.3686030Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:42:04.3686218Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:42:04.3686401Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:42:04.3686714Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:42:04.3687156Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:42:04.3687458Z %2 = tt.get_program_id x : i32 2026-02-21T08:42:04.3687634Z %3 = arith.muli %2, %c4_i32 : i32 2026-02-21T08:42:04.3687852Z %4 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:42:04.3688088Z %5 = tt.splat %3 : i32 -> tensor<4xi32> 2026-02-21T08:42:04.3688281Z %6 = arith.addi %5, %4 : tensor<4xi32> 2026-02-21T08:42:04.3688586Z %7 = scf.for %arg5 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg6 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:42:04.3688993Z %11 = tt.descriptor_load %0[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:04.3689352Z %12 = tt.descriptor_load %1[%3, %arg5] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:04.3689641Z %13 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:42:04.3690003Z %15 = tt.extern_elementwise %12 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:42:04.3690362Z %16 = arith.subf %12, %11 : tensor<4x1024xf32> 2026-02-21T08:42:04.3690566Z %17 = arith.mulf %15, %16 : tensor<4x1024xf32> 2026-02-21T08:42:04.3690768Z %18 = arith.addf %17, %cst : tensor<4x1024xf32> 2026-02-21T08:42:04.3691310Z scf.yield %18 : tensor<4x1024xf32> 2026-02-21T08:42:04.3691481Z } else { 2026-02-21T08:42:04.3691645Z %15 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:42:04.3692023Z %16 = arith.cmpf ogt, %12, %15 : tensor<4x1024xf32> 2026-02-21T08:42:04.3692249Z %17 = arith.cmpf une, %12, %12 : tensor<4x1024xf32> 2026-02-21T08:42:04.3692461Z %18 = arith.ori %16, %17 : tensor<4x1024xi1> 2026-02-21T08:42:04.3692696Z %19 = arith.select %18, %12, %15 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:42:04.3692945Z %20 = math.log %19 : tensor<4x1024xf32> 2026-02-21T08:42:04.3693136Z %21 = arith.subf %20, %11 : tensor<4x1024xf32> 2026-02-21T08:42:04.3693337Z %22 = arith.mulf %12, %21 : tensor<4x1024xf32> 2026-02-21T08:42:04.3693546Z %23 = arith.addf %22, %cst : tensor<4x1024xf32> 2026-02-21T08:42:04.3693737Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:42:04.3693907Z } 2026-02-21T08:42:04.3694151Z %14 = arith.addf %arg6, %13 : tensor<4x1024xf32> 2026-02-21T08:42:04.3694357Z scf.yield %14 : tensor<4x1024xf32> 2026-02-21T08:42:04.3694660Z } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32, tt.warp_specialize} 2026-02-21T08:42:04.3694984Z %8 = "tt.reduce"(%7) <{axis = 1 : i32}> ({ 2026-02-21T08:42:04.3695171Z ^bb0(%arg5: f32, %arg6: f32): 2026-02-21T08:42:04.3695342Z %11 = arith.addf %arg5, %arg6 : f32 2026-02-21T08:42:04.3695528Z tt.reduce.return %11 : f32 2026-02-21T08:42:04.3695707Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:42:04.3695933Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:42:04.3696179Z %10 = tt.addptr %9, %6 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:42:04.3696413Z tt.store %10, %8 : tensor<4x!tt.ptr> 2026-02-21T08:42:04.3696584Z tt.return 2026-02-21T08:42:04.3696712Z } 2026-02-21T08:42:04.3696836Z } 2026-02-21T08:42:04.3696903Z 2026-02-21T08:42:04.3696955Z {-# 2026-02-21T08:42:04.3697082Z external_resources: { 2026-02-21T08:42:04.3697230Z mlir_reproducer: { 2026-02-21T08:42:04.3701444Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:42:04.3705806Z disable_threading: false, 2026-02-21T08:42:04.3705970Z verify_each: true 2026-02-21T08:42:04.3706195Z } 2026-02-21T08:42:04.3706311Z } 2026-02-21T08:42:04.3706430Z #-} 2026-02-21T08:42:04.3706846Z /tmp/torchinductor_root/x7/cx77nrsi5h5ftxazbuunxznwcmli6oncasidotgp6j4mzzin64mi.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:42:04.3708118Z /tmp/torchinductor_root/x7/cx77nrsi5h5ftxazbuunxznwcmli6oncasidotgp6j4mzzin64mi.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:42:04.3709147Z [254s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:42:04.3710288Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'tensor_descriptor'], load_eviction_policies=['first', 'first'], num_stages=7, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:42:04.3711263Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:42:04.3711522Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:42:06.8374889Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.8 configs/s 2026-02-21T08:42:18.7811842Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 23.0 configs/s 2026-02-21T08:42:18.9866785Z [268s] Generation 7 complete: 2026-02-21T08:42:18.9870703Z error=2 2026-02-21T08:42:18.9875305Z ok=54 2026-02-21T08:42:18.9879607Z min=0.8439 2026-02-21T08:42:18.9884044Z mid=0.9154 2026-02-21T08:42:18.9889126Z max=4.9690 2026-02-21T08:42:18.9893815Z best={'block_sizes': [1024, 1], 2026-02-21T08:42:18.9894150Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:42:18.9898800Z 'load_eviction_policies': ['first', ''], 2026-02-21T08:42:18.9902558Z 'num_stages': 8, 2026-02-21T08:42:18.9907058Z 'num_warps': 1, 2026-02-21T08:42:18.9908888Z 'pid_type': 'flat', 2026-02-21T08:42:18.9909089Z 'range_flattens': [None, None], 2026-02-21T08:42:18.9909271Z 'range_multi_buffers': [None, False], 2026-02-21T08:42:18.9909458Z 'range_num_stages': [0, 1], 2026-02-21T08:42:18.9909620Z 'range_unroll_factors': [0, 1], 2026-02-21T08:42:18.9909800Z 'range_warp_specializes': [None, True]} 2026-02-21T08:42:18.9910084Z [268s] Fitting surrogate: 589 points, 589 targets 2026-02-21T08:42:19.8464143Z [269s] Generation 8 starting: 48 neighbors, 4 active search path(s) 2026-02-21T08:42:23.9218831Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49/49 3.1 configs/s 2026-02-21T08:42:26.9104551Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 49/49 16.6 configs/s 2026-02-21T08:42:38.4361108Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 276/276 23.8 configs/s 2026-02-21T08:42:38.6367717Z [288s] Generation 8 complete: 2026-02-21T08:42:38.6373337Z ok=52 2026-02-21T08:42:38.6375028Z min=0.8320 2026-02-21T08:42:38.6375193Z mid=0.8786 2026-02-21T08:42:38.6375313Z max=2.6164 2026-02-21T08:42:38.6375458Z best={'block_sizes': [2048, 2], 2026-02-21T08:42:38.6375681Z 'indexing': ['pointer', 'pointer', 'tensor_descriptor'], 2026-02-21T08:42:38.6375915Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:42:38.6376089Z 'num_stages': 7, 2026-02-21T08:42:38.6376232Z 'num_warps': 32, 2026-02-21T08:42:38.6376373Z 'pid_type': 'flat', 2026-02-21T08:42:38.6376526Z 'range_flattens': [None, False], 2026-02-21T08:42:38.6376711Z 'range_multi_buffers': [None, False], 2026-02-21T08:42:38.6376891Z 'range_num_stages': [0, 0], 2026-02-21T08:42:38.6377090Z 'range_unroll_factors': [0, 0], 2026-02-21T08:42:38.6377263Z 'range_warp_specializes': [None, False]} 2026-02-21T08:42:38.6390875Z [288s] Fitting surrogate: 641 points, 641 targets 2026-02-21T08:42:39.2907877Z [289s] Generation 9 starting: 31 neighbors, 3 active search path(s) 2026-02-21T08:42:42.9391711Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 2.9 configs/s 2026-02-21T08:42:43.7004085Z module attributes {ttg.maxnreg = 64 : i32} { 2026-02-21T08:42:43.7008454Z tt.func public @_helion_kl_div_forward(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}, %arg3: i1, %arg4: f32) attributes {noinline = false} { 2026-02-21T08:42:43.7012781Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T08:42:43.7014214Z %c0_i32 = arith.constant 0 : i32 2026-02-21T08:42:43.7014445Z %c148_i32 = arith.constant 148 : i32 2026-02-21T08:42:43.7014674Z %cst = arith.constant dense<0.000000e+00> : tensor<4x1024xf32> 2026-02-21T08:42:43.7014907Z %c4_i32 = arith.constant 4 : i32 2026-02-21T08:42:43.7015087Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T08:42:43.7015596Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T08:42:43.7015820Z %c131072_i64 = arith.constant 131072 : i64 2026-02-21T08:42:43.7015998Z %c1_i64 = arith.constant 1 : i64 2026-02-21T08:42:43.7016317Z %0 = tt.make_tensor_descriptor %arg0, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:42:43.7016757Z %1 = tt.make_tensor_descriptor %arg1, [%c4096_i32, %c131072_i32], [%c131072_i64, %c1_i64] : , > 2026-02-21T08:42:43.7017069Z %2 = tt.get_program_id x : i32 2026-02-21T08:42:43.7017242Z %3 = arith.subi %c1024_i32, %2 : i32 2026-02-21T08:42:43.7017411Z %c1_i32 = arith.constant 1 : i32 2026-02-21T08:42:43.7017588Z %4 = arith.subi %c148_i32, %c1_i32 : i32 2026-02-21T08:42:43.7017758Z %5 = arith.addi %3, %4 : i32 2026-02-21T08:42:43.7017924Z %6 = arith.divui %5, %c148_i32 : i32 2026-02-21T08:42:43.7018089Z %c2_i32 = arith.constant 2 : i32 2026-02-21T08:42:43.7018259Z %7 = arith.remsi %6, %c2_i32 : i32 2026-02-21T08:42:43.7018427Z %8 = arith.subi %6, %7 : i32 2026-02-21T08:42:43.7018598Z %9 = arith.muli %8, %c148_i32 : i32 2026-02-21T08:42:43.7018770Z %10 = arith.addi %2, %9 : i32 2026-02-21T08:42:43.7018957Z %11 = arith.muli %c148_i32, %c2_i32 : i32 2026-02-21T08:42:43.7019153Z scf.for %arg5 = %2 to %10 step %11 : i32 { 2026-02-21T08:42:43.7019342Z %12 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:42:43.7019568Z %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:42:43.7019805Z %14 = tt.splat %12 : i32 -> tensor<4xi32> 2026-02-21T08:42:43.7019996Z %15 = arith.addi %14, %13 : tensor<4xi32> 2026-02-21T08:42:43.7020314Z %16 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:42:43.7020723Z %30 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7021095Z %31 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7021390Z %32 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:42:43.7021760Z %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:42:43.7022218Z %35 = arith.subf %31, %30 : tensor<4x1024xf32> 2026-02-21T08:42:43.7022425Z %36 = arith.mulf %34, %35 : tensor<4x1024xf32> 2026-02-21T08:42:43.7022643Z %37 = arith.addf %36, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7022844Z scf.yield %37 : tensor<4x1024xf32> 2026-02-21T08:42:43.7023021Z } else { 2026-02-21T08:42:43.7023232Z %34 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:42:43.7023460Z %35 = arith.cmpf ogt, %31, %34 : tensor<4x1024xf32> 2026-02-21T08:42:43.7023688Z %36 = arith.cmpf une, %31, %31 : tensor<4x1024xf32> 2026-02-21T08:42:43.7023897Z %37 = arith.ori %35, %36 : tensor<4x1024xi1> 2026-02-21T08:42:43.7024298Z %38 = arith.select %37, %31, %34 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:42:43.7024551Z %39 = math.log %38 : tensor<4x1024xf32> 2026-02-21T08:42:43.7024754Z %40 = arith.subf %39, %30 : tensor<4x1024xf32> 2026-02-21T08:42:43.7024963Z %41 = arith.mulf %31, %40 : tensor<4x1024xf32> 2026-02-21T08:42:43.7025164Z %42 = arith.addf %41, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7025366Z scf.yield %42 : tensor<4x1024xf32> 2026-02-21T08:42:43.7025534Z } 2026-02-21T08:42:43.7025686Z %33 = arith.addf %arg7, %32 : tensor<4x1024xf32> 2026-02-21T08:42:43.7025877Z scf.yield %33 : tensor<4x1024xf32> 2026-02-21T08:42:43.7026134Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:42:43.7026407Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:42:43.7026667Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:42:43.7026855Z %30 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:42:43.7027032Z tt.reduce.return %30 : f32 2026-02-21T08:42:43.7027215Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:42:43.7027436Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:42:43.7027695Z %19 = tt.addptr %18, %15 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:42:43.7027924Z tt.store %19, %17 : tensor<4x!tt.ptr> 2026-02-21T08:42:43.7028132Z %c1_i32_0 = arith.constant 1 : i32 2026-02-21T08:42:43.7028332Z %20 = arith.muli %c148_i32, %c1_i32_0 : i32 2026-02-21T08:42:43.7028519Z %21 = arith.addi %arg5, %20 : i32 2026-02-21T08:42:43.7028693Z %22 = arith.muli %21, %c4_i32 : i32 2026-02-21T08:42:43.7028911Z %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:42:43.7029140Z %24 = tt.splat %22 : i32 -> tensor<4xi32> 2026-02-21T08:42:43.7029330Z %25 = arith.addi %24, %23 : tensor<4xi32> 2026-02-21T08:42:43.7029637Z %26 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:42:43.7030047Z %30 = tt.descriptor_load %0[%22, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7030412Z %31 = tt.descriptor_load %1[%22, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7030697Z %32 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:42:43.7031059Z %34 = tt.extern_elementwise %31 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:42:43.7031424Z %35 = arith.subf %31, %30 : tensor<4x1024xf32> 2026-02-21T08:42:43.7031629Z %36 = arith.mulf %34, %35 : tensor<4x1024xf32> 2026-02-21T08:42:43.7031827Z %37 = arith.addf %36, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7032075Z scf.yield %37 : tensor<4x1024xf32> 2026-02-21T08:42:43.7032250Z } else { 2026-02-21T08:42:43.7032406Z %34 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:42:43.7032631Z %35 = arith.cmpf ogt, %31, %34 : tensor<4x1024xf32> 2026-02-21T08:42:43.7032848Z %36 = arith.cmpf une, %31, %31 : tensor<4x1024xf32> 2026-02-21T08:42:43.7033064Z %37 = arith.ori %35, %36 : tensor<4x1024xi1> 2026-02-21T08:42:43.7033301Z %38 = arith.select %37, %31, %34 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:42:43.7033546Z %39 = math.log %38 : tensor<4x1024xf32> 2026-02-21T08:42:43.7033748Z %40 = arith.subf %39, %30 : tensor<4x1024xf32> 2026-02-21T08:42:43.7033946Z %41 = arith.mulf %31, %40 : tensor<4x1024xf32> 2026-02-21T08:42:43.7034156Z %42 = arith.addf %41, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7034351Z scf.yield %42 : tensor<4x1024xf32> 2026-02-21T08:42:43.7034526Z } 2026-02-21T08:42:43.7034674Z %33 = arith.addf %arg7, %32 : tensor<4x1024xf32> 2026-02-21T08:42:43.7034949Z scf.yield %33 : tensor<4x1024xf32> 2026-02-21T08:42:43.7035190Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:42:43.7035454Z %27 = "tt.reduce"(%26) <{axis = 1 : i32}> ({ 2026-02-21T08:42:43.7035641Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:42:43.7035812Z %30 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:42:43.7035997Z tt.reduce.return %30 : f32 2026-02-21T08:42:43.7036174Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:42:43.7036397Z %28 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:42:43.7036646Z %29 = tt.addptr %28, %25 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:42:43.7036879Z tt.store %29, %27 : tensor<4x!tt.ptr> 2026-02-21T08:42:43.7037104Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:42:43.7037409Z scf.for %arg5 = %10 to %c1024_i32 step %c148_i32 : i32 { 2026-02-21T08:42:43.7037631Z %12 = arith.muli %arg5, %c4_i32 : i32 2026-02-21T08:42:43.7037850Z %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32> 2026-02-21T08:42:43.7038093Z %14 = tt.splat %12 : i32 -> tensor<4xi32> 2026-02-21T08:42:43.7038284Z %15 = arith.addi %14, %13 : tensor<4xi32> 2026-02-21T08:42:43.7038607Z %16 = scf.for %arg6 = %c0_i32 to %c131072_i32 step %c1024_i32 iter_args(%arg7 = %cst) -> (tensor<4x1024xf32>) : i32 { 2026-02-21T08:42:43.7039029Z %20 = tt.descriptor_load %0[%12, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7039404Z %21 = tt.descriptor_load %1[%12, %arg6] : !tt.tensordesc> -> tensor<4x1024xf32> 2026-02-21T08:42:43.7039705Z %22 = scf.if %arg3 -> (tensor<4x1024xf32>) { 2026-02-21T08:42:43.7040074Z %24 = tt.extern_elementwise %21 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<4x1024xf32>) -> tensor<4x1024xf32> 2026-02-21T08:42:43.7040447Z %25 = arith.subf %21, %20 : tensor<4x1024xf32> 2026-02-21T08:42:43.7040668Z %26 = arith.mulf %24, %25 : tensor<4x1024xf32> 2026-02-21T08:42:43.7040881Z %27 = arith.addf %26, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7041095Z scf.yield %27 : tensor<4x1024xf32> 2026-02-21T08:42:43.7041263Z } else { 2026-02-21T08:42:43.7041428Z %24 = tt.splat %arg4 : f32 -> tensor<4x1024xf32> 2026-02-21T08:42:43.7041646Z %25 = arith.cmpf ogt, %21, %24 : tensor<4x1024xf32> 2026-02-21T08:42:43.7041912Z %26 = arith.cmpf une, %21, %21 : tensor<4x1024xf32> 2026-02-21T08:42:43.7042119Z %27 = arith.ori %25, %26 : tensor<4x1024xi1> 2026-02-21T08:42:43.7042360Z %28 = arith.select %27, %21, %24 : tensor<4x1024xi1>, tensor<4x1024xf32> 2026-02-21T08:42:43.7042605Z %29 = math.log %28 : tensor<4x1024xf32> 2026-02-21T08:42:43.7042799Z %30 = arith.subf %29, %20 : tensor<4x1024xf32> 2026-02-21T08:42:43.7043008Z %31 = arith.mulf %21, %30 : tensor<4x1024xf32> 2026-02-21T08:42:43.7043207Z %32 = arith.addf %31, %cst : tensor<4x1024xf32> 2026-02-21T08:42:43.7043408Z scf.yield %32 : tensor<4x1024xf32> 2026-02-21T08:42:43.7043571Z } 2026-02-21T08:42:43.7043720Z %23 = arith.addf %arg7, %22 : tensor<4x1024xf32> 2026-02-21T08:42:43.7043916Z scf.yield %23 : tensor<4x1024xf32> 2026-02-21T08:42:43.7044156Z } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32, tt.warp_specialize} 2026-02-21T08:42:43.7044424Z %17 = "tt.reduce"(%16) <{axis = 1 : i32}> ({ 2026-02-21T08:42:43.7044606Z ^bb0(%arg6: f32, %arg7: f32): 2026-02-21T08:42:43.7044780Z %20 = arith.addf %arg6, %arg7 : f32 2026-02-21T08:42:43.7044955Z tt.reduce.return %20 : f32 2026-02-21T08:42:43.7045139Z }) : (tensor<4x1024xf32>) -> tensor<4xf32> 2026-02-21T08:42:43.7045358Z %18 = tt.splat %arg2 : !tt.ptr -> tensor<4x!tt.ptr> 2026-02-21T08:42:43.7045608Z %19 = tt.addptr %18, %15 : tensor<4x!tt.ptr>, tensor<4xi32> 2026-02-21T08:42:43.7045906Z tt.store %19, %17 : tensor<4x!tt.ptr> 2026-02-21T08:42:43.7046118Z } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32} 2026-02-21T08:42:43.7046317Z tt.return 2026-02-21T08:42:43.7046436Z } 2026-02-21T08:42:43.7046553Z } 2026-02-21T08:42:43.7046619Z 2026-02-21T08:42:43.7046667Z {-# 2026-02-21T08:42:43.7046796Z external_resources: { 2026-02-21T08:42:43.7046947Z mlir_reproducer: { 2026-02-21T08:42:43.7051204Z pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc{hoist-out-of-if=false}, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=7}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=7}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=7}, tritongpu-combine-tensor-select-and-if, tritongpu-hoist-tmem-alloc{hoist-out-of-if=true}, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=100}, triton-nvidia-mma-lowering, sccp, cse, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})", 2026-02-21T08:42:43.7055618Z disable_threading: false, 2026-02-21T08:42:43.7055790Z verify_each: true 2026-02-21T08:42:43.7055933Z } 2026-02-21T08:42:43.7056060Z } 2026-02-21T08:42:43.7056173Z #-} 2026-02-21T08:42:43.7056606Z /tmp/torchinductor_root/g5/cg5nwozgggubgr466sqq55w5riqw7l6mhmslcmevptxxg2gvpnsu.py:21:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T08:42:43.7057844Z /tmp/torchinductor_root/g5/cg5nwozgggubgr466sqq55w5riqw7l6mhmslcmevptxxg2gvpnsu.py:21:0: note: Pipeline failed while executing [`TritonGPUAutomaticWarpSpecialization` on 'builtin.module' operation, `TritonGPULoadMMASpecialization` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T08:42:43.7058857Z [293s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T08:42:43.7059976Z Config: @helion.kernel(config=helion.Config(block_sizes=[1024, 4], indexing=['tensor_descriptor', 'tensor_descriptor', 'pointer'], load_eviction_policies=['last', 'first'], maxnreg=64, num_sm_multiplier=1, num_stages=7, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[2, 1], range_warp_specializes=[None, True]), static_shapes=True) 2026-02-21T08:42:43.7060993Z Error: RuntimeError: PassManager::run failed 2026-02-21T08:42:43.7061254Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T08:42:44.9776074Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.0 configs/s 2026-02-21T08:42:52.1028582Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 279/279 38.6 configs/s 2026-02-21T08:42:52.2820885Z [302s] Generation 9 complete: 2026-02-21T08:42:52.2826009Z error=1 2026-02-21T08:42:52.2831065Z ok=33 2026-02-21T08:42:52.2832449Z min=0.8337 2026-02-21T08:42:52.2832610Z mid=0.8899 2026-02-21T08:42:52.2832727Z max=8.5688 2026-02-21T08:42:52.2832869Z best={'block_sizes': [2048, 2], 2026-02-21T08:42:52.2833110Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:42:52.2833369Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:42:52.2833547Z 'num_stages': 7, 2026-02-21T08:42:52.2833687Z 'num_warps': 32, 2026-02-21T08:42:52.2833824Z 'pid_type': 'flat', 2026-02-21T08:42:52.2833974Z 'range_flattens': [None, False], 2026-02-21T08:42:52.2834148Z 'range_multi_buffers': [None, False], 2026-02-21T08:42:52.2834321Z 'range_num_stages': [0, 0], 2026-02-21T08:42:52.2834485Z 'range_unroll_factors': [0, 0], 2026-02-21T08:42:52.2834655Z 'range_warp_specializes': [None, False]} 2026-02-21T08:42:52.2842979Z [302s] Fitting surrogate: 675 points, 675 targets 2026-02-21T08:42:52.8175470Z [302s] Generation 10 starting: 22 neighbors, 2 active search path(s) 2026-02-21T08:42:54.2309654Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 26.7 configs/s 2026-02-21T08:42:55.5637298Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 17.1 configs/s 2026-02-21T08:43:00.3182865Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 279/279 57.4 configs/s 2026-02-21T08:43:00.4798522Z [310s] Generation 10 complete: 2026-02-21T08:43:00.4802870Z error=1 2026-02-21T08:43:00.4804278Z ok=23 2026-02-21T08:43:00.4804441Z min=0.8344 2026-02-21T08:43:00.4804574Z mid=0.8929 2026-02-21T08:43:00.4804702Z max=3.5113 2026-02-21T08:43:00.4804843Z best={'block_sizes': [2048, 2], 2026-02-21T08:43:00.4805102Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:43:00.4805357Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:43:00.4805548Z 'num_stages': 7, 2026-02-21T08:43:00.4805718Z 'num_warps': 32, 2026-02-21T08:43:00.4805868Z 'pid_type': 'flat', 2026-02-21T08:43:00.4806026Z 'range_flattens': [None, False], 2026-02-21T08:43:00.4806199Z 'range_multi_buffers': [None, False], 2026-02-21T08:43:00.4806380Z 'range_num_stages': [0, 0], 2026-02-21T08:43:00.4806539Z 'range_unroll_factors': [0, 0], 2026-02-21T08:43:00.4806718Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:00.4818307Z [310s] Fitting surrogate: 699 points, 699 targets 2026-02-21T08:43:00.8480201Z [310s] Generation 11 starting: 8 neighbors, 1 active search path(s) 2026-02-21T08:43:01.4792740Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 54.7 configs/s 2026-02-21T08:43:01.9829196Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 8/8 17.4 configs/s 2026-02-21T08:43:03.9848713Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 279/279 131.1 configs/s 2026-02-21T08:43:04.1325024Z [313s] Generation 11 complete: 2026-02-21T08:43:04.1329396Z ok=9 2026-02-21T08:43:04.1331097Z min=0.8335 2026-02-21T08:43:04.1331260Z mid=0.8551 2026-02-21T08:43:04.1331386Z max=1.2503 2026-02-21T08:43:04.1331515Z best={'block_sizes': [2048, 2], 2026-02-21T08:43:04.1331765Z 'indexing': ['tensor_descriptor', 'pointer', 'tensor_descriptor'], 2026-02-21T08:43:04.1332277Z 'load_eviction_policies': ['last', 'last'], 2026-02-21T08:43:04.1332458Z 'num_stages': 7, 2026-02-21T08:43:04.1332602Z 'num_warps': 32, 2026-02-21T08:43:04.1332735Z 'pid_type': 'flat', 2026-02-21T08:43:04.1332897Z 'range_flattens': [None, False], 2026-02-21T08:43:04.1333070Z 'range_multi_buffers': [None, False], 2026-02-21T08:43:04.1333250Z 'range_num_stages': [0, 0], 2026-02-21T08:43:04.1333408Z 'range_unroll_factors': [0, 0], 2026-02-21T08:43:04.1333587Z 'range_warp_specializes': [None, False]} 2026-02-21T08:43:04.1347478Z [313s] Fitting surrogate: 708 points, 708 targets 2026-02-21T08:43:04.4168295Z [314s] Autotuning complete in 314.3s after searching 668 configs. 2026-02-21T08:43:04.4168704Z One can hardcode the best config and skip autotuning with: 2026-02-21T08:43:04.4169974Z @helion.kernel(config=helion.Config(block_sizes=[2048, 2], indexing=['tensor_descriptor', 'pointer', 'tensor_descriptor'], load_eviction_policies=['last', 'last'], num_stages=7, num_warps=32, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[None, False]), static_shapes=True) 2026-02-21T08:43:04.4170784Z 2026-02-21T08:43:04.4171025Z [314s] Code of selected kernel: /tmp/torchinductor_root/xt/cxtwlkxgiqf2akxmt2wa6geuqdah3nhx4n5v3r2xiooicgcsodh2.py 2026-02-21T08:43:04.4355732Z from __future__ import annotations 2026-02-21T08:43:04.4355965Z 2026-02-21T08:43:04.4356125Z import torch 2026-02-21T08:43:04.4356260Z import triton 2026-02-21T08:43:04.4356465Z import triton.language as tl 2026-02-21T08:43:04.4356705Z from torch._inductor.runtime import triton_helpers 2026-02-21T08:43:04.4357259Z from torch._inductor.runtime.triton_helpers import math as tl_math 2026-02-21T08:43:04.4357594Z from torch._inductor.runtime.triton_compat import libdevice 2026-02-21T08:43:04.4357871Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T08:43:04.4358051Z 2026-02-21T08:43:04.4358121Z _BLOCK_SIZE_1 = tl.constexpr(2) 2026-02-21T08:43:04.4358292Z _BLOCK_SIZE_0 = tl.constexpr(2048) 2026-02-21T08:43:04.4358411Z 2026-02-21T08:43:04.4358465Z @triton.jit 2026-02-21T08:43:04.4358645Z def _helion_kl_div_forward(y_pred, y_true, loss, log_target, eps): 2026-02-21T08:43:04.4358960Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:43:04.4359210Z pid_0 = tl.program_id(0) 2026-02-21T08:43:04.4359384Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T08:43:04.4359618Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T08:43:04.4359919Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:43:04.4360234Z loss_sum = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:43:04.4360511Z # src[kl_div.py:92]: for tile_v in hl.tile(V, block_size=block_size_n): 2026-02-21T08:43:04.4360816Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:43:04.4361090Z # src[kl_div.py:92-112]: ... 2026-02-21T08:43:04.4361415Z for offset_0 in tl.range(0, 131072, _BLOCK_SIZE_0, warp_specialize=False, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T08:43:04.4361811Z indices_0 = offset_0 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32) 2026-02-21T08:43:04.4362128Z loss_sum_copy = loss_sum 2026-02-21T08:43:04.4362305Z loss_sum_copy_0 = loss_sum_copy 2026-02-21T08:43:04.4362602Z # src[kl_div.py:93]: kl_loss = hl.zeros([block_size_m, block_size_n], dtype=torch.float32) 2026-02-21T08:43:04.4362906Z kl_loss = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_0], 0.0, tl.float32) 2026-02-21T08:43:04.4363177Z # src[kl_div.py:95]: y_pred_val = y_pred[tile_bt, tile_v] 2026-02-21T08:43:04.4363540Z y_pred_val = tl.load(y_pred + (indices_1[:, None] * 131072 + indices_0[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T08:43:04.4363888Z # src[kl_div.py:96]: y_true_val = y_true[tile_bt, tile_v] 2026-02-21T08:43:04.4364231Z y_true_val = tl.load(y_true + (indices_1[:, None] * 131072 + indices_0[None, :] * 1), None, eviction_policy='evict_last') 2026-02-21T08:43:04.4364552Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:43:04.4364816Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:43:04.4365116Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:43:04.4365324Z # src[kl_div.py:98-106]: ... 2026-02-21T08:43:04.4365514Z if log_target: 2026-02-21T08:43:04.4365671Z y_true_val_copy = y_true_val 2026-02-21T08:43:04.4365862Z y_pred_val_copy = y_pred_val 2026-02-21T08:43:04.4366040Z kl_loss_copy = kl_loss 2026-02-21T08:43:04.4366337Z y_true_val_copy_0 = y_true_val_copy 2026-02-21T08:43:04.4366535Z y_pred_val_copy_0 = y_pred_val_copy 2026-02-21T08:43:04.4366731Z kl_loss_copy_0 = kl_loss_copy 2026-02-21T08:43:04.4366964Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:43:04.4367195Z v_0 = libdevice.exp(y_true_val_copy_0) 2026-02-21T08:43:04.4367450Z # src[kl_div.py:101]: kl_loss += prob_true * (y_true_val - y_pred_val) 2026-02-21T08:43:04.4367713Z v_1 = y_true_val_copy_0 - y_pred_val_copy_0 2026-02-21T08:43:04.4367912Z v_2 = v_0 * v_1 2026-02-21T08:43:04.4368078Z kl_loss = kl_loss_copy_0 + v_2 2026-02-21T08:43:04.4368273Z # src[kl_div.py:98]: if log_target: 2026-02-21T08:43:04.4368539Z # src[kl_div.py:99]: # KL(P || Q) = exp(y_true) * (y_true - y_pred) when both in log-space 2026-02-21T08:43:04.4368909Z # src[kl_div.py:100]: prob_true = torch.exp(y_true_val) 2026-02-21T08:43:04.4369137Z # src[kl_div.py:98-106]: ... 2026-02-21T08:43:04.4369311Z _not = not log_target 2026-02-21T08:43:04.4369477Z if _not: 2026-02-21T08:43:04.4369625Z y_true_val_copy_1 = y_true_val 2026-02-21T08:43:04.4369811Z y_pred_val_copy_1 = y_pred_val 2026-02-21T08:43:04.4369986Z kl_loss_copy_1 = kl_loss 2026-02-21T08:43:04.4370177Z y_true_val_copy_1_0 = y_true_val_copy_1 2026-02-21T08:43:04.4370393Z y_pred_val_copy_1_0 = y_pred_val_copy_1 2026-02-21T08:43:04.4370597Z kl_loss_copy_1_0 = kl_loss_copy_1 2026-02-21T08:43:04.4370864Z # src[kl_div.py:105]: log_true = torch.log(torch.clamp(y_true_val, min=eps)) 2026-02-21T08:43:04.4371161Z v_4 = triton_helpers.maximum(y_true_val_copy_1_0, eps) 2026-02-21T08:43:04.4371384Z v_5 = tl_math.log(v_4) 2026-02-21T08:43:04.4371608Z # src[kl_div.py:106]: kl_loss += y_true_val * (log_true - y_pred_val) 2026-02-21T08:43:04.4371918Z v_6 = v_5 - y_pred_val_copy_1_0 2026-02-21T08:43:04.4372121Z v_7 = y_true_val_copy_1_0 * v_6 2026-02-21T08:43:04.4372313Z kl_loss = kl_loss_copy_1_0 + v_7 2026-02-21T08:43:04.4372524Z # src[kl_div.py:112]: loss_sum += kl_loss 2026-02-21T08:43:04.4372723Z loss_sum = loss_sum_copy_0 + kl_loss 2026-02-21T08:43:04.4372953Z # src[kl_div.py:115]: loss[tile_bt] = loss_sum.sum(dim=-1) 2026-02-21T08:43:04.4373191Z sum_1 = tl.cast(tl.sum(loss_sum, 1), tl.float32) 2026-02-21T08:43:04.4373417Z tl.store(loss + indices_1 * 1, sum_1, None) 2026-02-21T08:43:04.4373551Z 2026-02-21T08:43:04.4373854Z def kl_div_forward(y_pred: Tensor, y_true: Tensor, log_target: bool=False, reduction: str='batchmean', eps: float=1e-10, *, _launcher=_default_launcher): 2026-02-21T08:43:04.4374262Z """ 2026-02-21T08:43:04.4374402Z Compute KL Divergence loss. 2026-02-21T08:43:04.4374508Z 2026-02-21T08:43:04.4374561Z Args: 2026-02-21T08:43:04.4374739Z y_pred: Input predictions in log-space, shape (BT, V) 2026-02-21T08:43:04.4375021Z y_true: Target values (probabilities or log-probabilities), shape (BT, V) 2026-02-21T08:43:04.4375351Z log_target: If True, y_true is in log-space; if False, y_true is probabilities 2026-02-21T08:43:04.4375662Z reduction: Reduction mode ('none', 'sum', 'mean', 'batchmean') 2026-02-21T08:43:04.4375901Z eps: Small value to avoid numerical issues 2026-02-21T08:43:04.4376039Z 2026-02-21T08:43:04.4376092Z Returns: 2026-02-21T08:43:04.4376225Z loss: KL divergence loss 2026-02-21T08:43:04.4376380Z """ 2026-02-21T08:43:04.4376515Z # src[kl_div.py:74]: BT, V = y_pred.shape 2026-02-21T08:43:04.4376698Z BT, V = y_pred.shape 2026-02-21T08:43:04.4376890Z # src[kl_div.py:75]: assert y_true.shape == y_pred.shape, ( 2026-02-21T08:43:04.4377156Z # src[kl_div.py:76]: f"Shape mismatch: {y_true.shape} != {y_pred.shape}" 2026-02-21T08:43:04.4377394Z # src[kl_div.py:77]: ) 2026-02-21T08:43:04.4377709Z assert y_true.shape == y_pred.shape, f'Shape mismatch: {y_true.shape} != {y_pred.shape}' 2026-02-21T08:43:04.4377991Z # src[kl_div.py:80]: if reduction == "none": 2026-02-21T08:43:04.4378206Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:43:04.4378411Z # src[kl_div.py:82]: else: 2026-02-21T08:43:04.4378567Z # src[kl_div.py:80-83]: ... 2026-02-21T08:43:04.4378728Z if reduction == 'none': 2026-02-21T08:43:04.4378913Z # src[kl_div.py:81]: loss = torch.zeros_like(y_pred) 2026-02-21T08:43:04.4379114Z loss = torch.zeros_like(y_pred) 2026-02-21T08:43:04.4379283Z else: 2026-02-21T08:43:04.4379498Z # src[kl_div.py:83]: loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:43:04.4379824Z loss = torch.zeros((BT,), dtype=torch.float32, device=y_pred.device) 2026-02-21T08:43:04.4380168Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:43:04.4380405Z _BLOCK_SIZE_1 = 2 2026-02-21T08:43:04.4380605Z # src[kl_div.py:89]: for tile_bt in hl.tile(BT, block_size=block_size_m): 2026-02-21T08:43:04.4380907Z # src[kl_div.py:90]: loss_sum = hl.zeros([tile_bt, block_size_n], dtype=torch.float32) 2026-02-21T08:43:04.4381167Z # src[kl_div.py:89-115]: ... 2026-02-21T08:43:04.4381512Z _launcher(_helion_kl_div_forward, (triton.cdiv(4096, _BLOCK_SIZE_1),), y_pred, y_true, loss, log_target, eps, num_warps=32, num_stages=7) 2026-02-21T08:43:04.4381925Z # src[kl_div.py:118]: if reduction == "batchmean": 2026-02-21T08:43:04.4382156Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:43:04.4382391Z # src[kl_div.py:120]: elif reduction == "sum": 2026-02-21T08:43:04.4382590Z # src[kl_div.py:118-125]: ... 2026-02-21T08:43:04.4382758Z if reduction == 'batchmean': 2026-02-21T08:43:04.4382960Z # src[kl_div.py:119]: final_loss = torch.sum(loss) / BT 2026-02-21T08:43:04.4383171Z final_loss = torch.sum(loss) / BT 2026-02-21T08:43:04.4383352Z elif reduction == 'sum': 2026-02-21T08:43:04.4383538Z # src[kl_div.py:121]: final_loss = torch.sum(loss, dim=0) 2026-02-21T08:43:04.4383756Z final_loss = torch.sum(loss, dim=0) 2026-02-21T08:43:04.4383938Z elif reduction == 'mean': 2026-02-21T08:43:04.4384132Z # src[kl_div.py:123]: final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:43:04.4384359Z final_loss = torch.sum(loss) / (BT * V) 2026-02-21T08:43:04.4384528Z else: 2026-02-21T08:43:04.4384670Z # src[kl_div.py:125]: final_loss = loss 2026-02-21T08:43:04.4384845Z final_loss = loss 2026-02-21T08:43:04.4385010Z # src[kl_div.py:127]: return final_loss 2026-02-21T08:43:04.4385179Z return final_loss 2026-02-21T08:43:05.7162929Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T08:43:05.7164747Z (B, T, V) 2026-02-21T08:43:05.7164904Z ---------------- 2026-02-21T08:43:05.7165068Z (8, 512, 131072) 2026-02-21T08:43:05.7165164Z 2026-02-21T08:43:05.7165485Z 100%|██████████| 6/6 [21:38<00:00, 250.82s/it] 2026-02-21T08:43:05.7170044Z 100%|██████████| 6/6 [21:38<00:00, 216.35s/it] 2026-02-21T08:43:05.7179192Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpcy2mmull.csv 2026-02-21T08:43:06.3849077Z (B, T, V) liger_kl_div-speedup liger_kl_div-accuracy torch_compile_kl_div-speedup torch_compile_kl_div-accuracy helion_kl_div_tritonbench-speedup helion_kl_div_tritonbench-accuracy 2026-02-21T08:43:06.3853430Z ---------------- ---------------------- ----------------------- ------------------------------ ------------------------------- ----------------------------------- ------------------------------------ 2026-02-21T08:43:06.3855460Z (8, 512, 4096) 3.11483 1 3.03404 1 3.36283 1 2026-02-21T08:43:06.3855969Z (8, 512, 8192) 3.50195 1 3.18197 1 3.94311 1 2026-02-21T08:43:06.3856772Z (8, 512, 16384) 4.03663 1 3.25016 1 4.16629 1 2026-02-21T08:43:06.3857223Z (8, 512, 32768) 4.0457 1 3.15498 1 3.98916 1 2026-02-21T08:43:06.3857666Z (8, 512, 65536) 3.99696 1 3.44325 1 3.88139 1 2026-02-21T08:43:06.3858230Z (8, 512, 131072) 3.74916 1 3.49813 1 3.61356 1 2026-02-21T08:43:06.3858689Z average 3.74087 1 3.26042 1 3.82605 1 2026-02-21T08:43:08.4984482Z ✅ Completed benchmark for kernel: kl_div 2026-02-21T08:43:08.4994259Z [ 2026-02-21T08:43:08.4994499Z { 2026-02-21T08:43:08.4994663Z "benchmark": { 2026-02-21T08:43:08.4994841Z "name": "Helion Benchmark", 2026-02-21T08:43:08.4995100Z "extra_info": { 2026-02-21T08:43:08.4995255Z "device": "NVIDIA B200" 2026-02-21T08:43:08.4995476Z } 2026-02-21T08:43:08.4995604Z }, 2026-02-21T08:43:08.4995764Z "model": { 2026-02-21T08:43:08.4995917Z "name": "kl_div" 2026-02-21T08:43:08.4996055Z }, 2026-02-21T08:43:08.4996231Z "metric": { 2026-02-21T08:43:08.4996408Z "name": "triton_speedup", 2026-02-21T08:43:08.4996650Z "benchmark_values": [ 2026-02-21T08:43:08.4996812Z 3.1148345231848293, 2026-02-21T08:43:08.4996952Z 3.5019458271068076, 2026-02-21T08:43:08.4997100Z 4.036633414300421, 2026-02-21T08:43:08.4997243Z 4.045695489846165, 2026-02-21T08:43:08.4997392Z 3.99695912395291, 2026-02-21T08:43:08.4997532Z 3.7491585331985173 2026-02-21T08:43:08.4997672Z ] 2026-02-21T08:43:08.4997783Z }, 2026-02-21T08:43:08.4997904Z "shape": [ 2026-02-21T08:43:08.4998029Z "(8, 512, 4096)", 2026-02-21T08:43:08.4998170Z "(8, 512, 8192)", 2026-02-21T08:43:08.4998309Z "(8, 512, 16384)", 2026-02-21T08:43:08.4998446Z "(8, 512, 32768)", 2026-02-21T08:43:08.4998585Z "(8, 512, 65536)", 2026-02-21T08:43:08.4998716Z "(8, 512, 131072)" 2026-02-21T08:43:08.4998851Z ] 2026-02-21T08:43:08.4998962Z }, 2026-02-21T08:43:08.4999079Z { 2026-02-21T08:43:08.4999194Z "benchmark": { 2026-02-21T08:43:08.4999347Z "name": "Helion Benchmark", 2026-02-21T08:43:08.4999511Z "extra_info": { 2026-02-21T08:43:08.4999659Z "device": "NVIDIA B200" 2026-02-21T08:43:08.4999806Z } 2026-02-21T08:43:08.4999925Z }, 2026-02-21T08:43:08.5000047Z "model": { 2026-02-21T08:43:08.5000172Z "name": "kl_div" 2026-02-21T08:43:08.5000312Z }, 2026-02-21T08:43:08.5000424Z "metric": { 2026-02-21T08:43:08.5000567Z "name": "triton_accuracy", 2026-02-21T08:43:08.5000728Z "benchmark_values": [ 2026-02-21T08:43:08.5000879Z 1.0, 2026-02-21T08:43:08.5000999Z 1.0, 2026-02-21T08:43:08.5001126Z 1.0, 2026-02-21T08:43:08.5001242Z 1.0, 2026-02-21T08:43:08.5001366Z 1.0, 2026-02-21T08:43:08.5001481Z 1.0 2026-02-21T08:43:08.5001602Z ] 2026-02-21T08:43:08.5001716Z }, 2026-02-21T08:43:08.5001827Z "shape": [ 2026-02-21T08:43:08.5002195Z "(8, 512, 4096)", 2026-02-21T08:43:08.5002334Z "(8, 512, 8192)", 2026-02-21T08:43:08.5002485Z "(8, 512, 16384)", 2026-02-21T08:43:08.5003095Z "(8, 512, 32768)", 2026-02-21T08:43:08.5003236Z "(8, 512, 65536)", 2026-02-21T08:43:08.5003367Z "(8, 512, 131072)" 2026-02-21T08:43:08.5003530Z ] 2026-02-21T08:43:08.5003661Z }, 2026-02-21T08:43:08.5003779Z { 2026-02-21T08:43:08.5003912Z "benchmark": { 2026-02-21T08:43:08.5004059Z "name": "Helion Benchmark", 2026-02-21T08:43:08.5004230Z "extra_info": { 2026-02-21T08:43:08.5004368Z "device": "NVIDIA B200" 2026-02-21T08:43:08.5004520Z } 2026-02-21T08:43:08.5004628Z }, 2026-02-21T08:43:08.5004749Z "model": { 2026-02-21T08:43:08.5004872Z "name": "kl_div" 2026-02-21T08:43:08.5005010Z }, 2026-02-21T08:43:08.5005122Z "metric": { 2026-02-21T08:43:08.5005270Z "name": "torch_compile_speedup", 2026-02-21T08:43:08.5005451Z "benchmark_values": [ 2026-02-21T08:43:08.5005597Z 3.034036673978964, 2026-02-21T08:43:08.5005743Z 3.181974407448598, 2026-02-21T08:43:08.5005978Z 3.2501613834972476, 2026-02-21T08:43:08.5006134Z 3.15497766353392, 2026-02-21T08:43:08.5006272Z 3.443246731995977, 2026-02-21T08:43:08.5006414Z 3.4981284879927594 2026-02-21T08:43:08.5006545Z ] 2026-02-21T08:43:08.5006662Z }, 2026-02-21T08:43:08.5006772Z "shape": [ 2026-02-21T08:43:08.5006902Z "(8, 512, 4096)", 2026-02-21T08:43:08.5007043Z "(8, 512, 8192)", 2026-02-21T08:43:08.5007174Z "(8, 512, 16384)", 2026-02-21T08:43:08.5007316Z "(8, 512, 32768)", 2026-02-21T08:43:08.5007449Z "(8, 512, 65536)", 2026-02-21T08:43:08.5007588Z "(8, 512, 131072)" 2026-02-21T08:43:08.5007714Z ] 2026-02-21T08:43:08.5007832Z }, 2026-02-21T08:43:08.5007938Z { 2026-02-21T08:43:08.5008058Z "benchmark": { 2026-02-21T08:43:08.5008194Z "name": "Helion Benchmark", 2026-02-21T08:43:08.5008356Z "extra_info": { 2026-02-21T08:43:08.5008494Z "device": "NVIDIA B200" 2026-02-21T08:43:08.5008643Z } 2026-02-21T08:43:08.5008791Z }, 2026-02-21T08:43:08.5008904Z "model": { 2026-02-21T08:43:08.5009034Z "name": "kl_div" 2026-02-21T08:43:08.5009162Z }, 2026-02-21T08:43:08.5009282Z "metric": { 2026-02-21T08:43:08.5009420Z "name": "torch_compile_accuracy", 2026-02-21T08:43:08.5009598Z "benchmark_values": [ 2026-02-21T08:43:08.5009748Z 1.0, 2026-02-21T08:43:08.5009865Z 1.0, 2026-02-21T08:43:08.5009992Z 1.0, 2026-02-21T08:43:08.5010108Z 1.0, 2026-02-21T08:43:08.5010232Z 1.0, 2026-02-21T08:43:08.5010350Z 1.0 2026-02-21T08:43:08.5010472Z ] 2026-02-21T08:43:08.5010580Z }, 2026-02-21T08:43:08.5010701Z "shape": [ 2026-02-21T08:43:08.5010823Z "(8, 512, 4096)", 2026-02-21T08:43:08.5010962Z "(8, 512, 8192)", 2026-02-21T08:43:08.5011091Z "(8, 512, 16384)", 2026-02-21T08:43:08.5011230Z "(8, 512, 32768)", 2026-02-21T08:43:08.5011368Z "(8, 512, 65536)", 2026-02-21T08:43:08.5011501Z "(8, 512, 131072)" 2026-02-21T08:43:08.5011639Z ] 2026-02-21T08:43:08.5011750Z }, 2026-02-21T08:43:08.5011911Z { 2026-02-21T08:43:08.5012025Z "benchmark": { 2026-02-21T08:43:08.5012172Z "name": "Helion Benchmark", 2026-02-21T08:43:08.5012329Z "extra_info": { 2026-02-21T08:43:08.5012480Z "device": "NVIDIA B200" 2026-02-21T08:43:08.5012626Z } 2026-02-21T08:43:08.5012746Z }, 2026-02-21T08:43:08.5012859Z "model": { 2026-02-21T08:43:08.5012993Z "name": "kl_div" 2026-02-21T08:43:08.5013131Z }, 2026-02-21T08:43:08.5013259Z "metric": { 2026-02-21T08:43:08.5013409Z "name": "helion_speedup", 2026-02-21T08:43:08.5013571Z "benchmark_values": [ 2026-02-21T08:43:08.5013732Z 3.3628256221234514, 2026-02-21T08:43:08.5013875Z 3.9431083058633747, 2026-02-21T08:43:08.5014024Z 4.1662855246496715, 2026-02-21T08:43:08.5014166Z 3.9891636594329802, 2026-02-21T08:43:08.5014316Z 3.881386565950531, 2026-02-21T08:43:08.5014459Z 3.6135570456389985 2026-02-21T08:43:08.5014607Z ] 2026-02-21T08:43:08.5014820Z }, 2026-02-21T08:43:08.5014943Z "shape": [ 2026-02-21T08:43:08.5015086Z "(8, 512, 4096)", 2026-02-21T08:43:08.5015232Z "(8, 512, 8192)", 2026-02-21T08:43:08.5015385Z "(8, 512, 16384)", 2026-02-21T08:43:08.5015530Z "(8, 512, 32768)", 2026-02-21T08:43:08.5015684Z "(8, 512, 65536)", 2026-02-21T08:43:08.5015829Z "(8, 512, 131072)" 2026-02-21T08:43:08.5015978Z ] 2026-02-21T08:43:08.5016100Z }, 2026-02-21T08:43:08.5016229Z { 2026-02-21T08:43:08.5016357Z "benchmark": { 2026-02-21T08:43:08.5016520Z "name": "Helion Benchmark", 2026-02-21T08:43:08.5016699Z "extra_info": { 2026-02-21T08:43:08.5016851Z "device": "NVIDIA B200" 2026-02-21T08:43:08.5017016Z } 2026-02-21T08:43:08.5017138Z }, 2026-02-21T08:43:08.5017274Z "model": { 2026-02-21T08:43:08.5017413Z "name": "kl_div" 2026-02-21T08:43:08.5017566Z }, 2026-02-21T08:43:08.5017692Z "metric": { 2026-02-21T08:43:08.5017922Z "name": "helion_accuracy", 2026-02-21T08:43:08.5018097Z "benchmark_values": [ 2026-02-21T08:43:08.5018258Z 1.0, 2026-02-21T08:43:08.5018386Z 1.0, 2026-02-21T08:43:08.5018520Z 1.0, 2026-02-21T08:43:08.5018652Z 1.0, 2026-02-21T08:43:08.5018779Z 1.0, 2026-02-21T08:43:08.5018913Z 1.0 2026-02-21T08:43:08.5019038Z ] 2026-02-21T08:43:08.5019169Z }, 2026-02-21T08:43:08.5019293Z "shape": [ 2026-02-21T08:43:08.5019435Z "(8, 512, 4096)", 2026-02-21T08:43:08.5019581Z "(8, 512, 8192)", 2026-02-21T08:43:08.5019734Z "(8, 512, 16384)", 2026-02-21T08:43:08.5019879Z "(8, 512, 32768)", 2026-02-21T08:43:08.5020031Z "(8, 512, 65536)", 2026-02-21T08:43:08.5020174Z "(8, 512, 131072)" 2026-02-21T08:43:08.5020324Z ] 2026-02-21T08:43:08.5020449Z } 2026-02-21T08:43:08.5036264Z ] 2026-02-21T08:43:08.5089991Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T08:43:08.5090266Z with: 2026-02-21T08:43:08.5090634Z github-token: *** 2026-02-21T08:43:08.5090794Z venv: .venv/bin/activate 2026-02-21T08:43:08.5090951Z schema-version: v3 2026-02-21T08:43:08.5091099Z env: 2026-02-21T08:43:08.5091234Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:08.5091447Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5091714Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:08.5092017Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5092234Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5092446Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5092794Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:08.5093221Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:08.5093446Z ##[endgroup] 2026-02-21T08:43:08.5149819Z ##[group]Run set -eux 2026-02-21T08:43:08.5150004Z set -eux 2026-02-21T08:43:08.5150160Z  2026-02-21T08:43:08.5150318Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T08:43:08.5150525Z  echo "Missing github-token input" 2026-02-21T08:43:08.5150714Z  exit 1 2026-02-21T08:43:08.5150843Z fi 2026-02-21T08:43:08.5151779Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:08.5152060Z env: 2026-02-21T08:43:08.5152205Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:08.5152411Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5152671Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:08.5152925Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5153141Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5153361Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5153735Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:08.5154248Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:08.5154614Z GITHUB_TOKEN: *** 2026-02-21T08:43:08.5154756Z ##[endgroup] 2026-02-21T08:43:08.5713459Z + [[ -z *** ]] 2026-02-21T08:43:08.5777369Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T08:43:08.5777635Z with: 2026-02-21T08:43:08.5777877Z github-token: *** 2026-02-21T08:43:08.5778027Z env: 2026-02-21T08:43:08.5778159Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:08.5778368Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5778618Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:08.5778855Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5779075Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5779302Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5779653Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:08.5780029Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:08.5780246Z ##[endgroup] 2026-02-21T08:43:08.5789603Z ##[group]Run set -eux 2026-02-21T08:43:08.5789784Z set -eux 2026-02-21T08:43:08.5789939Z  2026-02-21T08:43:08.5790245Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T08:43:08.5790695Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:08.5790898Z env: 2026-02-21T08:43:08.5791045Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:08.5791252Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5791511Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:08.5791971Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5792198Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5792425Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:08.5792801Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:08.5793210Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:08.5793545Z GITHUB_TOKEN: *** 2026-02-21T08:43:08.5793704Z ##[endgroup] 2026-02-21T08:43:08.6340797Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 dgxb200-04-1007 2026-02-21T08:43:09.9433141Z setting job-id=64380329773 2026-02-21T08:43:09.9435304Z setting job-name=run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200 2026-02-21T08:43:09.9579657Z ##[group]Run set -eux 2026-02-21T08:43:09.9579831Z set -eux 2026-02-21T08:43:09.9579959Z  2026-02-21T08:43:09.9580129Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T08:43:09.9580331Z  source ".venv/bin/activate" 2026-02-21T08:43:09.9580494Z fi 2026-02-21T08:43:09.9580621Z  2026-02-21T08:43:09.9580840Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T08:43:09.9581141Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T08:43:09.9581337Z  --repo "${REPO}" \ 2026-02-21T08:43:09.9581514Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T08:43:09.9581706Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T08:43:09.9581965Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T08:43:09.9582164Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T08:43:09.9582340Z  --job-id "${JOB_ID}" \ 2026-02-21T08:43:09.9582517Z  --job-name "${JOB_NAME}" 2026-02-21T08:43:09.9582768Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:09.9582953Z env: 2026-02-21T08:43:09.9583085Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:09.9583282Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:09.9583519Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:09.9583858Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:09.9584070Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:09.9584271Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:09.9584619Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:09.9584993Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:09.9585203Z SCHEMA_VERSION: v3 2026-02-21T08:43:09.9585359Z REPO: pytorch/helion 2026-02-21T08:43:09.9585509Z HEAD_BRANCH: refs/heads/main 2026-02-21T08:43:09.9585711Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T08:43:09.9585909Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T08:43:09.9586077Z RUN_ATTEMPT: 1 2026-02-21T08:43:09.9586211Z JOB_ID: 64380329773 2026-02-21T08:43:09.9586409Z JOB_NAME: run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200 2026-02-21T08:43:09.9586637Z ##[endgroup] 2026-02-21T08:43:10.0174548Z + [[ -n .venv/bin/activate ]] 2026-02-21T08:43:10.0174767Z + source .venv/bin/activate 2026-02-21T08:43:10.0174935Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0175081Z ++ '[' -n x ']' 2026-02-21T08:43:10.0175232Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:43:10.0175493Z ++ '[' .venv/bin/activate = /__w/_temp/297db34b-cc20-4609-b220-031348e5e286.sh ']' 2026-02-21T08:43:10.0175757Z ++ deactivate nondestructive 2026-02-21T08:43:10.0175921Z ++ unset -f pydoc 2026-02-21T08:43:10.0176052Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0176183Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0176303Z ++ hash -r 2026-02-21T08:43:10.0176426Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0176553Z ++ unset VIRTUAL_ENV 2026-02-21T08:43:10.0176706Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:43:10.0177153Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:43:10.0177368Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:43:10.0177646Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:43:10.0177839Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:43:10.0178029Z ++ export VIRTUAL_ENV 2026-02-21T08:43:10.0178169Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0178313Z ++ unset SCRIPT_PATH 2026-02-21T08:43:10.0178921Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:43:10.0180009Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:43:10.0180646Z ++ export PATH 2026-02-21T08:43:10.0180784Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:43:10.0180956Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:43:10.0181120Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:43:10.0181267Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0181395Z ++ '[' -z '' ']' 2026-02-21T08:43:10.0181521Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:43:10.0181667Z ++ PS1='(helion) ' 2026-02-21T08:43:10.0181802Z ++ export PS1 2026-02-21T08:43:10.0182018Z ++ alias pydoc 2026-02-21T08:43:10.0182146Z ++ true 2026-02-21T08:43:10.0182273Z ++ hash -r 2026-02-21T08:43:10.0183220Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329773 --job-name 'run-b200 (kl_div) / benchmark-cu130-kl_div-py3.12-b200' 2026-02-21T08:43:10.0543606Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T08:43:10.0543866Z with: 2026-02-21T08:43:10.0544006Z venv: .venv/bin/activate 2026-02-21T08:43:10.0544154Z env: 2026-02-21T08:43:10.0544292Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:10.0544570Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0544818Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:10.0545050Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0545269Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0545476Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0545829Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:10.0546223Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:10.0546432Z ##[endgroup] 2026-02-21T08:43:10.0555367Z ##[group]Run set -eux 2026-02-21T08:43:10.0555537Z set -eux 2026-02-21T08:43:10.0555679Z  2026-02-21T08:43:10.0555815Z if command -v nvidia-smi; then 2026-02-21T08:43:10.0556007Z  DEVICE_NAME=cuda 2026-02-21T08:43:10.0556167Z  nvidia-smi 2026-02-21T08:43:10.0556327Z elif command -v rocm-smi; then 2026-02-21T08:43:10.0556509Z  DEVICE_NAME=rocm 2026-02-21T08:43:10.0556657Z  rocm-smi 2026-02-21T08:43:10.0556809Z elif command -v hl-smi; then 2026-02-21T08:43:10.0556981Z  DEVICE_NAME=hpu 2026-02-21T08:43:10.0557133Z  hl-smi 2026-02-21T08:43:10.0557258Z else 2026-02-21T08:43:10.0557397Z  arch=$(uname -m) 2026-02-21T08:43:10.0557545Z  2026-02-21T08:43:10.0557667Z  case "$arch" in 2026-02-21T08:43:10.0557822Z  aarch64|arm64) 2026-02-21T08:43:10.0557978Z  DEVICE_NAME=arm64-cpu 2026-02-21T08:43:10.0558146Z  ;; 2026-02-21T08:43:10.0558271Z  *) 2026-02-21T08:43:10.0558409Z  DEVICE_NAME=cpu 2026-02-21T08:43:10.0558558Z  ;; 2026-02-21T08:43:10.0558686Z  esac 2026-02-21T08:43:10.0558810Z  lscpu 2026-02-21T08:43:10.0558945Z fi 2026-02-21T08:43:10.0559112Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T08:43:10.0559397Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:10.0559591Z env: 2026-02-21T08:43:10.0559719Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:10.0559915Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0560148Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:10.0560385Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0560596Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0560799Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.0561151Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:10.0561518Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:10.0561732Z ##[endgroup] 2026-02-21T08:43:10.1111408Z /usr/bin/nvidia-smi 2026-02-21T08:43:10.1112953Z + command -v nvidia-smi 2026-02-21T08:43:10.1113134Z + DEVICE_NAME=cuda 2026-02-21T08:43:10.1113277Z + nvidia-smi 2026-02-21T08:43:10.1256870Z Sat Feb 21 08:43:10 2026 2026-02-21T08:43:10.1257220Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:43:10.1257618Z | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | 2026-02-21T08:43:10.1258001Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:43:10.1258381Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2026-02-21T08:43:10.1259367Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2026-02-21T08:43:10.1259741Z | | | MIG M. | 2026-02-21T08:43:10.1259970Z |=========================================+========================+======================| 2026-02-21T08:43:10.1318187Z | 0 NVIDIA B200 Off | 00000000:9D:00.0 Off | 0 | 2026-02-21T08:43:10.1318604Z | N/A 40C P0 197W / 750W | 0MiB / 183359MiB | 0% Default | 2026-02-21T08:43:10.1318932Z | | | Disabled | 2026-02-21T08:43:10.1319264Z +-----------------------------------------+------------------------+----------------------+ 2026-02-21T08:43:10.1319460Z 2026-02-21T08:43:10.1319585Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:43:10.1319883Z | Processes: | 2026-02-21T08:43:10.1320169Z | GPU GI CI PID Type Process name GPU Memory | 2026-02-21T08:43:10.1320429Z | ID ID Usage | 2026-02-21T08:43:10.1320668Z |=========================================================================================| 2026-02-21T08:43:10.1320930Z | No running processes found | 2026-02-21T08:43:10.1321228Z +-----------------------------------------------------------------------------------------+ 2026-02-21T08:43:10.1592502Z + echo DEVICE_NAME=cuda 2026-02-21T08:43:10.1627426Z ##[group]Run set -eux 2026-02-21T08:43:10.1627619Z set -eux 2026-02-21T08:43:10.1627756Z  2026-02-21T08:43:10.1627915Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T08:43:10.1628146Z  # Return the same device name as PyTorch 2026-02-21T08:43:10.1628467Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T08:43:10.1628746Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T08:43:10.1629047Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T08:43:10.1629365Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T08:43:10.1629713Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T08:43:10.1630055Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T08:43:10.1630724Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T08:43:10.1631373Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T08:43:10.1631679Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T08:43:10.1632014Z fi 2026-02-21T08:43:10.1632178Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T08:43:10.1632464Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:10.1632665Z env: 2026-02-21T08:43:10.1632795Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:10.1632989Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.1633222Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:10.1633457Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.1633660Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.1633868Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.1634297Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:10.1634667Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:10.1634884Z DEVICE_NAME: cuda 2026-02-21T08:43:10.1635023Z ##[endgroup] 2026-02-21T08:43:10.2121025Z + [[ cuda == \c\u\d\a ]] 2026-02-21T08:43:10.2125611Z ++ nvidia-smi -i 0 --query-gpu=name --format=csv,noheader 2026-02-21T08:43:10.2305549Z + DEVICE_TYPE='NVIDIA B200' 2026-02-21T08:43:10.2307203Z + echo 'DEVICE_TYPE=NVIDIA B200' 2026-02-21T08:43:10.2339998Z ##[group]Run set -eux 2026-02-21T08:43:10.2340165Z set -eux 2026-02-21T08:43:10.2340294Z  2026-02-21T08:43:10.2340446Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T08:43:10.2340642Z  source ".venv/bin/activate" 2026-02-21T08:43:10.2340813Z fi 2026-02-21T08:43:10.2340932Z  2026-02-21T08:43:10.2341123Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T08:43:10.2341461Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T08:43:10.2341838Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:10.2342104Z env: 2026-02-21T08:43:10.2342233Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:10.2342433Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.2342669Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:10.2342906Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.2343118Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.2343317Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:10.2343662Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:10.2344027Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:10.2344240Z DEVICE_NAME: cuda 2026-02-21T08:43:10.2344379Z DEVICE_TYPE: NVIDIA B200 2026-02-21T08:43:10.2344533Z ##[endgroup] 2026-02-21T08:43:10.2822696Z + [[ -n .venv/bin/activate ]] 2026-02-21T08:43:10.2822974Z + source .venv/bin/activate 2026-02-21T08:43:10.2823168Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2823323Z ++ '[' -n x ']' 2026-02-21T08:43:10.2823476Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T08:43:10.2823735Z ++ '[' .venv/bin/activate = /__w/_temp/f0e3475c-e481-4719-a418-fb2e6ecaf6e2.sh ']' 2026-02-21T08:43:10.2824007Z ++ deactivate nondestructive 2026-02-21T08:43:10.2824171Z ++ unset -f pydoc 2026-02-21T08:43:10.2824303Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2824440Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2824576Z ++ hash -r 2026-02-21T08:43:10.2824705Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2824834Z ++ unset VIRTUAL_ENV 2026-02-21T08:43:10.2824994Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T08:43:10.2825173Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T08:43:10.2825373Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T08:43:10.2825555Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T08:43:10.2825708Z ++ '[' linux-gnu = msys ']' 2026-02-21T08:43:10.2825884Z ++ export VIRTUAL_ENV 2026-02-21T08:43:10.2826018Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2826152Z ++ unset SCRIPT_PATH 2026-02-21T08:43:10.2826758Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:43:10.2827858Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T08:43:10.2828501Z ++ export PATH 2026-02-21T08:43:10.2828641Z ++ '[' xhelion '!=' x ']' 2026-02-21T08:43:10.2828808Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T08:43:10.2828975Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T08:43:10.2829134Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2829450Z ++ '[' -z '' ']' 2026-02-21T08:43:10.2829587Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T08:43:10.2829737Z ++ PS1='(helion) ' 2026-02-21T08:43:10.2829874Z ++ export PS1 2026-02-21T08:43:10.2830014Z ++ alias pydoc 2026-02-21T08:43:10.2830144Z ++ true 2026-02-21T08:43:10.2830331Z ++ hash -r 2026-02-21T08:43:10.2830562Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T08:43:10.9333144Z Collecting psutil==7.0.0 2026-02-21T08:43:10.9844739Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T08:43:11.0051804Z Collecting nvidia-ml-py==13.580.82 2026-02-21T08:43:11.0108076Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T08:43:11.0200804Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T08:43:11.0388459Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T08:43:11.1221699Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T08:43:11.1229116Z Attempting uninstall: nvidia-ml-py 2026-02-21T08:43:11.1246995Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T08:43:11.1257486Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T08:43:11.1898196Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T08:43:11.2359727Z Attempting uninstall: psutil 2026-02-21T08:43:11.2389655Z Found existing installation: psutil 7.2.2 2026-02-21T08:43:11.2404589Z Uninstalling psutil-7.2.2: 2026-02-21T08:43:11.2408592Z Successfully uninstalled psutil-7.2.2 2026-02-21T08:43:11.3524495Z 2026-02-21T08:43:11.3556459Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T08:43:11.4756596Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T08:43:13.1112720Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T08:43:13.1112993Z with: 2026-02-21T08:43:13.1113137Z venv: .venv/bin/activate 2026-02-21T08:43:13.1113288Z env: 2026-02-21T08:43:13.1113431Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:13.1113629Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1113897Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:13.1114130Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1114345Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1114551Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1114901Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:13.1115278Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:13.1115488Z DEVICE_NAME: cuda 2026-02-21T08:43:13.1115662Z DEVICE_TYPE: NVIDIA B200 2026-02-21T08:43:13.1115812Z ##[endgroup] 2026-02-21T08:43:13.1124363Z ##[group]Run set -eux 2026-02-21T08:43:13.1124530Z set -eux 2026-02-21T08:43:13.1124667Z  2026-02-21T08:43:13.1124815Z # TODO (huydhn): Implement this part 2026-02-21T08:43:13.1125047Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T08:43:13.1125358Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T08:43:13.1125547Z env: 2026-02-21T08:43:13.1125685Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:13.1125875Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1126114Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:13.1126345Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1126559Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1126769Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1127123Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:13.1127502Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:13.1127710Z DEVICE_NAME: cuda 2026-02-21T08:43:13.1127858Z DEVICE_TYPE: NVIDIA B200 2026-02-21T08:43:13.1128005Z ##[endgroup] 2026-02-21T08:43:13.1706392Z + echo 'dependencies={}' 2026-02-21T08:43:13.1755454Z ##[group]Run actions/upload-artifact@v6 2026-02-21T08:43:13.1755658Z with: 2026-02-21T08:43:13.1755811Z name: benchmark-results-b200-kl_div 2026-02-21T08:43:13.1755997Z path: test/test-reports 2026-02-21T08:43:13.1756163Z if-no-files-found: warn 2026-02-21T08:43:13.1756318Z compression-level: 6 2026-02-21T08:43:13.1756470Z overwrite: false 2026-02-21T08:43:13.1756611Z include-hidden-files: false 2026-02-21T08:43:13.1756774Z env: 2026-02-21T08:43:13.1756903Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T08:43:13.1757103Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1757341Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T08:43:13.1757583Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1757796Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1758003Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T08:43:13.1758377Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 2026-02-21T08:43:13.1758773Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T08:43:13.1758988Z DEVICE_NAME: cuda 2026-02-21T08:43:13.1759128Z DEVICE_TYPE: NVIDIA B200 2026-02-21T08:43:13.1759285Z ##[endgroup] 2026-02-21T08:43:13.1761398Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:43:13.3962312Z With the provided path, there will be 1 file uploaded 2026-02-21T08:43:13.3966787Z Artifact name is valid! 2026-02-21T08:43:13.3971460Z Root directory input is valid! 2026-02-21T08:43:13.6655149Z Beginning upload of artifact content to blob storage 2026-02-21T08:43:14.0133362Z Uploaded bytes 622 2026-02-21T08:43:14.1034365Z Finished uploading artifact content to blob storage! 2026-02-21T08:43:14.1036020Z SHA256 digest of uploaded artifact zip is 5753666ca7007086ebe962314e4f20502e13d355286efeb938ff9ecc486f419a 2026-02-21T08:43:14.1036416Z Finalizing artifact upload 2026-02-21T08:43:14.4186289Z Artifact benchmark-results-b200-kl_div.zip successfully finalized. Artifact ID 5600481450 2026-02-21T08:43:14.4186860Z Artifact benchmark-results-b200-kl_div has been successfully uploaded! Final size is 622 bytes. Artifact ID is 5600481450 2026-02-21T08:43:14.4190494Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5600481450 2026-02-21T08:43:14.4311744Z Post job cleanup. 2026-02-21T08:43:14.4315561Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:43:14.6199438Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T08:43:14.6199931Z (node:83677) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T08:43:14.6200366Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T08:43:14.6294229Z Post job cleanup. 2026-02-21T08:43:14.6296587Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:43:14.8236290Z Post job cleanup. 2026-02-21T08:43:14.8239367Z ##[command]/usr/bin/docker exec 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 sh -c "cat /etc/*release | grep ^ID" 2026-02-21T08:43:14.9928874Z [command]/usr/bin/git version 2026-02-21T08:43:14.9958720Z git version 2.43.0 2026-02-21T08:43:14.9989677Z Temporarily overriding HOME='/__w/_temp/f15b19ce-d71f-469b-8167-b80cbddb17e7' before making global git config changes 2026-02-21T08:43:14.9991691Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T08:43:14.9992186Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T08:43:15.0028149Z Removing SSH command configuration 2026-02-21T08:43:15.0028471Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T08:43:15.0054257Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T08:43:15.0275987Z Removing HTTP extra header 2026-02-21T08:43:15.0276432Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T08:43:15.0297702Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T08:43:15.0504823Z Removing includeIf entries pointing to credentials config files 2026-02-21T08:43:15.0513433Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T08:43:15.0527265Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T08:43:15.0527565Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T08:43:15.0527837Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T08:43:15.0528079Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T08:43:15.0534023Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T08:43:15.0550998Z /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0558638Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0580410Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T08:43:15.0607188Z /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0612191Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0636745Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T08:43:15.0653591Z /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0657675Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0684498Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T08:43:15.0704700Z /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0711788Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config 2026-02-21T08:43:15.0742613Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T08:43:15.0965489Z Removing credentials config '/__w/_temp/git-credentials-54f21934-3188-4617-90d1-6c095be51146.config' 2026-02-21T08:43:15.1058465Z Stop and remove container: d92e73d387994e1e949d78541a22b449_nvidiacuda1301develubuntu2404_f71b97 2026-02-21T08:43:15.1061532Z ##[command]/usr/bin/docker rm --force 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:43:18.2876758Z 227825efe019125b70005c6ebc31886ea4d33207d7c9141b934d29ece02d9a12 2026-02-21T08:43:18.2905225Z Remove container network: github_network_635fe730ba6e422c803643553ff1a973 2026-02-21T08:43:18.2907943Z ##[command]/usr/bin/docker network rm github_network_635fe730ba6e422c803643553ff1a973 2026-02-21T08:43:18.7397906Z github_network_635fe730ba6e422c803643553ff1a973 2026-02-21T08:43:18.7452398Z Evaluate and set job outputs 2026-02-21T08:43:18.7457545Z Set output 'benchmark-metadata' 2026-02-21T08:43:18.7458985Z Set output 'runners-info' 2026-02-21T08:43:18.7459631Z Set output 'dependencies' 2026-02-21T08:43:18.7460035Z Cleaning up orphan processes